The PivotNine Blog

DataCebo Creates Synthetic Enterprise Data with Actually Useful Generative AI

30 April 2024
Justin Warren

Generative AI hype is at fever pitch, yet most examples are toys, broken, or worse. It is refreshing to find a company using generative algorithms to do something useful.

DataCebo uses generative AI to model enterprise data and then uses those models to generate synthetic datasets with production-like qualities. The company recently took in $8.5 million in seed funding to build out its vision.

“Once customers build a generative model out of their data, they can generate sample data as much as they want. It’s synthetic data which is not really connected to the real data, but it has all the same properties, including format and statistical properties,” said Kalyan Veeramachaneni, CEO and co-founder of DataCebo.

DataCebo cofounder and CEO Kalyan Veeramachaneni
Kalyan Veeramachaneni, CEO and co-founder of DataCebo.

This synthetic data is perfect for testing, particularly in situations that are difficult to test without access to real, production data. We all want to keep production data secure within our production systems, yet there are times where access to that data is important.

The traditional approach to testing with production-like data is to take live production data and process it to remove sensitive fields or mask them in various ways. Credit card numbers, social security numbers, tax and healthcare identifiers are all incredibly sensitive. Various jurisdictions have strict rules about how such data can be handled. Yet total removal prevents testing that a system uses these fields correctly. Masking, such as replacing most of the numbers in a credit card with XXXX, can break calculations that rely on the data being valid. Fake data can’t be too fake.

DataCebo’s approach promises data that isn’t real but that looks real. Very real, and real in a variety of important ways. Real enough to test even quite complex logic relating different fields to one another, such as for fraud detection. Does this phone number have an area code of a customer with an address in Manhattan? Does the synthetic purchase history look enough like a real customer’s purchase history that we can test our algorithms won’t trigger false positives? Will our new features actually work when we launch them?

While it’s possible to build test data generators that have these capabilities, it’s complex and time-consuming. Such systems also tend to be tightly coupled to the system they’re modelling. Designers need to understand the linkage between data fields to accurately model those relationships. If production changes, any downstream data processing also has to change. This can slow down releases, or kill off new functionality that will require too much expensive rework.

“Other approaches aren’t easily generalizable. With this system, you can just point to any database, or multiple tables, and we will find the connections with our product,” Veeramachaneni says. “And once it’s connected, you can build a generative model automatically. So there is not much of human involvement. There’s not much customization required when you move from one system to another.”

DataCebo is less about replacing human labour than allowing these more advanced techniques to be used more often. The skilled data scientists needed for traditional approaches are rare and expensive. Tedious and repetitive work is not the sort of thing highly-skilled people want to spend their days doing, especially when there are plenty of other options. By automating the tedious work no one wants to do, systems like DataCebo mean more things will get tested, and tested better.

Right now, far too many organizations do a poor job of sanitizing production data copied for testing. This places customers at greater risk of data breaches, which are already an unacceptably large and growing problem. Yet organizations also don’t test things enough, setting up a conflict of incentives where everyone loses. DataCebo suggests there is a way through, enhancing both security and robustness while also lowering costs.

This is also an all-too-rare example of generative AI deployed where it is genuinely useful. Creating extremely plausible lies is what generative AI does. It is fundamental to how the technology works. It just so happens that production-like test data is a highly plausible lie we actually want more of.

Testing is one of those boring-but-important aspects of enterprise technology. It is part of what turns amateur hacking into professional software development. Doing more and better testing is an obviously good thing that should be encouraged.