What is synthetic data and how can it help protect privacy?
The increasing prevalence of data science coupled with a recent proliferation of privacy scandals is driving demand for secure and accessible synthetic data.
Synthetic data is artificially generated to replicate the statistical components of real-world data but doesn’t contain any identifiable information. This lowers the barrier to deploy data science by removing the need for vast volumes or real data, and also adds security by eliminating personal information from the equation.
It’s used by financial institutions to test and train their fraud detection systems on certain situations or criteria, by researchers working on clinical trials to create baselines for future studies, and increasingly to train machine learning models, as Waymo did by testing its autonomous vehicles on simulated roadways.
Hazy is one of the UK’s leading proponents of synthetic data. The London-based startup was founded in 2017, just as the Cambridge Analytica scandal was breaking and GDPR compliance was becoming a focus for data-driven organisations.
The founders had their own problems with privacy risks and became convinced that synthetic data could provide a solution.
“The privacy just isn’t good enough in anonymization,” Hazy CEO Harry Keen tells Techworld. “It’s very easy to infer characteristics about someone or even identify an exact individual in an anonymised dataset, because you may very well have access to an ancillary dataset which you can cross-reference … With synthetic data it’s not a record transformation into another record – it’s literally starting from scratch and creating new people based off a generalised statistical approach.”
Hazy trains a machine learning model on the raw data, which picks up patterns, trends and correlations from the statistic. That model then generates artificial data points with no relation to the real data to create an entirely new synthetic dataset based on the characteristics of the original information.
This approach has already attracted the attention of Nationwide. Hazy created an on-premise synthetic data for the building society, which was was stored in the cloud where Nationwide runs analytics workloads on the data with advanced tooling, higher compute capacity and privacy guarantees.
“The benefit of using Hazy in that sense is that they have never had to put real data into a less trusted environment,” says Keen.
Synthetic data history
Synthetic data originated in the early 1990s but has gained major traction in recent years as the benefits and risks of data science have become widely known.
In addition to reducing the risk of exposing personal information, synthetic data allows organisations amplify smaller datasets and compete with the companies that own vast volumes of information.
Keen admits that there is always a trade-off between privacy and utility, but argues that synthetic data provides a better balance than more brute forces such as anonymisation and marking. The challenge is ensuring that the synthetic data can match the insights generated by real information.
“If the raw data set is good at predicting someone’s salary level you’re just doing a prediction exercise, you want to make sure your synthetic data is just as good at doing that,” says Keen.
“What we do at Hazy is we have ways of optimising your synthetic data. You don’t just literally plug it in and it generates synthetic data and says there you go; we optimise the synthetic data to make sure it’s useful for a specific use case.”
Hazy is currently focused on attracting clients in the financial sector but plans to eventually move into any sectors that have large amounts of sensitive data and strict regulatory requirements and want to run analytics in a trusted environment, from healthcare to telecommunications.
Keen believes that most of their workflows will work with synthetic data, but acknowledges that there are some areas where it needs to be deployed with extreme care, such as the outlier detection used by banks to investigate evidence of fraud.
“You have to be very careful to preserve the outlier, because one feature of the synthetic data generation process is that it can often smooth out those kinds of outliers and lose them from the dataset. If you’re building a fraud detection model, for example, in a bank, you may want to have access to real data or at least double-check the model on that,” he says.
“There are certain use cases where raw data is just going to be better, but there are other ways around that as well, because with synthetic data you can do some really clever stuff where you generate more examples of that kind of outlier and then you use it in an augmented data set.”