Deep Generative Models are Privacy Regularisers
Regulatory Topics I
Introduction
Most financial and business data are plagued with important anomalies that can’t easily be preserved when using synthetic data generators. The bulk of solutions that are found online are also time-agnostic leading to low fidelity synthetic time series. And even single-standing, cross-section methods fail after performing simple data validation operations, e.g., in census data you might produce a record of a person aged five who already has a mortgage and dependents. As things stand, to produce quality synthetic data still involves a range of preprocessing and postprocessing techniques. It is of course only a matter of time until these problems are ironed out, however, there are a bulk of other problems that will stick around for good.
A single synthetic data project can be broken into two steps, (1) identify the methods to generate a synthetic dataset or amend the current dataset, (2) choose, measure, and compare similarity metrics and empirical privacy metrics across different solutions. Synthetic data can primarily be generated in three ways: hand-engineered methods, agent-based modelling, and generative models.
Methods
Hand-engineered methods (HEM) identify an underlying distribution from real data and expert opinion and seek to imitate it. The data is treated as an instantiation of a random variable and a joint multivariate probability distribution is modelled and sampled. Discrete variables can be modelled with decision trees and Bayesian networks; spatial variables can be modelled with spatial decomposition trees, and non-linear correlated continuous variables can be modelled using copulas. It is classified as a top-down modelling approach.
Agent-based models (ABM) seek to generate data by first establishing known agents and allowing them to interact according to prescribed rules in the hope that this interaction would ultimately amount to distribution profiles that look similar to the original dataset. It is classified as a bottom-up modelling approach.
Generative models (SDG) are typically deep learning models used to generate synthetic data. For example, generative adversarial networks (GANs) can be used to imitate the profile of the original data. Generative modelling also performs top-down modelling like the hand-engineered method, but in a more automated manner. It is also able to model complex relationship to generate more realistic looking data.
Regardless of the method used, the purpose is to develop statistically equivalent data with enhanced privacy.
The generated data should be private and anonymised while maintaining utility for prediction and modelling task. Where the purpose is to generate multiple datasets over time, the methods should be scalable and streamlined. Private synthetic data can be created in three steps: obfuscation, generation, evaluation.
Obfuscation and De-identification
Obfuscation involves the removal/distortion of sensitive values, distributions, and relationships from the original data.
A first step is to remove all sensitive values or ensure that they are replaced with pseudonymous or hashed values. For example, postal codes can be replaced with synthetic or masked values. However, simply hashing a postal code won't be sufficient because you could still identify postal codes by matching the distribution of the dataset against public records.
With some additional adjustments, you can hide the distribution for individual privacy and for competitive business reasons. For example, including a customer’s postal code and the street name could expose your client demographic. A first natural solution is to perturb the value with some noise. A further option is to use aggregation methods: the data can be bucketed, rounded up, or aggregated according to some higher-level identifier. For example, instead of using postal codes, you can use city names. If a high-level identifier is not available, you can use statistical clustering methods to designate new identifiers. More sophisticated group-based strategies like l-diversity, k-anonymity, and t-closeness can also be used.
A third concern is variable relationships for example the association of city names and other variables like revenue-per-client might be sensitive business information. Here you can apply random or selective reshuffling of the revenue-per-client column to break this relationship.
Generating Synthetic Data
At the point where you have masked sensitive values, and removed sensitive distributions, and cross-correlations you can start using synthetic data generation techniques. As of now, the data still maintains a one-to-one matching, and this has to be changed for additional privacy protection. The question is, how can we learn from feature values, distributions, and relationships to generate similar looking, but entirely new datasets. As highlighted before, there are three main methods to achieve that, being, hand-engineered methods, agent-based models, and generative models.
We generally start with looking at the data structure and select a range of applicable model architectures and parameter choices that are available. For example, are you dealing with time-series data, cross-sectional data, panel data, or network data, is it a high-dimensional problem, and are you dealing with mixed data types?
SDG with Differential Privacy
HEMs and ABMs have a very low risk of revealing sensitive row-level attributes, whereas SDGs do pose some risk while generally providing an enhanced utility. Although SDGs give good non-guaranteed privacy protection by simply attempting to imitate distributions and joint-distributions, they could overfit the original data which could imply some privacy concerns. For example, if GAN based SDG models are not built correctly, they can lead to mode-collapse, and it can result in some real records (real hashed records) leaking into the generated dataset. This can be alleviated by using Wasserstein GANs, weight clipping, and gradient penalties.
Differential Privacy on the other hand attempts to inject noise either into the dataset, the optimisation process, the model parameters, or the model outputs. Recent research has shown that injecting noise into the optimisation process of neural networks could offer the best privacy-utility trade-off. The synthetic model plus differential privacy gives mathematical privacy guarantees. The problem with noise injection is that it can lead to a large degradation in the quality of the synthetic data being produced so the accuracy-privacy trade-off should be continuously inspected.
Privacy Risks
For one, synthetic data still has and always will have an element of privacy risk. And the reason is quite simple, deep generative models are seeking to reproduce the distribution of the original data, and if it does its job too well i.e., overfits, then some real records could be leaked or be very close to the original records, even if it is by pure chance. To help, we can add, what is essentially noise, to the generation process— called differential privacy. Naturally, the more noise we introduce the greater the privacy, but the worse the utility of the data. There are essentially two groups of privacy metrics: attribute disclosure and presence disclosure risks, both of which can’t, and never will be able to eliminate all the privacy risks by simply using differential privacy[1].
With presence-privacy, we want to make sure that we can’t identify whether the original records were used to develop the new records, and with attribute-privacy, we want to make sure that if some information from the original records leaks, that the original records can’t be used to infer more information. Differential privacy, i.e. noise injection, can help us minimise these two groups of risks.
However, deep generative models with differential privacy are still a game-changer because it finally exposes a parameter to a modeller to decide what accuracy-privacy trade-off one would be happy with; so instead of just pseudonymising a data-set which retains a one-to-one match with the original data, you can now generate an infinite amount of one-to-many datapoints and introduce your preferred level of privacy for insurance. Synthetic data is therefore a data-privacy regulariser. For that reason, users of synthetic data have to establish some synthetic data production frontier and draw strict accuracy and privacy lower limits that they would be happy with.
On top of that, it should only be the second line of defence. Most of the attention should still go into de-identification methods like masking, and perturbation not just at the row level but also at the column level; you might not want to share certain correlations with the world. These are things that have to be decided by the data owner. My advice is generally that if you are not willing to share the de-identified data with permuted columns, synthetic data won’t help you.
Problems That Will Remain
It is easy to create a single synthetic dataset, but it is exponentially harder to develop a synthetic database that consists of multiple datasets and might, in fact, require some expert advice and hardcoded behaviours.
It is extremely hard to explain privacy metrics to management, for that reason the industry has converged on only developing synthetic data only as a second line of defence for already de-identified data.
And even then, privacy risks remain after performing de-identification (masking) and differential privacy, deep generative, synthetic data generation (noise).
Privacy risks remain because ‘anonymized data’ is only ever anonymous in a local sense, not a global sense because the ‘anonymized data’ could always be correlated with external datasets, thus leaking personally identifiable information.
Synthetic data providers that don’t provide you with privacy metrics could, in fact, be lying and just resample from the original data with small perturbation, which should set off warning lights.
The term synthetic data generation will be misused in the future to say that companies are storing data that is GDPR compliant, I will explore this point in the future.