Synthetic Data Generation Whitewashing and Pattern Privacy
Data Cartels: Artificial Intelligence Information Heists XII
As soon as the data has entered the machine learning model, it is hard to get it out. We are still unsure how to deal with a data-removal request when the data points have already been captured in a production model. Some methods do exist, but they are not that efficient[1]. There is a possible dystopian future where companies would first generate synthetic data before a machine learning model ingests it. It will be sold as a way to ‘keep clients data safe’, even though it will simply be the most cost-effective way to deal with this deletion dilemma. In fact, this could all be done in one single-standing deep neural network model, by implanting a generative adversarial network or a variational autoencoder into the pipeline.
There are already papers from top academics, suggesting that deep generative models can produce synthetic data that is GDPR compliant.[2] It is not a far stretch to imagine large corporations claiming that they do not have to respond to deletion requests under GDPR’s Right To Be Forgotten regulation, because they do not carry personal identifiable data. When corporations do in fact implement the best models for user protection, the synthetic data will still have private patterns embedded into the generation process, all being gleaned from users’ personal data; for that reason, we need an update in regulation to declare these distributions as private too, because each individual’s data still rests within the manifold.
Arguably the world’s best research group working on synthetic data writes that “[regulation like GDPR] suffer from lack of clarity in the definitions of ‘personal data’, and ‘anonymisation’.” Regulatory bodies have to pay special attention to the research coming out of these groups. If we don’t get more specific with these definitions and, for example, state that the collection of user distributions is also confidential, then companies might get away with preserving our personal data indefinitely.
Deep generative differential privacy models are designed to maximise personal privacy while simultaneously maximising similarity in overall feature patterns and relationships. This raises a very important ethical question. Are we only fearful of individual record intrusion and not pattern intrusion? Personal privacy might be preserved, but the patterns still have all their functional properties, meaning that they can still be yielded to make decisions which have very personal consequences. For this reason, there might be a need not just to preserve individual privacy, but also to safeguard individual pattern privacy—otherwise, synthetic data generation could circumvent the intentions of the regulations.
Synthetic data white-washing hasn’t quite started yet, but it will, the incentives are in place for this to become a growing problem. More and more companies will be lured into using synthetic data generation techniques. See, for example, the following excerpt from Facteus’ website: “As data privacy laws and regulations such as GDPR, CCPA, and GLBA crowd the industry landscape, Facteus’ Synthetic Data methodology helps keep client data safe and compliant.” [3] They further say that the risks posed by synthetic data are “None” even though that has proven to be false on numerous occasions[4].
It is well known that we should not rely too much on anonymisation techniques. The one-to-one matching should make one cautious. This information must be strictly confidential in accordance with the General Data Protection Regulation (GDPR). However, inside the company, engineers that fix software bugs, or data scientists that develop dashboards will require realistic-looking data. Companies do not want their employees to have access to real user data, as this would violate user privacy, so actions need to be taken to prevent insiders from reading sensitive information. At Uber, for example, a system is deployed so that employees can only access perturbed customer data[5]. Prior to deploying the system, Uber reportedly mishandled customer data, allowing employees to gain unauthorised access[6]. Clearly then, there are some scenarios in which synthetic data generation methods will be of great benefit, but when it is used to collect and sell data unbeknownst to the users, it becomes ethically corrupt.
[1] https://proceedings.neurips.cc/paper/2019/file/cb79f8fa58b91d3af6c9c991f63962d3-Paper.pdf
[2] https://sci-hub.st/10.1109/JBHI.2020.2980262
[3] https://conifer.rhizome.org/snowde/the-finance-parlour/20210107151134/https://www.facteus.com/solutions/synthetic-data/
[4] https://arxiv.org/abs/2011.07018
[5] https://arxiv.org/abs/1706.09479
[6] https://www.theverge.com/2017/8/15/16150902/uber-ftc-complaint-mishandle-privacy-data