A car is totaled when the cost to repair it exceeds its total value. By that logic, Privacy by Design legislation could soon be totaling data pipelines at some of the most powerful tech companies.
Those pipelines were developed well before the advent of more robust user privacy laws, such as the European Union’s GDPR (2018) and the California Consumer Privacy Act (2020). Their foundational architectures were therefore designed without certain privacy-preserving principals in mind, including k-anonymity and differential privacy.
But the problem extends way beyond trying to layer privacy mechanisms on top of existing algorithms. Data pipelines have become so complex and unwieldy that companies might not even know whether they are complying with regulations. As Meta engineers put it in a leaked internal document: “We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments.”
(When we asked Meta for comment, a spokesperson referred us to the company’s original response to Motherboard about the leaked document, which said, in part: “The document was never intended to capture all of the processes we have in place to comply with privacy regulations around the world or to fully represent how our data practices and controls work.”)
As governments increasingly embrace Privacy by Design (PbD) legislation, tech companies face a choice: either start from scratch or try to fix data pipelines that are old, extraordinarily complex and already non-compliant. Some computer science researchers say a fresh start is the only way to go. But for tech companies, starting over would require engineers to roll out critical data infrastructure changes without disrupting day-to-day operations — a task that’s easier said than done.
‘Open borders’ won’t cut it
Motherboard published the leaked internal document, written by Meta engineers in 2021, at the end of April. In it, an engineering team recommended data architecture changes that would help Meta comply with a wave of governments embracing the “consent regime,” one of the core principles of PbD. India, Thailand, South Korea, South Africa and Egypt were all preparing “impactful regulations” in this realm, and the paper also anticipated U.S. federal privacy regulation in 2022 and beyond. Such legislation would generally require Meta to obtain user consent before collecting data for advertisements.
The Meta engineers identified “the heart of our challenge” as a lack of “closed form systems.” Closed systems, they said, would let Meta enumerate and control all the incoming data flows. The engineers placed that in contrast with the “open borders” system that had been baked into company culture for over a decade.
Meta’s systems had grown increasingly complex and untraceable, the engineers said, citing the example of a single feature (“user_home_city_moved”) drawing from around six thousand data tables.
“These are massive pipelines with massive amounts of data feeding into many different kinds of algorithms,” Nikola Banovic, an assistant professor of computer science and engineering at the University of Michigan, told Protocol. “Because this was never a consideration to begin with, now it’s increasingly difficult to untangle things.”
The leaked document showed the frustration of internal teams tasked with overhauling systems designed in an era when everything was fair game, Banovic said. He noted that advocacy groups are pressuring companies to now design systems around end users.
“It’s not going to be easy,” Banovic said of the shift. He added that, while enhancing user privacy would be possible from a technology perspective, online behavioral advertising is fundamentally in conflict with that objective.
The challenges of tracing data flows at that scale aren’t unique to Meta, according to Hana Habib, a postdoctoral researcher at Carnegie Mellon University. “I’m sure all the major tech companies like Google and Twitter — the big tech giants — are facing this issue just because [of] the scale of their operations,” she told Protocol. Habib noted that most of the largest tech companies have faced GDPR fines.
When to say goodbye
Researchers already have a firm grasp on ways to make existing algorithms more privacy-preserving. K-anonymization, for example, is a user privacy technique that ensures data is sufficiently aggregated such that no individual can be identified by combined factors such as hometown and employment. Differential privacy, a standard that has been studied for over a decade, guarantees that someone observing the outputs of an algorithm cannot know whether it included data from a particular individual.
For many years now, Big Tech engineers have studied, applied and occasionally advanced these privacy standards. Google, for instance, achieved differential privacy anonymization in Chrome around 2014 and has since worked to expand it to Google Maps and Assistant. In 2018, Meta assured differential privacy compliance when it gave academics access to user data for assessing the impact of social media on elections. Apple published an in-depth research paper in 2017 about its application of differential privacy for features such as emoji recommendations and lookup hints.
If you have 10 things to do and if you have resources to spend on three, which ones would you pick?
But several sources said the problem is scale and sprawl, not just technique.
“I don’t know how they can be expected to comply with privacy regulation, which stipulates that they provide notice to consumers about these aspects of their data, when they don’t really know themselves,” said Habib.
Companies often don’t have visibility into where their data is being used and stored, according to Balaji Ganesan, the CEO and co-founder of the data governance startup Privacera. Ganesan told Protocol that data scientists often copy data without communicating that to the broader organization. So when a customer then requests their data be removed — as they are entitled under a PbD framework — a large tech company might not even know how to do so. “The challenge is really understanding where that subject data is,” Ganesan said.
To comply with user privacy regulations, companies need to build data pipelines from the ground up, said Jane Im, a Ph.D. candidate in computer science and engineering at the University of Michigan. “If they really want to comply, they should limit the amount of data they’re collecting,” Im told Protocol.
Facebook and others are habituated to use "massive amounts of data" for their business, Im added. "Would it be feasible for Facebook to retrain models?" she asked, wondering aloud if users would consent to "tracking so much of their users’ behavior, including off-site" if given the opportunity.
“Since these privacy regulations have come out after these systems are built, it's hard to retrofit existing systems to match these laws, which are pretty comprehensive and seem in line with what people actually want related to digital privacy,” said Habib.
Privacy at what cost?
What’s good for privacy often isn’t good for business, but it doesn't need to be that way. As with so much in this field, the outcome depends on implementation.
“We shouldn’t be surprised that accuracy also depends on the context,” Ben Fish, an assistant professor in computer science and engineering at the University of Michigan, told Protocol. “But it is far from guaranteed that privacy techniques will make a system worse — they can make a system better.”
In the leaked document, Meta engineers said addressing the privacy challenges would “require additional multi-year investment in Ads and our infrastructure teams to gain control over how our systems ingest, process and egest data.” That effort would require roughly 600 years’ worth of engineering time assigned to related projects, the authors estimated.
The Meta document shows just how resource-intensive it can be to rework systems to be more privacy compliant. Assigning those resources is obviously costly, so the challenge for regulators is making the penalties for violators costly enough to push privacy up on the priority list.
Executives must choose between allocating resources to privacy initiatives and other business priorities, according to Ganesan. “It always boils down to, at the top level, if you have 10 things to do and if you have resources to spend on three, which ones would you pick?” he said. Ganesan said the willingness to prioritize those investments is where things fall short more than anything else.
Further complicating the investment calculus, several sources said they see the shift from open to closed systems as only a first step.
“Even questions about where should the controls for these kinds of actions be placed so they're findable, they're discoverable — so that people know that they can actually do this — is an open research question, let alone what would it take to create these massive, massive pipelines that control for user data,” said Banovic.
Then there’s the consumer side: “We need more education for users that could potentially lead to more collective action,” said Im. Most social media users don’t grasp the extent to which online behavioral advertising business models collect and monetize their data, according to several research papers she referenced. “This kind of goes back to media literacy,” Im said.