Privacy by Design laws will kill your data pipelines

The legislation could make old data pipelines more trouble than they’re worth.

An illustration of the Earth with data coming out

Data pipelines have become so unwieldy that companies might not even know if they are complying with regulations.

Image: Andriy Onufriyenko/Getty Images

A car is totaled when the cost to repair it exceeds its total value. By that logic, Privacy by Design legislation could soon be totaling data pipelines at some of the most powerful tech companies.

Those pipelines were developed well before the advent of more robust user privacy laws, such as the European Union’s GDPR (2018) and the California Consumer Privacy Act (2020). Their foundational architectures were therefore designed without certain privacy-preserving principals in mind, including k-anonymity and differential privacy.

But the problem extends way beyond trying to layer privacy mechanisms on top of existing algorithms. Data pipelines have become so complex and unwieldy that companies might not even know whether they are complying with regulations. As Meta engineers put it in a leaked internal document: “We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments.”

(When we asked Meta for comment, a spokesperson referred us to the company’s original response to Motherboard about the leaked document, which said, in part: “The document was never intended to capture all of the processes we have in place to comply with privacy regulations around the world or to fully represent how our data practices and controls work.”)

As governments increasingly embrace Privacy by Design (PbD) legislation, tech companies face a choice: either start from scratch or try to fix data pipelines that are old, extraordinarily complex and already non-compliant. Some computer science researchers say a fresh start is the only way to go. But for tech companies, starting over would require engineers to roll out critical data infrastructure changes without disrupting day-to-day operations — a task that’s easier said than done.

‘Open borders’ won’t cut it

Motherboard published the leaked internal document, written by Meta engineers in 2021, at the end of April. In it, an engineering team recommended data architecture changes that would help Meta comply with a wave of governments embracing the “consent regime,” one of the core principles of PbD. India, Thailand, South Korea, South Africa and Egypt were all preparing “impactful regulations” in this realm, and the paper also anticipated U.S. federal privacy regulation in 2022 and beyond. Such legislation would generally require Meta to obtain user consent before collecting data for advertisements.

The Meta engineers identified “the heart of our challenge” as a lack of “closed form systems.” Closed systems, they said, would let Meta enumerate and control all the incoming data flows. The engineers placed that in contrast with the “open borders” system that had been baked into company culture for over a decade.

Meta’s systems had grown increasingly complex and untraceable, the engineers said, citing the example of a single feature (“user_home_city_moved”) drawing from around six thousand data tables.

“These are massive pipelines with massive amounts of data feeding into many different kinds of algorithms,” Nikola Banovic, an assistant professor of computer science and engineering at the University of Michigan, told Protocol. “Because this was never a consideration to begin with, now it’s increasingly difficult to untangle things.”

The leaked document showed the frustration of internal teams tasked with overhauling systems designed in an era when everything was fair game, Banovic said. He noted that advocacy groups are pressuring companies to now design systems around end users.

“It’s not going to be easy,” Banovic said of the shift. He added that, while enhancing user privacy would be possible from a technology perspective, online behavioral advertising is fundamentally in conflict with that objective.

The challenges of tracing data flows at that scale aren’t unique to Meta, according to Hana Habib, a postdoctoral researcher at Carnegie Mellon University. “I’m sure all the major tech companies like Google and Twitter — the big tech giants — are facing this issue just because [of] the scale of their operations,” she told Protocol. Habib noted that most of the largest tech companies have faced GDPR fines.

When to say goodbye

Researchers already have a firm grasp on ways to make existing algorithms more privacy-preserving. K-anonymization, for example, is a user privacy technique that ensures data is sufficiently aggregated such that no individual can be identified by combined factors such as hometown and employment. Differential privacy, a standard that has been studied for over a decade, guarantees that someone observing the outputs of an algorithm cannot know whether it included data from a particular individual.

For many years now, Big Tech engineers have studied, applied and occasionally advanced these privacy standards. Google, for instance, achieved differential privacy anonymization in Chrome around 2014 and has since worked to expand it to Google Maps and Assistant. In 2018, Meta assured differential privacy compliance when it gave academics access to user data for assessing the impact of social media on elections. Apple published an in-depth research paper in 2017 about its application of differential privacy for features such as emoji recommendations and lookup hints.

If you have 10 things to do and if you have resources to spend on three, which ones would you pick?

But several sources said the problem is scale and sprawl, not just technique.

“I don’t know how they can be expected to comply with privacy regulation, which stipulates that they provide notice to consumers about these aspects of their data, when they don’t really know themselves,” said Habib.

Companies often don’t have visibility into where their data is being used and stored, according to Balaji Ganesan, the CEO and co-founder of the data governance startup Privacera. Ganesan told Protocol that data scientists often copy data without communicating that to the broader organization. So when a customer then requests their data be removed — as they are entitled under a PbD framework — a large tech company might not even know how to do so. “The challenge is really understanding where that subject data is,” Ganesan said.

To comply with user privacy regulations, companies need to build data pipelines from the ground up, said Jane Im, a Ph.D. candidate in computer science and engineering at the University of Michigan. “If they really want to comply, they should limit the amount of data they’re collecting,” Im told Protocol.

Facebook and others are habituated to use "massive amounts of data" for their business, Im added. "Would it be feasible for Facebook to retrain models?" she asked, wondering aloud if users would consent to "tracking so much of their users’ behavior, including off-site" if given the opportunity.

“Since these privacy regulations have come out after these systems are built, it's hard to retrofit existing systems to match these laws, which are pretty comprehensive and seem in line with what people actually want related to digital privacy,” said Habib.

Privacy at what cost?

What’s good for privacy often isn’t good for business, but it doesn't need to be that way. As with so much in this field, the outcome depends on implementation.

“We shouldn’t be surprised that accuracy also depends on the context,” Ben Fish, an assistant professor in computer science and engineering at the University of Michigan, told Protocol. “But it is far from guaranteed that privacy techniques will make a system worse — they can make a system better.”

In the leaked document, Meta engineers said addressing the privacy challenges would “require additional multi-year investment in Ads and our infrastructure teams to gain control over how our systems ingest, process and egest data.” That effort would require roughly 600 years’ worth of engineering time assigned to related projects, the authors estimated.

The Meta document shows just how resource-intensive it can be to rework systems to be more privacy compliant. Assigning those resources is obviously costly, so the challenge for regulators is making the penalties for violators costly enough to push privacy up on the priority list.

Executives must choose between allocating resources to privacy initiatives and other business priorities, according to Ganesan. “It always boils down to, at the top level, if you have 10 things to do and if you have resources to spend on three, which ones would you pick?” he said. Ganesan said the willingness to prioritize those investments is where things fall short more than anything else.

Further complicating the investment calculus, several sources said they see the shift from open to closed systems as only a first step.

“Even questions about where should the controls for these kinds of actions be placed so they're findable, they're discoverable — so that people know that they can actually do this — is an open research question, let alone what would it take to create these massive, massive pipelines that control for user data,” said Banovic.

Then there’s the consumer side: “We need more education for users that could potentially lead to more collective action,” said Im. Most social media users don’t grasp the extent to which online behavioral advertising business models collect and monetize their data, according to several research papers she referenced. “This kind of goes back to media literacy,” Im said.


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories