Why Facebook’s data-sharing project ballooned into a 2-year debacle
"I'm happy to be quoted saying this: This was the most frustrating thing I've been involved in, in my life."
Facebook's colossal social science dataset is finally here, but Nathaniel Persily, the Stanford law professor who helped shepherd its release, didn't hold back about the nearly two-year slog it required:
"I'm happy to be quoted saying this: This was the most frustrating thing I've been involved in, in my life," he said.
Persily is one of the co-chairs of Social Science One, the group that partnered with Facebook on the development and release of the dataset, which includes 38 million URLs shared publicly on Facebook, as well as anonymized data about who shared them and how. The trove is aimed at researchers who want to study social media's impact on elections and democracy.
Get what matters in tech, in your inbox every morning. Sign up for Source Code.
When he and his partner, Harvard political scientist Gary King, took on the project in 2018, Persily says they assumed that Facebook had already developed a process to share this data externally and had worked through the legal consequences of doing so.
They thought wrong.
Instead, Persily and King say they spent the next 20 months embroiled in tense negotiations with Facebook's engineers, communications professionals and legal teams over how to release the data in a way that both satisfies privacy considerations and makes for useful research material. They found that Facebook employees largely believed in the mission but were hamstrung by concerns about how both regulators and customers would view this type of data sharing. In the end, Persily says, the process of releasing the data said as much about Facebook as it did about the broad, and sometimes vague, laws that govern it.
One central issue for Facebook, Persily says, was complying with the General Data Protection Regulations, which went into effect in the European Union just a month after the project launched in 2018. Those regulations seek to prevent rampant data sharing by companies like Facebook, but Persily argues, they don't adequately balance the need to share data with researchers who might hold those same companies accountable. And given how new these regulations are, there's still a lack of clarity about how they'll be enforced in different countries throughout Europe.
"Just as we're regulating them to force them to protect privacy, we need to regulate them so they're accountable and transparent," he said, "and one aspect of an accountable and transparent regime is having someone other than people inside the firm analyze the data they have to see whether they're destroying democracy around the world."
Persily says he made three trips to Brussels to try to convince the European Commission to do something about that. Europe's Data Protection Supervisor recently issued guidance on this topic, noting, "Data protection obligations should not be misappropriated as a means for powerful players to escape transparency and accountability."
There is, of course, good reason for regulators — and for Facebook — to be cautious about sharing data with researchers. It was, after all, a researcher at the University of Cambridge who scraped Facebook data and sold it to the political consulting firm Cambridge Analytica before the 2016 election in the U.S., causing an international scandal.
That scandal was, in fact, the impetus for this project, according to King. In 2018, days after the Cambridge Analytica news broke, King says he got a call from Facebook CEO Mark Zuckerberg asking him to study Facebook's impact on the election. But King recalls that Zuckerberg was reluctant to give him access to all of the data he needed and allow him to publish his findings without Facebook's approval. As an alternative, King and Zuckerberg devised a plan to open data up to outside researchers and allow Social Science One to vet their research proposals.
"The Cambridge Analytica scandal was an enormous crisis for this company," King said. "We made use of the crisis. But it also made it difficult."
The Cambridge Analytica scandal forced a privacy backlash that some researchers feel didn't adequately account for the need for transparency. "Unfortunately, the overreaction to Cambridge Analytica by regulators has meant that a significant amount of legitimate academic research on how these platforms are able to be abused is not able to happen now," said Alex Stamos, Facebook's former chief security officer and the current director of the Stanford Internet Observatory. "Lots of politicians who have talked about privacy in a negative way at Facebook are also saying we want academic research to happen. They say those two things and don't realize they're not compatible."
It wasn't just the European Commission. Persily says Facebook was equally cautious about upsetting the FTC. In 2011, the company reached a consent decree with the Federal Trade Commission over its privacy practices, which the FTC viewed as deceptive. Then in 2019, the FTC slapped Facebook with a $5 billion fine for violating that agreement and forced the company to commit to a new set of privacy obligations.
Because the United States has no comprehensive privacy laws, that means Facebook's competitors don't face the same restrictions and requirements. Stamos argues it stands to reason, then, that Facebook would now take a more conservative approach to sharing data. "You probably only take those risks for actions that make you money," Stamos said. "You're probably less likely to take that risk for an action that helps someone write a paper."
A Facebook spokesperson told Protocol that abiding by regulations was only one of the company's concerns over the course of this process. "We still have a pretty big responsibility on our shoulders to think not just about the regulatory perspective, but also the good governance perspective," the spokesperson said. "We have a commitment and promise to think critically about the privacy parameters we put around these programs."
According to the spokesperson, Facebook had considered different approaches to sharing the data before Persily and King came on board. One option was to bring researchers in to "clean rooms" on Facebook's own property, where they could handle the data, but in a tightly controlled environment. That idea was scrapped because it would be geographically challenging for international researchers and wouldn't give the perception, or even the reality, that these researchers were really independent.
Once Persily and King came on board, they weighed other ideas, like limiting the number of queries any one researcher could perform on the dataset. That might minimize the risk that the data could be reidentified. But Persily says it also raised concerns that researchers might circumvent those rules by banding together and combining queries.
Finally, they settled on a technique known as "differential privacy," which is also being used by the U.S. Census Bureau to share census data. With this technique, the data would be infused with noise that prevents people from matching anonymous data with real people. Now, the question is whether that technique will weaken the integrity of the data and researchers' findings. Persily and King have released open-source software to help researchers deal with the challenges of this dataset.
Persily and King are hardly satisfied with what they've produced so far. "We thought this would take a couple months. It took 20. And also we thought we would have way more data than this," King said.
Yet, they still view this as an important milestone for anyone who wants to know more about how information flows on the tech platform that dominates our lives. "I always thought it was worth it if we could have a day like today," Persily said. "If we could unlock the safe to some of the data in Facebook's control. Then we would begin to show this is possible."