Researcher danah boyd on how to protect the census and fix tech
"I want the reckoning to not just be on the backs of the people who were most hurt by it."
No one is paying closer attention to how the digital world is changing our lives than danah boyd.
boyd (whose name is all lowercase) is one of the foremost chroniclers of the lives we live online. Over the past decade, her writing and research have shaped how we conceive of media bias, algorithmic fairness, disinformation, and the ethics of data science. boyd is founder and president of Data & Society, partner researcher at Microsoft Research, and visiting professor at NYU. Last year, she received the Electronic Frontier Foundation's Pioneer Award and used her acceptance speech to call upon her colleagues to join her in "articulat[ing] what a better future looks like and work to make it happen. Honestly, we don't have any other choice," she said.
Get what matters in tech, in your inbox every morning. Sign up for Source Code.
Today, boyd is focused on exposing and mitigating the myriad risks that have accompanied the explosion of commercial data and the acceleration of computing power. Most recently, she's been obsessed with the upcoming census. She released a report on the new and unprecedented stakes of the count, which will implement a novel technical system to protect individual privacy for the first time.
Protocol's Linda Kinstler spoke with boyd about what it will take to ensure that the census stays anonymous, what the "great reckoning" will look like for tech, and why it is still to come.
This interview has been edited and condensed for clarity.
One of the things that we're thinking about here at Protocol is, what does the tech world look like in the next decade? I went back and I looked at the provocations for big data that you and Kate Crawford published in 2011. Do you have any provocations for the next decade or what it might hold?
I think that one of the reasons why I'm doing this particular [census] project at this particular time — democracy happens to be really important to me — is that we are at a point where we think that data speaks for themselves.
We don't want to admit that, but we think that it speaks for themselves. Everybody's like, "I'm going to AI this, AI that." But how good is your data? Does your data actually stand up to that? What does it mean for data to be fit for use? How do you really know what you think you know? I'm reminded of this quote from an economist in the '60s. He said, "If you torture the data long enough, it will confess to anything."
The potential of using data in a sophisticated way has opened so many doors. But how do we make certain that this move towards data in different kinds of environments does not become an authoritarian tool to harm people? It has every potential to do that.
I think we need to simultaneously hold on to the potential for the things we delight in, that we dream of and that we hope for and design for, and the possibilities for where data and the ecosystem can be abused, it can be manipulated, it can be used to do harm. It can be used to exert power. All of those are true simultaneously.
Throughout our history, we've had turns to different forms of knowledge that have been both eye-opening and devastating. Think about the development of physics. Everybody was amazed by the possibility of what we could know and what we could do until we launched an atomic bomb. This is where I worry because I see the practices of data abuse as a way of not only directly harming but also up-ending the kinds of knowledge infrastructures, the kinds of communities, the kinds of things that allow our social fabric to stick together.
Think about this country, by which I mean the United States, and our original sin of slavery. Michelle Alexander's book, "The New Jim Crow," shows how slavery, as it was being dismantled, was helping enable the creation of Jim Crow, and as that was being dismantled, it was helping the creation of mass incarceration. That's her argument. What scares me is that as we're seeing all of this great work to try to dismantle mass incarceration, we're seeing a new data ecosystem being used and developed that is overwhelmingly punitive to communities of color.
I'm not willing to tolerate that. So how do we understand that? What are the various speed bumps we need to put in as we start to think about what are the laws that need to exist? What's the kind of infrastructure that needs to be built in order to counter that?
In the last couple of years, especially since Michelle Alexander's book and Safiya Noble's book and many others have come out to address these problems, have you seen people who are actually working with technology shift their practices in response?
You'll never meet an engineer who is trying to do something harmful. They don't wake up every morning to be like, "I'm going to screw people." That's not the mindset. And so you have to ask, how did they get to the point where their actions did harm? And this by the way, is true in every sector.
And this is my constant sort of push back in the field of journalism, which is that journalists don't wake up every day and say, "I'm going to help undo democracy." No! But what does it mean when journalists become complicit in amplifying things that actually contribute to that? You can help educate a journalist. You can help educate an engineer, but you also have to look at the broader structural conditions. The broader structural conditions within a technology company have a lot to do with certain things, like a commitment to universality. What values dominate when we're going with a universal outcome? That gets pretty dicey pretty fast.
Let's take some of Safiya's points. She looked at Google and at different autocomplete results. Google was like, "Oh, I know what we want to do, let's build this autocomplete so that people don't have to type as much because they're on a cell phone. Let's build autocomplete so that we get better data, so that they provide clarity of whether they want Subway the sandwich or subway, the train." They were like, it benefits people, it benefits us, this is great. And then all of a sudden people start autocompleting, and you realize that the minds of the many are horrible .
As Latanya Sweeney has pointed out, the internet amplifies our racist logics. Google amplifies our racist logics. It's not that Google created them. Google is amplifying them. So then, how much do we expect various corporations in this version of late-stage capitalism to be standing up and trying to make, effectively, cultural reparations? Again, that can be a value that many of us agree on, but do we think that value is compatible with late-stage capitalism? And if not, how do we work our way through that?
What do you make of the discourse around ethics in tech. Do you see it changing anything?
You now have people who are fully, deeply committed to making things more ethical. In many cases you also have buy-in from senior leaders. But what do they actually even mean by this? The way that I've been thinking of it lately is that the senior leaders are definitely focused on legal risk. They don't want any legal risk. They're probably also focused on forms of reputational risk. So they really want a form of compliance that is an articulated set of things, a big checklist that they can check off and know that they've done well. That's what they built before. So then these ethics folks come in. Now, the ethics folks that come in are now running against some of the organizational challenges because legal risk, reputation, risk compliance — that's not actually what we mean by "ethics" in the broader discourse. The first thing we have to think about is, what are the trade-offs? In order to talk about trade-offs, we have to talk about values.
How does a company have values beyond profit for shareholders? That's where things could get dicey. Many of the folks on the outside aren't even talking about trade-offs and values. They want justice. Ethics tends to encompass all of this, all the way from the world of legal risk, all the way to justice. As a result, the people on the outside are not at all satisfied by the ideas we're going to get from compliance. And these folks who are coming in are like, can we at least talk about trade-offs of values? They're not even trying to get to justice. They're just like, can we get one level down? We're going to have such contested challenges around this because we don't know how to articulate values within this form of capitalism.
When I grew up in the tech industry, some people dreamed of going IPO, but many people were really happy to have a self-sustaining company that wasn't that big of a deal. Give me a startup today that has any level of traction that's not VC-backed. In New York, I can't find a restaurant that's not VC-backed!
When you were presented with the EFF Pioneer Award in September, you said that the "great reckoning" is ahead of us. What do you hope that will look like?
My experience as a white woman coming into the tech industry was one of simultaneous privilege and opportunity. I had a computer science degree from elite university. And I was constantly facing the endemic sexism and misogyny. I watched numerous men who made a lot of money and engaged in really bad behavior. We've been having a conversation about the lack of women in technology basically since the '70s. Before that, [technology] wasn't cool enough to be taken over by men. We haven't even started dealing with why this industry is so devastating for people of color. All of the big companies, they have trainings about diversity, inclusion and equity. They have different ways of trying to go about this. But it's so baked into aspects of the culture, and it's so accepted.
It connects to some of the media manipulation work that I did. Gamergate was a big warning sign. Gamergate was also a siren call to say that we had shifted to a point where gaming was about identity formation for men, and particularly for white men, even with women and nonbinary individuals using these systems at scale. People who developed this [mindset] started seeping into tech and expecting that the tech industry would be about a masculine identity, a male identity. And even people who were trying to do good helped allow this to happen, through frameworks of meritocracy and cultural alignments and other things that helped feed this.
So what I want out of the reckoning: I want people to be actively reflective about their contributions to this, and work systematically to right some of those wrongs. I want the reckoning to not just be on the backs of the people who were most hurt by it. We've had a lot of #MeToo conversations, but that's honestly just the beginning. There is so much more going on. I want to get to a point where we are able to have a healthy tech ecosystem, because how can you expect to be building healthy technologies if the ecosystem is so insidious?
Now I want to turn to your recent report on the census. There's been a lot of controversy ahead of the 2020 census, but from a data perspective, how is this census different from ones that have come before?
If you think about what the census is, it's our core democratic infrastructure. It's the data for democracy. We cannot have a representative government without understanding who our people are.
For starters, there's an internet self-response component to it. That's going to change how people relate to this, just like changing to paper did. We are in a highly partisan and very fraught debate over what our democracy is. There's so much fear and anxiety, and that will get amplified and go in every which direction. So how do we get a good count in that ecosystem, in that moment where everybody's scared? How do we get people confident that this data is secure and trusted and that they actually will make the count?
So, what is "differential privacy" and how is it being applied in this year's census?
The census does not make personally identifiable information public. Full stop. It is only allowed to publish statistics. Historically, what that meant is that they just didn't publish tables of really small geographies. They just couldn't.
As things got more sophisticated, they started to engage in a variety of different ways of introducing noise into the census data. They would swap families from one block to another in order to not make somebody visible.
Well, the thing is, we've gotten to a point where the sophistication of doing computational reconstruction of data means you can reconstruct individual entries. You can now do that for the full data because the computational power is there. And because we have a tremendous amount of commercial data available, you can match those reconstructed entries to commercial data and re-identify people. That's not acceptable.
So, starting in 2006, the Census Bureau started evaluating whether or not it could take a set of new techniques that were coming about in computer science, and could they apply it to the census. They began an evaluation process well over 14 years ago. But as the availability of computing power and the availability of commercial data has caught up, it's become imperative that they have to change how they do disclosure avoidance. So they have been pushing forward a technique, mathematically aligned with differential privacy, that effectively is a form of encryption.
It allows them to introduce a very controllable amount of noise to the data that means that if you reconstruct the individual entrants, you can't match them against the commercial data.
What are the stakes behind differential privacy? What are the risks that come along with the possibility of the data not being totally anonymized?
If differential privacy and the techniques that they have built are applied to the data, and there is every expectation that they will be, you have the ability to protect the data.
So let's talk about the threats. One threat is that somebody on the outside could reconstruct and re-identify that data using commercial data. Differential privacy protects that. Another threat is that somebody could illegitimately access the whole dataset. With security, you're always concerned about any possibility [of a breach]. I will say personally, if I can breach Amazon Web Services, I can think of a lot better data that I want, starting with some bank data. So that's a threat, but the level of risk is not that high. The third and sort of most common threat that people get concerned about is that another part of the government would demand access to the data.
And here's where we have to be clear about whether or not we believe that the laws that are in place can withstand that pressure. That's obviously an open question for a lot of people.
So those are the three threats. Now let's flip it around. We want that data to be accessible, and we want that data to be usable. The usability is for the constitutional first requirement, which is the count of every state.
There's all of these other things that data is used for, like redistricting. Redistricting has a couple of different components to it. The first is just the ability to provide roughly equitable districts. You have to make certain that the districts you draw are legitimate and accepted based on a set of legal conditions, for example, the Voting Rights Act, to make certain that we have districts that are not undermining the votes of minority populations.
So one of the big questions at play right now is: How will this implementation alter all of these local and federal laws and the ways of validating and verifying them? This is a top-order concern for the Census Bureau. So what they're doing is trying to assess and evaluate the process as it unfolds. That is part of what they're doing, and they have to have all of that ready by March 31, 2021. That's a year and a half from now.
Funding allocations are also determined by different formulas that are applied to this data. And here's where it's really funny, because the Census Bureau has long produced the files but not ever evaluated all of the uses. And so one of the difficulties with them having to determine where to place the noise in this process in order to make it abide by the requirements of differential privacy is that they need to have more visibility anticipating uses. That's a lot of what's been going on right now. There's a lot of back and forth. People are frustrated because, you know, you're seeing a massive and radical change.
The big question is, will the data be usable? A lot of data users sort of have assumed that for the purposes of funding allocation, we don't have to calculate for margins of error. So those formulas don't account for margins of error. So the open question is, do they have to in this environment?
The question of how much we can assume about data makes me think of all the talk right now about improving the "explainability" of emerging technologies. Do you think that's a useful approach?
You've hit the nail on the head. This is precisely the challenge, because the Census Bureau is approaching this, as like, "We'll make the code transparent — we'll post the code!" And everybody else out there is like, "No! What's that going to do to the data?" But as we know in the battles over explainability, if you don't have the raw data, you don't necessarily know what happened. But the raw data is exactly what they're trying to protect.
Different communities have different concerns about what matters most in terms of their data. At the end of the day, what does it mean to be both public data — the public's data — but also to hold up privacy as a really imperative commitment?
It's not just imperative in terms of a value or even in terms of the law. It's also that people will not give you the data if it's not private. This is why it's so fraught: How do you hold these commitments together? Explainability, while it allows us a technical way of assessing, it only gets us so far because there is money and politics on the line here.