Twitter recently released one of its algorithms into the world — the one that controls how images are cropped in the Twitter app — and said it would pay people to find all the ways it was broken. Rumman Chowdhury and Jutta Williams, two executives on Twitter's META team, called it an "algorithmic bias bounty challenge," and said they hoped it would set a precedent for "proactive and collective identification of algorithmic harms."
The META team's job is to help Twitter (and the rest of the industry) make sure its artificial intelligence and machine-learning products are as ethically and responsibly used as they can be. What does that mean or look like in practice? Well, Twitter (and the rest of the industry) is still figuring that out. And this work, at Google and elsewhere, has led to huge internal turmoil as companies have begun to reckon more honestly with the ramifications of their own work.
Chowdhury and Williams joined the Source Code podcast to talk about how the META team works, what they hope the bias bounty challenge will accomplish and the challenges of doing qualitative research in a quantitative industry. That, and what "Chitty Chitty Bang Bang" can teach us about AI.
You can hear our full conversation on the latest episode of the Source Code podcast, or by clicking on the player above. Below are excerpts from our conversation, edited for length and clarity.
David Pierce: At the risk of starting with a question you could spend the whole hour answering, can you explain, at a very high level, what the problem is you're trying to solve at Twitter? And why it has proven so hard, both for you and for everyone, to figure out how to solve?
Rumman Chowdhury: Oh, that would actually take an hour, or more. Entire dissertations are being written on that topic! So, META stands for machine learning ethics, transparency and accountability. And in a nutshell, that's what we look at. There's an already well-documented history of algorithmic bias, unfairness and unintentional and intentional ethical harms. When I say intentional, there are some adversarial cases where bad things are happening. But most of the time, we are working in the space of unintended consequences.
Why this is such a big undertaking, and a team like META will never go away at any company, is that we are considering deeply ingrained social, ethical biases that have existed for quite some time. Machine learning does not create new biases, it is simply an amplification of problems that already exist. In that sense, it can seem like a very, very daunting task. We are not here to solve all of society's problems, but we will do our best, within the small slice of the universe that we can help and manage, to make sure it doesn't get reflected where we are.
Jutta Williams: Rumman and I come from different requirements and roles. And my role is to really take all this amazing learning, all these new, consequential ways of thinking, and apply them. Make them operational, put them in our products, and make sure that there's action. And so I think that for us here in META, I would say that one of the hardest things to do is to take this very nascent new learning, turn it into action and then make a visible change so that people are having a better experience.
DP: It's this very big, society-sized problem, and everybody is dealing with different versions of the same thing. And I could see a world in which you try to create a new office at the White House to do this kind of work, or you do it academically so that it's easy to share. I'm curious why you picked Twitter, but also why it felt right to do this inside of a company in the first place.
RC: It is an ecosystem of players that helps this kind of work move forward. You know, I think it is a noble goal that we are trying to algorithmically or technologically not re-create the problems and issues of the past. Given that this is such a broad undertaking, it does require an entire slew of people.
I do think that the White House should have an office, or at least a group of people who are working on this problem. I also do think academics should be well-funded to do this research. I also think that civil society should flourish. And you know, these types of organizations should be funded. And I also do think that [the] industry needs to have people!
Anna Kramer: So what are the things at a tech company, specifically, that they're empowered to do — maybe it's specific to Twitter, maybe it's industrywide — that is different from these other actors that you're talking about?
JW: I'd say that it's not either/or. I'm an "and" person. So, I used to chair standards for AI for ISSIP for the U.S. And I write to my representatives. And I am happy to speak with regulators. And I get to work inside of a tech company. So it's not that I'm doing this in opposition to or in lieu of all those other things. I think every citizen who is concerned about their own experiences and how algorithms make decisions about their experiences online should be involved in every way possible that's accessible to them. I happen to have access inside of this one of the tech companies, in addition to being a citizen that's affected.
Internally, we have the ability to talk to developers, to educate and to grow understanding and to see the data and understand how the data is being used very specifically. And that's a perspective you only get if you work inside of a company. And I think you can effect a lot more change when you have the ability to sit down with the people who are making the decisions and want to know how to do this better, faster and with more care.
RC: And specifically, sitting in a company, you do get access to data, to models, to the individuals building these models. You know, as META, we can only accomplish so much. Our team does not own every single model at Twitter. But we work with all those teams. And often, especially in a company like Twitter, we find that in good faith, people are trying to figure out how to solve these problems.
It's worth noting that the field of machine learning has not advanced enough where the average data scientist, the average ML engineer, really understands how to address these problems. We have only arrived at a place where it is a common enough conversation that people now are open enough to say "We should look at these problems," and "Here are the problems." And the next step is giving people the right kinds of tools and access to information and access to experts who can help them fix these problems.
It's still a contentious issue in the machine learning world. I mean, if you look at NeurIPS, they recently instituted an impact statement. It is just a statement. And even that has led to a firestorm of controversy. That is sometimes disheartening, to see that some people don't even want to take a minute to reflect on the work they are doing. No papers are being turned down because of impact statements! All that is being asked is that people give some consideration.
DP: Let's dig into that, actually. One of the strange things about listening to you explain this at the top was that it doesn't seem like the basic idea of what you're trying to do is terribly controversial. How do we make sure that machines don't make human mistakes? I can't imagine you could present that to anyone, and they would throw that back in your face. And yet, this has been a really controversial thing. What is it about this that comes across so controversial to people?
RC: You know, I don't know. I agree with you. I think we can all agree that really what we're trying to do is help companies be more thoughtful about what they're building, and how they build it, and how it impacts the world. There are plenty of other services and things that companies do, like data protection and security, that actually have a really similar remit.
I think we can all agree that really what we're trying to do is help companies be more thoughtful about what they're building, and how they build it, and how it impacts the world.
JW: There's always this tension of first-to-market. And so speed is always something that we compete with. And there's this misperception that if you add thoughtfulness and you add control that you'll be slow. I think that that's wildly incorrect. I think that when people don't have to guess, and when they're not worried, they actually go a lot faster.
I used to say, why are there brakes on a car? So that you can drive fast! When you look at the evolution of braking systems, the fastest cars have the most sophisticated braking systems. And it's so that you can take corners quickly. Especially if you're a rally car driver, and you're driving on an unknown course — which is what's happening with a lot of innovation — then you don't have to worry about driving off the rails and hurting yourself or others. So I don't think that there's tension when it's done well. Bolt-on practices and reaction isn't always done well.
AK: I think this is a good moment to bring up the bug bounty program. One of the ways I'm interpreting what you're doing there is kind of addressing head-on a lot of the people who are skeptical, because you're creating a more public forum for people to talk about this, and to understand what it is that you're doing. What was the thinking behind launching this bug bounty? Where does it fit into this broader goal of changing the bigger tech community conversation around the work that you're doing?
RC: Absolutely. This is like my favorite thing to talk about at the moment. I'm very excited.
So, our algorithmic bias bounty is modeled after your traditional InfoSec vulnerability bounties and bug bounties. What we're doing is opening up a model, we've provided a rubric, and we're very clear about how submissions are going to be graded. Folks have a week to identify all the harms they can, essentially, and share their findings with us. What we are asking people to share is their code, a brief self-grading rubric, as well as a brief description of why they took the approach they did.
We have very intentionally made this program global. We really do want global perspectives. One of the critiques of tech in general, but also even the responsible ML community, is we generally have a particular type of person, we tend to live in a particular place — i.e., the Bay Area — a lot of us work in tech or tech-adjacent. And also it's a very Western-focused field. So to hear people starting to ask questions about caste-based discrimination, for example, or how might an image-cropping algorithm mis-crop somebody who's wearing a head covering, these are usually not questions that come up in a very Western setting.
So our bias bounty is open until the end of the week. We have cash prizes for folks, not just for the people who score the highest, but also for the most innovative approach and the most generalizable approach.
JW: ML is often considered a thing, but it's really 25 different things. And figuring out how to apply a control or how to do something better in every one of those parts of developing and delivering an ML algorithm, it takes specialists. And when bug bounties were first introduced to security, it was enormous. I remember an operating system launch as part of when I worked for the government, and there were over 100,000 bugs that were active in that operating system when it went live. And the company that shipped that product shut down product development for a period of time just to close bugs.
ML is often considered a thing, but it's really 25 different things.
It's so big, and it's so complex, that it's very hard for any one entity to solve all the problems. We have that problem with AI in general. And it's not just a matter of perspectives, it's also the complexity of the systems. So asking for help should be rewarded. It was incredibly beneficial to the security world when we opened it up to not just adversarial thinking, but even cooperative thinking, gave people the method by which to communicate with us effectively, and then we could reward them for that work. I think that our world today is closer aligned to the security world than people appreciate or realize, and I don't see any challenge with this being just as beneficial to the ML space as security companies were to security engineering work.
AK: How do you create user demand for this? If it's something you're going to sell as an asset, your users need to be wanting it or requesting it and knowing what they're talking about. How much of your work is around that part of the question? And then how do you go about doing that?
JW: It's such a big part of the product management role and responsibility. We're supposed to be the advocate and the ombudsman, if you will, for consumers and for people. And I don't think that users of the platform are the only people who are affected by our products. So when I say people who are impacted, it could be society at large.
So we leverage consumer experience researchers, people who do qualitative investigation. They talk to people, not just people who use our product, but also people affected by our product, and the conversation that's enabled by our product and platforms.
We have a project ongoing right now around algorithmic choice. And there's a lot of rhetoric and conversation in the industry about giving people more choices, about how algorithms make decisions that affect their experiences on platforms like ours. But we don't really know what choice means to every person. And we don't know what algorithmic choice specifically means to people.
DP: I would imagine one of the challenges of doing this kind of work in a tech setting is that it just wants to be so qualitative, but eventually you have to find ways to try to make this stuff as quantitative as possible in order to actually start to build it into products. Is that as hard as I think it would be?
JW: It's extremely hard, especially in something as esoteric as this space. What I learned about privacy is that every person thinks of that word differently. So with algorithms, you're building — I won't say a one-size-fits-all, it's a personalization algorithm — but it's built off of one construct. And so when you see somebody implement a setting or a button that gives somebody choice, but it's a choice that is applied to everybody in exactly the same quantitative way, it's not necessarily the choice that I'm talking about when I say user choice.
Most of the time, these are settings that filter something bad out, right? As opposed to adding something delightful to your experience. And I don't know that more flags and filters is really adding choice or enabling something better for people. So it's figuring out what is the right thing to do. But then turning it into code, turning into a technical design that implements that on a very personal basis and allows people from different walks of life and still provides a safe experience? It is very complicated.
We've talked about, say, a profanity filter. You can turn on or off profanity, but what does profanity mean to you? And how do you qualify something as profane? What if you speak a different language? What if you don't consider one word to be profane? What if that's a very common word in the way that you speak English? These are all things we have to consider before we start making decisions about how algorithms apply and affect your world.
DP: Is it possible to draw that kind of baseline you're talking about in one way that works for, if not everybody, then almost everybody?
RC: I think we can reach a way of approaching it that could feasibly be generalizable. I do not think there is one rote methodology to follow. And that's the struggle.
The first question you get asked from any engineer is, "What's the checklist?" And to be fair, that's how a lot of engineering folks work. It's like, I have this checklist of things to follow. I do these steps. And it's really hard for folks to internalize that, you know, sometimes you don't pass, because the thing you might fundamentally be building is unethical or wrong or just will be an absolute disaster. That is one thing to internalize. And the second is that no, we actually do require people to be thoughtful. And then if they don't know how to answer it, raise their hand and ask the right people who can help them. That is quite difficult.
I've learned a lot working with folks in risk and compliance. It has reshaped how I understand algorithmic bias and the harms, by thinking about how people who look at risk — especially things that are less tangible, like reputational harm — think of that when doing risk calculations. A lot of that world sits in very legally wrapped language, and qualitative language that does not translate well to machine learning folks who want very clear, standard ways of saying things and doing things.
A lot of folks start with a list of questions that model owners should be asked. What I've found is that model owners want to give very precise answers. When you ask them open-ended questions, they spiral. And it's really difficult to answer! So what we have done in our assessment rubric is to state things as a statement, and ask them to assess the likelihood of this event happening, and the impact if it were to happen.
Rather than saying, "Is there bias in your model?" we make it a statement: "There is harmful bias in this model." That actually gives model owners a better place to start from. That completely reframed how we built our internal assessment tool.
AK: It also seems to me like part of your job that we're not talking about much is, when is it your job to just say, "No, this shouldn't be automated at all?" Or "No, artificial intelligence isn't useful for everything." Or is it just inevitable that eventually everything has some kind of model informing it, and fighting it is a futile effort?
RC: Our image-cropping assessment was a perfect example of us coming to a conclusion that a model wasn't the best way to do something. What we could have done is looked at our model in a bubble and said, here's where we see biases, and we're gonna go make them nice and then everything will work. But to take a step back, what we felt once we introduced the concept of representational harm — which by the way, our bias bounty is specifically focusing on presentational harm — we realized that the best way to enable pure representation is just to not introduce an algorithmic layer and allow people to share their photos as they are.
I was alluding earlier to our internal risk assessment, and one of the questions that I've added there actually asked the model owner, "Is this model better than what would exist otherwise?" And that is often not a question that is asked of model owners, because as long as someone can show that it's faster, or it's cool, then they're often not questioned. But we're specifically asking, what has the world been without this model that you've built? And is it actually adding a net improvement to someone's experience?
What has the world been without this model that you've built? And is it actually adding a net improvement to someone's experience?
JW: Andrew Ng very famously said that ML or AI is like electricity: It's going to be everywhere and do everything. But we don't use electricity for everything, even though it's pervasively available, right? We still make a soufflé without using an electric meter. And so my point is simply that unless you need to do something fast, unless you need to do something at scale and unless it's the right tool for the job, just because ML is available doesn't always make it the right tool.
One of my favorite movies is "Chitty Chitty Bang Bang." I don't know if it's necessary to create a big technical machine in order to fry an egg. So sometimes it's just a question of, is this the appropriate use of something that is supposed to be working at 25 horsepower, or do we actually just need to take a stroll?