Protocol | Policy

Lawmakers want humans to check runaway AI. Research shows they’re not up to the job.

Policymakers want people to oversee — and override — biased AI. But research suggests there's no evidence to prove humans are up to the task.

Closeup of lights reflected in a person's eye

The recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

Photo: Jackal Pan/Getty Images

There was a time, not long ago, when a certain brand of technocrat could argue with a straight face that algorithms are less biased decision-makers than human beings — and not be laughed out of the room. That time has come and gone, as the perils of AI bias have entered mainstream awareness.

Awareness of bias hasn't stopped institutions from deploying algorithms to make life-altering decisions about, say, people's prison sentences or their health care coverage. But the fear of runaway AI has led to a spate of laws and policy guidance requiring or recommending that these systems have some sort of human oversight, so machines aren't making the final call all on their own. The problem is: These laws almost never stop to ask whether human beings are actually up to the job.

"These assumptions about human oversight are playing a really critical role in justifying the use of these tools," said Ben Green, a postdoctoral scholar at the University of Michigan and an assistant professor at the Gerald R. Ford School of Public Policy. "If it doesn't work, then we're failing to get any of the protections that are seen as essential for making the system acceptable to us at all."

In a new paper, Green, who has extensively studied the use of algorithms in parole and sentencing decisions, demonstrates how the recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

"The point is not to say: Let's just allow these algorithms to be used without the human oversight," Green said. "But if we're only comfortable with these algorithms because we have human oversight, we actually shouldn't be comfortable with these algorithms at all, because the human oversight doesn't work."

This interview has been lightly edited and condensed.

What got you thinking about this issue to begin with?

For the last several years, I've been doing experimental technical work, studying how people interact with algorithms when making predictions and decisions. A good chunk of the empirical findings that I'm drawing on in the paper are this research that I've conducted over the last couple of years.

One of the starting points for me, several years ago, was thinking about this gap between how we evaluate algorithms — often just thinking about if they're accurate, if they're fair — and the actual mechanisms by which algorithms have impact. That is, this process where they're giving advice to a human, and then a human has to actually somehow interpret that information and decide whether and how to use it.

In doing that work, I uncovered a lot of issues in people's ability to identify errors, biases and how people respond to algorithms, and noticed a pretty significant disconnect between the empirical findings and the way that a lot of policies talked about this.

[Policies] are essentially just saying, "Hey, well, there's a human in the loop. So it's fine to use these risk assessments when making sentencing decisions." I wanted to really dig into this and see: What do the policies actually call for? And how do they fall short? Does anything actually work?

Before we walk through your findings in this paper, let's talk a little bit about what you have discovered in your more technical research on algorithms' impact.

The first paper really looked at how introducing risk assessments alters the predictions that people make. The primary finding was that people respond to risk assessments in biased ways. People are more likely to follow a recommendation to increase their estimate of risk when evaluating Black defendants and more likely to decrease their estimate of risk suggested by the risk assessment when evaluating white defendants. So, even if we were to say, "OK, this algorithm might meet certain standards of fairness," the actual impacts of these algorithms might not satisfy those constraints when you think about how humans are going to respond.

The second study was an extension of that, looking at whether people are able to evaluate the quality of algorithmic predictions. We found that they weren't. People can't really do that job, which is central to the idea of people being able to determine which recommendations from an algorithm they should work with or not.

The final piece, which was just published, was shifting from predictions to the decision-making process, and looking at how risk assessments alter the underlying decision-making process that people follow. If they're shown a risk assessment, does that actually make judges more likely to weigh risk more heavily when making decisions? We must balance the desire to reduce risk with other interests around the liberty of defendants, and so on. Are we improving the accuracy of human prediction? Or are we actually making risk a more salient feature of decision-making?

We ran an experiment to test that and found that we're more in the latter camp. We're not simply altering people's predictions of risk. We're altering how people factor risk into their decisions, and essentially prompting them to weigh risk as a more important factor when making decisions.

In the paper about human oversight of algorithms, you walk through three different ways policies are trying to introduce some level of human oversight to the deployment of AI, and you argue each way is flawed. Walk me through those three ways and their flaws.

They're all somewhat overlapping and related. The first approach is to say: If a decision is based on solely automated processing, then we're going to either prohibit it entirely or require certain rights, like the ability to request human review afterward. The most notable example of this would be the European General Data Protection Regulation, which has an article dedicated to solely automated processing.

By drawing this really strict boundary, we're failing to capture a lot of the influences of algorithms that have actually generated the most significant controversy and demonstrated injustice. Most of the decisions that we're most concerned about are not made in a solely automated fashion already. You could have a human play some relatively superficial role in the decision-making process, such that it's no longer solely automated. And if it doesn't count as solely automated decision-making, then you aren't subject to any of those regulations.

The second approach operates in some ways as a corollary to the first. It's saying: It's OK to use algorithms, as long as there's human discretion, and the human gets to make the final decisions. This is what we see, in particular, for a lot of the risk assessment tools used in the U.S.

But when you actually give people discretion to determine how they should use an algorithm, they don't do what you might want them to do with it. A lot of the research looks at how people override algorithms: How do people diverge from algorithmic predictions? And typically, they do that in sub-optimal ways.People are diverging from algorithms in ways that are actually making their predictions less accurate.

If the risk assessment says to detain someone, they'll generally follow that. If it says to release someone, they will override that in favor of detention much more frequently. Police who are supposed to be overseeing facial recognition predictions also do a really bad job of that. So all of the documentation we have about human oversight and human overrides suggests that they either defer to the tool when they shouldn't, or override the tools in typically detrimental ways.

The third category says: People might not understand the algorithm. So we really, really need to be sure that [the oversight] is meaningful. People should be able to understand how the algorithm works in some form that can help them determine when they should follow it or how to interpret it. The emphasis there is on explanations or algorithmic transparency.

The issue here really just builds on the issues of the second group. Yes, you can give people the ability to override the algorithm. But that doesn't necessarily help. Typically, people don't override algorithms in beneficial ways. Unfortunately, even explanations and transparency don't seem to improve things — and can actually make it worse. The explanations can make people trust the algorithm more, even if the algorithm shouldn't be trusted.

What are the alternatives, if human beings are not a sufficient safeguard?

It's not simply, "Oh, we can just turn from human oversight to something else." Human oversight plays a really fundamental role in justifying and legitimizing these tools. So we actually need to, given these failures, start from farther upstream and think about how we're even making decisions about when algorithms should be used at all.

We should be putting much more scrutiny on whether it's actually appropriate to use an algorithm in a given situation. Often, courts and policymakers will justify the use of low-quality algorithms by assuming that human review can account for their flaws, but I think we should be much more critical. And I think in many of these cases, we should be ready to say: This actually just isn't an algorithm that we trust. This isn't a decision where an algorithm is particularly well-suited to enhancing decision-making.

We should put much more of a burden on agencies to justify why it's appropriate to use an algorithm in a given situation. They should have to describe more proactively why this algorithm is going to improve decision-making or why it's appropriate to have an algorithm make this decision. And what is the quality of this algorithm? Is it actually one that we would trust with altering potentially high-stakes decisions? We just need to do much more proactive research of the actual human oversight or human-algorithm collaboration process.

Already, we're seeing policies that are calling for various types of evaluations of algorithms themselves, saying, "before you deploy the system, you have to run a test to show that the algorithm is accurate, and to show that it's fair." And I think that we should have similar types of tests that are required for the actual decision-making process. So if you're going to incorporate a pre-trial risk assessment into judicial decision making, there should be some sort of proactive assessment, not just of the pre-trial risk assessment itself, but also of how people or judges use the algorithm to make decisions.

Right now, we'll do evaluations after the fact. Two years down the line, we'll see that judges have been using this algorithm in all sorts of unexpected ways. And that's because we didn't actually properly do the homework.

A MESSAGE FROM ALIBABA

www.protocol.com

This year, China will become the first country where ecommerce sales will outpace brick-and-mortar transactions. U.S. businesses are using Alibaba's platforms to sell to 900 million digitally savvy consumers in China and untap new opportunities for long-term growth.

LEARN MORE

Photo Illustration: Igor Golovniov/SOPA Images/LightRocket via Getty Images

On this episode of the Source Code podcast: First, a brief update on the Facebook Files, as more stories start to come out. Then, Owen Thomas joins the show to discuss PayPal's reported interest in acquiring Pinterest, and why that deal might actually make sense for both sides. Janko Roettgers then discusses the good, bad and complicated of Netflix's last few weeks, before Lizzy Lawrence joins the show to talk about the world of productivity influencers.

For more on the topics in this episode:

Keep Reading Show less
David Pierce

David Pierce ( @pierce) is Protocol's editorial director. Prior to joining Protocol, he was a columnist at The Wall Street Journal, a senior writer with Wired, and deputy editor at The Verge. He owns all the phones.

The way we work has fundamentally changed. COVID-19 upended business dealings and office work processes, putting into hyperdrive a move towards digital collaboration platforms that allow teams to streamline processes and communicate from anywhere. According to the International Data Corporation, the revenue for worldwide collaboration applications increased 32.9 percent from 2019 to 2020, reaching $22.6 billion; it's expected to become a $50.7 billion industry by 2025.

"While consumers and early adopter businesses had widely embraced collaborative applications prior to the pandemic, the market saw five years' worth of new users in the first six months of 2020," said Wayne Kurtzman, research director of social and collaboration at IDC. "This has cemented collaboration, at least to some extent, for every business, large and small."

Keep Reading Show less
Kate Silver

Kate Silver is an award-winning reporter and editor with 15-plus years of journalism experience. Based in Chicago, she specializes in feature and business reporting. Kate's reporting has appeared in the Washington Post, The Chicago Tribune, The Atlantic's CityLab, Atlas Obscura, The Telegraph and many other outlets.

Theranos’ investor pitches go on trial

Prosecutors in the Elizabeth Holmes fraud case are now highlighting allegations the company sought to mislead investors.

The fresh details of unproven claims made about the viability of Theranos' blood tests and efforts to conceal errors when demonstrating testing equipment added to the evidence against Holmes, who is accused of fraud in her role leading the company.

Photo: David Paul Morris/Bloomberg via Getty Images

The Theranos trial continued this week with testimony from Daniel Edlin, a former product manager at the blood-testing startup, and Shane Weber, a scientist from Pfizer. Their testimonies appeared to bolster the government's argument that Holmes intentionally defrauded investors and patients.

The fresh details about audacious and unproven claims made about the viability of Theranos' blood tests and efforts to conceal errors when demonstrating testing equipment added to the evidence against Holmes, who is accused of fraud in her role leading the company.

Keep Reading Show less
Aisha Counts

Aisha Counts (@aishacounts) is a reporting fellow at Protocol, based out of Los Angeles. Previously, she worked for Ernst & Young, where she researched and wrote about the future of work, emerging technologies and startups. She is a graduate of the University of Southern California, where she studied business and philosophy. She can be reached at acounts@protocol.com.

Protocol | Policy

8 takeaways from states’ new filing against Google

New details have been unsealed in the states' antitrust suit against Google for anticompetitive behavior in the ads market.

Google is facing complaints by government competition enforcers on several fronts.

Photo: Drew Angerer/Getty Images

Up to 22%: That's the fee Google charges publishers for sales on its online ad exchanges, according to newly unredacted details in a complaint by several state attorneys general.

The figure is just one of the many details that a court allowed the states to unveil Friday. Many had more or less remained secrets inside Google and the online publishing industry, even through prior legal complaints and eager public interest.

Keep Reading Show less
Ben Brody

Ben Brody (@ BenBrodyDC) is a senior reporter at Protocol focusing on how Congress, courts and agencies affect the online world we live in. He formerly covered tech policy and lobbying (including antitrust, Section 230 and privacy) at Bloomberg News, where he previously reported on the influence industry, government ethics and the 2016 presidential election. Before that, Ben covered business news at CNNMoney and AdAge, and all manner of stories in and around New York. He still loves appearing on the New York news radio he grew up with.

Protocol | Workplace

This tech founder uses a converted Sprinter van as an office on wheels

The CEO of productivity startup Rock likes to work on the road. Here's how he does it — starting with three different WiFi hotspots.

Kenzo Fong, founder and CEO of the 20-person productivity software startup Rock, has been working out of his converted Mercedes-Benz Sprinter van since the pandemic began.

Photo: Kenzo Fong/Rock

Plenty of techies have started companies in garages. Try running a startup from a van.

In San Francisco, one software company founder has been using a converted Mercedes-Benz Sprinter van — picture an Amazon delivery vehicle — as a mobile office.

Keep Reading Show less
Allison Levitsky
Allison Levitsky is a reporter at Protocol covering workplace issues in tech. She previously covered big tech companies and the tech workforce for the Silicon Valley Business Journal. Allison grew up in the Bay Area and graduated from UC Berkeley.
Latest Stories