Policy

Lawmakers want humans to check runaway AI. Research shows they’re not up to the job.

Policymakers want people to oversee — and override — biased AI. But research suggests there's no evidence to prove humans are up to the task.

Closeup of lights reflected in a person's eye

The recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

Photo: Jackal Pan/Getty Images

There was a time, not long ago, when a certain brand of technocrat could argue with a straight face that algorithms are less biased decision-makers than human beings — and not be laughed out of the room. That time has come and gone, as the perils of AI bias have entered mainstream awareness.

Awareness of bias hasn't stopped institutions from deploying algorithms to make life-altering decisions about, say, people's prison sentences or their health care coverage. But the fear of runaway AI has led to a spate of laws and policy guidance requiring or recommending that these systems have some sort of human oversight, so machines aren't making the final call all on their own. The problem is: These laws almost never stop to ask whether human beings are actually up to the job.

"These assumptions about human oversight are playing a really critical role in justifying the use of these tools," said Ben Green, a postdoctoral scholar at the University of Michigan and an assistant professor at the Gerald R. Ford School of Public Policy. "If it doesn't work, then we're failing to get any of the protections that are seen as essential for making the system acceptable to us at all."

In a new paper, Green, who has extensively studied the use of algorithms in parole and sentencing decisions, demonstrates how the recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

"The point is not to say: Let's just allow these algorithms to be used without the human oversight," Green said. "But if we're only comfortable with these algorithms because we have human oversight, we actually shouldn't be comfortable with these algorithms at all, because the human oversight doesn't work."

This interview has been lightly edited and condensed.

What got you thinking about this issue to begin with?

For the last several years, I've been doing experimental technical work, studying how people interact with algorithms when making predictions and decisions. A good chunk of the empirical findings that I'm drawing on in the paper are this research that I've conducted over the last couple of years.

One of the starting points for me, several years ago, was thinking about this gap between how we evaluate algorithms — often just thinking about if they're accurate, if they're fair — and the actual mechanisms by which algorithms have impact. That is, this process where they're giving advice to a human, and then a human has to actually somehow interpret that information and decide whether and how to use it.

In doing that work, I uncovered a lot of issues in people's ability to identify errors, biases and how people respond to algorithms, and noticed a pretty significant disconnect between the empirical findings and the way that a lot of policies talked about this.

[Policies] are essentially just saying, "Hey, well, there's a human in the loop. So it's fine to use these risk assessments when making sentencing decisions." I wanted to really dig into this and see: What do the policies actually call for? And how do they fall short? Does anything actually work?

Before we walk through your findings in this paper, let's talk a little bit about what you have discovered in your more technical research on algorithms' impact.

The first paper really looked at how introducing risk assessments alters the predictions that people make. The primary finding was that people respond to risk assessments in biased ways. People are more likely to follow a recommendation to increase their estimate of risk when evaluating Black defendants and more likely to decrease their estimate of risk suggested by the risk assessment when evaluating white defendants. So, even if we were to say, "OK, this algorithm might meet certain standards of fairness," the actual impacts of these algorithms might not satisfy those constraints when you think about how humans are going to respond.

The second study was an extension of that, looking at whether people are able to evaluate the quality of algorithmic predictions. We found that they weren't. People can't really do that job, which is central to the idea of people being able to determine which recommendations from an algorithm they should work with or not.

The final piece, which was just published, was shifting from predictions to the decision-making process, and looking at how risk assessments alter the underlying decision-making process that people follow. If they're shown a risk assessment, does that actually make judges more likely to weigh risk more heavily when making decisions? We must balance the desire to reduce risk with other interests around the liberty of defendants, and so on. Are we improving the accuracy of human prediction? Or are we actually making risk a more salient feature of decision-making?

We ran an experiment to test that and found that we're more in the latter camp. We're not simply altering people's predictions of risk. We're altering how people factor risk into their decisions, and essentially prompting them to weigh risk as a more important factor when making decisions.

In the paper about human oversight of algorithms, you walk through three different ways policies are trying to introduce some level of human oversight to the deployment of AI, and you argue each way is flawed. Walk me through those three ways and their flaws.

They're all somewhat overlapping and related. The first approach is to say: If a decision is based on solely automated processing, then we're going to either prohibit it entirely or require certain rights, like the ability to request human review afterward. The most notable example of this would be the European General Data Protection Regulation, which has an article dedicated to solely automated processing.

By drawing this really strict boundary, we're failing to capture a lot of the influences of algorithms that have actually generated the most significant controversy and demonstrated injustice. Most of the decisions that we're most concerned about are not made in a solely automated fashion already. You could have a human play some relatively superficial role in the decision-making process, such that it's no longer solely automated. And if it doesn't count as solely automated decision-making, then you aren't subject to any of those regulations.

The second approach operates in some ways as a corollary to the first. It's saying: It's OK to use algorithms, as long as there's human discretion, and the human gets to make the final decisions. This is what we see, in particular, for a lot of the risk assessment tools used in the U.S.

But when you actually give people discretion to determine how they should use an algorithm, they don't do what you might want them to do with it. A lot of the research looks at how people override algorithms: How do people diverge from algorithmic predictions? And typically, they do that in sub-optimal ways.People are diverging from algorithms in ways that are actually making their predictions less accurate.

If the risk assessment says to detain someone, they'll generally follow that. If it says to release someone, they will override that in favor of detention much more frequently. Police who are supposed to be overseeing facial recognition predictions also do a really bad job of that. So all of the documentation we have about human oversight and human overrides suggests that they either defer to the tool when they shouldn't, or override the tools in typically detrimental ways.

The third category says: People might not understand the algorithm. So we really, really need to be sure that [the oversight] is meaningful. People should be able to understand how the algorithm works in some form that can help them determine when they should follow it or how to interpret it. The emphasis there is on explanations or algorithmic transparency.

The issue here really just builds on the issues of the second group. Yes, you can give people the ability to override the algorithm. But that doesn't necessarily help. Typically, people don't override algorithms in beneficial ways. Unfortunately, even explanations and transparency don't seem to improve things — and can actually make it worse. The explanations can make people trust the algorithm more, even if the algorithm shouldn't be trusted.

What are the alternatives, if human beings are not a sufficient safeguard?

It's not simply, "Oh, we can just turn from human oversight to something else." Human oversight plays a really fundamental role in justifying and legitimizing these tools. So we actually need to, given these failures, start from farther upstream and think about how we're even making decisions about when algorithms should be used at all.

We should be putting much more scrutiny on whether it's actually appropriate to use an algorithm in a given situation. Often, courts and policymakers will justify the use of low-quality algorithms by assuming that human review can account for their flaws, but I think we should be much more critical. And I think in many of these cases, we should be ready to say: This actually just isn't an algorithm that we trust. This isn't a decision where an algorithm is particularly well-suited to enhancing decision-making.

We should put much more of a burden on agencies to justify why it's appropriate to use an algorithm in a given situation. They should have to describe more proactively why this algorithm is going to improve decision-making or why it's appropriate to have an algorithm make this decision. And what is the quality of this algorithm? Is it actually one that we would trust with altering potentially high-stakes decisions? We just need to do much more proactive research of the actual human oversight or human-algorithm collaboration process.

Already, we're seeing policies that are calling for various types of evaluations of algorithms themselves, saying, "before you deploy the system, you have to run a test to show that the algorithm is accurate, and to show that it's fair." And I think that we should have similar types of tests that are required for the actual decision-making process. So if you're going to incorporate a pre-trial risk assessment into judicial decision making, there should be some sort of proactive assessment, not just of the pre-trial risk assessment itself, but also of how people or judges use the algorithm to make decisions.

Right now, we'll do evaluations after the fact. Two years down the line, we'll see that judges have been using this algorithm in all sorts of unexpected ways. And that's because we didn't actually properly do the homework.

A MESSAGE FROM ALIBABA

www.protocol.com

This year, China will become the first country where ecommerce sales will outpace brick-and-mortar transactions. U.S. businesses are using Alibaba's platforms to sell to 900 million digitally savvy consumers in China and untap new opportunities for long-term growth.

LEARN MORE

SKOREA-ENTERTAINMENT-GAMING-MICROSOFT-XBOX
A visitor plays a game using Microsoft's Xbox controller at a flagship store of SK Telecom in Seoul on November 10, 2020. (Photo by Jung Yeon-je / AFP) (Photo by JUNG YEON-JE/AFP via Getty Images)

On this episode of the Source Code podcast: Nick Statt joins the show to discuss Microsoft’s $68.7 billion acquisition of Activision Blizzard, and what it means for the tech and game industries. Then, Issie Lapowsky talks about a big week in antitrust reform, and whether real progress is being made in the U.S. Finally, Hirsh Chitkara explains why AT&T, Verizon, the FAA and airlines have been fighting for months about 5G coverage.

For more on the topics in this episode:

Keep Reading Show less
David Pierce

David Pierce ( @pierce) is Protocol's editorial director. Prior to joining Protocol, he was a columnist at The Wall Street Journal, a senior writer with Wired, and deputy editor at The Verge. He owns all the phones.

COVID-19 accelerated what many CEOs and CTOs have struggled to do for the past decade: It forced organizations to be agile and adjust quickly to change. For all the talk about digital transformation over the past decade, when push came to shove, many organizations realized they had made far less progress than they thought.

Now with the genie of rapid change out of the bottle, we will never go back to accepting slow and steady progress from our organizations. To survive and thrive in times of disruption, you need to build a resilient, adaptable business with systems and processes that will keep you nimble for years to come. An essential part of business agility is responding to change by quickly developing new applications and adapting old ones. IT faces an unprecedented demand for new applications. According to IDC, by 2023, more than 500 million digital applications and services will be developed and deployed — the same number of apps that were developed in the last 40 years.[1]

Keep Reading Show less
Denise Broady, CMO, Appian
Denise oversees the Marketing and Communications organization where she is responsible for accelerating the marketing strategy and brand recognition across the globe. Denise has over 24+ years of experience as a change agent scaling businesses from startups, turnarounds and complex software companies. Prior to Appian, Denise worked at SAP, WorkForce Software, TopTier and Clarkston Group. She is also a two-time published author of “GRC for Dummies” and “Driven to Perform.” Denise holds a double degree in marketing and production and operations from Virginia Tech.
Policy

Congress’ antitrust push has a hate speech problem

Sen. Klobuchar’s antitrust bill is supposed to promote competition. So why are advocates afraid it could also promote extremists?

The bill as written could make it a lot riskier for large tech companies to deplatform or demote companies that violate their rules.

Photo: Photo by Elizabeth Frantz-Pool/Getty Images

The antitrust bill that passed the Senate Judiciary Committee Thursday and is now headed to the Senate floor is, at its core, an attempt to prevent the likes of Apple, Amazon and Google from boosting their own products and services on the marketplaces and platforms they own.

But upon closer inspection, some experts say, the bill as written could make it a lot riskier for large tech companies to deplatform or demote companies that violate their rules.

Keep Reading Show less
Issie Lapowsky

Issie Lapowsky ( @issielapowsky) is Protocol's chief correspondent, covering the intersection of technology, politics, and national affairs. She also oversees Protocol's fellowship program. Previously, she was a senior writer at Wired, where she covered the 2016 election and the Facebook beat in its aftermath. Prior to that, Issie worked as a staff writer for Inc. magazine, writing about small business and entrepreneurship. She has also worked as an on-air contributor for CBS News and taught a graduate-level course at New York University's Center for Publishing on how tech giants have affected publishing.

Boost 2

Can Matt Mullenweg save the internet?

He's turning Automattic into a different kind of tech giant. But can he take on the trillion-dollar walled gardens and give the internet back to the people?

Matt Mullenweg, CEO of Automattic and founder of WordPress, poses for Protocol at his home in Houston, Texas.
Photo: Arturo Olmos for Protocol

In the early days of the pandemic, Matt Mullenweg didn't move to a compound in Hawaii, bug out to a bunker in New Zealand or head to Miami and start shilling for crypto. No, in the early days of the pandemic, Mullenweg bought an RV. He drove it all over the country, bouncing between Houston and San Francisco and Jackson Hole with plenty of stops in national parks. In between, he started doing some tinkering.

The tinkering is a part-time gig: Most of Mullenweg’s time is spent as CEO of Automattic, one of the web’s largest platforms. It’s best known as the company that runs WordPress.com, the hosted version of the blogging platform that powers about 43% of the websites on the internet. Since WordPress is open-source software, no company technically owns it, but Automattic provides tools and services and oversees most of the WordPress-powered internet. It’s also the owner of the booming ecommerce platform WooCommerce, Day One, the analytics tool Parse.ly and the podcast app Pocket Casts. Oh, and Tumblr. And Simplenote. And many others. That makes Mullenweg one of the most powerful CEOs in tech, and one of the most important voices in the debate over the future of the internet.

Keep Reading Show less
David Pierce

David Pierce ( @pierce) is Protocol's editorial director. Prior to joining Protocol, he was a columnist at The Wall Street Journal, a senior writer with Wired, and deputy editor at The Verge. He owns all the phones.

Workplace

Ask a tech worker: How many of your colleagues have caught omicron?

Millions of workers called in sick in recent weeks. How is tech handling it?

A record number of Americans called in sick with COVID-19 in recent weeks. Even with high vaccination rates, tech companies aren’t immune.

Illustration: Christopher T. Fong/Protocol

Welcome back to Ask a Tech Worker! For this recurring feature, I’ve been roaming downtown San Francisco at lunchtime to ask tech employees about how the workplace is changing. This week, I caught up with tech workers about what their companies are doing to avoid omicron outbreaks, and whether many of their colleagues had been out sick lately. Got an idea for a future topic? Email me.

Omicron stops for no one, it seems. Between Dec. 29 and Jan. 10, 8.8 million Americans missed work to either recover from COVID-19 or care for someone who was recovering, according to the Census Bureau. That number crushed the previous record of 6.6 million from last January, and tripled the numbers from early last month.

Keep Reading Show less
Allison Levitsky
Allison Levitsky is a reporter at Protocol covering workplace issues in tech. She previously covered big tech companies and the tech workforce for the Silicon Valley Business Journal. Allison grew up in the Bay Area and graduated from UC Berkeley.

The fast-growing paychecks of Big Tech’s biggest names

Tech giants had a huge pandemic, and their execs are getting paid.

TIm Cook received $82 million in stock awards on top of his $3 million salary as Apple's CEO.

Photo: Mario Tama/Getty Images

Tech leaders are making more than ever.

As tech giants thrive amid the pandemic, companies like Meta, Alphabet and Microsoft have continued to pay their leaders accordingly: Big Tech CEO pay is higher than ever. In the coming months, we’ll begin seeing a lot of companies release their executive compensation from the past year as fiscal 2022 begins.

Keep Reading Show less
Nat Rubio-Licht
Nat Rubio-Licht is a Los Angeles-based news writer at Protocol. They graduated from Syracuse University with a degree in newspaper and online journalism in May 2020. Prior to joining the team, they worked at the Los Angeles Business Journal as a technology and aerospace reporter.
Latest Stories
Bulletins