Policy

Lawmakers want humans to check runaway AI. Research shows they’re not up to the job.

Policymakers want people to oversee — and override — biased AI. But research suggests there's no evidence to prove humans are up to the task.

Closeup of lights reflected in a person's eye

The recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

Photo: Jackal Pan/Getty Images

There was a time, not long ago, when a certain brand of technocrat could argue with a straight face that algorithms are less biased decision-makers than human beings — and not be laughed out of the room. That time has come and gone, as the perils of AI bias have entered mainstream awareness.

Awareness of bias hasn't stopped institutions from deploying algorithms to make life-altering decisions about, say, people's prison sentences or their health care coverage. But the fear of runaway AI has led to a spate of laws and policy guidance requiring or recommending that these systems have some sort of human oversight, so machines aren't making the final call all on their own. The problem is: These laws almost never stop to ask whether human beings are actually up to the job.

"These assumptions about human oversight are playing a really critical role in justifying the use of these tools," said Ben Green, a postdoctoral scholar at the University of Michigan and an assistant professor at the Gerald R. Ford School of Public Policy. "If it doesn't work, then we're failing to get any of the protections that are seen as essential for making the system acceptable to us at all."

In a new paper, Green, who has extensively studied the use of algorithms in parole and sentencing decisions, demonstrates how the recent trend toward requiring human oversight of automated decision-making systems runs counter to mounting research about humans' inability to effectively override AI tools.

"The point is not to say: Let's just allow these algorithms to be used without the human oversight," Green said. "But if we're only comfortable with these algorithms because we have human oversight, we actually shouldn't be comfortable with these algorithms at all, because the human oversight doesn't work."

This interview has been lightly edited and condensed.

What got you thinking about this issue to begin with?

For the last several years, I've been doing experimental technical work, studying how people interact with algorithms when making predictions and decisions. A good chunk of the empirical findings that I'm drawing on in the paper are this research that I've conducted over the last couple of years.

One of the starting points for me, several years ago, was thinking about this gap between how we evaluate algorithms — often just thinking about if they're accurate, if they're fair — and the actual mechanisms by which algorithms have impact. That is, this process where they're giving advice to a human, and then a human has to actually somehow interpret that information and decide whether and how to use it.

In doing that work, I uncovered a lot of issues in people's ability to identify errors, biases and how people respond to algorithms, and noticed a pretty significant disconnect between the empirical findings and the way that a lot of policies talked about this.

[Policies] are essentially just saying, "Hey, well, there's a human in the loop. So it's fine to use these risk assessments when making sentencing decisions." I wanted to really dig into this and see: What do the policies actually call for? And how do they fall short? Does anything actually work?

Before we walk through your findings in this paper, let's talk a little bit about what you have discovered in your more technical research on algorithms' impact.

The first paper really looked at how introducing risk assessments alters the predictions that people make. The primary finding was that people respond to risk assessments in biased ways. People are more likely to follow a recommendation to increase their estimate of risk when evaluating Black defendants and more likely to decrease their estimate of risk suggested by the risk assessment when evaluating white defendants. So, even if we were to say, "OK, this algorithm might meet certain standards of fairness," the actual impacts of these algorithms might not satisfy those constraints when you think about how humans are going to respond.

The second study was an extension of that, looking at whether people are able to evaluate the quality of algorithmic predictions. We found that they weren't. People can't really do that job, which is central to the idea of people being able to determine which recommendations from an algorithm they should work with or not.

The final piece, which was just published, was shifting from predictions to the decision-making process, and looking at how risk assessments alter the underlying decision-making process that people follow. If they're shown a risk assessment, does that actually make judges more likely to weigh risk more heavily when making decisions? We must balance the desire to reduce risk with other interests around the liberty of defendants, and so on. Are we improving the accuracy of human prediction? Or are we actually making risk a more salient feature of decision-making?

We ran an experiment to test that and found that we're more in the latter camp. We're not simply altering people's predictions of risk. We're altering how people factor risk into their decisions, and essentially prompting them to weigh risk as a more important factor when making decisions.

In the paper about human oversight of algorithms, you walk through three different ways policies are trying to introduce some level of human oversight to the deployment of AI, and you argue each way is flawed. Walk me through those three ways and their flaws.

They're all somewhat overlapping and related. The first approach is to say: If a decision is based on solely automated processing, then we're going to either prohibit it entirely or require certain rights, like the ability to request human review afterward. The most notable example of this would be the European General Data Protection Regulation, which has an article dedicated to solely automated processing.

By drawing this really strict boundary, we're failing to capture a lot of the influences of algorithms that have actually generated the most significant controversy and demonstrated injustice. Most of the decisions that we're most concerned about are not made in a solely automated fashion already. You could have a human play some relatively superficial role in the decision-making process, such that it's no longer solely automated. And if it doesn't count as solely automated decision-making, then you aren't subject to any of those regulations.

The second approach operates in some ways as a corollary to the first. It's saying: It's OK to use algorithms, as long as there's human discretion, and the human gets to make the final decisions. This is what we see, in particular, for a lot of the risk assessment tools used in the U.S.

But when you actually give people discretion to determine how they should use an algorithm, they don't do what you might want them to do with it. A lot of the research looks at how people override algorithms: How do people diverge from algorithmic predictions? And typically, they do that in sub-optimal ways.People are diverging from algorithms in ways that are actually making their predictions less accurate.

If the risk assessment says to detain someone, they'll generally follow that. If it says to release someone, they will override that in favor of detention much more frequently. Police who are supposed to be overseeing facial recognition predictions also do a really bad job of that. So all of the documentation we have about human oversight and human overrides suggests that they either defer to the tool when they shouldn't, or override the tools in typically detrimental ways.

The third category says: People might not understand the algorithm. So we really, really need to be sure that [the oversight] is meaningful. People should be able to understand how the algorithm works in some form that can help them determine when they should follow it or how to interpret it. The emphasis there is on explanations or algorithmic transparency.

The issue here really just builds on the issues of the second group. Yes, you can give people the ability to override the algorithm. But that doesn't necessarily help. Typically, people don't override algorithms in beneficial ways. Unfortunately, even explanations and transparency don't seem to improve things — and can actually make it worse. The explanations can make people trust the algorithm more, even if the algorithm shouldn't be trusted.

What are the alternatives, if human beings are not a sufficient safeguard?

It's not simply, "Oh, we can just turn from human oversight to something else." Human oversight plays a really fundamental role in justifying and legitimizing these tools. So we actually need to, given these failures, start from farther upstream and think about how we're even making decisions about when algorithms should be used at all.

We should be putting much more scrutiny on whether it's actually appropriate to use an algorithm in a given situation. Often, courts and policymakers will justify the use of low-quality algorithms by assuming that human review can account for their flaws, but I think we should be much more critical. And I think in many of these cases, we should be ready to say: This actually just isn't an algorithm that we trust. This isn't a decision where an algorithm is particularly well-suited to enhancing decision-making.

We should put much more of a burden on agencies to justify why it's appropriate to use an algorithm in a given situation. They should have to describe more proactively why this algorithm is going to improve decision-making or why it's appropriate to have an algorithm make this decision. And what is the quality of this algorithm? Is it actually one that we would trust with altering potentially high-stakes decisions? We just need to do much more proactive research of the actual human oversight or human-algorithm collaboration process.

Already, we're seeing policies that are calling for various types of evaluations of algorithms themselves, saying, "before you deploy the system, you have to run a test to show that the algorithm is accurate, and to show that it's fair." And I think that we should have similar types of tests that are required for the actual decision-making process. So if you're going to incorporate a pre-trial risk assessment into judicial decision making, there should be some sort of proactive assessment, not just of the pre-trial risk assessment itself, but also of how people or judges use the algorithm to make decisions.

Right now, we'll do evaluations after the fact. Two years down the line, we'll see that judges have been using this algorithm in all sorts of unexpected ways. And that's because we didn't actually properly do the homework.

A MESSAGE FROM ALIBABA

www.protocol.com

This year, China will become the first country where ecommerce sales will outpace brick-and-mortar transactions. U.S. businesses are using Alibaba's platforms to sell to 900 million digitally savvy consumers in China and untap new opportunities for long-term growth.

LEARN MORE

A 'Soho house for techies': VCs place a bet on community

Contrary is the latest venture firm to experiment with building community spaces instead of offices.

Contrary NYC is meant to recreate being part of a members-only club where engineers and entrepreneurs can hang out together, have a space to work, and host events for people in tech.

Photo: Courtesy of Contrary

In the pre-pandemic times, Contrary’s network of venture scouts, founders and top technologists reflected the magnetic pull Silicon Valley had on the tech industry. About 80% were based in the Bay Area, with a smattering living elsewhere. Today, when Contrary asked where people in its network were living, the split had changed with 40% in the Bay Area and another 40% living in or planning to move to New York.

It’s totally bifurcated now, said Contrary’s founder Eric Tarczynski.

Keep Reading Show less
Biz Carson

Biz Carson ( @bizcarson) is a San Francisco-based reporter at Protocol, covering Silicon Valley with a focus on startups and venture capital. Previously, she reported for Forbes and was co-editor of Forbes Next Billion-Dollar Startups list. Before that, she worked for Business Insider, Gigaom, and Wired and started her career as a newspaper designer for Gannett.

Sponsored Content

Great products are built on strong patents

Experts say robust intellectual property protection is essential to ensure the long-term R&D required to innovate and maintain America's technology leadership.

Every great tech product that you rely on each day, from the smartphone in your pocket to your music streaming service and navigational system in the car, shares one important thing: part of its innovative design is protected by intellectual property (IP) laws.

From 5G to artificial intelligence, IP protection offers a powerful incentive for researchers to create ground-breaking products, and governmental leaders say its protection is an essential part of maintaining US technology leadership. To quote Secretary of Commerce Gina Raimondo: "intellectual property protection is vital for American innovation and entrepreneurship.”

Keep Reading Show less
James Daly
James Daly has a deep knowledge of creating brand voice identity, including understanding various audiences and targeting messaging accordingly. He enjoys commissioning, editing, writing, and business development, particularly in launching new ventures and building passionate audiences. Daly has led teams large and small to multiple awards and quantifiable success through a strategy built on teamwork, passion, fact-checking, intelligence, analytics, and audience growth while meeting budget goals and production deadlines in fast-paced environments. Daly is the Editorial Director of 2030 Media and a contributor at Wired.
Fintech

Binance CEO wrestles with the 'Chinese company' label

Changpeng "CZ" Zhao, who leads crypto’s largest marketplace, is pushing back on attempts to link Binance to Beijing.

Despite Binance having to abandon its country of origin shortly after its founding, critics have portrayed the exchange as a tool of the Chinese government.

Photo: Akio Kon/Bloomberg via Getty Images

In crypto, he is known simply as CZ, head of one of the industry’s most dominant players.

It took only five years for Binance CEO and co-founder Changpeng Zhao to build his company, which launched in 2017, into the world’s biggest crypto exchange, with 90 million customers and roughly $76 billion in daily trading volume, outpacing the U.S. crypto powerhouse Coinbase.

Keep Reading Show less
Benjamin Pimentel

Benjamin Pimentel ( @benpimentel) covers crypto and fintech from San Francisco. He has reported on many of the biggest tech stories over the past 20 years for the San Francisco Chronicle, Dow Jones MarketWatch and Business Insider, from the dot-com crash, the rise of cloud computing, social networking and AI to the impact of the Great Recession and the COVID crisis on Silicon Valley and beyond. He can be reached at bpimentel@protocol.com or via Google Voice at (925) 307-9342.

Enterprise

How I decided to leave the US and pursue a tech career in Europe

Melissa Di Donato moved to Europe to broaden her technology experience with a different market perspective. She planned to stay two years. Seventeen years later, she remains in London as CEO of Suse.

“It was a hard go for me in the beginning. I was entering inside of a company that had been very traditional in a sense.”

Photo: Suse

Click banner image for more How I decided seriesA native New Yorker, Melissa Di Donato made a life-changing decision back in 2005 when she packed up for Europe to further her career in technology. Then with IBM, she made London her new home base.

Today, Di Donato is CEO of Germany’s Suse, now a 30-year-old, open-source enterprise software company that specializes in Linux operating systems, container management, storage, and edge computing. As the company’s first female leader, she has led Suse through the coronavirus pandemic, a 2021 IPO on the Frankfurt Stock Exchange, and the acquisitions of Kubernetes management startup Rancher Labs and container security company NeuVector.

Keep Reading Show less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Enterprise

UiPath had a rocky few years. Rob Enslin wants to turn it around.

Protocol caught up with Enslin, named earlier this year as UiPath’s co-CEO, to discuss why he left Google Cloud, the untapped potential of robotic-process automation, and how he plans to lead alongside founder Daniel Dines.

Rob Enslin, UiPath's co-CEO, chats with Protocol about the company's future.

Photo: UiPath

UiPath has had a shaky history.

The company, which helps companies automate business processes, went public in 2021 at a valuation of more than $30 billion, but now the company’s market capitalization is only around $7 billion. To add insult to injury, UiPath laid off 5% of its staff in June and then lowered its full-year guidance for fiscal year 2023 just months later, tanking its stock by 15%.

Keep Reading Show less
Aisha Counts

Aisha Counts (@aishacounts) is a reporter at Protocol covering enterprise software. Formerly, she was a management consultant for EY. She's based in Los Angeles and can be reached at acounts@protocol.com.

Latest Stories
Bulletins