AI needs massive data sets to work. Meta is testing a way to do more with less.

Despite the constant deluge of content flowing into Facebook and Instagram, Meta has struggled to get enough data to train AI to spot harmful content, so it’s banking on an emerging approach.

A visualization of the few-shot learning AI process

Meta plans to announce that few-shot learning shows promise in its constant battle to weed out disinformation or other content that violates its policies on Facebook and Instagram.

Image: Meta

After a terrorist attack on a mosque in Christchurch, New Zealand was livestreamed on Facebook in 2019, its parent company, now called Meta, outfitted London police officers with body cams while they conducted terrorism training. At the time, Meta said there wasn’t enough video data to train its artificial intelligence systems to detect and remove violent content, so it hoped the body cam project would produce more of that scarce AI training data.

A year prior to that horrific incident, the company acknowledged that it failed to keep up with inflammatory posts from extremist groups in Myanmar. Again, it said the problem was a lack of data — there wasn’t enough content in Burmese to train algorithmic moderation systems to spot more of it.

They weren’t wrong: Despite the constant deluge of content flowing into Facebook and Instagram, traditional AI approaches used by Meta and other companies need enough examples of the bad stuff to recognize it when it shows up again. A dearth of training data can plague AI systems that need large amounts of information labeled by humans in order to learn.

Enter few-shot learning, a concept that researchers across the globe have experimented with in recent years. Few-shot learning models can be trained from generic data supplemented with just a “few” pieces of labeled content.

Now, Meta plans to announce Wednesday that few-shot learning shows promise in its constant battle to weed out disinformation or other content that violates its policies on Facebook and Instagram, particularly when there isn’t enough AI training data, such as in the case of emerging subject areas or breaking news events.

Following early tests on Facebook and Instagram, the company told Protocol that the technique has helped reduce the prevalence of content such as hate speech. So far, it has only used the approach to tackle a few content areas such as “misleading or sensationalized information that likely discourages COVID-19 vaccinations, and hostile speech like bullying and harassment and violence and incitement,” said a Meta AI spokesperson. For instance, the company tested few-shot learning to identify content that promoted the debunked notion that COVID-19 vaccines change people’s DNA.

Meta said the few-shot process shortens the amount of time it takes to train up an AI system from several months to a few weeks. “Since it scales quickly, the time from policy framing to enforcement would shorten by orders of magnitude,” wrote Meta in a blog post published Wednesday. In addition to text and image content, the company said the technique also works for video content by consuming audio transcript, text on video and video embedding.

The company aims to burnish its image amid endless scrutiny by lawmakers and everyday people of its handling of abusive and false content on Facebook and Instagram. Later on Wednesday Adam Mosseri, the head of Instagram, will answer inquiries from Senate Commerce Consumer Protection subcommittee members about how its algorithmic systems fuel content that has negative effects on kids.

Google, Baidu and others research few-shot approaches

Historically, artificial intelligence and machine learning algorithms have needed vast amounts of data to train them. Feed an algorithm lots of images of bananas or AK-47s labeled as such, and it will learn to recognize them — or at least that’s the goal.

Researchers from Open AI, Google, Baidu and academic institutions across the globe have studied few-shot learning in recent years to circumvent the need for massive datasets, and not just for removing harmful social media content. Researchers have suggested few-shot learning can be used to help discover molecular properties for drug development when data is restricted by privacy rules, or to uncover tweets related to natural disasters in the hopes of disseminating important safety information.

“Because large, labeled datasets are often unavailable for tasks of interest, solving this problem would enable, for example, quick customization of models to individual user’s needs, democratizing the use of machine learning,” wrote Google AI researchers in 2020 in a company blog post about few-shot learning.

Meta has been working on this AI problem for some time. It revealed some detail four years ago about how its AI tried to detect harmful content associated with terrorism, for example.

“When someone tries to upload a terrorist photo or video, our systems look for whether the image matches a known terrorism photo or video,” said the company at the time. To automatically remove text-based content, the company said, “we’re currently experimenting with analyzing text that we’ve already removed for praising or supporting terrorist organizations such as ISIS and Al Qaeda so we can develop text-based signals that such content may be terrorist propaganda. That analysis goes into an algorithm that is in the early stages of learning how to detect similar posts. The machine learning algorithms work on a feedback loop and get better over time.”

A track record of language failures

That was then. In November, Meta pointed to a series of technical milestones that led researchers to what it called “breakthrough” exploration of applying few-shot learning to content moderation. In a blog post last month, the company showed a timeline of advancements including a process called XLM-R that trains a model in one language and then applies it to content in other languages without the need for additional training data.

The company seems to be confident that emerging AI techniques like few-shot learning and XLM-R will help to improve how it patrols content in languages where it’s faltered before, such as “low-resource languages” like Burmese.

Yet the recently leaked Facebook Papers revealed Meta’s struggles to remove harmful content in places where it hasn’t hired enough human moderators or built well-trained moderation algorithms. Meta itself has admitted publicly that its automated moderation technologies have not worked well to weed out unwanted Burmese content in Myanmar, for example. But the exposed documents also showed the company did not develop algorithms to detect hate speech in Hindi and Bengali, both among the top-ten most-spoken languages in the world.

When asked by Protocol why it believes few-shot learning works in so many languages despite past failures, the Meta spokesperson said the system was trained on more than 100 languages and incorporates techniques like XLM-R. “The nuance and semantics of language is one of the reasons why we built this technology — to be able to more quickly address content in multiple languages,” said the spokesperson. “As these underlying language and text encoders improve, Meta AI FSL will also bring the improvements to additional languages too.”

Still, a lot of testing will be required to know if these emerging approaches can work at scale.

“We are early in the use of this technology,” said the Meta spokesperson. “As we continue to mature the tech and test it across various enforcement mechanisms and problems the goal is to further increase its use and continued accuracy.”


Why foundation models in AI need to be released responsibly

Foundation models like GPT-3 and DALL-E are changing AI forever. We urgently need to develop community norms that guarantee research access and help guide the future of AI responsibly.

Releasing new foundation models doesn’t have to be an all or nothing proposition.

Illustration: sorbetto/DigitalVision Vectors

Percy Liang is director of the Center for Research on Foundation Models, a faculty affiliate at the Stanford Institute for Human-Centered AI and an associate professor of Computer Science at Stanford University.

Humans are not very good at forecasting the future, especially when it comes to technology.

Keep Reading Show less
Percy Liang
Percy Liang is Director of the Center for Research on Foundation Models, a Faculty Affiliate at the Stanford Institute for Human-Centered AI, and an Associate Professor of Computer Science at Stanford University.

Every day, millions of us press the “order” button on our favorite coffee store's mobile application: Our chosen brew will be on the counter when we arrive. It’s a personalized, seamless experience that we have all come to expect. What we don’t know is what’s happening behind the scenes. The mobile application is sourcing data from a database that stores information about each customer and what their favorite coffee drinks are. It is also leveraging event-streaming data in real time to ensure the ingredients for your personal coffee are in supply at your local store.

Applications like this power our daily lives, and if they can’t access massive amounts of data stored in a database as well as stream data “in motion” instantaneously, you — and millions of customers — won’t have these in-the-moment experiences.

Keep Reading Show less
Jennifer Goforth Gregory
Jennifer Goforth Gregory has worked in the B2B technology industry for over 20 years. As a freelance writer she writes for top technology brands, including IBM, HPE, Adobe, AT&T, Verizon, Epson, Oracle, Intel and Square. She specializes in a wide range of technology, such as AI, IoT, cloud, cybersecurity, and CX. Jennifer also wrote a bestselling book The Freelance Content Marketing Writer to help other writers launch a high earning freelance business.

The West’s drought could bring about a data center reckoning

When it comes to water use, data centers are the tech industry’s secret water hogs — and they could soon come under increased scrutiny.

Lake Mead, North America's largest artificial reservoir, has dropped to about 1,052 feet above sea level, the lowest it's been since being filled in 1937.

Photo: Mario Tama/Getty Images

The West is parched, and getting more so by the day. Lake Mead — the country’s largest reservoir — is nearing “dead pool” levels, meaning it may soon be too low to flow downstream. The entirety of the Four Corners plus California is mired in megadrought.

Amid this desiccation, hundreds of the country’s data centers use vast amounts of water to hum along. Dozens cluster around major metro centers, including those with mandatory or voluntary water restrictions in place to curtail residential and agricultural use.

Keep Reading Show less
Lisa Martine Jenkins

Lisa Martine Jenkins is a senior reporter at Protocol covering climate. Lisa previously wrote for Morning Consult, Chemical Watch and the Associated Press. Lisa is currently based in Brooklyn, and is originally from the Bay Area. Find her on Twitter ( @l_m_j_) or reach out via email (ljenkins@protocol.com).


Indeed is hiring 4,000 workers despite industry layoffs

Indeed’s new CPO, Priscilla Koranteng, spoke to Protocol about her first 100 days in the role and the changing nature of HR.

"[Y]ou are serving the people. And everything that's happening around us in the world is … impacting their professional lives."

Image: Protocol

Priscilla Koranteng's plans are ambitious. Koranteng, who was appointed chief people officer of Indeed in June, has already enhanced the company’s abortion travel policies and reinforced its goal to hire 4,000 people in 2022.

She’s joined the HR tech company in a time when many other tech companies are enacting layoffs and cutbacks, but said she sees this precarious time as an opportunity for growth companies to really get ahead. Koranteng, who comes from an HR and diversity VP role at Kellogg, is working on embedding her hybrid set of expertise in her new role at Indeed.

Keep Reading Show less
Amber Burton

Amber Burton (@amberbburton) is a reporter at Protocol. Previously, she covered personal finance and diversity in business at The Wall Street Journal. She earned an M.S. in Strategic Communications from Columbia University and B.A. in English and Journalism from Wake Forest University. She lives in North Carolina.


New Jersey could become an ocean energy hub

A first-in-the-nation bill would support wave and tidal energy as a way to meet the Garden State's climate goals.

Technological challenges mean wave and tidal power remain generally more expensive than their other renewable counterparts. But government support could help spur more innovation that brings down cost.

Photo: Jeremy Bishop via Unsplash

Move over, solar and wind. There’s a new kid on the renewable energy block: waves and tides.

Harnessing the ocean’s power is still in its early stages, but the industry is poised for a big legislative boost, with the potential for real investment down the line.

Keep Reading Show less
Lisa Martine Jenkins

Lisa Martine Jenkins is a senior reporter at Protocol covering climate. Lisa previously wrote for Morning Consult, Chemical Watch and the Associated Press. Lisa is currently based in Brooklyn, and is originally from the Bay Area. Find her on Twitter ( @l_m_j_) or reach out via email (ljenkins@protocol.com).

Latest Stories