Google’s multicloud national AI research plan could cost $500M a year. It wants first crack at the data

All three big cloud providers — Amazon, Google and Microsoft — want in on a huge national project to build an AI research hub, but Google has specific plans. Especially when it comes to processing the data.

Andrew Moore

Google's vice president and general manager for AI and industry solutions in its cloud unit, Andrew Moore, sits on the government task force guiding the project.

Photo: The National Security Commission on Artificial Intelligence

Google has big ideas for a massive federally funded AI cloud research project, and it thinks they are worth $500 million a year. All it wants is first dibs on vast amounts of raw government data.

The company wants the U.S. government to pony up at least half a billion dollars annually to fund a giant national hub for AI research, according to Google’s response to a request for stakeholders to weigh in on the project. Already in the works, the National AI Research Resource — or NAIRR — is expected to benefit all three of the largest commercial cloud services — including Google Cloud. But Google has devised a detailed plan for how it could be built and how Google should be involved. And its vice president and general manager for AI and industry solutions in its cloud unit, Andrew Moore, sits on the government task force guiding the project.

If the company has its way, the proposed funding would not just create a new contract for Google Cloud, it could benefit other divisions under the Alphabet umbrella including its urban tech unit Sidewalk Labs.

In a previously unreported bold proposal submitted in October to the federal task force overseeing the project, Google stated, “In order to achieve significant impact, we recommend that the [U.S. government] fund the resource at $500 million/year or more.” The resource is meant to be a repository of data, AI tools and access to computing power necessary for researchers to develop machine learning and other AI systems. It’s in the early stages of planning through a process led by the National Science Foundation and the White House Office of Science and Technology Policy.

Because of its focus on large-scale AI, which requires huge datasets and tons of storage and computational capacity, the resource by its nature is likely to involve the top cloud providers. The other two cloud giants, Amazon Web Services and Microsoft Azure, did not propose specific dollar figures, but both companies are eager to get in on the action.

And, although the companies engage in bruising competition to attract enterprise cloud customers, AWS, Google and Microsoft each indicated some willingness to work together to support the initiative. In the end, the project could reap dividends for the AI and cloud industry when it comes to fueling data sources, educating the next generation of much-desired tech talent and spurring increased interest in cloud and AI services from the public sector.

The push for strong commercial ties

All three companies also emphasized the benefits of constructing the research resource on a foundation enabled by commercial cloud providers as opposed to something built by the federal government.

“We believe the NAIRR should be a multi-cloud hosting platform for commercial Cloud resources (as opposed to a new Cloud platform developed by government or academia),” wrote Google, which, as the underdog of the Big Cloud triad, has the most to gain from playing nice with the competition in a multi-cloud setup.

The company highlighted the “security, operational, and energy efficiency” benefits of partnering with the cloud experts rather than building in-house. “Building a new platform from the ground up would require a huge investment of dollars and expertise, and even once built would not have the advantages brought by the scale of existing Cloud providers.”

Meanwhile, Amazon used the task force’s request for information as an opportunity to pitch its AI and cloud products and services. “As a leading cloud service provider, AWS’s compute, storage, AI/ML, and data analytics services can form the backbone of NAIRR’s shared research infrastructure,” noted the company. Sometimes AWS even ventured into sales-deck territory: “The AWS Global Cloud Infrastructure [can enable] the NAIRR to deploy application workloads across the globe in a single click,” and its pre-trained AI services “can provide ready-made intelligence,” the company wrote.

Microsoft didn’t shy away from the sales opportunity either. Plus, it had flow charts: One chart featured an AI technology stack built from Microsoft services including the Azure Open Data repository and Azure Machine Learning.

Manish Parashar, director of NSF’s Office of Advanced Cyberinfrastructure and co-chair of the task force, told Protocol it’s too early to know whether or how private sector cloud providers might be involved. However, he said there is general consensus that the data and computing service infrastructure underpinning the project will combine existing and new resources.

“This approach would take advantage of campus-level, region-level and national-level resources, creating a federated platform that connects users to a diverse set of resources and facilitates their use through educational tools and user support,” he said.

Following meetings next week and into early next year, the task force will issue reports to Congress in May and November 2022.

Over time, a hybrid approach to standing up the research resource would be the fastest and most cost efficient, according to the Stanford Institute for Human-Centered Artificial Intelligence, which has pushed hard for a cloud-based national AI research hub. In its “Blueprint for the National Research Cloud” published in October, the institute recommended a dual investment strategy involving partnering with commercial cloud providers for computing power at first, then piloting a publicly-owned infrastructure built by commercial vendors but operated by the government — the model for national labs such as the massive Oak Ridge National Laboratory.

The Stanford team estimated that building a standalone public infrastructure for the research hub would be less expensive in the long run than working under a vendor contracting arrangement. According to their math, if the government were to negotiate a 10% discount with AWS to use its computing services and comparable hardware under constant usage over a five year period, it could cost 7.5 times as much as the estimated costs to run the Summit supercomputer at Oak Ridge, the world’s second-most powerful supercomputer.

“Even in a scenario where [national research cloud] usage fluctuates dramatically, commercial cloud computing could cost 2.8 times Summit’s estimated cost,” they wrote.

All three big cloud companies mentioned existing public-private partnerships they’re involved with to enable cloud services for academic and government research. For instance, they all partner with CloudBank, an NSF-affiliated service led by several universities to provide cloud access to computer science students, as well as a cloud environment for medical research overseen by the National Institutes of Health.

Microsoft even mentioned its partnerships in support of research outside the U.S. including through its government-funded, public-private “AI innovation hub” in Shanghai, China.

Mentioning a partnership with the Chinese government is notable. The task force was established by Congress in 2020 on recommendation from the National Security Commission on AI, which has pushed for billions of dollars in non-defense funding to bolster AI research in the hopes that the U.S. keeps pace with global AI development, particularly with China. The commission has referred to China as a rival in a “race” not just to win AI tech development, but to ensure AI incorporates Western “values.”

Why Google wants to manage the data

From the looks of its own submission to the task force, Google has moved well beyond the sales pitch to the project planning phase. The company’s ideas for the research resource are descriptive even beyond proposing a dollar figure.

Google wants data from the private sector as well as from state and local government sources to be fed into the system — including “some types of sensitive government data” like health, census and financial services data. And it wants researchers that don’t need computing power from the research cloud to be able to access that data. For one thing, that would ensure that researchers from commercial cloud providers like Google, AWS and Microsoft can get at the data. “Rates should be lower and subsidized by the [U.S. government] for academic and government users,” Google wrote.

Stanford’s AI Institute researchers emphasized the need to ensure that the government-funded research hub remain a resource for academic and non-profit researchers, not the private sector. Jen King, privacy and data policy fellow at the institute who helped write the paper, pointed to “the growing brain drain of AI academics into industry,” where it’s easier to access data and computing power. “My colleagues and I explored the question of whether it would make sense to open this resource to private actors, and we concluded that at least initially, doing so would pose legal and logistical issues, as well as distract from the core mission of supporting research in AI.”

Google wants to be as close as possible to the firehose of data that would flow into the research hub. “We recommend that the NAIRR co-locate an instance of Data Commons in all NAIRR clouds, which we would provide as an in-kind contribution.” Essentially what Google is proposing here is that it would manage all the data clean-up work to ensure data quality and standard formatting of countless disparate government data feeds, and it would do it for free. The process is necessary to prepare and unify data to use to train AI models. Once it’s cleaned and standardized by Google, it would sit in a common area accessible through any cloud platform connected to the research resource.

“So, for example, if a researcher wants the population, violent crime rate and unemployment rate of a county, the researcher does not have to go to three different datasets (Census, FBI and BLS), but can instead, get it from a single database, using one schema, one API,” wrote Google. “Co-locating updated versions of Data Commons with the NAIRR would therefore enable more effective use of the resource.”

But despite proposing to do the work at no cost, Google makes a point to highlight how valuable a service it is. “Cleaning a large dataset is no small feat; before making Google datasets publicly available for the open-source community, we spend hundreds of hours standardizing data and validating quality.”

Google’s proposal to do the job pro bono, said Eric Woods, research director at smart city technology research firm Guidehouse Insights, “raises the question of tech companies bearing gifts — what’s in it for them?”

Ultimately the project may not be just about the cloud business for Google. As a leader in extracting value from the world’s information, said Woods, Google could squeeze a lot of value from processing raw government data. “There is value in that data before it’s filtered that can be extracted,” he said. Particularly when it comes to sensitive data that the company may not have access to currently, it could provide new insights and help Google improve algorithms for various aspects of its business — for starters, its search and maps products.

The resource could become a huge dumping ground for regularly updated, raw data feeds from federal agencies, states, municipalities or even private entities across the country. As the official cleaning crew, Google could access information it has not been able to see before, in a form others could not see once it’s cleaned and formulated for access through a data commons. Perhaps more importantly, it could give Google the power to decide how that information is organized, labeled and formatted.

Matt Tarascio, senior vice president of Artificial Intelligence at consulting and research firm Booz Allen agreed that having first-hand knowledge of data flows and what information looks like before clean-up would enhance Google’s algorithmic prowess. “There’s significant value in understanding the data streams and where the data comes from,” he said.

Having that sort of data access and decision-making power could be particularly beneficial for Google sibling, Alphabet-owned Sidewalk Labs, a company that uses municipal and other public and commercial data to build algorithmic tech for city governments, energy utilities, real estate developers and healthcare providers. “They would enhance their ability to understand and cleanse messy, public datasets,” said Woods. Sidewalk Labs itself proposed use of a data commons in conjunction with its failed “city of the future” experiment in Canada, Sidewalk Toronto.

If Google were to be chosen to process the data for the research hub, there are bound to be concerns about a commercial entity managing it, said Woods. “That’s exactly the debate that was going on around Sidewalk Toronto,” he said. When Google proposed using a data commons there, he said, “Others were saying, hang on, who’s ultimately got control over this?”


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories