Google has big ideas for a massive federally funded AI cloud research project, and it thinks they are worth $500 million a year. All it wants is first dibs on vast amounts of raw government data.
The company wants the U.S. government to pony up at least half a billion dollars annually to fund a giant national hub for AI research, according to Google’s response to a request for stakeholders to weigh in on the project. Already in the works, the National AI Research Resource — or NAIRR — is expected to benefit all three of the largest commercial cloud services — including Google Cloud. But Google has devised a detailed plan for how it could be built and how Google should be involved. And its vice president and general manager for AI and industry solutions in its cloud unit, Andrew Moore, sits on the government task force guiding the project.
If the company has its way, the proposed funding would not just create a new contract for Google Cloud, it could benefit other divisions under the Alphabet umbrella including its urban tech unit Sidewalk Labs.
In a previously unreported bold proposal submitted in October to the federal task force overseeing the project, Google stated, “In order to achieve significant impact, we recommend that the [U.S. government] fund the resource at $500 million/year or more.” The resource is meant to be a repository of data, AI tools and access to computing power necessary for researchers to develop machine learning and other AI systems. It’s in the early stages of planning through a process led by the National Science Foundation and the White House Office of Science and Technology Policy.
Because of its focus on large-scale AI, which requires huge datasets and tons of storage and computational capacity, the resource by its nature is likely to involve the top cloud providers. The other two cloud giants, Amazon Web Services and Microsoft Azure, did not propose specific dollar figures, but both companies are eager to get in on the action.
And, although the companies engage in bruising competition to attract enterprise cloud customers, AWS, Google and Microsoft each indicated some willingness to work together to support the initiative. In the end, the project could reap dividends for the AI and cloud industry when it comes to fueling data sources, educating the next generation of much-desired tech talent and spurring increased interest in cloud and AI services from the public sector.
The push for strong commercial ties
All three companies also emphasized the benefits of constructing the research resource on a foundation enabled by commercial cloud providers as opposed to something built by the federal government.
“We believe the NAIRR should be a multi-cloud hosting platform for commercial Cloud resources (as opposed to a new Cloud platform developed by government or academia),” wrote Google, which, as the underdog of the Big Cloud triad, has the most to gain from playing nice with the competition in a multi-cloud setup.
The company highlighted the “security, operational, and energy efficiency” benefits of partnering with the cloud experts rather than building in-house. “Building a new platform from the ground up would require a huge investment of dollars and expertise, and even once built would not have the advantages brought by the scale of existing Cloud providers.”
Meanwhile, Amazon used the task force’s request for information as an opportunity to pitch its AI and cloud products and services. “As a leading cloud service provider, AWS’s compute, storage, AI/ML, and data analytics services can form the backbone of NAIRR’s shared research infrastructure,” noted the company. Sometimes AWS even ventured into sales-deck territory: “The AWS Global Cloud Infrastructure [can enable] the NAIRR to deploy application workloads across the globe in a single click,” and its pre-trained AI services “can provide ready-made intelligence,” the company wrote.
Microsoft didn’t shy away from the sales opportunity either. Plus, it had flow charts: One chart featured an AI technology stack built from Microsoft services including the Azure Open Data repository and Azure Machine Learning.
Manish Parashar, director of NSF’s Office of Advanced Cyberinfrastructure and co-chair of the task force, told Protocol it’s too early to know whether or how private sector cloud providers might be involved. However, he said there is general consensus that the data and computing service infrastructure underpinning the project will combine existing and new resources.
“This approach would take advantage of campus-level, region-level and national-level resources, creating a federated platform that connects users to a diverse set of resources and facilitates their use through educational tools and user support,” he said.
Following meetings next week and into early next year, the task force will issue reports to Congress in May and November 2022.
Over time, a hybrid approach to standing up the research resource would be the fastest and most cost efficient, according to the Stanford Institute for Human-Centered Artificial Intelligence, which has pushed hard for a cloud-based national AI research hub. In its “Blueprint for the National Research Cloud” published in October, the institute recommended a dual investment strategy involving partnering with commercial cloud providers for computing power at first, then piloting a publicly-owned infrastructure built by commercial vendors but operated by the government — the model for national labs such as the massive Oak Ridge National Laboratory.
The Stanford team estimated that building a standalone public infrastructure for the research hub would be less expensive in the long run than working under a vendor contracting arrangement. According to their math, if the government were to negotiate a 10% discount with AWS to use its computing services and comparable hardware under constant usage over a five year period, it could cost 7.5 times as much as the estimated costs to run the Summit supercomputer at Oak Ridge, the world’s second-most powerful supercomputer.
“Even in a scenario where [national research cloud] usage fluctuates dramatically, commercial cloud computing could cost 2.8 times Summit’s estimated cost,” they wrote.
All three big cloud companies mentioned existing public-private partnerships they’re involved with to enable cloud services for academic and government research. For instance, they all partner with CloudBank, an NSF-affiliated service led by several universities to provide cloud access to computer science students, as well as a cloud environment for medical research overseen by the National Institutes of Health.
Microsoft even mentioned its partnerships in support of research outside the U.S. including through its government-funded, public-private “AI innovation hub” in Shanghai, China.
Mentioning a partnership with the Chinese government is notable. The task force was established by Congress in 2020 on recommendation from the National Security Commission on AI, which has pushed for billions of dollars in non-defense funding to bolster AI research in the hopes that the U.S. keeps pace with global AI development, particularly with China. The commission has referred to China as a rival in a “race” not just to win AI tech development, but to ensure AI incorporates Western “values.”
Why Google wants to manage the data
From the looks of its own submission to the task force, Google has moved well beyond the sales pitch to the project planning phase. The company’s ideas for the research resource are descriptive even beyond proposing a dollar figure.
Google wants data from the private sector as well as from state and local government sources to be fed into the system — including “some types of sensitive government data” like health, census and financial services data. And it wants researchers that don’t need computing power from the research cloud to be able to access that data. For one thing, that would ensure that researchers from commercial cloud providers like Google, AWS and Microsoft can get at the data. “Rates should be lower and subsidized by the [U.S. government] for academic and government users,” Google wrote.
Stanford’s AI Institute researchers emphasized the need to ensure that the government-funded research hub remain a resource for academic and non-profit researchers, not the private sector. Jen King, privacy and data policy fellow at the institute who helped write the paper, pointed to “the growing brain drain of AI academics into industry,” where it’s easier to access data and computing power. “My colleagues and I explored the question of whether it would make sense to open this resource to private actors, and we concluded that at least initially, doing so would pose legal and logistical issues, as well as distract from the core mission of supporting research in AI.”
Google wants to be as close as possible to the firehose of data that would flow into the research hub. “We recommend that the NAIRR co-locate an instance of Data Commons in all NAIRR clouds, which we would provide as an in-kind contribution.” Essentially what Google is proposing here is that it would manage all the data clean-up work to ensure data quality and standard formatting of countless disparate government data feeds, and it would do it for free. The process is necessary to prepare and unify data to use to train AI models. Once it’s cleaned and standardized by Google, it would sit in a common area accessible through any cloud platform connected to the research resource.
“So, for example, if a researcher wants the population, violent crime rate and unemployment rate of a county, the researcher does not have to go to three different datasets (Census, FBI and BLS), but can instead, get it from a single database, using one schema, one API,” wrote Google. “Co-locating updated versions of Data Commons with the NAIRR would therefore enable more effective use of the resource.”
But despite proposing to do the work at no cost, Google makes a point to highlight how valuable a service it is. “Cleaning a large dataset is no small feat; before making Google datasets publicly available for the open-source community, we spend hundreds of hours standardizing data and validating quality.”
Google’s proposal to do the job pro bono, said Eric Woods, research director at smart city technology research firm Guidehouse Insights, “raises the question of tech companies bearing gifts — what’s in it for them?”
Ultimately the project may not be just about the cloud business for Google. As a leader in extracting value from the world’s information, said Woods, Google could squeeze a lot of value from processing raw government data. “There is value in that data before it’s filtered that can be extracted,” he said. Particularly when it comes to sensitive data that the company may not have access to currently, it could provide new insights and help Google improve algorithms for various aspects of its business — for starters, its search and maps products.
The resource could become a huge dumping ground for regularly updated, raw data feeds from federal agencies, states, municipalities or even private entities across the country. As the official cleaning crew, Google could access information it has not been able to see before, in a form others could not see once it’s cleaned and formulated for access through a data commons. Perhaps more importantly, it could give Google the power to decide how that information is organized, labeled and formatted.
Matt Tarascio, senior vice president of Artificial Intelligence at consulting and research firm Booz Allen agreed that having first-hand knowledge of data flows and what information looks like before clean-up would enhance Google’s algorithmic prowess. “There’s significant value in understanding the data streams and where the data comes from,” he said.
Having that sort of data access and decision-making power could be particularly beneficial for Google sibling, Alphabet-owned Sidewalk Labs, a company that uses municipal and other public and commercial data to build algorithmic tech for city governments, energy utilities, real estate developers and healthcare providers. “They would enhance their ability to understand and cleanse messy, public datasets,” said Woods. Sidewalk Labs itself proposed use of a data commons in conjunction with its failed “city of the future” experiment in Canada, Sidewalk Toronto.
If Google were to be chosen to process the data for the research hub, there are bound to be concerns about a commercial entity managing it, said Woods. “That’s exactly the debate that was going on around Sidewalk Toronto,” he said. When Google proposed using a data commons there, he said, “Others were saying, hang on, who’s ultimately got control over this?”