Why AI and machine learning are drifting away from the cloud

Cloud computing isn’t going anywhere, but some companies are shifting their machine learning data and models to their own machines they manage in-house. Adopters are spending less money and getting better performance.

Illustration of servers.

In the end, the transition could be a sign of sophistication among businesses that have moved beyond merely dipping their toes in AI.

Illustration: Christopher T. Fong/Protocol

A quick-service restaurant chain is running its AI models on machines inside its stores to localize delivery logistics. At the same time, a global pharma company is training its machine learning models on premises, using servers it manages by itself.

Cloud computing isn’t going anywhere, but some companies that use machine learning models and the tech vendors supplying the platforms to manage them say machine learning is having an on-premises moment. For many years, cloud providers have argued that the computing requirements for machine learning would be far too expensive and cumbersome to start up on their own, but the field is maturing.

“We still have a ton of customers who want to go on a cloud migration, but we're definitely now seeing — at least in the past year or so — a lot more customers who want to repatriate workloads back onto on-premise because of cost,” said Thomas Robinson, vice president of strategic partnerships and corporate development at MLOps platform company Domino Data Lab. Cost is actually a big driver, said Robinson, noting the hefty price of running computationally intensive deep-learning models such as GPT-3 or other large-language transformer models, which businesses today use in their conversation AI tools and chatbots, on cloud servers.

There's more of an equilibrium where they are now investing again in their hybrid infrastructure.

The on-prem trend is growing among big box and grocery retailers that need to feed product, distribution and store-specific data into large machine learning models for inventory predictions, said Vijay Raghavendra, chief technology officer at SymphonyAI, which works with grocery chain Albertsons. Raghavendra left Walmart in 2020 after seven years with the company in senior engineering and merchant technology roles.

“This happened after my time at Walmart. They went from having everything on-prem, to everything in the cloud when I was there. And now I think there's more of an equilibrium where they are now investing again in their hybrid infrastructure — on-prem infrastructure combined with the cloud,” Raghavendra told Protocol. “If you have the capability, it may make sense to stand up your own [co-location data center] and run those workloads in your own colo, because the costs of running it in the cloud does get quite expensive at certain scale.”

Some companies are considering on-prem setups in the model building phase, when ML and deep-learning models are trained before they are released to operate in the wild. That process requires compute-heavy tuning and testing of large numbers of parameters or combinations of different model types and inputs using terabytes or petabytes of data.

“The high cost of training is giving people some challenges,” said Danny Lange, vice president of AI and machine learning at gaming and automotive AI company Unity Technologies. The cost of training can run into millions of dollars, Lange said.

“It’s a cost that a lot of companies are now looking at saying, can I bring my training in-house so that I have more control on the cost of training, because if you let engineers train on a bank of GPUs in a public cloud service, it can get very expensive, very quickly.”

Companies shifting compute and data to their own physical servers located inside owned or leased co-located data centers tend to be on the cutting edge of AI or deep-learning use, Robinson said. “[They] are now saying, ‘Maybe I need to have a strategy where I can burst to the cloud for appropriate stuff. I can do, maybe, some initial research, but I can also attach an on-prem workload.”

If you let engineers train on a bank of GPUs in a public cloud service, it can get very expensive, very quickly.

Even though the customer has publicized its cloud-centric strategy, one pharmaceutical customer Domino Data Lab works with has purchased two Nvidia server clusters to manage compute-heavy image recognition models on-prem, Robinson said.

High cost? How about bad broadband

For some companies, a preference for running their own hardware is not just about training massive deep-learning models. Victor Thu, president at Datatron, said retailers or fast-food chains with area-specific machine learning models — used to localize delivery logistics or optimize store inventory — would rather run ML inference workloads in their own servers inside their stores, rather than passing data back and forth to run the models in the cloud.

Some customers “don’t want it in the cloud at all,” Thu told Protocol. “Retail behavior in San Francisco can be very different from Los Angeles and San Diego for example,” he said, noting that Datatron has witnessed customers moving some ML operations to their own machines, especially those retailers with poor internet connectivity in certain locations.

Model latency is a more commonly recognized reason to shift away from the cloud. Once a model is deployed, the amount of time it takes for it to pass data back and forth between cloud servers is a common factor in deciding to go in-house. Some companies also avoid the cloud to make sure models respond rapidly to fresh data when operating in a mobile device or inside a semi-autonomous vehicle.

“Often the decision to operationalize a model on-prem or in the cloud has largely been a question of latency and security dictated by where the data is being generated or where the model results are being consumed,” Robinson said.

Over the years, cloud providers have overcome early perceptions that their services were not secure enough for some customers, particularly those from highly regulated industries. As big-name companies such as Capital One have embraced the cloud, data security concerns have less currency nowadays.

Still, data privacy and security does compel some companies to use on-prem systems. AiCure uses a hybrid approach in managing data and machine learning models for its app used by patients in clinical trials, said the company’s CEO Ed Ikeguchi. AiCure keeps processes involving sensitive, personally identifiable information (PII) under its own control.

“We do much of our PII-type work locally,” Ikeguchi said. However, he said, when the company can use aggregated and anonymized data, “then all of the abstracted data will work with cloud.”

Ikeguchi added, “Some of these cloud providers do have excellent infrastructure to support private data. That said, we do take a lot of precautions on our end as well, in terms of what ends up in the cloud.”

“We have customers that are very security conscious,” said Biren Fondekar, vice president of customer experience and digital strategy at NetApp, whose customers from highly regulated financial services and health care industries run NetApp’s AI software in their own private data centers.

Big cloud responds

Even cloud giants are responding to the trend by subtly pushing their on-prem products for machine learning. AWS promoted its Outposts infrastructure for machine learning last year in a blog post, citing decreased latency and high data volume as two key reasons customers want to run ML outside the cloud.

“One of the challenges customers are facing with performing inference in the cloud is the lack of real-time inference and/or security requirements preventing user data to be sent or stored in the cloud,” wrote Josh Coen, AWS senior solutions architect, and Mani Khanuja, artificial intelligence and machine learning specialist at AWS.

In October, Google Cloud announced Google Distributed Cloud Edge to accommodate customer concerns about region-specific compliance, data sovereignty, low latency and local data processing.

Microsoft Azure has introduced products including its Azure Arc services to help customers take a hybrid approach to managing machine learning by running ML models in data centers or at the edge, and validating and debugging models on local machines, then deploying them in the cloud.

Snowflake, which is integrated with Domino Data Lab’s MLOps platform, is mulling more on-prem tools for customers, said Harsha Kapre, senior product manager at Snowflake. “I know we're thinking about it actively,” he told Protocol. Snowflake said in July that it would offer its external table data lake architecture — which can be used for machine learning data preparation — for use by customers on their own hardware.

“I think in the early days, your data had to be in Snowflake. Now, if you start to look at it, your data doesn't actually have to be technically [in Snowflake],” Kapre said. “I think it’s probably a little early” to say more, he added.

Hidden costs

As companies integrate AI across their businesses, more and more people in an enterprise are using machine learning models, which can run up costs if they do it in the cloud, said Robinson. “Some of these models are now used by applications with so many users that the compute required skyrockets and it now becomes an economic necessity to run them on-prem,” he said.

But some say the on-prem promise has hidden costs.

“The cloud providers are really, really good at purchasing equipment and running it economically, so you are competing with people who really know how to run efficiently. If you want to bring your training in-house, it requires a lot of additional cost and expertise to do,” Lange said.

Bob Friday, chief AI officer at communications and AI network company Juniper Networks, agreed.

“It’s almost always cheaper to leave it at Google, AWS or Microsoft if you can,” Friday said, adding that if a company doesn’t have an edge use-case requiring split-second decision-making in a semi-autonomous vehicle, or handling large streaming video files, on-prem doesn’t make sense.

But cost savings are there for enterprises with large AI initiatives, Robinson said. While companies with smaller AI operations may not realize cost benefits by going in-house, he said, “at scale, cloud infrastructure, particularly for GPUs and other AI-optimized hardware, is much more expensive,” he said, alluding to Domino Data Lab’s pharmaceutical client that invested in Nvidia clusters “because the cost and availability of GPUs was not palatable on AWS alone.”

Everybody goes to the cloud, then they sort of try to move back a bit. I think it's about finding the right balance.

Robinson added, “another thing to take into consideration is that AI-accelerated hardware is evolving very rapidly and cloud vendors have been slow in making it available to users.”

In the end, like the shift toward multiple clouds and hybrid cloud strategies, the machine learning transition to incorporate on-prem infrastructure could be a sign of sophistication among businesses that have moved beyond merely dipping their toes in AI.

“There's always been a bit of a pendulum effect going on,” Lange said. “Everybody goes to the cloud, then they sort of try to move back a bit. I think it's about finding the right balance.”


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories