Transformer networks, colloquially known to deep-learning practitioners and computer engineers as “transformers,” are all the rage in AI. Over the last few years, these models, known for their massive size, large amount of data inputs, big scale of parameters — and, by extension, high carbon footprint and cost — have grown in favor over other types of neural network architectures.
Some transformers, particularly some open-source, large natural-language-processing transformer models, even have names that are recognizable to people outside AI, such as GPT-3 and BERT. They’re used across audio-, video- and computer-vision-related tasks, drug discovery and more.
Now chipmakers and researchers want to make them speedier and more nimble.
“It’s interesting how fast technology for neural networks changes. Four years ago, everybody was using these recurrent neural networks for these language models and then the attention paper was introduced, and all of a sudden, everybody is using transformers,” said Bill Dally, chief scientist at Nvidia during an AI conference last week held by Stanford’s HAI. Dally was referring to an influential 2017 Google research paper presenting an innovative architecture forming the backbone of transformer networks that is reliant on “attention mechanisms” or “self-attention,” a new way to process the data inputs and outputs of models.
“The world pivoted in a matter of a few months and everything changed,” Dally said. To meet the growing interest in transformer use, in March the AI chip giant introduced its Hopper h100 transformer engine to streamline transformer model workloads.
Designing transformer tech for the edge
But some researchers are pushing for even more. There’s talk not only of making compute- and energy-hungry transformers more efficient, but of eventually upgrading their design so they can process fresh data in edge devices without having to make the round trip to process the data in the cloud.
A group of researchers from Notre Dame and China’s Zhejiang University presented a way to reduce memory-processing bottlenecks and computational and energy consumption requirements in an April paper. The “iMTransformer” approach is a transformer accelerator, which works to decrease memory transfer needs by computing in-memory, and reduces the number of operations required by caching reusable model parameters.
Right now the trend is to bulk up transformers so the models get large enough to take on increasingly complex tasks, said Ana Franchesca Laguna, a computer science and engineering PhD at Notre Dame. When it comes to large natural-language-processing models, she said, “It’s the difference between a sentence or a paragraph and a book.” But, she added, “The bigger the transformers are, your energy footprint also increases.”
Using an accelerator like the iMTransformer could help to pare down that footprint, and, in the future, create transformer models that could ingest, process and learn from new data in edge devices. “Having the model closer to you would be really helpful. You could have it in your phone, for example, so it would be more accessible for edge devices,” she said.
That means IoT devices such as Amazon’s Alexa, Google Home or factory equipment maintenance sensors could process voice or other data in the device rather than having to send it to the cloud, which takes more time and more compute power, and could expose the data to possible privacy breaches, Laguna said.
IBM also introduced an AI accelerator called RAPID last year. “Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments,” wrote the company’s researchers in a paper. “The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling.”
Farah Papaioannou, co-founder and president at Edgeworx, said she thinks of the edge as anything outside the cloud. “What we’re seeing of our customers, they’re deploying these AI models you want to train and update on a regular basis, so having the ability to manage that capability and update that on a much faster basis [is definitely important],” she said during a 2020 Protocol event about computing at the edge.
Wanted: custom chips
Laguna uses a work-from-home analogy when thinking of the benefits of processing data for AI models at the edge.
“[Instead of] commuting from your home to the office, you actually work from home. It’s all in the same place, so it saves a lot of energy,” she said. She said she hopes research like hers will enable people to build and use transformers in a more cost- and energy-efficient way. “We want it on our edge devices. We want it smaller and smaller, and it has to be more energy efficient.”
Laguna and the other researchers she worked with tested their accelerator approach using smaller chips, and then extrapolated their findings to estimate how the process would work at a larger scale. However, Laguna said that turning the small-scale project into a reality at a larger scale will require customized, larger chips.
Ultimately, she hopes it spurs investment. A goal of the project, she said, “is to convince people that this is worthy of investing in so we can create chips so we can create these types of networks.”
That investor interest might just be there. AI is spurring increases in investments in chips for specific use cases. According to data from PitchBook, global sales of AI chips rose 60% last year to $35.9 billion compared to 2020. Around half of that total came from specialized AI chips in mobile phones.
Systems designed to operate at the edge with less memory rather than in the cloud could facilitate AI-based applications that can respond to new information in real time, said Jarno Kartela, global head of AI Advisory at consultancy Thoughtworks.
“What if you can build systems that by themselves learn in real time and learn by interaction?” he said. “Those systems, you don’t need to run them on cloud environments only with massive infrastructure — you can run them virtually anywhere.”