Sonos logo
Image: Sonos

How Sonos built its voice assistant

Protocol Entertainment

Hello, and welcome to Protocol Entertainment, your guide to the business of the gaming and media industries. This Thursday, we’re exploring how Sonos built its voice assistant, and why Amazon didn’t use computer vision for its new Glow projector device. Also: Time to take a deep breath.

'Hey, Sonos'

Sonos quietly began rolling out its voice assistant to some people in the U.S. this week, days before its official June 1 launch date. Sonos Voice Control is purpose-built for music playback, and it comes with strong privacy safeguards: Unlike Alexa or the Google Assistant, it doesn’t upload any voice recordings to the cloud, but instead processes everything on the device.

I spoke with Sonos’ Sébastien Maury, the company’s senior director for Voice Experience, and Kåre Sjolander, European head of Text-to-Speech for synthetic voice specialist ReadSpeaker, to learn more about the work that went into building the assistant.

Giving the Sonos assistant a voice. Sonos teamed up with ReadSpeaker to generate the unique voice profile of its assistant, which is based on “Breaking Bad” actor Giancarlo Esposito.

  • Esposito spent around 40 hours in the studio recording thousands of sentences and phrases that were then used as training data for the voice model.
  • A lot of the recorded material was not specific to music at all. “Basically, we have a base standard script for English,” Sjolander said. “It's short sentences, longer ones, numbers. All different types of material.”
  • ReadSpeaker also included a bunch of Spanish vocabulary in Esposito’s script to improve the voice model’s pronunciation of Latin artists and songs. “He actually had a Spanish coach during the recording,” Maury said.
  • Esposito was also asked to read some Sonos assistant-specific material, but even those phrases aren’t being used 1:1. Instead, it’s all AI fodder. “You build a model of the actor’s voice, which basically should be able to say anything,” Sjolander said.
  • There’s one notable exception to this: When people summon the assistant with the “Hey, Sonos” phrase and then don’t follow up with anything, you’ll hear Esposito’s actual voice say “Yes?”
  • “We wanted to have a very specific intonation for that,” Maury said. “Like somebody that is a bit annoyed … ‘Okay, come on!’”

Making sure the assistant understands you. Having an assistant respond with a synthetic voice is only half the battle. Getting it to actually understand requests is just as important — and even more challenging if it’s done locally on the device.

  • Amazon and Google use cloud-based voice recognition for their respective assistants and actually have humans review small subsets of those recordings to improve accuracy.
  • However, there’s been some backlash against that practice, which is why Sonos decided against it. Instead, Sonos is using voice recordings from its opt-in community of beta users to train its assistant.
  • The company also partnered with outside contractors for additional recordings. “We give them some script and we gather [training] data,” Maury said.
  • Sonos plans to continuously update this data to account for new artists, weirdly pronounced song names and other edge cases.

The focus on just one use case makes things a little easier. Sonos Voice Control won’t need to tell people about their weather or commute, and speaker owners will likely use a much more streamlined set of requests.

  • Still, it’s no walk in the park. “The music domain is actually probably the hardest,” Maury said.
  • Just think of all the artists whose name you can’t pronounce, or all the artists, bands and songs that contain the term “Alice.” Somehow, the assistant has to make sense of all of them, or people will just give up on using it.
  • Sonos does have a bit of a super power at its disposal: The company is using songs and artists people have favored in its app as the default answer.
  • Instead of building one all-knowing assistant, the company effectively personalizes it for each and every listener.

“This is one of the advantages of running locally,” Maury said. “We have one [speech recognition] model per house.”

— Janko Roettgers

Computer vision is not a panacea

When I first heard about Amazon’s new kid-focused video calling device, the Amazon Glow, my mind immediately went to Osmo, which has been combining digital and physical play with its child-centric entertainment apps and accessories for years. There are even tangram sets for both, allowing kids to solve digital puzzles with physical puzzle pieces.

But after talking to some of the folks who worked on the Glow for an in-depth story on its development that published on this week, I realized that Amazon ultimately decided to take a very different approach — and the reasons for that decision show that there’s no one-size-fits-all approach when it comes to building next-generation entertainment devices.

  • Osmo uses computer vision to extend play beyond the screen. The company’s hardware includes a small clip-on mirror that redirects an iPad’s forward-facing camera view towards the table, turning it into a supervised play space.
  • “We looked at Osmo,” acknowledged Amazon Senior Hardware Engineer Martin Aalund, a founding member of the Glow team. “They're recognizing objects with their camera and tracking those objects.”
  • However, the premise of the Glow went beyond tracking objects. “We wanted an interactive screen,” Aalund said. “To actually detect when a finger is touching a surface is a lot harder.”
  • The Glow team looked at a couple different ways to make computer vision work, including using multiple cameras and tracking the shadow of a child’s finger. Nothing really seemed good enough.
  • One issue: Cameras get distracted easily. “If you put [your device] next to a window and there's a tree outside with branches that are swaying, and you have shadows moving across the playspace, all of a sudden you start detecting all these false positives,” Aalund said.

Ultimately, Amazon went with an IR sensor that can track a person’s fingers instead of a traditional RGB camera. However, Aalund readily admitted computer vision may one day provide even better results. “We started this five years ago,” he said. We didn't have quite as powerful cameras and systems as we do today. That biased us a little bit.”

— Janko Roettgers

The digital revolution is already here – transforming the way we live, work, and communicate. Smart infrastructure is a key part of this revolution. It brings the power of the digital world to physical components like energy, public transportation, and public safety by using sensors, cameras, and connected devices.

Learn more

In other news

Magic Leap is getting rid of its original headset. The company’s pivot to the enterprise is complete: Discount site Woot is selling the Magic Leap 1 headset, which used to be $2300, for $550 this week.

Netflix is eyeing console and cloud gaming. In a lengthy survey, the company asked subscribers about their interest in playing Netflix games on TV.

Niantic is building an AR map of the world. The Pokemon Go developer has been crowdsourcing its Visual Positioning System, which allows developers to create persistent AR experiences at 30,000 locations.

Netflix layoffs disproportionately impacted people with marginalized identities. A recent round of layoffs resulted in deep cuts on social media teams set up to speak to people of color and LGBTQ+ viewers.

The metaverse gets its first in-world conference. The Meta Festival, scheduled for June 28, will include speakers from Netflix, Headspace, Paramount and others.

Roblox hires a former Zynga and Twitter exec. Nick Tornow, the former chief technology officer at Zynga, is joining Roblox as vice president of Engineering for its developer team. Tornow was previously Twitter’s platform lead.

The war in Ukraine is still straining game development. The Belarusian game developer Sad Cat Studio said on Wednesday it was delaying its upcoming Xbox exclusive Replaced to 2023, citing the ongoing conflict and the impact it’s had on staff members.

An NFT nightmare: Seth Green made headlines this week when his Bored Ape NFT was stolen and resold to a buyer who has no intention of returning it. That could complicate Green’s plans for an animated TV show using the underlying art and character of the NFT.

Take a deep breath

It’s easy to feel lost and overwhelmed in a week like this. Self-care obviously won’t solve all of our problems (for one, it doesn’t get rid of assault weapons), but taking a moment for yourself can at least help to cope with some of the feelings these senseless tragedies leave us with. One way to do that is guided meditation, which is something that VR meditation company Tripp is currently offering for free in its mobile app. Plus, Tripp recently teamed up with Niantic to soon integrate AR experiences into its mobile app, so you’ll be able to find those self-care moments anywhere.

— Janko Roettgers

The potential of the IIJA to shape our future is immense; if we don’t spend the funds wisely, the effects will be felt for generations. Physical infrastructure alone does not fully address the diverse needs of our modern, information-driven economy and set us up for future success.

Learn more

Thoughts, questions, tips? Send them to Enjoy your day, see you tomorrow.

Recent Issues