Spotify doesn't want to just build a better way to listen to music. (Though, yes, it does want to do that.) The company has made clear over the last couple of years that its ambitions are much bigger: Spotify has invested deeply in podcasting both for creators and consumers, it has delved into the world of audiobook, it acquired a company to build a live-audio product and in general it wants to be the home of audio online.
If you really want to understand where Spotify is going, though, forget the music and audio industry altogether. Look at what's happening with video online. YouTube is making video searchable, discoverable and wildly lucrative; TikTok is making it social, remixable and viral. Spotify wants to do all of that, but in your headphones instead of on your screen. And that means rethinking the way the entire audio business — and tech stack — works.
Gustav Soderstrom, Spotify's chief R&D officer and chief product officer, leads a team of thousands building the future the company imagines. He joined the Source Code podcast to talk about why audio was skipped over in the evolution of technology, how Spotify is trying to balance supporting an open ecosystem with building its own stuff, and how audio changes when you treat it like software. (One thing he didn't want to talk about? Joe Rogan, and the questions the company faces about content moderation and misinformation. That's for another episode.) He also talked about Spotify's ongoing quest to figure out how to bring all that audio into a single app in a way that makes sense.
You can hear our full conversation on the latest episode of the Source Code podcast, or by clicking on the player above. Below are excerpts from our conversation, edited for length and clarity.
The first thing I want to talk about is something you said to me when we talked a year ago, which was that we kind of skipped past audio in the story of technology. That we went straight from text to photos to video and that audio kind of should have been in there in the middle. But we just jumped right past it. Why do you think that is?
I'm not sure I know exactly why. One view you could have is that it's just been under-invested in, until maybe Spotify. But most of the other mediums, they also move very slowly. For example, text messaging was built around standards like SMS and MMS. And I think between the inception of the idea of adding pictures to a text message, and the MMS standard actually shipping and being implemented across all carriers of phones, is maybe 10 or 15 years. So sure, the formula existed and the innovation cycle was there, it was just very, very slow.
A large part of that was probably because the innovation cycle was very broad, based on a decision between many players, where you eventually arrived at these lowest common denominator solutions. What happened to these other mediums is that for various reasons, at some point they kind of all became software. They got enveloped inside a single software stack, from end to end, from creator to consumer. And they started moving at the speed of software iteration, instead of the speed of standardization bodies.
And obviously, podcasts came to exist through standardization. And it was and is a fantastic thing, because that's what brought it scale. Apple and iTunes did a lot of important work in creating that. But I think that once audio is inside a single software stack — and this is what we're trying to do — it can also start developing much faster than it has.
The RSS standard was the only thing that existed for a very long time. But that was true for the others as well: There were standards around texts, there were standards around photos and video and so forth. And on that note, I think the RSS standard is a great thing. And this is the reason why we make sure to stay compatible with RSS. It is in our interest that our creators get as wide distribution as possible, because what is in our creators' interest is in our interest. So we tried to do that. But we try to walk this line where it should also get better. But it should get voluntarily better. If you want to use the features that only work in a fully software world, you can, but we're not forcing you to choose between the two.
The other thing I've heard from a few folks is that the music business was just such a mess for such a long time. When I asked around, people were like, "Why would any VC in their right mind touch the music business in 2009?" And I think now, obviously, that is very different, and the industry works very differently. So things, in and out of Spotify, seem like they're moving much faster now.
One way to talk about it is that the future is path-dependent, and history takes a certain path. And the music business took this one path, where there was no innovation for a long time because it was too profitable. And then piracy happened, and there was sort of nothing to lose. I think the reason that Spotify happened in Sweden and not in the U.S. was because Sweden was the worst hit. There were no revenues left to be had, so the music industry was prepared to take chances on new models.
That wasn't really the case for podcasts. There wasn't a piracy crisis for podcasts.
And the radio was doing just fine.
Exactly. So I think it's different between different formats. You're right that music was a very specific case. And everyone advised against it, because there was just so much roadkill in music startups.
Another thing that's also different from podcasts, obviously, is music is very centralized: A few big labels control all the catalogs or whatever you do. So whatever product development you do has to be negotiation-based. It's not like building useful software products, where you try something and see if it works. Before you even get to try it, you have to negotiate for years. And then it turns out it doesn't work! And so when you negotiate with three, four really strong parties, you're going to get this lowest common denominator of what very powerful organizations want.
But podcasts don't have the same structure with a few labels that control everything; it's much more distributed. So for me as a software person, it was, frankly, very liberating to get to innovate similar to how a Twitter or Google would innovate, by trying things with a few creators and seeing if it works instead of first negotiating long contracts for years. So we actually move faster in podcasts than we do in music.
So let's talk about podcasting a bit. I think we've come to understand how technology and music interact, with algorithmic and editorial personalization, things that Spotify has been talking about for a long time. But it feels like we're just at the beginning of all of that stuff with all the stuff that isn't music. And you've talked a lot over the last couple of years about starting to think about audio as software. Can you just explain what that means to you as you start to think about audio not just as an MP3 track but a piece of software? What does that look like to make that shift?
When everything is wrapped in a single software stack, both the creator experience and the consumer experience, you can start treating it as a non-fixed format. You don't have to decide on exactly the feature set forevermore, right? Because you're not relying on an industry standard underneath.
When everything is wrapped in a single software stack, both the creator experience and the consumer experience, you can start treating it as a non-fixed format.
And so that means that you can do some of the things that I think seem obvious from other mediums. Like, why doesn't podcasting have comments? And now they do. As we've done video, for example, there was a standard around video and video podcasting already, but because it needed to be standardized, as a consumer you had to choose the video feed or the audio feed. You had to download the videos, consume a lot of bandwidth. When you're in the software world, it seems to make sense that you could add video dynamically; if it's in the foreground you stream video, if it's in the background we don't.
So it turns from a fixed thing into quote-unquote "just code." And with code, you can do whatever you want. You can A/B test things, like you normally do in software, so you can learn and iterate much faster both as a company, but also as a creator. So you get all the benefits, I think, of something that is dynamic versus something that is fixed.
And as an analogy, it's interesting that music is actually still very much fixed, because it's centrally created and distributed as a standard object to all DSPs. And also, music honestly hasn't moved that much as a format in the last 100 years. It's now stereo, but that's about it.
That's where, like, Kanye's Stem player is super interesting, right? It's sort of the thing you're talking about: What if I take this thing, and I'm able to break it into its component parts and then reorganize it or remix it or do different things rather than just saying, "Here is the file." Is that a decent comp for what you're talking about?
I think that's exactly what I mean. I think it's super exciting. I'm so glad that Kanye is trying this. The problem for him is that in order to get to participate and try this, you literally have to buy a different player. Because you can't distribute the stuff he's trying through the existing system, because it's standards-based. It's not software based. So I think you see the exact same creator need from Kanye, for example, wanting to innovate. Why should music be the only format that doesn't ever get better? It just seems unreasonable. And I think it's for structural reasons.
Why should music be the only format that doesn't ever get better?
Help me understand what you gain in having control over the audio file. Because I think that the debate has always been, there's this open RSS-based system that's very useful. And it just puts out a feed and everybody can access it. And there's an open ecosystem that's very good! The downside of it is you're just handed an MP3 file with a bunch of crappy metadata, and that's all you can do with it.
But we're also getting better at speech-to-text, there's AI software that can turn a song into its various stems relatively successfully. So what do you gain by actually having control over the whole end-to-end process, as opposed to just being handed this high-fidelity MP3 file and having to figure out what to do with it?
So you can imagine all kinds of things. In music, you just mentioned what Kanye is doing, which you could let people work with the music and re-create it and so forth. And that's a combination of a technical problem — you can't do that in the MP3 format, you would need the stems — but it's also very much a business model problem. That derivative work, who owns that? Who gets paid for it? But someday, someone is going to solve that. I hope it's us.
On the podcast side, it's different, because you don't have the same structure that you have to pre-negotiate everything. You can actually work with individual creators and try stuff, and some creators want to use them and others don't.
For example, we can add video dynamically so the user doesn't have to choose. We can change the bitrate dynamically, depending on your bandwidth, which the MP3 also doesn't allow. We can do advertising that is much more effective for the creator because it can be real-time and targeted, instead of burnt-in. Which, you know, in video was a step change for creators in how they can monetize.
You can also work on expanding the show notes concept, where creators would like a lot more functionality than the RSS show notes allow for. I personally think that paid podcasting will continue to grow and become a big thing, too. That's another thing you can take from almost all other mediums: There's a huge chunk of free text, there's a huge chunk of paid text, a huge chunk of free video, huge chunk of paid video. We started investing in supporting paid audio some time ago, both through something called Spotify Open Access Platform but also if you're an Anchor creator, you can add paid podcasts.
People already hacked RSS to do that with something called Private RSS, where you gave up private links that you promised not to share, because they were personal to you. And so you saw the creator need, and you saw the user need, and people tried to hack the standard to make it work. When it's all software, we can let you have your free episodes and then the paid episode after that in the same feed, you don't need to subscribe to something new. We can let creators play around with business models: Do they want to charge a subscription? Do they want to charge per episode a la carte? Because it's just software, we can theoretically do anything.
That's a pretty convincing case for why it makes sense to try and bring all of this stuff in house. But there's also real upside to this decentralized, standardized podcasting system. To your messaging example, it's objectively true that all of these systems are better than SMS, right? It also sucks that there's 50 systems. I have 50 inboxes. The way that we message is both way better, and fundamentally broken. And there's a world in which, if we let audio and podcasting fragment the same way, it could be the same sort of thing: There are 10 great apps, but I have to use 10 apps. How do you strike that balance? Is there a balance there?
I think there is, and you're completely right. I think text messaging is a great example: If you use iMessage, you get a richer experience, but they've struck the balance where I can still message with my Android friends.
In this case, I think Apple realized that the value of the service is relative to the number of the people you can communicate with, and so you need to keep that as big as possible. And that's exactly how we think about podcasts as well. The value for the creator is the amount of listeners you can get, right? So we try to maximize that. Then, I think we feel that it's fair for us to try to make our experience as good as possible. And if the creator wants to voluntarily add these features, because they think it brings value — whether they get more statistics or they could get user feedback — then I think that's fair and great. But what we don't want to do is to force them to choose between the two.
And this goes for, for everything we do, actually. I mean, we've been pretty public about the fact that we think that platforms should be more open. So we kind of have to live up to that as well.
We've been pretty public about the fact that we think that platforms should be more open. So we kind of have to live up to that as well.
On the discovery side, I think the listener experience of podcasts has never been as good as it should be. And it seems like a big part of improving that would be starting to understand the actual content of shows — the same way we're getting good at understanding, like, "This what kind of guitar lick this is, and this is what this singer sounds like, and if I like these kinds of drums, I probably like these kinds of drums." Podcasting so far has not been nearly as sophisticated in helping people understand what I like and why and what else I might like as a result. How are we doing on the road towards that? Is that even the road that we're on?
It's definitely the road that we're on. I would agree with you that we're not there yet, and it should be much, much better than it is. This is something that someone like me would always say, but I do think we're on the cusp of getting to a very different experience for podcasts as well. And I think there are a few ways to think about it. One is, if you look at music before, before streaming, the discovery problem was basically the same, right? Most people followed artists, and they consumed music in albums. And a lot of people still do.
But what happened was, once you have this flat-rate access to music, you could start putting together sessions of these objects along new dimensions. And that was a massive boon for discovery: People started doing all kinds of things for music. Sleep music playlists, stuff that you would never do at 99 cents per three minutes. So all of that innovation happened.
And I think you will see the same thing in podcasts. People are very much subscribing to hosts that they listen to, and personalities. And I think they will always continue doing that. And I think it's a good thing. But I think if you look at something like YouTube or Netflix, you see where I think the future should be: that you have these shows that you love and follow, but you also get recommendations for individual episodes, or even parts of an episode, from something similar.
As I said before, I think you can look at the other mediums as sort of a cheat sheet for where audio is going. So that requires exactly what you said, an understanding of the audio. And we do this, obviously, through a lot of machine learning. There are these large language models like transformers that are getting very good at understanding the contents of a podcast. They're also understanding the sentiments, they're understanding the hosts, they're even starting to be able to summarize podcasts. For podcasts that don't have show notes, you can generate them. There are other technologies that we're looking at, like graph neural networks, which are looking at not the podcast but the audience type for podcasts, and inferring. So there are many technologies that are getting very powerful.
I think the problem now is actually the discovery format itself. Because the big difference between podcasts and music is that you had these three-minute objects you could stack after each other into a good session. You don't really have those objects in podcasts, the objects are like an hour. That's not a great discovery format. So I think one of the tricks is to understand how you either summarize the podcast or pick up part of the podcast, and then how do you program it? What's the format of discovering this?
I think the understanding is there. It's actually the discovery consumption format that isn't there yet. So that's something that we're working really hard on. I can't share exactly yet exactly how we're doing it. But I think that's the key now.
My great frustration with Spotify as an app is sometimes there's just too much going on all in one place. I've gone back and forth a million times about whether it feels like a good idea to have my audiobooks and my podcasts and my music all together. I don't imagine ever wanting to do all three of those things in sequence. Then you throw live in, which feels like yet another thing.
I buy the logic of them being one app, just because people don't download apps. And it's hard to get people to download new apps and try new things. But they could just be a bunch of totally separate experiences inside the Spotify app, Snapchat-style, where you have different panes for different things. But you seem to be trying to figure out a way for them all to live together. Why?
It's a great question. And there's two answers. One is, as you mentioned, it is hard to get people to install applications, if you want the innovations that we do to reach lots of people and to be successful. Integrating them is a benefit. But that's kind of a Spotify distribution benefit.
The question is, what's the benefit for the consumer? And what we saw when people started hacking the system and uploading podcasts, and also uploading audiobooks as music, was that we saw people voluntarily using them. That's when we said, maybe there is convenience here. So we took on this sort of contrarian view, at least at the time, which is: What if the user interface could adapt to the content, instead of the user having to switch apps for the content? It's certainly much harder as a design challenge. The risk is obvious: It's just confusing. It's not the best of two worlds, it's the worst of two worlds.
So we took on the harder challenge, because it was exciting for us, our user data told us it was interesting and it was also the strategy that made sense for us to do. And if it couldn't pull it off, obviously, people wouldn't want to use it. So we would fail. And with the data we have, statistically, it looks like it's working, because we're growing fast. However, we're far from the perfect place where we want to be. I think we've come pretty far in being able to adapt between music and podcasts, so that it feels like you're listening to podcasts, you get the right controls and so forth. And all these weird cases: like you have a queue of music, but then you queue a podcast, what happens? All those take an insane amount of time.
And actually, you see other companies are starting to follow. You see on YouTube, when you listen to certain types of content, you get scrub controls instead of skipping the full episode. So I think others are following us in having the app adapt to the content instead of the user having to switch apps for different types of content. So I think we're on the right track. If you're Apple, maybe you think differently about it, because you can pre-install your app. So you don't have a distribution challenge, right? But we don't have that luxury.
The other thing I would say is that if you can actually integrate them, there is a lot of content that has music and talk. And obviously, the big problem is for most podcasters, they can't legally put music in their podcast, because they would need to build a music service and license it all. So there's an obvious but in reality really complicated thing that we built with this music and talk format, which I think was a clear innovation on podcasts. So there you have something that couldn't be done if it was separate apps. And you could have some argument around video as well: Should video be a separate app? But it seems like most podcasters want to be able to go between the video and the audio. So that doesn't make sense to have a set as a separate app either.
So it's a combination of strategies. But I think a strategy that goes against what users want is just going to be a failed strategy.