People

The internet is splitting apart. The Internet Archive wants to save it all forever.

The Internet Archive has grand ambitions for preserving the internet. But in order to do that, Big Tech has to stay out of the way.

The internet is splitting apart. The Internet Archive wants to save it all forever.

Brewster Kahle, the founder of the Internet Archive, worries about how the splintering internet could end a golden age for the Internet Archive.

Photo: Internet Archive

The internet's first librarian likes to reminisce. The early internet is like a fantasy for the founder of the Internet Archive, a place he returns to over and over again in conversation when questions about the present turn dark or depressing. Brewster Kahle might know more about the early years of the web than anyone else.

He has occasion to talk about the Archive's beginnings perhaps more than he should these days. Discussing its future can at times be grim, or, at the very least, uncertain. The glories of the Wayback Machine, the petabytes of data capturing every day of human existence online in warehouses scattered across the world, the smooth system of crawlers marching from my Twitter to the homepage for the Russian government to Clubhouse in China — in the grand scheme of history, all of this could be an ephemeral golden age.

The so-called balkanization of the internet isn't just a theoretical problem for the Internet Archive. If internet firewalls stay up in China, Iran and Russia, new content continues to move mostly behind paywalls and passwords, and U.S. political leaders decide it's finally time for Section 230 to go, the crawlers whose simple formulas have preserved the last few decades for future historians might not do the same for more than the next few decades.

"There are more and more walled gardens where you can't go. We just have crawlers going at a crazy scale, and they can get blocked just like anybody can get blocked," said Jefferson Bailey, the Archive's director of web archiving and data services.

But even still, until someone or something fundamentally changes the rules of the web, the Internet Archive will keep doing what it's been doing since 1996: preserving every fragment of text you or I are ever likely to read. Tech's walled gardens might make it harder to get a perfect picture, but the small team of librarians, digital archivists and software engineers at the Internet Archive plan to keep bringing the world the Wayback Machine, the Open Library, the Software Archive, etc., until the end of time. Literally.

The balkanization of the internet

When Kahle was a student at MIT in the early '80s, he used a professor's ID to break into the Harvard Law library to access cases for a project. If there was a moment in his lifetime that encapsulated the closed nature of access to information before the internet, it was that.

But today, anyone can find the information he needed back then without so much as a library card. "Usually, things are very closed and locked down. Historically, this is a very rare moment," he said.

That could soon change, however. "Are we at risk of locking down? Yes, absolutely," he said. The Internet Archive is currently blocked in China, and occasionally as well in Russia, India and Turkey, and that's just at the whim of nation-state governments that have the tools to make that work. According to Kahle and Bailey, corporations are just as capable of fracturing the web in ways that make it harder to access and archive; even "user lock-in" to a specific browser and products could one day create internet bubbles, and then walls, based on the products people pay for.

"The Facebooks and the Googles are taking over, and they want to make money," Bailey said. The more people act on the internet behind a password and the more the web becomes corporate, the more the open internet ethos fades away from the public consciousness, easing the way toward that splintering that Kahle fears.

"That's a strategic concern for everyone. Of course, it impacts archiving, too," Bailey said. The archive does its best to capture Twitter, Tumblr, Instagram, YouTube, Vimeo, Facebook and others. Facebook is the hardest, because the company is archiving-unfriendly in general, according to Bailey. But in reality, if any of these social companies decided they wanted to stop the Internet Archive from doing its job, they probably could, he said.

"We're embedded in the community," Bailey said. "At the end of the day, we're just a library."

Kahle fears that the eventual "walling" of the internet could develop in an incongruous place: from tech companies eager for regulation that would cement their own status by stifling future innovation. For example, almost any proposed change to Section 230 — which protects website owners from legal liability for content created and posted by its users — would destroy the delicate legal framework that protects the Internet Archive's work (as well as Wikipedia and user-contributed projects), according to Kahle. Facebook's Mark Zuckerberg is among the many tech leaders to express support for a rewrite.

And tech companies, book publishers and even the music industry have lobbied to limit, change or even remove general copyright fair use exceptions, as well as specific copyright and use exemptions for libraries. Changes to these laws could (accidentally or intentionally, depending on who you ask) make it much harder for people to share their creative work online, and for groups like the Internet Archive to save them.

"Why are they doing this? Some people say it's money. But when you have oligarchies, it's really about protecting against new entrants in the market," Kahle said. At the end of the day, large companies have adapted to the current legal regimes, and they have the money and technical know-how to be able to advocate for stricter regulations that would allow them to preserve their monopolies while changing or limiting fair-use protections.

How the Internet Archive decides what to archive

Until the day these more existential problems firm into something Kahle can fight with more than words, the Internet Archive's day-to-day struggle is preserving the constantly transient web. Web pages have an average lifespan of about 90 days before they change or disappear, and so the Archive needs to capture those pages at a minimum of every 90 days to preserve a full picture of the web over time.

The archivists employ three main strategies to capture most of what might be important for future historians. Bailey wouldn't guess exactly what percentage of the web they manage to preserve — "I'd look like an idiot," he said — because no one really can guess the size or scale of the internet. (Don't get there in your head, if you can avoid it. How would you even measure: by data size? Number of objects? Number of distinct URLs?) "There's no use being anxious over what's outside your control," he said.

The archivists start by considering the entirety of the web and seeking out the most important fraction. They capture a shallow outline of the entire internet (every single URL and associated homepage that's accessible), and then they dive deep into as many pages as possible for the top 5 million or so most-visited websites. This creates a fairly flat, bird's-eye view of the internet.

To get a more three-dimensional picture, they seek other signals of importance, ranging from news aggregators to the entirety of a national domain (like Cuba, France, Somalia, etc.) when there is an important event, and even every single YouTube URL ever shared on Twitter (they can't capture all of YouTube, but at least they can capture the videos people deem important enough to share elsewhere).

And finally, other institutions can use the Internet Archive to build their own archiving services, usually creating specialized collections around topics like human rights or bioengineering. All of those collections are then copied back into the Wayback Machine, which is the publicly accessible version of the web archive.

Abbie Grotke, the web archiving team lead at the Library of Congress, has been involved in this work in one way or another for over 20 years. The Library of Congress's own archive is one of the special collections built in collaboration with Bailey, and it contains about 2.4 petabytes and over 18 billion objects, ranging from U.S. government websites to the most culturally important memes. Grotke has given her life to preserving the internet for the Library of Congress.

The work itself is technically an enormous task, but it boils down to one simple goal. "We're just trying to capture changes over time," she said.


Brewster Kahle is the internet's first librarian.Photo: Internet Archive


The Library of Congress began capturing websites in 2014, focusing mostly on political collections and at-risk websites and collections that might be taken down before they can be captured. "We're always sort of worried about, are we collecting everything we need to be collecting? Is there something we're missing?" said Amber Paranick, one of the Library of Congress's reference librarians. But this problem isn't that different because it's digital: "That's always the dilemma of the librarian."

The web archive alone is about 45 petabytes — 4,500 terabytes — and the Internet Archive itself is about double that size (the group has other collections, like a huge database of educational films, music and even long-gone software programs).

It's impossible to conceptualize actually usable, accessible data at that scale, let alone make it text-searchable. So while the Archive has some projects to use machine learning to identify some images, like pictures of horses, Bailey likes to think about the odd, unimaginable applications that have emerged and how they foretell grander uses in the future.

The Wayback Machine has evolved to play an important role in patent litigation, for example. People fighting over patent ownership look for what's called "prior art," which indicates who might have first thought of a product. In one case, when two people were disputing who first created a specific design for hubcap rims, one was able to prove their ownership by finding an old website that had been archived in the Wayback Machine.

And there are other use cases, too: The people building open-source translation tools at Mozilla have also found the internet archive's collection of websites in multiple languages useful for training their translation tools. There is very little printed or digitized material that has large amounts of the same text in two languages, but many official websites do, which can help build quality translation tools for "minor languages," like English-Swahili translations, according to Bailey.

The future of our histories

When I asked Kahle how he thinks about preserving today for historians centuries away, he grew philosophical. He sent links in the Zoom chat, first to the Google doc for a book he wrote, then a Nation piece, then a long blog post he wrote in 2015. By the time we hung up the call, I had piles for reading material, most of it dense, most of it dated.

There's value to all of this history, he told me. "What we're able to do now is know about your individual history. We're able to get to the specificity of the historical record. Which I think is going to really be engaging in 100 years' time. What would you give for a video of your great-grandmother? It would just give you this ballast, it would give you an anchoring, that we right now lack," he said. "We're living in the perpetual present, and that is dangerous." Kahle believes our history makes us better people, and gives us better knowledge. But history isn't financially lucrative.

Social media companies want us to focus on tomorrow, not on the posts we made a year ago. Publishers do, too. HarperCollins is suing the archive to try to prevent it from sharing out-of-print books in its digital library, arguing that publicly sharing out-of-print books is a massive violation of copyright laws. While at first it might seem odd that publishers would care about books that aren't in print anymore, for companies whose business depends on people buying new things, archiving so that people can focus on the past is not in their financial interest.

"They are erasing the past through every legal and political means they can," Kahle said.

If the balkanization of the internet can be prevented, the Internet Archive could transform the way we learn about larger historical moments, Kahle said. History books and historians are limited to a few textual works, mostly by the powerful people of the time. With the Internet Archive, the everyday history will become suddenly accessible to those studying our time. Imagine if each of us could look back on our great-grandparents and know what they said or thought at age 15, and then 25, and 50. The Archive would allow that.

The Archive could also force historians to become professional data miners. "There will be a lot of these comparison studies at a much larger scale in the future — every tweet from every president in 30 years. Longitudinal analysis could be done with petabytes of data," Bailey said. The research questions themselves may not change much; they will just stretch over bigger timelines and larger comparisons.

"We're in the process of building macroscopes," Kahle said.

Caught in a golden age

More than 1 million people use the Internet Archive every day. Most of them seek out the Wayback Machine, but people also read the digitized books in the archive's open library, or watch movies from the huge archive of public domain films.

"We love the dreamers, the people who come to this new medium with their ideas. The dreams are important to archive, whatever happens," Kahle said. Despite the existential threats to his work and to the values of the open internet, Kahle wants to be hopeful.

"Those who want to monopolize the internet are very well-funded. We need to communicate and deliver the value of openness. Am I optimistic we can do that? I'd say yes. But it's based on an enormous number of people wanting it to happen," he said.

"Some believe that people will only do things if you pay them, others that people are just sheep," Kahle said. "None of that is true. They may not be interested in the same things, but when we look at what people produce on the internet, if it's about the things they care about … They'll prove you wrong in a nanosecond."

Protocol | Policy

5 things to know about FCC nominee Gigi Sohn

The veteran of some of the earliest tech policy fights is a longtime consumer champion and net-neutrality advocate.

Gigi Sohn, who President Joe Biden nominated to serve on the FCC, is a longtime net-neutrality advocate.

Photo: Alex Wong/Getty Images

President Joe Biden on Tuesday nominated Gigi Sohn to serve as a Federal Communications Commissioner, teeing up a Democratic majority at the agency that oversees broadband issues after months of delay.

Like Lina Khan, who Biden picked in June to head up the Federal Trade Commission, Sohn is a progressive favorite. And if confirmed, she'll take up a position in an agency trying to pull policy levers on net neutrality, privacy and broadband access even as Congress is stalled.

Keep Reading Show less
Ben Brody

Ben Brody (@ BenBrodyDC) is a senior reporter at Protocol focusing on how Congress, courts and agencies affect the online world we live in. He formerly covered tech policy and lobbying (including antitrust, Section 230 and privacy) at Bloomberg News, where he previously reported on the influence industry, government ethics and the 2016 presidential election. Before that, Ben covered business news at CNNMoney and AdAge, and all manner of stories in and around New York. He still loves appearing on the New York news radio he grew up with.

If you've ever tried to pick up a new fitness routine like running, chances are you may have fallen into the "motivation vs. habit" trap once or twice. You go for a run when the sun is shining, only to quickly fall off the wagon when the weather turns sour.

Similarly, for many businesses, 2020 acted as the storm cloud that disrupted their plans for innovation. With leaders busy grappling with the pandemic, innovation frequently got pushed to the backburner. In fact, according to McKinsey, the majority of organizations shifted their focus mainly to maintaining business continuity throughout the pandemic.

Keep Reading Show less
Gaurav Kataria
Group Product Manager, Trello at Atlassian
Protocol | Workplace

Adobe wants a more authentic NFT world

Adobe's Content Credentials feature will allow Creative Cloud subscribers to attach edit-tracking information to Photoshop files. The goal is to create a more trustworthy NFT market and digital landscape.

Adobe's Content Credentials will allow users to attach their identities to an image

Image: Adobe

Remember the viral, fake photo of Kurt Cobain and Biggie Smalls that duped and delighted the internet in 2017? Doctored images manipulate people and erode trust and we're not great at spotting them. The entire point of the emerging NFT art market is to create valuable and scarce digital files and when there isn't an easy way to check for an image's origin and edits, there's a problem. What if someone steals an NFT creator's image and pawns it off as their own? As a hub for all kinds of multimedia, Adobe feels a responsibility to combat misinformation and provide a safe space for NFT creators. That's why it's rolling out Content Credentials, a record that can be attached to a Photoshop file of a creator's identity and includes any edits they made.

Users can connect their social media addresses and crypto wallet addresses to images in Photoshop. This further proves the image creator's identity, but it's also helpful in determining the creators of NFTs. Adobe has partnered with NFT marketplaces KnownOrigin, OpenSea, Rarible and SuperRare in this effort. "Today there's not a way to know that the NFT you're buying was actually created by a true creator," said Adobe General Counsel Dana Rao. "We're allowing the creator to show their identity and attach it to the image."

Keep Reading Show less
Lizzy Lawrence

Lizzy Lawrence ( @LizzyLaw_) is a reporter at Protocol, covering tools and productivity in the workplace. She's a recent graduate of the University of Michigan, where she studied sociology and international studies. She served as editor in chief of The Michigan Daily, her school's independent newspaper. She's based in D.C., and can be reached at llawrence@protocol.com.

Protocol | China

Why another Chinese lesbian dating app just shut down

With neither political support nor a profitable business model, lesbian dating apps are finding it hard to survive in China.

Operating a dating app for LGBTQ+ communities in China is like walking a tightrope.

Photo: Nicolas Asfouri/AFP via Getty Images

When Lesdo, a Chinese dating app designed for lesbian women, announced it was closing down, it didn't come as a surprise to the LGBTQ+ community.

It's unclear what directly caused this decision. 2021 hasn't been kind to China's queer communities; WeChat has deactivated queer groups' public accounts and Beijing has pressured charity organizations not to work with queer activists.

Keep Reading Show less
Zeyi Yang
Zeyi Yang is a reporter with Protocol | China. Previously, he worked as a reporting fellow for the digital magazine Rest of World, covering the intersection of technology and culture in China and neighboring countries. He has also contributed to the South China Morning Post, Nikkei Asia, Columbia Journalism Review, among other publications. In his spare time, Zeyi co-founded a Mandarin podcast that tells LGBTQ stories in China. He has been playing Pokemon for 14 years and has a weird favorite pick.

The Oura Ring was a sleep-tracking hit. Can the next one be even more?

Oura wants to be a media company, an activity tracker and even a way to know you're sick before you feel sick.

Over the last few years, the Oura Ring has become one of the most recognizable wearables this side of the Apple Watch.

Photo: Oura

Oura CEO Harpreet Rai swears he didn't know Kim Kardashian was a fan. He was as surprised as anyone when she started posting screenshots from the Oura app to her Instagram story, and got into a sleep battle with fellow Oura user Gwyneth Paltrow. Or when Jennifer Aniston revealed that Jimmy Kimmel got her hooked on Oura … and how her ring fell off in a salad. "I am addicted to it," Aniston said, "and it's ruining my life" by shaming her about her lack of sleep. "I think we're definitely seeing traction outside of tech," Rai said. "Which is cool."

Over the last couple of years, Oura's ring (imaginatively named the Oura Ring) has become one of the most recognizable wearables this side of the Apple Watch. The company started with a Kickstarter campaign in 2015, but really started to find traction with its second-generation model in 2018. It's not exactly a mainstream device — Oura said it has sold more than 500,000 rings, up from 150,000 in March 2020 but still not exactly Apple Watch levels — but it has reached some of the most successful, influential and probably sleep-deprived people in the industry. Jack Dorsey is a professed fan, as is Marc Benioff.

Keep Reading Show less
David Pierce

David Pierce ( @pierce) is Protocol's editorial director. Prior to joining Protocol, he was a columnist at The Wall Street Journal, a senior writer with Wired, and deputy editor at The Verge. He owns all the phones.

Latest Stories