People

The internet is splitting apart. The Internet Archive wants to save it all forever.

The Internet Archive has grand ambitions for preserving the internet. But in order to do that, Big Tech has to stay out of the way.

The internet is splitting apart. The Internet Archive wants to save it all forever.

Brewster Kahle, the founder of the Internet Archive, worries about how the splintering internet could end a golden age for the Internet Archive.

Photo: Internet Archive

The internet's first librarian likes to reminisce. The early internet is like a fantasy for the founder of the Internet Archive, a place he returns to over and over again in conversation when questions about the present turn dark or depressing. Brewster Kahle might know more about the early years of the web than anyone else.

He has occasion to talk about the Archive's beginnings perhaps more than he should these days. Discussing its future can at times be grim, or, at the very least, uncertain. The glories of the Wayback Machine, the petabytes of data capturing every day of human existence online in warehouses scattered across the world, the smooth system of crawlers marching from my Twitter to the homepage for the Russian government to Clubhouse in China — in the grand scheme of history, all of this could be an ephemeral golden age.

The so-called balkanization of the internet isn't just a theoretical problem for the Internet Archive. If internet firewalls stay up in China, Iran and Russia, new content continues to move mostly behind paywalls and passwords, and U.S. political leaders decide it's finally time for Section 230 to go, the crawlers whose simple formulas have preserved the last few decades for future historians might not do the same for more than the next few decades.

"There are more and more walled gardens where you can't go. We just have crawlers going at a crazy scale, and they can get blocked just like anybody can get blocked," said Jefferson Bailey, the Archive's director of web archiving and data services.

But even still, until someone or something fundamentally changes the rules of the web, the Internet Archive will keep doing what it's been doing since 1996: preserving every fragment of text you or I are ever likely to read. Tech's walled gardens might make it harder to get a perfect picture, but the small team of librarians, digital archivists and software engineers at the Internet Archive plan to keep bringing the world the Wayback Machine, the Open Library, the Software Archive, etc., until the end of time. Literally.

The balkanization of the internet

When Kahle was a student at MIT in the early '80s, he used a professor's ID to break into the Harvard Law library to access cases for a project. If there was a moment in his lifetime that encapsulated the closed nature of access to information before the internet, it was that.

But today, anyone can find the information he needed back then without so much as a library card. "Usually, things are very closed and locked down. Historically, this is a very rare moment," he said.

That could soon change, however. "Are we at risk of locking down? Yes, absolutely," he said. The Internet Archive is currently blocked in China, and occasionally as well in Russia, India and Turkey, and that's just at the whim of nation-state governments that have the tools to make that work. According to Kahle and Bailey, corporations are just as capable of fracturing the web in ways that make it harder to access and archive; even "user lock-in" to a specific browser and products could one day create internet bubbles, and then walls, based on the products people pay for.

"The Facebooks and the Googles are taking over, and they want to make money," Bailey said. The more people act on the internet behind a password and the more the web becomes corporate, the more the open internet ethos fades away from the public consciousness, easing the way toward that splintering that Kahle fears.

"That's a strategic concern for everyone. Of course, it impacts archiving, too," Bailey said. The archive does its best to capture Twitter, Tumblr, Instagram, YouTube, Vimeo, Facebook and others. Facebook is the hardest, because the company is archiving-unfriendly in general, according to Bailey. But in reality, if any of these social companies decided they wanted to stop the Internet Archive from doing its job, they probably could, he said.

"We're embedded in the community," Bailey said. "At the end of the day, we're just a library."

Kahle fears that the eventual "walling" of the internet could develop in an incongruous place: from tech companies eager for regulation that would cement their own status by stifling future innovation. For example, almost any proposed change to Section 230 — which protects website owners from legal liability for content created and posted by its users — would destroy the delicate legal framework that protects the Internet Archive's work (as well as Wikipedia and user-contributed projects), according to Kahle. Facebook's Mark Zuckerberg is among the many tech leaders to express support for a rewrite.

And tech companies, book publishers and even the music industry have lobbied to limit, change or even remove general copyright fair use exceptions, as well as specific copyright and use exemptions for libraries. Changes to these laws could (accidentally or intentionally, depending on who you ask) make it much harder for people to share their creative work online, and for groups like the Internet Archive to save them.

"Why are they doing this? Some people say it's money. But when you have oligarchies, it's really about protecting against new entrants in the market," Kahle said. At the end of the day, large companies have adapted to the current legal regimes, and they have the money and technical know-how to be able to advocate for stricter regulations that would allow them to preserve their monopolies while changing or limiting fair-use protections.

How the Internet Archive decides what to archive

Until the day these more existential problems firm into something Kahle can fight with more than words, the Internet Archive's day-to-day struggle is preserving the constantly transient web. Web pages have an average lifespan of about 90 days before they change or disappear, and so the Archive needs to capture those pages at a minimum of every 90 days to preserve a full picture of the web over time.

The archivists employ three main strategies to capture most of what might be important for future historians. Bailey wouldn't guess exactly what percentage of the web they manage to preserve — "I'd look like an idiot," he said — because no one really can guess the size or scale of the internet. (Don't get there in your head, if you can avoid it. How would you even measure: by data size? Number of objects? Number of distinct URLs?) "There's no use being anxious over what's outside your control," he said.

The archivists start by considering the entirety of the web and seeking out the most important fraction. They capture a shallow outline of the entire internet (every single URL and associated homepage that's accessible), and then they dive deep into as many pages as possible for the top 5 million or so most-visited websites. This creates a fairly flat, bird's-eye view of the internet.

To get a more three-dimensional picture, they seek other signals of importance, ranging from news aggregators to the entirety of a national domain (like Cuba, France, Somalia, etc.) when there is an important event, and even every single YouTube URL ever shared on Twitter (they can't capture all of YouTube, but at least they can capture the videos people deem important enough to share elsewhere).

And finally, other institutions can use the Internet Archive to build their own archiving services, usually creating specialized collections around topics like human rights or bioengineering. All of those collections are then copied back into the Wayback Machine, which is the publicly accessible version of the web archive.

Abbie Grotke, the web archiving team lead at the Library of Congress, has been involved in this work in one way or another for over 20 years. The Library of Congress's own archive is one of the special collections built in collaboration with Bailey, and it contains about 2.4 petabytes and over 18 billion objects, ranging from U.S. government websites to the most culturally important memes. Grotke has given her life to preserving the internet for the Library of Congress.

The work itself is technically an enormous task, but it boils down to one simple goal. "We're just trying to capture changes over time," she said.


Brewster Kahle is the internet's first librarian.Photo: Internet Archive


The Library of Congress began capturing websites in 2014, focusing mostly on political collections and at-risk websites and collections that might be taken down before they can be captured. "We're always sort of worried about, are we collecting everything we need to be collecting? Is there something we're missing?" said Amber Paranick, one of the Library of Congress's reference librarians. But this problem isn't that different because it's digital: "That's always the dilemma of the librarian."

The web archive alone is about 45 petabytes — 4,500 terabytes — and the Internet Archive itself is about double that size (the group has other collections, like a huge database of educational films, music and even long-gone software programs).

It's impossible to conceptualize actually usable, accessible data at that scale, let alone make it text-searchable. So while the Archive has some projects to use machine learning to identify some images, like pictures of horses, Bailey likes to think about the odd, unimaginable applications that have emerged and how they foretell grander uses in the future.

The Wayback Machine has evolved to play an important role in patent litigation, for example. People fighting over patent ownership look for what's called "prior art," which indicates who might have first thought of a product. In one case, when two people were disputing who first created a specific design for hubcap rims, one was able to prove their ownership by finding an old website that had been archived in the Wayback Machine.

And there are other use cases, too: The people building open-source translation tools at Mozilla have also found the internet archive's collection of websites in multiple languages useful for training their translation tools. There is very little printed or digitized material that has large amounts of the same text in two languages, but many official websites do, which can help build quality translation tools for "minor languages," like English-Swahili translations, according to Bailey.

The future of our histories

When I asked Kahle how he thinks about preserving today for historians centuries away, he grew philosophical. He sent links in the Zoom chat, first to the Google doc for a book he wrote, then a Nation piece, then a long blog post he wrote in 2015. By the time we hung up the call, I had piles for reading material, most of it dense, most of it dated.

There's value to all of this history, he told me. "What we're able to do now is know about your individual history. We're able to get to the specificity of the historical record. Which I think is going to really be engaging in 100 years' time. What would you give for a video of your great-grandmother? It would just give you this ballast, it would give you an anchoring, that we right now lack," he said. "We're living in the perpetual present, and that is dangerous." Kahle believes our history makes us better people, and gives us better knowledge. But history isn't financially lucrative.

Social media companies want us to focus on tomorrow, not on the posts we made a year ago. Publishers do, too. HarperCollins is suing the archive to try to prevent it from sharing out-of-print books in its digital library, arguing that publicly sharing out-of-print books is a massive violation of copyright laws. While at first it might seem odd that publishers would care about books that aren't in print anymore, for companies whose business depends on people buying new things, archiving so that people can focus on the past is not in their financial interest.

"They are erasing the past through every legal and political means they can," Kahle said.

If the balkanization of the internet can be prevented, the Internet Archive could transform the way we learn about larger historical moments, Kahle said. History books and historians are limited to a few textual works, mostly by the powerful people of the time. With the Internet Archive, the everyday history will become suddenly accessible to those studying our time. Imagine if each of us could look back on our great-grandparents and know what they said or thought at age 15, and then 25, and 50. The Archive would allow that.

The Archive could also force historians to become professional data miners. "There will be a lot of these comparison studies at a much larger scale in the future — every tweet from every president in 30 years. Longitudinal analysis could be done with petabytes of data," Bailey said. The research questions themselves may not change much; they will just stretch over bigger timelines and larger comparisons.

"We're in the process of building macroscopes," Kahle said.

Caught in a golden age

More than 1 million people use the Internet Archive every day. Most of them seek out the Wayback Machine, but people also read the digitized books in the archive's open library, or watch movies from the huge archive of public domain films.

"We love the dreamers, the people who come to this new medium with their ideas. The dreams are important to archive, whatever happens," Kahle said. Despite the existential threats to his work and to the values of the open internet, Kahle wants to be hopeful.

"Those who want to monopolize the internet are very well-funded. We need to communicate and deliver the value of openness. Am I optimistic we can do that? I'd say yes. But it's based on an enormous number of people wanting it to happen," he said.

"Some believe that people will only do things if you pay them, others that people are just sheep," Kahle said. "None of that is true. They may not be interested in the same things, but when we look at what people produce on the internet, if it's about the things they care about … They'll prove you wrong in a nanosecond."

Fintech

Data privacy and harassment could spoil Grindr’s Wall Street romance

As it pursues a long-held goal of going public, the gay dating app has to confront its demons.

Grindr may finally be a public company.

Illustration: woocat/iStock/Getty Images Plus; Protocol

Grindr's looking for more than just a hookup with Wall Street. Finding a stable relationship may be tough.

The location-based dating app favored by gay men was a pioneer, predating Tinder by three years. It’s bounced from owner to owner after founder Joel Simkhai sold it in 2018 for $245 million. A SPAC merger could be the answer, but businesses serving the LGBTQ+ community have had trouble courting investors. And Grindr has its own unique set of challenges.

Keep Reading Show less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol, covering breaking news. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

Sponsored Content

Why the digital transformation of industries is creating a more sustainable future

Qualcomm’s chief sustainability officer Angela Baker on how companies can view going “digital” as a way not only toward growth, as laid out in a recent report, but also toward establishing and meeting environmental, social and governance goals.

Three letters dominate business practice at present: ESG, or environmental, social and governance goals. The number of mentions of the environment in financial earnings has doubled in the last five years, according to GlobalData: 600,000 companies mentioned the term in their annual or quarterly results last year.

But meeting those ESG goals can be a challenge — one that businesses can’t and shouldn’t take lightly. Ahead of an exclusive fireside chat at Davos, Angela Baker, chief sustainability officer at Qualcomm, sat down with Protocol to speak about how best to achieve those targets and how Qualcomm thinks about its own sustainability strategy, net zero commitment, other ESG targets and more.

Keep Reading Show less
Chris Stokel-Walker

Chris Stokel-Walker is a freelance technology and culture journalist and author of "YouTubers: How YouTube Shook Up TV and Created a New Generation of Stars." His work has been published in The New York Times, The Guardian and Wired.

Inside the Crypto Cannabis Club

As crypto crashes, an NFT weed club holds on to the high.

The Crypto Cannabis Club’s Discord has 23,000 subscribers, with 28 chapters globally.

Photo: Nat Rubio-Licht/Protocol

On a Saturday night in downtown Los Angeles, a group of high strangers gathered in a smoky, colorful venue less than a mile from Crypto.com Arena. The vibe was relaxed but excited, and the partygoers, many of whom were meeting each other for the very first time, greeted each other like old friends, calling each other by their Discord names. The mood was celebratory: The Crypto Cannabis Club, an NFT community for stoners, was gathering to celebrate the launch of its metaverse dispensary.

The warmth and belonging of the weed-filled party was a contrast to the metaverse store, which was underwhelming by comparison. But the dispensary launch and the NFTs required to buy into the group are just an excuse: As with most Web3 projects, it’s really about the community. Even though crypto is crashing, taking NFTs with it, the Crypto Cannabis Club is unphased, CEO Ryan Hunter told Protocol.

Keep Reading Show less
Nat Rubio-Licht

Nat Rubio-Licht is a Los Angeles-based news writer at Protocol. They graduated from Syracuse University with a degree in newspaper and online journalism in May 2020. Prior to joining the team, they worked at the Los Angeles Business Journal as a technology and aerospace reporter.

Climate

The minerals we need to save the planet are getting way too expensive

Supply chain problems and rising demand have sent prices spiraling upward for the minerals and metals essential for the clean energy transition.

Critical mineral prices have exploded over the past year.

Photo: Andrey Rudakov/Bloomberg via Getty Images

The newest source of the alarm bells echoing throughout the renewables industry? Spiking critical mineral and metal prices.

According to a new report from the International Energy Agency, a maelstrom of rising demand and tattered supply chains have caused prices for the materials needed for clean energy technologies to soar in the last year. And this increase has only accelerated since 2022 began.

Keep Reading Show less
Lisa Martine Jenkins

Lisa Martine Jenkins is a senior reporter at Protocol covering climate. Lisa previously wrote for Morning Consult, Chemical Watch and the Associated Press. Lisa is currently based in Brooklyn, and is originally from the Bay Area. Find her on Twitter ( @l_m_j_) or reach out via email (ljenkins@protocol.com).

Enterprise

The 911 system is outdated. Updating it to the cloud is risky.

Unlike tech companies, emergency services departments can’t afford to make mistakes when migrating to the cloud. Integrating new software in an industry where there’s no margin for error is risky, and sometimes deadly.

In an industry where seconds can mean the difference between life and death, many public safety departments are hesitant to take risks on new cloud-based technologies.

Illustration: Christopher T. Fong/Protocol

Dialing 911 could be the most important phone call you will ever make. But what happens when the software that’s supposed to deliver that call fails you? It may seem simple, but the technology behind a call for help is complicated, and when it fails, deadly.

The infrastructure supporting emergency contact centers is one of the most critical assets for any city, town or local government. But just as the pandemic exposed the creaky tech infrastructure that runs local governments, in many cases the technology in those call centers is outdated and hasn’t been touched for decades.

Keep Reading Show less
Aisha Counts

Aisha Counts (@aishacounts) is a reporter at Protocol covering enterprise software. Formerly, she was a management consultant for EY. She's based in Los Angeles and can be reached at acounts@protocol.com.

Latest Stories
Bulletins