enterprise| enterpriseauthorTom KrazitNoneAre you keeping up with the latest cloud developments? Get Tom Krazit and Joe Williams' newsletter every Monday and Thursday.d3d5b92349
×

Get access to Protocol

I’ve already subscribed

Will be used in accordance with our Privacy Policy

Protocol | Enterprise

Fault-finding, not firefighting: Why observability is the new monitoring

Understanding software performance is an extremely important — and complex — undertaking for the modern enterprise. Simply watching the meter no longer works.

Fault-finding, not firefighting: Why observability is the new monitoring

There's a lot to keep track of in modern software.

Image: Alexander Sinn/Kwamina2

No unhappy complex system is alike: Each is unhappy in its own way. A growing line of business in software development, observability seeks to understand how and why modern software applications and teams become unhappy in order to set them on a path toward happiness, uptime and profit.

An evolution of monitoring software — which became popular during the rise of Web 2.0 applications and spawned companies such as Splunk, Datadog, New Relic and SolarWinds — observability takes the idea of simply watching IT systems a step further. While it's helpful to have dashboards that let administrators determine the health and performance of their applications at a glance, observability advocates believe what modern businesses really need are tools that help them understand the root cause of software issues.

"To help me build better software, you can't just do everything reactively anymore," Bill Staples, president of New Relic, told Protocol. "If you work reactively in today's cloud environment, you're firefighting constantly."

The idea is that it's better for software developers to understand exactly which part of their code is causing a problem and why than rely on alerts that flag problems but require painstaking analysis to identify the cause. The time that can save is important: Forward-thinking software organizations move very quickly these days, preferring to deploy small changes to their code on a frequent basis rather than deploying big changes at a slower pace.

"If it takes you two months to ship your code, you're probably not high enough to ride this ride," said Charity Majors, co-founder and CTO of Honeycomb, one of several startups pushing the boundaries of this emerging field.

But it's not just coding: Observability tools can also help companies understand how their people are performing, and how the structure of their organization might be causing more problems than it is solving.

"The tech metrics don't mean anything without understanding the pressures that people are under when they are building the systems," said Nora Jones, co-founder and CEO of startup Jeli.

Watching, waiting, commiserating

Administrators have been monitoring the performance of computers since the first was plugged into a wall. But the modern concept of application performance management started to come together alongside the wave of enterprise software innovation that came out of the Great Recession.

As new SaaS tools started to become some of the most important operational tools inside businesses, performance — always important — took on new meaning. Customers increasingly had expectations for how software delivered over the web should perform, and as lots of these application vendors built their services atop metered cloud computing platforms from AWS and others, they had to be very aware of how much computing resource they were consuming on the back end.

The growing consensus around the value of frequent deployment meant that software developers needed tools to quickly measure the impact of those changes so they could pull back a change that introduces a new problem, said John Allspaw, co-founder of Adaptive Capacity Labs, who played key roles at Yahoo's Flickr and later Etsy during the period in which monitoring became table stakes.

"There was a period of time where some companies totally got [the idea of frequent deployment], and can't imagine working a different way and other companies can't even imagine why you would even try to deploy more than once a day," he said. That latter group gets the benefits of continuous deployment now, he said, which has lifted the fortunes of companies like JFrog, CircleCI and CloudBees, which have all built businesses around making software pipelines more efficient..

But early monitoring tools that were used to study software once it was deployed were passive, and didn't provide as complete a picture of how an application was performing.

"Maybe 10 years ago, the way things would work is developers would write the application and then hand it off to an IT pro, who would probably deploy it onto a server in a data center," Staples said. "Meanwhile, they hope that the system keeps up and running and the IT pro will tell them if something breaks — otherwise, they just go on to the next feature."

A decade later, that approach will not fly. Companies no longer separate software development from operations: a shift known as DevOps, which calls for closer cooperation between the teams and forces developers to be more aware of the impact of their changes.

One of the big risks of making changes to a monolithic application was the chance that you could cause a difficult-to-detect problem in a completely unrelated part of the app. Microservices changed that, allowing developers to break their applications down into lots of smaller pieces that can be operated and tweaked separately. At the same time, cloud adoptees started moving toward deploying their applications in containers, which meant they could be deployed across a wide range of servers.

Something more sophisticated was required to understand how all of that was working.

Disturbance in the system

"Observability is all about looking at [the application]; shipping the code and looking through instrumentation to know if it is doing what I expected it to do," Majors said.

Honeycomb's approach was modeled after the notion of control theory in mechanical engineering, she said. Its tools give operations engineers a way to build instrumentation into their code to flag problems as they happen, allowing them to discover exactly where something has gone wrong rather than seeing a poor end-user performance and digging through the code to find the problem.

Usually software organizations aren't dealing with issues like the massive outage that took down Slack last week: Most software incidents are minor, Staples said.

"When things fail, they don't fail completely. You see drop-off rates, you see errors fail for 10 percent of users," he said. "What you do as an engineer is you're constantly using those signals: to know where to go invest more, whether it's improving the feature to get the customer through a trouble spot that's slowing them down and keeping them on, whether it's scaling a part of the system to increase the performance of that component or other things."

Fixing those problems quickly, rather than spending time debugging a poorly-performing application, can give software teams more time to focus on improving their products.

"Code is like food; it rots," Majors said. Preventing that rot from dragging down an entire system as quickly as possible can prevent bigger outages down the road that cost companies money.

But it's not just the code that needs observing; it's the people.

Jones is a veteran of high-performing software teams at Slack, Netflix and Jet.com. Yet even within companies at the forefront of software development practices, organizational structures can make as much of an impact on healthy applications as coding practices, she said.

At one of those companies (she declined to share which), a disproportionately large number of performance issues happened within a short period of time each year, and the company was having a hard time figuring out what was causing the problem. Turns out, those problematic periods came just after its annual promotions cycle, during which engineers had scrambled to ship as much code as they could in a short period of time to hit their goals for the year.

"It wasn't their fault. It was the system that was created at the company," Jones said. "Understanding that these promotion cycles were being correlated to an increase in incidents, because people were trying to get things done really quickly, actually incentivized the company to completely restructure how they did their promotion cycles, which led to this kind of stuff not happening as much."

Insights like that led Jones to found Jeli, which allows companies to evaluate and monitor how their organizational structures map against their coding practices. The company just raised a $4 million seed round to build out tools for that type of customer.

Infinite runway

The promise of observability tools is preventative maintenance: Not only will you be able to see and react to problems faster than current monitoring tools allow, but you'll also be able to glean insights from that data in a way that helps protect against future problems yet to rear their head.

The surge of interest in this space from upstarts and traditional monitoring companies has lots of ideas flying fast and furious, but it will take some time before that promise will be met, according to Allspaw.

"We have enough problems with the known unknowns," he said. "The runway to make progress on that is as close to infinite as we can get."

Still, 25 years into the internet revolution, we've come to expect certain levels of performance and reliability from our web and mobile applications. Big organizations like AWS, Google, Netflix and others are well down the observability road inside their own companies. And now the tools and companies that will bring those insights to the rest of us are starting to get traction.

Power

Yes, GameStop is a content moderation issue for Reddit

The same tools that can be used to build mass movements can be used by bad actors to manipulate the masses later on. Consider Reddit warned.

WallStreetBets' behavior may not be illegal. But that doesn't mean it's not a problem for Reddit.

Image: Omar Marques/Getty Images

The Redditors who are driving up the cost of GameStop stock just to pwn the hedge funds that bet on its demise may not be breaking the law. But this show of force by the subreddit r/WallStreetBets still represents a new and uncharted front in the evolution of content moderation on social media platforms.

In a statement to Protocol, a Reddit spokesperson said the company's site-wide policies "prohibit posting illegal content or soliciting or facilitating illegal transactions. We will review and cooperate with valid law enforcement investigations or actions as needed."

Keep Reading Show less
Issie Lapowsky
Issie Lapowsky (@issielapowsky) is a senior reporter at Protocol, covering the intersection of technology, politics, and national affairs. Previously, she was a senior writer at Wired, where she covered the 2016 election and the Facebook beat in its aftermath. Prior to that, Issie worked as a staff writer for Inc. magazine, writing about small business and entrepreneurship. She has also worked as an on-air contributor for CBS News and taught a graduate-level course at New York University’s Center for Publishing on how tech giants have affected publishing. Email Issie.
Protocol | Enterprise

The GE Mafia: How an old-school firm birthed a generation of tech leaders

The conglomerate hot-housed graduates in the '90s and '00s to create an adaptable army of tech talent. Now those execs are everywhere.

Look at the resumes of the top tech executives at the nation's largest companies and you're likely to find at least one theme: a stint at General Electric.

The once-quintessential American conglomerate has served as a launch pad for individuals now spearheading IT operations at companies such as Airbnb, United Airlines, Unilever, Morgan Stanley, AIG and dozens of others, according to analysis by Protocol.

Keep Reading Show less
Joe Williams

Joe Williams is a senior reporter at Protocol covering enterprise software, including industry giants like Salesforce, Microsoft, IBM and Oracle. He previously covered emerging technology for Business Insider. Joe can be reached at JWilliams@Protocol.com. To share information confidentially, he can also be contacted on a non-work device via Signal (+1-309-265-6120) or JPW53189@protonmail.com.

Protocol | Enterprise

Databricks plans to take on Snowflake and Google and score a huge IPO

Even against intensifying competition, Databricks hopes to be a hit when it heads to the public markets this year.

Ali Ghodsi is the CEO of Databricks.

Photo: Databricks

Enterprise software had a huge 2020 on Wall Street as companies such as Snowflake and C3.ai went public with blockbuster initial offerings. Databricks CEO Ali Ghodsi is hoping to ride the same wave in 2021.

The public debut of the data analytics startup, valued at $6.2 billion, is among the most-watched IPOs for the year. And for good reason: It competes in a similar space as the much-hyped Snowflake, helping customers find the data to power the algorithms that help with everything from picking which products to order to which candidates to bring in for job interviews. While Databricks has been tight-lipped on its specific plans, including which bankers it is tapping to help navigate the often arduous process, it is taking steps internally to prepare.

Keep Reading Show less
Joe Williams

Joe Williams is a senior reporter at Protocol covering enterprise software, including industry giants like Salesforce, Microsoft, IBM and Oracle. He previously covered emerging technology for Business Insider. Joe can be reached at JWilliams@Protocol.com. To share information confidentially, he can also be contacted on a non-work device via Signal (+1-309-265-6120) or JPW53189@protonmail.com.

Protocol | China

More women are joining China's tech elite, but 'Wolf Culture' isn't going away

It turns out getting rid of misogyny in Chinese tech isn't just a numbers game.

Chinese tech companies that claim to value female empowerment may act differently behind closed doors.

Photo: Qilai Shen/Getty Images

A woman we'll call Fan had heard about the men of Alibaba before she joined its high-profile affiliate about three years ago. Some of them were "greasy," she said, to use a Chinese term often describing middle-aged men with poor boundaries. Fan tells Protocol that lewd conversations were omnipresent at team meetings and private events, and even women would feel compelled to crack off-color jokes in front of the men. Some male supervisors treated younger female colleagues like personal assistants.

Within six months, despite the cachet the lucrative job carried, Fan wanted to quit.

Keep Reading Show less
Shen Lu

Shen Lu is a Reporter with Protocol | China. She has spent six years covering China from inside and outside its borders. Previously, she was a fellow at Asia Society's ChinaFile and a Beijing-based producer for CNN. Her writing has appeared in Foreign Policy, The New York Times and POLITICO, among other publications. Shen Lu is a founding member of Chinese Storytellers, a community serving and elevating Chinese professionals in the global media industry.

Protocol | Enterprise

How Christian Klein’s reboot of SAP’s strategy is working out

The pandemic wasn't kind to the company. But the way it's working with the major COVID-19 vaccine makers is a model for what comes next.

Christian Klein became SAP's sole CEO in April.

Photo: Picture Alliance/Getty Images

Christian Klein took over as SAP's sole CEO in April. It wasn't an ideal time to take the helm of an organization that sells expensive enterprise software.

As the spread of COVID-19 forced corporations everywhere to cut costs, one of the first places they looked was IT budgets. Specifically, companies around the world trimmed spending on back-end products, such as those offered by SAP, many of which still run via on-premise data centers.

Keep Reading Show less
Joe Williams

Joe Williams is a senior reporter at Protocol covering enterprise software, including industry giants like Salesforce, Microsoft, IBM and Oracle. He previously covered emerging technology for Business Insider. Joe can be reached at JWilliams@Protocol.com. To share information confidentially, he can also be contacted on a non-work device via Signal (+1-309-265-6120) or JPW53189@protonmail.com.

Latest Stories