The PivotNine Blog

LightStep Adds Distributed Tracing For All

04 March 2019

Justin Warren

Distributed tracing company LightStep has released a new product based on the same technology as its high-end [x]PM product. Called simply LightStep Tracing, the tool is aimed at individual teams rather than entire organisations, and helps them to understand exactly what's going on in their complex modern applications.
Mere mortals struggle to comprehend the complexity of modern microservice-based applications, which has given rise to new tools that help humans understand what's happening as millions of transactions flow across their systems every minute. Distributed tracing helps admins understand the path an individual transaction takes through a microservices application.

Ben Sigelman, CEO and co-founder of LightStep

“It used to be possible to enumerate all the ways your system could fail,” says LightStep CEO Ben Sigelman. That's no longer true as the constantly changing nature of the always-on Internet meets the complex interconnectedness of modern software systems.

Clicking ‘buy' on a website today could mean touching dozens of services running on hundreds of servers, with networks and containers and service meshes and any number of other systems involved in a delicate and complex web that has to work properly for that simple act of purchasing to succeed. If it doesn't work, or is slow, the customer frustration is real, as is the lost revenue.

The rise of open-source projects such as OpenTracing, one of numerous Cloud Native Computing Foundation projects (and co-created by Sigelman), has lead to more widespread adoption of distributed tracing by builders of modern software.

“We often find prospective customers have implemented tracing already,” says Sigelman, “And often it's OpenTracing, which is easy to use.” But the value of tracing comes from the insights it provides into system behaviour, which requires more sophisticated analysis than what the basic open-source projects provide.

“Traces are just raw data. The value comes from using traces as raw materials to do more advanced analysis,” says Sigelman. “We want to save you the trouble of looking at traces.”

LightStep Tracing takes the same technology underpinnings as the [x]PM product and packages it for use by smaller teams, rather than entire organisations. It's delivered as Software-as-a-Service managed by LightStep, freeing teams to concentrate on using tracing rather than administering tracing infrastructure. The same high-end architecture is important, as it allows Tracing to collect the same 100% coverage tracing data as [x]PM, but without needing the additional satellites common to the higher-end [x]PM deployments.

LightStep Tracing aggregates information from the raw trace data it collects to identify trends and correlations that are difficult to find when manually inspecting individual traces. This helps developers and operators to rapidly detect anomalous behavior and locate its actual cause, rather than wasting time chasing down and eliminating many possible causes.

“The future isn't about applications, it's about services,” says Sigelman, and I'm inclined to agree. Modern applications often make use of multiple online services connected over a network. The increasing use of service mesh technologies to connect disparate services together further complicates the task of maintaining a mental model of exactly how everything connects. If we add in ephemeral services such as Functions-as-a-Service or AWS Lambda, the challenge is only increased.

I'm a big fan of grepping through log files and putting DEBUG statements into my code, but this approach doesn't scale well. Distributed tracing doesn't replace other methods, it augments them, providing another way of looking at and understanding complex systems. Debug logs may well help to locate and fix a problem with an individual service, but first you need to know which service is to blame for a performance issue.

A trusted system that can reliably point to the problem, using data that everyone on the team can see, helps to avoid the issues that plague infrastructure groups where the database team blames the storage who blame the network who blame the firewalls who blame DNS. A solid overview of the system showing its dynamic nature that we can easily drill into to locate areas for further investigation, while avoiding red herrings, can only be a boon to operations and development teams alike.

It's certainly worth a look.

This article first appeared in Forbes.com here.