Paper: Distributed Systems Tracing With Dapper
Dapper is Google's system for tracing requests in distributed systems.
A nice feature of Dapper is what the authors call ubiquity – Dapper builds on top of common RPC frameworks used at Google, so you get basic tracing right out of the box for pretty much every service, no special configuration required. The out-of-the-box tracing contains basic commonly used data, and in addition you can enrich the traces with application specific details by writing some custom code.
Here's an overview of how Dapper works:
- A unique trace ID is generated for every request in the front end service.
- Google's RPC framework passes the trace ID across the different services that are processing the request.
- Each service will dump trace data to local logs. There is heavy sampling of the logs (configurable per service, usually ~0.1% or so) to keep the cost of the tracing low.
- Dapper's collectors will pick up the logs from each service, join them all together and dump them to Bigtable (Google's NoSQL DataStore), queryable by trace ID.
- There is a bunch of tooling that allows for querying the trace data, either in individual traces or in bulk.
Systems like Dapper that allow us to peer into the details of complicated systems are enormously powerful. For example, the paper talks about how Dapper enabled the AdWords team to optimize their backend by finding unnecessary requests along the critical path.
Also interesting were Dapper's shortcomings:
Our model implicitly assumes that various subsystems will perform work for one traced request at a time. In some cases it is more efficient to buffer a few requests before performing an operation on a group of requests at once (coalescing of disk writes is one such example). In such instances, a traced request can be blamed for a deceptively large unit of work.
Dapper also doesn't seem to work well in its current form for workloads where a single request might be too compute / data intensive (i.e. services with very low QPS):
However, off-line data intensive workloads, such as those that fit the MapReduce model, can also benefit from better performance insight. In such cases, we need to associate a trace id with some other meaningful unit of work, such as a key (or range of keys) in the input data, or a MapReduce shard.