We've gone through our various databases, caches, web servers, daemons, and despite some increased traffic activity across the board, all systems are running nominally. The truth is we're not sure what's happening.

The quote and image above are from the much talked-about Twitter blog post. Alex Payne has a longer post detailing their architecture issues here (Techmeme discussion).

The problem Twitter's dev team is doing right now in diagnosing their problems is something I've been thinking about for some time. As a programmer, your top diagnostic tools are probably your debugger, your profiler and a logging system. Now, take a look at the simplified architecture diagram I created below of a typical high traffic web application (note my complete lack of artistic skill).
image

Notice anything wrong? The tools that we just spoke about help us only on each individual node. They do a terrible job of traversing program flow across machines. Given that any self-respecting service is running on multiple machines, most dev teams find out the hard way that tracking down distributed bugs is very hard. Take Twitter for example - by their own admission, their problems lie not in the boxes themselves but in the arrows between them. This problem is compounded due to the fact that your app code is a very small part of this system. It may be possible to build extensive logging into your code but having to trace values across memcached/Cacheman and MySql/Sql Server and your load balancer is non-trivial, if not impossible.

I think there's a tremendous opportunity in solving this problem. Unfortunately, I don't know what exact shape or form it will take (I do have a few guesses). Here's my wishlist for what such a solution should have

  • Distributed debugging I want the ability to seamlessly debug from one machine to another, moving between web frontend code, SQL stored procedures, any messaging infrastructure and back. For places where I can't debug, I want to see all data coming and coming out.
  • Establishing causality For post-mortem distributed debugging, you are always trying to establish causality - what event lead to what other event. Such a system will log enough data to establish causality relationships between various events in the system. For a particular request or stack trace, I should be able to go back and look at code flow across machines for that request.
  • Distributed profiling Anyone who has maintained an public facing site has seen unexplained slowdowns which are hard to debug. In the very early days of Popfly, we used to see an unexplained slowdown everyday at lunchtime (we used to joke about the servers going out to lunch). A tool that constantly monitors server performance and in case of a slowdown, shows you where time is spent across the system will be critical. Tools like Nagios do a great deal of monitoring for you but I haven't seen any tool do the latter.

I haven't found any current system which does all the above to my liking. I'm writing some code in my spare time to get some of the above but it is a daunting task. The closest I've seen anyone come is the XTrace project from Berkeley's RAD labs. The following excerpt from the paper explains how it works

A user or operator invokes X-Trace when initiating an application task (e.g., a webrequest), by inserting X-Trace metadata with a task identifier in the resulting request. This metadata is then propagated down to lower layers through protocol interfaces (which may need to be modified to carry X-Trace metadata), and also along all recursive requests that result from the original task. This is what makes X-Trace comprehensive; it tags all network operations resulting from a particular task with the same task identifier.

The problem with systems like X-Trace is that they require extensive modifications to database servers, your web server, your cache server, etc. Such a task is not easy, especially given the wide array of choices you have in each of those. However, X-Trace is promising and I know that they've carried out some large scale tests with good results.

Does anyone have pointers to tools and solutions which can help with the above?


#