[Sans] Call for Datacenter Management/Troubleshooting Scenarios

Fri Sep 3 13:34:45 CEST 2010

Hi,

I am writing to the list to ask for input from real sysadmins on
datacenter management and troubleshooting scenarios.

The context is a research paper on a comprehensive datacenter tracing
framework I am writing together with Microsoft Research Silicon Valley.

The framework allows you to write concise queries that can execute over
the whole datacenter and trace detailed information across all layers of
the software stack down to the operating system kernel. Tracepoints are
automatically deployed and trace data is automatically aggregated and
stored with low overhead for (real-time) presentation.

In the paper, we would like to include case studies of datacenter
management and troubleshooting scenarios that really occurred and which
could be done easier with such a tracing framework.

Examples of cases I am looking for are (all of these really occurred):

* Troubleshooting misconfigurations or software bugs. Eg. DHCP server
overloaded as an unknown host in the datacenter is generating lots of
traffic through a misconfigured Ethernet bridge driver.

* Detecting known problems. Eg. An application periodically hangs
because glibc tries to communicate with a credentials caching daemon
that doesn't respond.

* Analyzing the impact of changes. Eg. have to find out which
applications are affected by the new kernel patch I just installed
because something is running mysteriously slow.

Especially cases where multiple hosts were involved and the answer
wasn't obvious are interesting. If there's something really crazy that
happened, I'd like to hear about it as well. I'd be very glad for some
of your experiences.

Thank you!
Simon