Monitoring vs. profiling in production
Any serious production application is accompanied with monitoring and alerting. If not, the downtime or performance issues will stay unnoticed for longer, which will obviously negatively impact any organization.
However, knowing that there is a problem is just one step towards solving it. Solving performance and availability issues requires knowledge about the root cause of the problem. Monitoring may give us general hints about the root cause, e.g. memory, CPU, etc, but it will most likely miss any deep-dive, application related information.
On the other hand, we have a full arsenal of profiling/debugging tools, but these are only usable at development time.
So why do we still have the split between uptime/availability monitoring and deep-dive profiling tools needed for problem root cause analysis if both are actually needed in production environment?
Traditionally, it was considered ok to conduct problem root cause analysis on development environment. These are some of the reasons:
- The split between development and operations assumed that the root cause analysis and troubleshooting is developers’ requirement, while uptime/availability monitoring is operations’ business.
- The cost of bad performance was not known/considered to be as high as it is nowadays.
- Profiling and debugging tools are mostly available on dev environments out of the box.
- One monitoring tool, e.g. Nagios, can cover a big part of the common technology stack. Everyone wanted to manage less tools.
On the other hand, it is more challenging to find problems on development or staging environments, because they are different from production in terms of resources, traffic, configuration, etc.
The age of DevOps, continuous delivery and container-based environments
Modern development processes and practices allow and actually need to make full use of production profiling tools, because:
- The interests are better aligned in DevOps teams, which means there is no point in dividing uptime responsibility from application performance responsibility between developers and operations teams.
- Cost of a downtime and bad performance is extremely high now. Slow web apps may even impact their search engine ranking. This makes a low mean time to repair (MTTR) a priority for an organization.
- It is tricky to instrument container based environments on demand. No SSH may be available to log in to a particular machine or it will even be impossible to locate a host in the first place.
- Faster releases with a sharp focus on features may allow for more performance issues to slip to production.
- Software-as-a-service monitoring and profiling solutions free DevOps teams from setting up and managing the dashboards on premises, which in turn allows them to use more and better suited solutions.
What a modern monitoring/profiling tool for production environments should be
StackImpact is a production profiler and monitor for Go language, which we can look at as one way of how profiling on production environments could be implement. Here is how for example it presents the CPU hot spots for a production Golang application.
In a nutshell, the agent running inside of the application reports regular and anomaly-triggered profiles of CPU usage, memory allocations, I/O, lock contention, and other aspects of application execution. These along with bottleneck traces and runtime metrics are presented in the Dashboard in a historically comparable form.