Monitoring vs. profiling in production
Any serious production application is accompanied by monitoring and alerting. If not, the downtime or performance issues will stay unnoticed for longer, which will obviously negatively impact any organization.
However, knowing that there is a problem is just one step towards solving it. Solving performance and availability issues requires knowledge about the root cause of the problem. Monitoring may give us general hints about the root cause, e.g. memory, CPU, etc., but it will most likely miss any deep-dive, application-related information.
On the other hand, we have a full arsenal of profiling/debugging tools, but these are only usable at development time.
So why do we still have the split between uptime/availability monitoring and deep-dive profiling tools needed for problem root cause analysis if both are actually needed in the production environment?
Traditionally, it was considered okay to conduct problem root cause analysis on development environment. These are some of the reasons:
- The split between development and operations assumed that root cause analysis and troubleshooting are developers’ responsibilities, while uptime/availability monitoring is operations’ business.
- The cost of bad performance was not known or considered to be as critical as it is now.
- Profiling and debugging tools are mostly available on dev environments out of the box.
- One monitoring tool, e.g. Nagios, can cover a big part of the common technology stack. Everyone wanted to manage fewer tools.
On the other hand, it is more challenging to find problems in development or staging environments because they are different from production in terms of resources, traffic, configuration, etc.
The age of DevOps, continuous delivery and container-based environments
Modern development processes and practices allow and actually need to make full use of production profiling tools because:
- The interests are better aligned in DevOps teams, which means there is no point in dividing uptime responsibility and application performance responsibility between developers and operations teams.
- The cost of downtime and bad performance is extremely high now. Slow web apps may even impact their search engine ranking. This makes a low mean time to repair (MTTR) a priority for any organization.
- It is tricky to instrument container-based environments on demand. No SSH may be available to login to a particular machine, or it may even be impossible to locate a host in the first place.
- Faster releases with a sharp focus on features may allow for more performance issues to slip to production.
- Software-as-a-service monitoring and profiling solutions free DevOps teams from setting up and managing the dashboards on premises, which in turn allows them to use more and better suited solutions.
What a modern monitoring/profiling tool for production environments should be
StackImpact is a production profiler and monitor for Go, which we can look at as one way that profiling in production environments could be implemented. For example, here is how it presents the CPU hot spots for a production Golang application.
In a nutshell, the agent running inside of the application reports regular and anomaly-triggered profiles of CPU usage, memory allocations, I/O, lock contention and other aspects of application execution. These, along with bottleneck traces and runtime metrics are presented in the Dashboard in a historically comparable form.