Continuous Performance Profiling

The difference between a program’s exponential, linear, logarithmic and constant execution times is critical for various use cases. Even if an algorithm is purposely designed to satisfy a certain complexity class, there are multiple reasons why it might not. An underlying library, OS or even hardware can be the root cause of a performance problem.

Performance profiling has been a part of software development since the beginning. It is essential for optimizing and fixing a program’s time and space complexity as well as any bottlenecks caused by third-party dependencies. A performance issue without an execution profile is like an error without a stack trace. It will lead to a lot of manual work to get to the root cause.

Call graphs

The profiler’s output is usually structured as some sort of call graph depending on the type of profile. For the CPU profile it would be a call graph, usually in the form of a tree, consisting of stack frames of function calls as branches and the number of samples as values. Looking at such a profile will immediately reveal a hot spot, i.e. a function call, which was found (sampled) the most on the CPU. Similarly for memory allocations, such a profile will show how many bytes are allocated and not released by which function call.

Other types of sampling profilers will provide similar information about blocking calls, i.e. calls waiting for an event (e.g. mutex or even asynchronous calls).

A CPU call graph may look like this:

Screenshot

Profiling cloud applications

The era of horizontally scalable, data-intensive cloud applications deployed on FaaS, PaaS, IaaS or bare metal introduces an even greater need for profilers, since the performance of a single instance of an application running locally on a developer’s machine no longer correlates to a large-scale data center deployment. A different scale and use of production application, its data volume, traffic patterns or configuration will expose inefficiencies and issues in the code not detectable in a development or testing environment.

Traditional application performance management and monitoring products tried to address cloud applications by monitoring and tracing certain business specific workloads and introducing on-demand and automatic remote profiling capabilities.

Continuous vs. on-demand performance profiling

The problem is that on-demand or automatic profiling only allows post factum analysis. It might be helpful in case of performance regression or a problem-driven optimization, but it doesn’t provide the basis for continuous performance improvements. Assuming the application is evolving, rather than addressing its performance continuously will result in gradual performance regression.

Continuous profiling is not triggered by any event or human. The idea is that it is “always” active on a small subset of application processes. In terms of profiling overhead, this leads to even lower total overhead.

The most obvious benefits of such an approach are:

  • Constant access to various current and historical performance profiles for troubleshooting and optimization.
  • Ability to historically compare profiles and locate regression causes with line-of-code precision.
  • Locating infrastructure-wide hot code or libraries, fixing which would benefit all applications.
  • Availability of pre-crash profile history for post mortem analysis.
  • No risk of crashing the application by invoking an on-demand profiler against a suffering or failing application, which is ironically the main use case for on-demand profiling.

A perfect example of a large-scale continuous profiling system is Google’s Google-Wide Profiling (GWP), which profiles almost every server and application at Google. Please refer to the GWP paper for the full details.

In turn, StackImpact enables continuous performance profiling for anyone - developers, small businesses or large enterprises. It currently supports Go, Node.js and Python applications with the ability to profile CPU usage, memory allocations, blocking and async calls, also providing contextual information such as errors and multiple runtime metrics. Learn more.

Historically comparable CPU profiles from an application over a selected period of time:

Screenshot