Go Profiler Internals

Profilers provide application developers with critical execution insights that enable resolving performance issues, locating memory leaks, thread contention, and more. Although hot code stack traces and call graphs generated by the profilers are usually self-explanatory, it is sometimes necessary to understand how profilers work underneath to infer even greater details about applications from the generated profiles.

This article gives a conceptual overview of Go’s CPU, memory and block profiler implementations aiming to help developers to better understand their programs.

CPU profiler

Like many other CPU profilers, the Go’s CPU profiler uses SIGPROF signal to record code execution statistics. When registered, the signal is delivered every specified time interval. Unlike ordinary timers, the time here is incremented only when the CPU is busy executing the process. The signal interrupts code execution and that’s how it makes it possible to see what code was interrupted, i.e. executing at that moment.

When the pprof.StartCPUProfile function is called, a SIGPROF signal handler is registered to be called every 10 ms interval by default, which is internally referred as a rate of 100 Hz. On Unix/Linux it uses setitimer(2) system call to set the signal timer.

On every invocation signal handler traces back the execution by unwinding it from the current PC value resulting in a stack trace containing caller function PC values of unwound functions. That stack trace hit count is then incremented. The result is a list of stack traces grouped by the number of times each stack trace is seen.

After the profiler is stopped with pprof.StopCPUProfile, the stack trace PCs are symbolized to readable stack frames by the go pprof tool provided that it has access to the executable file. If the profiler is initiated remotely, the go pprof tool will send a request containing all PCs found in the profile to actual program, expecting corresponding stack frame info in return. And if the profiler is initiated by the of StackImpact agent, the PCs are symbolized in the same process using runtime.FuncForPC function and the profile is reported to the dashboard afterwards.

The end result is that we now know how many times each stack trace was seen executing on the CPU. Now this set of stack traces with corresponding hit counts can be represented in many different ways: as a call graph, a flame graph, top calls, and more.

Memory allocation profiler

The memory allocation profiler samples heap allocations and makes it possible to see what function calls allocate how much memory and how often. Tracking every allocation and unwinding a stack trace would be extremely expensive, therefore a sampling technique is used.

To sample only a fraction of allocations, the sampling relies on a random number generation that follows exponential distribution. The generated numbers define the distance between samples in terms of allocated memory size. Only allocations that cross the next random sampling point are sampled.

To control how big is the fraction of allocations that are sampled, i.e. the average sampling distance, a sampling rate is used. The sampling rate defines the mean of exponential distribution with the default value of 512 KB, specified by the runtime.MemProfileRate variable.

The CDF of the exponential distribution, 1 - e−λx, is used for the sampling distance generation, where λ−1 is the mean and x is the random value. The x is calculated by equaling CDF to a uniformly distributed random numbers and solving for x. The value of x is then used as a random sampling distance.

Since the memory allocation profiler is always active, the list of stack traces with corresponding sampled allocation info can be requested any time. Dividing each sampled allocation size by its sampling probability, calculated using CDF, scales sampled allocations, as if no sampling has been applied and all allocations were counted.

Similar to the CPU profile, once the stack traces with corresponding allocated memory size and allocation count are ready, they will be symbolized to readable stack frames with source file names, line numbers and function names.

Block profiler

Block profiler samples blocking calls, whether those are channel, mutex, network or file system waits. It is useful in understanding where in the program the most waits happen. It is even more useful when multiple profiles are compared, showing what call has slowed down exactly.

Again, it would be extremely inefficient to time all blocking calls, therefore, a sampling is applied. The profiler only samples calls that on average take longer than the specified sampling rate. The rate is set using runtime.SetBlockProfileRate function. The rate defines an average distance between samples in nanoseconds. In other words, the profiler will aim to sample one call during the time specified by the rate.

This is achieved by testing if call duration exceeds the rate or if it exceeds the random number modulo rate. In other words, the probability of sampling a call with a longer duration grows linearly, hitting 100% for calls equal to or longer than the rate.

Starting and stopping the profiler is initiated by setting the rate to a value >0 and respectively <=0. The profile is accumulated over the activity periods and can be requested any time. Once the stack traces with corresponding wait times and call counts are ready, they will be symbolized to readable stack frames with source file names, line numbers and function names.

About the author: Dmitri Melikyan is the founder of StackImpact. His main interest is complexity and performance problems, designing methods and tools to understand and solve them.