Go Profiler Internals

Profilers provide application developers with critical execution insights that enable resolving performance issues, locating memory leaks, thread contention and more. Although hot code stack traces and call graphs generated by the profilers are usually self-explanatory, it is sometimes necessary to understand how profilers work underneath to infer even greater details about applications from the generated profiles.

This article gives a conceptual overview of Go’s CPU, memory and block profiler implementations aiming to help developers better understand their programs.

CPU profiler

Like many other CPU profilers, the Go’s CPU profiler uses a SIGPROF signal to record code execution statistics. When registered, the signal is delivered every specified time interval. Unlike ordinary timers, the time here is incremented only when the CPU is busy executing the process. The signal interrupts code execution, making it possible to see which code was interrupted, i.e. executing at that moment.

When the pprof.StartCPUProfile function is called, a SIGPROF signal handler is registered to be called every 10 milliseconds interval by default, which is internally referred as a rate of 100 Hz. On Unix/Linux, it uses a setitimer(2) system call to set the signal timer.

On every invocation signal, the handler traces back the execution by unwinding it from the current PC value, resulting in a stack trace containing caller function PC values of unwound functions. The stack trace hit count is then incremented, resulting in a list of stack traces grouped by the number of times each stack trace is seen.

After the profiler is stopped with pprof.StopCPUProfile, the stack trace PCs are symbolized to readable stack frames by the Go pprof tool provided that it has access to the executable file. If the profiler is initiated remotely, the Go pprof tool will send a request containing all PCs found in the profile to the actual program, expecting the corresponding stack frame info in return. And if the profiler is initiated by the of StackImpact Go agent, the PCs are symbolized in the same process using the runtime.FuncForPC function and the profile is reported to the Dashboard afterwards.

The end result is that we now know how many times each stack trace was seen executing on the CPU. Now this set of stack traces with corresponding hit counts can be represented in many different ways: as a call graph, flame graph, top calls, and more.

Memory allocation profiler

The memory allocation profiler samples heap allocations and makes it possible to see which function calls allocate how much memory and how often. Tracking every allocation and unwinding a stack trace would be extremely expensive; therefore, a sampling technique is used.

To sample only a fraction of allocations, the sampling relies on a random number generation that follows exponential distribution. The generated numbers define the distance between samples in terms of allocated memory size. Only allocations that cross the next random sampling point are sampled.

To control the size of the fraction of allocations that are sampled, i.e. the average sampling distance, a sampling rate is used. The sampling rate defines the mean of exponential distribution with a default value of 512 KB, specified by the runtime.MemProfileRate variable.

The CDF of the exponential distribution, 1 - e−λx, is used for the sampling distance generation, where λ−1 is the mean and x is the random value. x is calculated by equaling CDF to uniformly distributed random numbers and solving for x. The value of x is then used as a random sampling distance.

Since the memory allocation profiler is always active, the list of stack traces with corresponding sampled allocation info can be requested any time. Dividing each sampled allocation size by its sampling probability, calculated using CDF, scales sampled allocations, as if no sampling has been applied and all allocations were counted.

Similar to the CPU profile, once the stack traces with corresponding allocated memory size and allocation count are ready, they will be symbolized to readable stack frames with source file names, line numbers and function names.

Block profiler

The block profiler samples blocking calls, whether they are channel, mutex, network or file system waits. It is useful in understanding where in the program the most waits happen. It is even more useful when multiple profiles are compared, showing which call has slowed down exactly.

Again, it would be extremely inefficient to time all blocking calls; therefore, a sampling is applied. The profiler only samples calls that on average take longer than the specified sampling rate. The rate is set using the runtime.SetBlockProfileRate function. The rate defines an average distance between samples in nanoseconds. In other words, the profiler will aim to sample one call during the time specified by the rate.

This is achieved by testing if the call duration exceeds the rate or if the random number modulo rate. In other words, the probability of sampling a call with a longer duration grows linearly, hitting 100% for calls equal to or longer than the rate.

Starting and stopping the profiler is initiated by setting the rate to a value >0 and respectively <=0. The profile is accumulated over the activity periods and can be requested at any time. Once the stack traces with corresponding wait times and call counts are ready, they will be symbolized to readable stack frames with source file names, line numbers and function names.