Docs

Overview

StackImpact is a performance profiling and monitoring service for production Go (Golang) applications. It gives developers continuous visibility with line-of-code precision into application performance, such as CPU, memory and I/O hot spots as well execution bottlenecks, allowing developers to optimize applications and troubleshoot issues before they impact customers.

Features

  • Automatic hot spot profiling for CPU, memory allocations, network, system calls and lock contention.
  • Automatic bottleneck tracing for HTTP handlers and HTTP clients.
  • Error and panic monitoring.
  • Health monitoring including CPU, memory, garbage collection and other runtime metrics.
  • Alerts on hot spot anomalies.
  • Multiple account users for team collaboration.

Learn more on the features page (with screenshots).

The StackImpact agent reports performance information to the Dashboard running as SaaS.

Supported platforms and languages

Linux, OS X and Windows. Go version 1.5+.

Getting started with Go profiling

Create StackImpact account

Sign up for a free account at stackimpact.com.

Installing the agent

Install the Go agent by running

go get github.com/stackimpact/stackimpact-go

And import the package github.com/stackimpact/stackimpact-go into your application.

Configuring the agent

Initialization

Start the agent by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

agent := stackimpact.NewAgent();
agent.Start(stackimpact.Options{
  AgentKey: "agent key here",
  AppName: "MyGoApp",
})

Other initialization options:

  • AppVersion (Optional) Sets application version, which can be used to associate profiling information with the source code release.
  • AppEnvironment (Optional) Used to differentiate applications in different environments.
  • HostName (Optional) By default, host name will be the OS hostname.
  • Debug (Optional) Enables debug logging.

Basic example

package main

import (
      "fmt"
      "net/http"

      "github.com/stackimpact/stackimpact-go"
)

func handler(w http.ResponseWriter, r *http.Request) {
      fmt.Fprintf(w, "Hello world!")
}

func main() {
    agent := stackimpact.NewAgent()
    agent.Start(stackimpact.Options{
      AgentKey: "agent key here",
      AppName: "Basic Go Server",
      AppVersion: "1.0.0",
      AppEnvironment: "production",
    })

    // use MeasureHandlerFunc or MeasureHandler to additionally measure HTTP request execution time.
    http.HandleFunc(agent.MeasureHandlerFunc("/", handler)) 
    http.ListenAndServe(":8080", nil)
}

Measuring code segments

The use of Segment API is optional.

To measure the execution time of arbitrary parts of the application, the Segment API can be used. The agent continuously watches segment execution time and initiates profiling when anomalies are detected.

// Starts measurement of execution time of a code segment.
// To stop measurement, call Stop on returned Segment object.
// After calling Stop, the segment is recorded, aggregated and
// reported with regular intervals.
segment := agent.MeasureSegment("Segment1")
defer segment.Stop()
// A helper function to measure HTTP handler execution by wrapping http.Handle method parameters.
// Usage example:
//   http.Handle(agent.MeasureHandler("/some-path", someHandler))
pattern, wrappedHandler := agent.MeasureHandler(pattern, handler)
// A helper function to measure HTTP handler function execution by wrapping http.HandleFunc method parameters.
// Usage example:
//   http.HandleFunc(agent.MeasureHandlerFunc("/some-path", someHandlerFunc))
pattern, wrappedHandlerFunc := agent.MeasureHandlerFunc(pattern, handlerFunc)

Monitoring errors

The use of Error API is optional.

To monitor exceptions and panics with stack traces, the error recording API can be used.

// Aggregates and reports errors with regular intervals.
agent.RecordError(someError)
// Aggregates and reports panics with regular intervals.
defer agent.RecordPanic()
// Aggregates and reports panics with regular intervals. This function also
// recovers from panics.
defer agent.RecordAndRecoverPanic()

Analyzing performance data in the Dashboard

Once your application is restarted, you can start observing regular and anomaly-triggered CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

Troubleshooting

To enable debug logging, add Debug: true to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

Reference

Hot spot profiling

Profile recording and reporting by the agent

Each profiling report represents a series of profiles recorded by the agent regularly or on application activity and anomalies, such as a rapid change in a metric relevant for the profile type. Regular recording intervals are normally a few minutes long. The recording duration of a profile is limited to few seconds, depending on the profile type and associated overhead.

Historical profile grouping

Reports are shown for a selected application and time frame. A default view is Timeline, which will present profiles from multiple subsequent sources, e.g. machines or containers, in a single time sequence. If multiple sources report profiles simultaneously, e.g. the application is scaled to multiple machines or containers, not all profiles will be visible. It is possible to select a particular source only.

Additionally, a timeframe can be selected to filter recorded profiles.

Profile history

The profile chart shows a key measurement, e.g. total or max, for each recorded profile over time. By clicking on the measurement point in the chart, a profile for the selected time point will be shown as an expandable call tree. Every call (stack frame) in the call tree shows its own share of the total measurement as well as the trend based on previous values of the call found in previous profiles.

Profile context

The profile context, which is displayed as a list of tags, reflects the application environment and the state at which the profile was recorded. The following entries are possible:

  • Host name – the host name of the host, instance or container the application is running on. The value is obtained from the system.
  • Runtime type – a language or platform, e.g. Go.
  • Runtime version – a version of the language or platform.
  • Application version – can be defined in the agent initialization statement by the developer.
  • Build ID – a prefix of an SHA1 value of the program.
  • Run ID – a unique ID for every (re)start of the application.
  • Agent version – version of the agent that recorded the profile.

CPU usage profile

CPU usage profiles are recorded by Go’s built-in sampling profiler. In the Dashboard, it is represented by a call tree with nodes corresponding to function calls. Each node’s value represents a percentage of the absolute time a call was executing during recording of the profile. The percentage is a best effort to calculate absolute execution time. It is achieved by using the number of cores available to the process and the profiler's sampling rate. Additionally, the number of samples for each call is provided.

There can be many root causes of high CPU usage. Some of them are:

  • Algorithm complexity, i.e. a code has a high time complexity. For example, it performs exponentially more steps relative to the data size the algorithm is processing.
  • Extensive garbage collection caused by too many objects being allocated and released
  • Infinite or tight loops

See also:

Memory allocation profile

Memory profiles are recorded by reading current heap allocation statistics. Each node in the memory allocation profile call tree represents a line of code where memory was allocated and not released after garbage collection. The value of the node is the number of bytes allocated by a function call or by some of its nested calls, which call new at some point. If a node has multiple children nodes, the node's value is the total of its children's values. The number of samples, which is shown next to the allocated size, corresponds to the number of allocated objects.

A single profile is not a good indicator of a memory leak, while memory can be released shortly after the memory allocation statistics were read. A better indication of a memory leak is a continuous increase of allocated memory at a single call node relative to its previous readings. Different types of memory leaks may manifest themselves at different timeframes.

Memory leaks can have different root causes. Some of them are:

  • The pointer to which an object is assigned after allocation stays unreleased, e.g. it has the wrong scope.
  • A pointer is assigned to another pointer that is not released, similarly to previous point.
  • Unintended allocation of memory, e.g. in a loop.

See also:

Network, system, lock and channel wait profiles

Wait time profiles represent a call tree, where each node is a function call that waits for an event. The value is an aggregated waiting time of a call during one second. It can be grater than one second, because the same call can wait for an event simultaneously in different goroutines/threads. Events can be network reads and writes, system calls and mutex waits as well as channel synchronization. The number of samples, which is shown next to the wait time, corresponds to the number of executions of the function call.

See also:

Bottleneck profiling

Bottleneck profiles are recorded, reported and represented identically to hot spot profiles, except that the values of function calls in the call trees are not aggregate values over time, but 95th percentiles of the call execution times during recording period. The number of samples is the count of call executions that were seen by the profiler.

Unlike hot spot profiles, bottleneck profiles are built around a specific type of functionality; for example HTTP handlers. The call tree node values are times when the calls were waiting on some blocking event such as I/O, system calls, mutexes, etc.

HTTP handler profile

The HTTP handler bottleneck profile includes all function calls related to HTTP request execution that were waiting for some blocking event.

See also:

HTTP client profile

HTTP client bottleneck profile includes all function calls related to outgoing HTTP client requests that were waiting for some blocking event.

Database client profile

The database client bottleneck profile includes all function calls related to database client commands that were waiting for some blocking event. Currently supported clients are SQL packages, which implement standard interfaces, as well as most popular MongoDB and Redis packages.

See also:

Segment measurements

Measuring the execution time of custom code segments is possible using the agent's API. The agent will aggregate and report the execution time of a segment, which is the 95th percentile of all instances of the same segment during a 60-second time period.

A helper wrapper for measuring the execution time of HTTP handler functions is also available.

Measurement charts will be available in the Dashboard's Bottlenecks section under Segments.

See also:

Error monitoring

The agent provides an API for reporting errors. When used, the Events -> Errors section will contain error profile reports for different types of errors. Each report is a collection of error profiles for a sequence of sources (Timeline), e.g. hosts or containers, or a single source over a period of time, which is adjustable.

A chart shows the number of total errors of a particular type over time. Clicking on a point will select an error profile corresponding to the selected time.

An error profile is a call tree, where each branch is an error stack trace and each node is an error stack frame. The value of a node indicates the number of times an error has occurred in a 60-second time period.

See also:

Health monitoring

The agents report various metrics related to application execution, runtime, operating system, etc. Measurements are taken every 60 seconds. The following metrics are supported for Golang applications:

  • CPU
    • CPU usage – percentage of total time the CPU was busy. This is a best effort to calculate absolute CPU usage based on the number of cores available to the process.
    • CPU time – similar to CPU usage, but not converted to percentage.
  • Memory
    • Allocated memory – total number of bytes allocated and not garbage collected.
    • Mallocs – number of malloc operations per measurement interval.
    • Frees – number of free operations per measurement interval.
    • Lookups – number of pointer lookup operations per measurement interval.
    • Heap objects – number of heap objects.
    • Heap non-idle – heap space in bytes currently used.
    • Heap idle – heap space in bytes currently unused.
    • Current RSS – resident set size taht is a portion of process memory held in RAM.
    • Max RSS – peak resident set size during application execution.
  • Garbage collection
    • Number of GCs – number of garbage collection cycles per measurement interval.
    • GC CPU fraction – fraction of CPU used by garbage collection.
    • GC total pause – amount of time garbage collection took during measurement interval.
  • Runtime
    • Number of goroutines – number of currently running goroutines.
    • Number of cgo calls – number of cgo calls made per measurement interval.

See also:

Anomaly alerts

Whenever there is an anomaly in a hot spot profile, a notification can be sent to an email address, webhook URL or Slack Incoming Webhook.

Anomalies are detected by comparing the last few profiles with the baseline of the same profile over the last 24-hour interval.

See also:

Agent overhead

The agent overhead is measured to be less than 1% for applications under high load.