Docs

Overview

StackImpact is a performance profiler for production applications. It gives developers continuous and historical view of application performance with line-of-code precision, which includes CPU, memory allocation and blocking call hot spots as well as execution bottlenecks, errors and runtime metrics.

Features

  • Continuous hot spot profiling for CPU, memory allocations, network, system calls and lock contention.
  • Continuous bottleneck tracing for HTTP handlers and HTTP clients.
  • Error and panic monitoring.
  • Health monitoring including CPU, memory, garbage collection and other runtime metrics.
  • Alerts on hot spot anomalies.
  • Multiple account users for team collaboration.

Learn more on the features page (with screenshots).

The StackImpact agent reports performance information to the Dashboard running as SaaS.

Getting started with Go profiling

Create StackImpact account

Sign up for a free account at stackimpact.com.

Supported environment

Linux, OS X and Windows. Go version 1.5+.

Installing the agent

Install the Go agent by running

go get github.com/stackimpact/stackimpact-go

And import the package github.com/stackimpact/stackimpact-go into your application.

Configuring the agent

Initialization

Start the agent by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

agent := stackimpact.Start(stackimpact.Options{
    AgentKey: "agent key here",
    AppName: "MyGoApp",
})

All initialization options:

  • AgentKey (Required) The access key for communication with the StackImpact servers.
  • AppName (Required) A name to identify and group application data. Typically, a single codebase, deployable unit or executable module corresponds to one application.
  • AppVersion (Optional) Sets application version, which can be used to associate profiling information with the source code release.
  • AppEnvironment (Optional) Used to differentiate applications in different environments.
  • HostName (Optional) By default, host name will be the OS hostname.
  • ProxyAddress (Optional) Proxy server URL to use when connecting to the Dashboard servers.
  • AutoProfiling (Optional) If set to false, disables the default automatic profiling and reporting. agent.Profile() should be used instead. Useful for environments without support for timers or background tasks.
  • Debug (Optional) Enables debug logging.

Basic example

package main

import (
    "fmt"
    "net/http"

    "github.com/stackimpact/stackimpact-go"
)

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hello world!")
}

func main() {
    agent := stackimpact.Start(stackimpact.Options{
        AgentKey: "agent key here",
        AppName: "Basic Go Server",
        AppVersion: "1.0.0",
        AppEnvironment: "production",
    })

    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}

Manual profiling

The use of manual profiling is optional.

Manual profiling is suitable for repeating code, such as request or event handlers. By default, the agent starts and stops profiling automatically. In order to make sure the agent profiles the most relevant execution intervals, the agent.Profile() method can be used.

// Use this method to instruct the agent to start and stop 
// profiling. It does not guarantee that any profiler will be 
// started. The decision is made by the agent based on the 
// overhead constraints. The method returns Span object, on 
// which the Stop() method should be called.
span := agent.Profile();
defer span.Stop();
// A helper function to profile HTTP handler execution by wrapping 
// http.Handle method parameters.
// Usage example:
//   http.Handle(agent.ProfileHandler("/some-path", someHandler))
pattern, wrappedHandler := agent.ProfileHandler(pattern, handler)
// A helper function to profile HTTP handler function execution 
// by wrapping http.HandleFunc method parameters.
// Usage example:
//   http.HandleFunc(agent.ProfileHandlerFunc("/some-path", someHandlerFunc))
pattern, wrappedHandlerFunc := agent.ProfileHandlerFunc(pattern, handlerFunc)

Measuring code segments

The use of Segment API is optional.

To measure the execution time of arbitrary parts of the application, the Segment API can be used. The agent continuously watches segment execution time and initiates profiling when anomalies are detected.

// Starts measurement of execution time of a code segment.
// To stop measurement, call Stop on returned Segment object.
// After calling Stop, the segment is recorded, aggregated and
// reported with regular intervals.
segment := agent.MeasureSegment("Segment1")
defer segment.Stop()
// A helper function to measure HTTP handler execution by wrapping http.Handle method parameters.
// Usage example:
//   http.Handle(agent.MeasureHandler("/some-path", someHandler))
pattern, wrappedHandler := agent.MeasureHandler(pattern, handler)
// A helper function to measure HTTP handler function execution by wrapping http.HandleFunc method parameters.
// Usage example:
//   http.HandleFunc(agent.MeasureHandlerFunc("/some-path", someHandlerFunc))
pattern, wrappedHandlerFunc := agent.MeasureHandlerFunc(pattern, handlerFunc)

Monitoring errors

The use of Error API is optional.

To monitor exceptions and panics with stack traces, the error recording API can be used.

// Aggregates and reports errors with regular intervals.
agent.RecordError(someError)
// Aggregates and reports panics with regular intervals.
defer agent.RecordPanic()
// Aggregates and reports panics with regular intervals. This function also
// recovers from panics.
defer agent.RecordAndRecoverPanic()

Analyzing performance data in the Dashboard

Once your application is restarted, you can start observing regular and anomaly-triggered CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

Troubleshooting

To enable debug logging, add Debug: true to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

Getting started with Node.js profiling

Create StackImpact account

Sign up for a free account at stackimpact.com.

Supported environment

  • Linux, OS X or Windows. Node.js v4.0.0 or higher.
  • CPU profiler is disabled by default for Node.js v7.0.0 and higher due to memory leak in underlying V8’s CPU profiler. To enable, add cpuProfilerDisabled: false to startup options.
  • Allocation profiler supports Node.js v6.0.0 and higher. The allocation profiler is disabled by default, since V8’s heap sampling is still experimental and is seen to result in segmentation faults. To enable, add allocationProfilerDisabled: false to startup options.
  • Async profiler supports Node.js v8.1.0 and higher.

Installing the agent

Install the Node.js agent by running

npm install stackimpact

And import the package in your application

const stackimpact = require('stackimpact');

Configuring the agent

Start the agent in the main thread by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

let agent = stackimpact.start({
  agentKey: 'agent key here',
  appName: 'MyNodejsApp'
});

All initialization options:

  • agentKey (Required) The API key for communication with the StackImpact servers.
  • appName (Required) A name to identify and group application data. Typically, a single codebase, deployable unit or executable module corresponds to one application.
  • appVersion (Optional) Sets application version, which can be used to associate profiling information with the source code release.
  • appEnvironment (Optional) Used to differentiate applications in different environments.
  • hostName (Optional) By default, host name will be the OS hostname.
  • autoProfiling (Optional) If set to false, disables automatic profiling and reporting. agent.profile() should be used instead. Useful for environments without support for timers or background tasks.
  • debug (Optional) Enables debug logging.
  • cpuProfilerDisabled, allocationProfilerDisabled, asyncProfilerDisabled, errorProfilerDisabled (Optional) Disables respective profiler when true.
  • includeAgentFrames (Optional) Set to true to not exclude agent stack frames from profiles.

Manual profiling

Optional

Use agent.profile() to instruct the agent when to start and stop profiling. The agent decides if and which profiler is activated. Normally, this method should be used in repeating code, such as request or event handlers. If autoProfiling is disabled, this method will also periodically report the profiling data to the Dashboard. Usage example:

const span = agent.profile();

// your code here

span.stop(() => {
    // stoppped
});

Is no callback is provided, stop() method returns a promise.

Shutting down the agent

Use agent.destroy() to stop the agent if necessary, e.g. to allow application to exit.

Analyzing performance data in the Dashboard

Once your application is restarted, you can start observing continuous CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

Troubleshooting

To enable debug logging, add debug: true to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

Getting started with Python profiling

Create StackImpact account

Sign up for a free account at stackimpact.com.

Supported environment

  • Linux, OS X or Windows. Python version 2.7, 3.4 or higher.
  • Memory allocation profiler and some GC metrics are only available for Python 3.
  • CPU and Time profilers only support Linux and OS X.
  • Time (blocking call) profiler supports threads and gevent.

Installing the agent

Install the Python agent by running

pip install stackimpact

And import the package in your application

import stackimpact

Configuring the agent

Start the agent in the main thread by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

agent = stackimpact.start(
    agent_key = 'agent key here',
    app_name = 'MyPythonApp')

All initialization options:

  • agent_key (Required) The access key for communication with the StackImpact servers.
  • app_name (Required) A name to identify and group application data. Typically, a single codebase, deployable unit or executable module corresponds to one application.
  • app_version (Optional) Sets application version, which can be used to associate profiling information with the source code release.
  • app_environment (Optional) Used to differentiate applications in different environments.
  • host_name (Optional) By default, host name will be the OS hostname.
  • debug (Optional) Enables debug logging.
  • cpu_profiler_disabled, allocation_profiler_disabled, block_profiler_disabled, error_profiler_disabled (Optional) Disables respective profiler when True.
  • include_agent_frames, include_system_frames (Optional) Set to True to not exclude agent and/or system stack frames from profiles.
  • auto_destroy (Optional) Set to False to disable agent's exit handlers. If necessary, call destroy() to gracefully shutdown the agent.

Analyzing performance data in the Dashboard

Once your application is restarted, you can start observing continuous CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

Troubleshooting

To enable debug logging, add debug = True to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

Reference

Hot spot profiling

Profile recording and reporting by the agent

Each profiler report represents a series of profiles recorded by the agent continuously. Depending on the profiler and the associated overhead, the agent schedules the profile recording in for optimal results.

Historical profile grouping

Reports are shown for a selected application and time frame. A default view is Timeline, which will present profiles from multiple subsequent sources, e.g. machines or containers, in a single time sequence. If multiple sources report profiles simultaneously, e.g. the application is scaled to multiple machines or containers, not all profiles will be visible. It is possible to select a particular source only.

Additionally, a timeframe can be selected to filter recorded profiles.

Profile history

The profile chart shows a key measurement, e.g. total, max or rate, for each recorded profile over time. By clicking on the measurement point in the chart, a profile for the selected time point will be shown as an expandable call tree. Every call (stack frame) in the call tree shows its own share of the total measurement as well as the trend based on previous values of the call found in previous profiles.

Profile context

The profile context, which is displayed as a list of tags, reflects the application environment and the state at which the profile was recorded. The following entries are possible:

  • Host name – the host name of the host, instance or container the application is running on. The value is obtained from the system.
  • Runtime type – a language or platform, e.g. Go or Python
  • Runtime version – a version of the language or platform.
  • Application version – can be defined in the agent initialization statement by the developer.
  • Build ID – a prefix of an SHA1 value of the program.
  • Run ID – a unique ID for every (re)start of the application.
  • Agent version – version of the agent that recorded the profile.

CPU usage profile

CPU usage profiles are recorded by sampling profilers. In the Dashboard, it is represented by a call tree with nodes corresponding to function calls. Each node’s value represents a percentage of the absolute time a call was executing during recording of the profile. The percentage is a best effort to calculate absolute execution time. It is achieved by using the number of cores available to the process and the profiler's sampling rate. Additionally, the number of samples for each call is provided.

There can be many root causes of high CPU usage. Some of them are:

  • Algorithm complexity, i.e. a code has a high time complexity. For example, it performs exponentially more steps relative to the data size the algorithm is processing.
  • Extensive garbage collection caused by too many objects being allocated and released
  • Infinite or tight loops

See also:

Memory allocation profile

Memory profiles are recorded by reading current heap allocation statistics. Each node in the memory allocation profile call tree represents a line of code where memory was allocated and not yet released after garbage collection. The value of the node is the number of bytes allocated by a function call or by some of its nested calls. If a node has multiple children nodes, the node's value is the total of its children's values. The number of samples, which is shown next to the allocated size, corresponds to the number of allocated objects. Some agents, e.g. Python, report allocation rate.

A single profile is not a good indicator of a memory leak, while memory can be released shortly after the memory allocation statistics were read. A better indication of a memory leak is a continuous increase of allocated memory at a single call node relative to its previous readings. Different types of memory leaks may manifest themselves at different timeframes.

Memory leaks can have different root causes. Some of them are:

  • The pointer to which an object is assigned after allocation stays unreleased, e.g. it has the wrong scope.
  • A pointer is assigned to another pointer that is not released, similarly to previous point.
  • Unintended allocation of memory, e.g. in a loop.

See also:

Time profile

Blocking or async call profiles represent a call tree, where each node is a function call that waits for an event. The value is an aggregated waiting time of a call during one second. It can be grater than one second, because the same call can wait for an event in parallel. Events can be network reads and writes, system calls, mutex waits, etc. The number of samples, which is shown next to the wait time, corresponds to the number of executions of the function calls sampled.

See also:

Bottleneck profiling

Bottleneck profiles are recorded, reported and represented identically to hot spot profiles, except that the values of function calls represent execution duration percentiles of averages. Depending on the profile, the duration can represent a blocking or asynchronous operation.

Segment measurements

Only available in Go

Measuring the execution time of custom code segments is possible using the agent's API. The agent will aggregate and report the execution time of a segment, which is the 95th percentile of all instances of the same segment during a 60-second time period.

A helper wrapper for measuring the execution time of HTTP handler functions is also available.

Measurement charts will be available in the Dashboard's Bottlenecks section under Segments.

See also:

Error monitoring

The agent provides an API for reporting errors. When used, the Errors section will contain error profile reports for different types of errors. Each report is a collection of error profiles for a sequence of sources (Timeline), e.g. hosts or containers, or a single source over a period of time, which is adjustable.

A chart shows the number of total errors of a particular type over time. Clicking on a point will select an error profile corresponding to the selected time.

An error profile is a call tree, where each node is an error stack frame. The value of a node indicates the number of times an error has occurred in a 60-second time period.

See also:

Health monitoring

The agents report various metrics related to application execution, runtime, operating system, etc. Measurements are taken every 60 seconds.

Go application metrics

  • CPU
    • CPU usage – percentage of total time the CPU was busy. This is a best effort to calculate absolute CPU usage based on the number of cores available to the process.
    • CPU time – similar to CPU usage, but not converted to percentage.
  • Memory
    • Allocated memory – total number of bytes allocated and not garbage collected.
    • Mallocs – number of malloc operations per measurement interval.
    • Frees – number of free operations per measurement interval.
    • Lookups – number of pointer lookup operations per measurement interval.
    • Heap objects – number of heap objects.
    • Heap non-idle – heap space in bytes currently used.
    • Heap idle – heap space in bytes currently unused.
    • VM Size – virtual memory size
    • Current RSS – resident set size that is the portion of process memory held in RAM.
    • Max RSS – peak resident set size during application execution.
  • Garbage collection
    • Number of GCs – number of garbage collection cycles per measurement interval.
    • GC CPU fraction – fraction of CPU used by garbage collection.
    • GC total pause – amount of time garbage collection took during measurement interval.
  • Runtime
    • Number of goroutines – number of currently running goroutines.
    • Number of cgo calls – number of cgo calls made per measurement interval.

Node.js application metrics

  • CPU
    • CPU usage – a percentage of total time the CPU was busy.
  • Memory
    • Total heap size – total heap size, including idle size.
    • Used heap size – a set of metrics representing heap size as well as heap space sizes for code space, new space, old space, map space and large objects.
    • C++ objects – memory usage by C++ objects bound to Javascript objects.
    • RSS – resident set size that is the portion of process memory held in RAM.
  • Garbage collection
    • GC cycles – number of garbage collection cycles.
    • GC time – time spent performing garbage collection.
  • Runtime
    • Event loop I/O stage – time spent in event loop I/O stage.
    • Event loop ticks – number of event loop ticks.

Python application metrics

  • CPU
    • CPU usage – percentage of total time the CPU was busy. This is a best effort to calculate absolute CPU usage based on the number of cores available to the process.
    • CPU time – similar to CPU usage, but not converted to percentage.
  • Memory
    • VM Size – virtual memory size
    • Current RSS – resident set size that is a portion of process memory held in RAM.
    • Max RSS – peak resident set size during application execution.
  • Garbage collection
    • Collected objects – number of collected objects by garbage collected (Python 3).
    • Uncollected objects – number of objects, which are not yet collected.
    • Uncollectable objects – number of objects, which cannot be collected (Python 3).
    • Collections – number of garbage collection cycles (Python 3).
  • Runtime
    • Active threads – number of active threads

See also:

Application footprint

Footprint section gives a cross-application view of the resource consumption with breakdown by application, allowing to see which application, including all of its processes, consumes how much CPU and memory over time.

The CPU footprint is calculated by multiplying the number of application processes by the percentage of a single process CPU usage of the total infrastructure.

The memory footprint is calculated by multiplying the number of application processes by the process RSS.

See also:

Anomaly detection

StackImpact continuously observes the hot spot and error profiles reported by the agents from each application in order to detect changes, which it is worth looking at.

An anomlary alert notification can be sent in case of an anomaly to an endpoint. Alert endpoints can be added in the in the Configuration section. The endpoint can be an email address, webhook URL or Slack Incoming Webhook.

See also:

Agent overhead

The agent overhead is measured to be less than 1% for applications under high load. For applications that are horizontally scaled to multiple processes, StackImpact agents are only active on a small subset (adjustable) of the processes at any point of time, therefore the total overhead is much lower.