Docs

    Overview

    StackImpact is a performance profiler for production applications. It gives developers continuous and historical view of application performance with line-of-code precision, which includes CPU, memory allocation and blocking call hot spots as well as execution bottlenecks, errors and runtime metrics.

    Features

    • Automatic hot spot profiling for CPU, memory allocations, network, system calls and lock contention.
    • Automatic bottleneck tracing for HTTP handlers and HTTP clients.
    • Error and panic monitoring.
    • Health monitoring including CPU, memory, garbage collection and other runtime metrics.
    • Alerts on hot spot anomalies.
    • Multiple account users for team collaboration.

    Learn more on the features page (with screenshots).

    The StackImpact agent reports performance information to the Dashboard running as SaaS.

    Getting started with Go profiling

    Create StackImpact account

    Sign up for a free account at stackimpact.com.

    Supported environment

    Linux, OS X and Windows. Go version 1.5+.

    Installing the agent

    Install the Go agent by running

    go get github.com/stackimpact/stackimpact-go
    

    And import the package github.com/stackimpact/stackimpact-go into your application.

    Configuring the agent

    Initialization

    Start the agent by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

    agent := stackimpact.Start(stackimpact.Options{
        AgentKey: "agent key here",
        AppName: "MyGoApp",
    })
    

    All initialization options:

    • AgentKey (Required) The access key for communication with the StackImpact servers.
    • AppName (Required) A name to identify and group application data. Typically, a single codebase, deployable unit or executable module corresponds to one application.
    • AppVersion (Optional) Sets application version, which can be used to associate profiling information with the source code release.
    • AppEnvironment (Optional) Used to differentiate applications in different environments.
    • HostName (Optional) By default, host name will be the OS hostname.
    • ProxyAddress (Optional) Proxy server URL to use when connecting to the Dashboard servers.
    • Debug (Optional) Enables debug logging.

    Basic example

    package main
    
    import (
        "fmt"
        "net/http"
    
        "github.com/stackimpact/stackimpact-go"
    )
    
    func handler(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello world!")
    }
    
    func main() {
        agent := stackimpact.Start(stackimpact.Options{
            AgentKey: "agent key here",
            AppName: "Basic Go Server",
            AppVersion: "1.0.0",
            AppEnvironment: "production",
        })
    
        // use MeasureHandlerFunc or MeasureHandler to additionally measure HTTP request execution time.
        http.HandleFunc(agent.MeasureHandlerFunc("/", handler)) 
        http.ListenAndServe(":8080", nil)
    }
    

    Measuring code segments

    The use of Segment API is optional.

    To measure the execution time of arbitrary parts of the application, the Segment API can be used. The agent continuously watches segment execution time and initiates profiling when anomalies are detected.

    // Starts measurement of execution time of a code segment.
    // To stop measurement, call Stop on returned Segment object.
    // After calling Stop, the segment is recorded, aggregated and
    // reported with regular intervals.
    segment := agent.MeasureSegment("Segment1")
    defer segment.Stop()
    
    // A helper function to measure HTTP handler execution by wrapping http.Handle method parameters.
    // Usage example:
    //   http.Handle(agent.MeasureHandler("/some-path", someHandler))
    pattern, wrappedHandler := agent.MeasureHandler(pattern, handler)
    
    // A helper function to measure HTTP handler function execution by wrapping http.HandleFunc method parameters.
    // Usage example:
    //   http.HandleFunc(agent.MeasureHandlerFunc("/some-path", someHandlerFunc))
    pattern, wrappedHandlerFunc := agent.MeasureHandlerFunc(pattern, handlerFunc)
    

    Monitoring errors

    The use of Error API is optional.

    To monitor exceptions and panics with stack traces, the error recording API can be used.

    // Aggregates and reports errors with regular intervals.
    agent.RecordError(someError)
    
    // Aggregates and reports panics with regular intervals.
    defer agent.RecordPanic()
    
    // Aggregates and reports panics with regular intervals. This function also
    // recovers from panics.
    defer agent.RecordAndRecoverPanic()
    

    Analyzing performance data in the Dashboard

    Once your application is restarted, you can start observing regular and anomaly-triggered CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

    Troubleshooting

    To enable debug logging, add Debug: true to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

    Getting started with Python profiling

    Create StackImpact account

    Sign up for a free account at stackimpact.com.

    Supported environment

    • Linux, OS X or Windows. Python version 2.7, 3.4 or higher.
    • Memory allocation profiler and some GC metrics are only available for Python 3.
    • CPU and Time profilers only support Linux and OS X.
    • Time (blocking call) profiler supports threads and gevent.

    Installing the agent

    Install the Python agent by running

    pip install stackimpact
    

    And import the package in your application

    import stackimpact
    

    Configuring the agent

    Start the agent in the main thread by specifying the agent key and application name. The agent key can be found in your account's Configuration section.

    agent = stackimpact.start(
        agent_key = 'agent key here',
        app_name = 'MyPythonApp')
    

    All initialization options:

    • agent_key (Required) The access key for communication with the StackImpact servers.
    • app_name (Required) A name to identify and group application data. Typically, a single codebase, deployable unit or executable module corresponds to one application.
    • app_version (Optional) Sets application version, which can be used to associate profiling information with the source code release.
    • app_environment (Optional) Used to differentiate applications in different environments.
    • host_name (Optional) By default, host name will be the OS hostname.
    • debug (Optional) Enables debug logging.
    • cpu_profiler_disabled, allocation_profiler_disabled, block_profiler_disabled, error_profiler_disabled (Optional) Disables respective profiler when True.
    • include_agent_frames, include_system_frames (Optional) Set to True to not exclude agent and/or system stack frames from profiles.
    • auto_destroy (Optional) Set to False to disable agent's exit handlers. If necessary, call destroy() to gracefully shutdown the agent.

    Analyzing performance data in the Dashboard

    Once your application is restarted, you can start observing continuous CPU, memory, I/O, and other hot spot profiles, execution bottlenecks as well as process metrics in the Dashboard.

    Troubleshooting

    To enable debug logging, add debug = True to startup options. If the debug log doesn't give you any hints on how to fix a problem, please report it to our support team in your account's Support section.

    Reference

    Hot spot profiling

    Profile recording and reporting by the agent

    Each profiler report represents a series of profiles recorded by the agent continuously. Depending on the profiler and the associated overhead, the agent schedules the profile recording in for optimal results.

    Historical profile grouping

    Reports are shown for a selected application and time frame. A default view is Timeline, which will present profiles from multiple subsequent sources, e.g. machines or containers, in a single time sequence. If multiple sources report profiles simultaneously, e.g. the application is scaled to multiple machines or containers, not all profiles will be visible. It is possible to select a particular source only.

    Additionally, a timeframe can be selected to filter recorded profiles.

    Profile history

    The profile chart shows a key measurement, e.g. total, max or rate, for each recorded profile over time. By clicking on the measurement point in the chart, a profile for the selected time point will be shown as an expandable call tree. Every call (stack frame) in the call tree shows its own share of the total measurement as well as the trend based on previous values of the call found in previous profiles.

    Profile context

    The profile context, which is displayed as a list of tags, reflects the application environment and the state at which the profile was recorded. The following entries are possible:

    • Host name – the host name of the host, instance or container the application is running on. The value is obtained from the system.
    • Runtime type – a language or platform, e.g. Go or Python
    • Runtime version – a version of the language or platform.
    • Application version – can be defined in the agent initialization statement by the developer.
    • Build ID – a prefix of an SHA1 value of the program.
    • Run ID – a unique ID for every (re)start of the application.
    • Agent version – version of the agent that recorded the profile.

    CPU usage profile

    CPU usage profiles are recorded by sampling profilers. In the Dashboard, it is represented by a call tree with nodes corresponding to function calls. Each node’s value represents a percentage of the absolute time a call was executing during recording of the profile. The percentage is a best effort to calculate absolute execution time. It is achieved by using the number of cores available to the process and the profiler's sampling rate. Additionally, the number of samples for each call is provided.

    There can be many root causes of high CPU usage. Some of them are:

    • Algorithm complexity, i.e. a code has a high time complexity. For example, it performs exponentially more steps relative to the data size the algorithm is processing.
    • Extensive garbage collection caused by too many objects being allocated and released
    • Infinite or tight loops

    See also:

    Memory allocation profile

    Memory profiles are recorded by reading current heap allocation statistics. Each node in the memory allocation profile call tree represents a line of code where memory was allocated and not yet released after garbage collection. The value of the node is the number of bytes allocated by a function call or by some of its nested calls. If a node has multiple children nodes, the node's value is the total of its children's values. The number of samples, which is shown next to the allocated size, corresponds to the number of allocated objects. Some agents, e.g. Python, report allocation rate.

    A single profile is not a good indicator of a memory leak, while memory can be released shortly after the memory allocation statistics were read. A better indication of a memory leak is a continuous increase of allocated memory at a single call node relative to its previous readings. Different types of memory leaks may manifest themselves at different timeframes.

    Memory leaks can have different root causes. Some of them are:

    • The pointer to which an object is assigned after allocation stays unreleased, e.g. it has the wrong scope.
    • A pointer is assigned to another pointer that is not released, similarly to previous point.
    • Unintended allocation of memory, e.g. in a loop.

    See also:

    Time profile

    Blocking or async call profiles represent a call tree, where each node is a function call that waits for an event. The value is an aggregated waiting time of a call during one second. It can be grater than one second, because the same call can wait for an event in parallel. Events can be network reads and writes, system calls, mutex waits, etc. The number of samples, which is shown next to the wait time, corresponds to the number of executions of the function calls sampled.

    See also:

    Bottleneck profiling

    Bottleneck profiles are recorded, reported and represented identically to hot spot profiles, except that the values of function calls represent execution duration percentiles of averages. Depending on the profile, the duration can represent a blocking or asynchronous operation.

    Segment measurements

    Only available in Go

    Measuring the execution time of custom code segments is possible using the agent's API. The agent will aggregate and report the execution time of a segment, which is the 95th percentile of all instances of the same segment during a 60-second time period.

    A helper wrapper for measuring the execution time of HTTP handler functions is also available.

    Measurement charts will be available in the Dashboard's Bottlenecks section under Segments.

    See also:

    Error monitoring

    The agent provides an API for reporting errors. When used, the Errors section will contain error profile reports for different types of errors. Each report is a collection of error profiles for a sequence of sources (Timeline), e.g. hosts or containers, or a single source over a period of time, which is adjustable.

    A chart shows the number of total errors of a particular type over time. Clicking on a point will select an error profile corresponding to the selected time.

    An error profile is a call tree, where each node is an error stack frame. The value of a node indicates the number of times an error has occurred in a 60-second time period.

    See also:

    Health monitoring

    The agents report various metrics related to application execution, runtime, operating system, etc. Measurements are taken every 60 seconds.

    Supported metrics for Golang applications:

    • CPU
      • CPU usage – percentage of total time the CPU was busy. This is a best effort to calculate absolute CPU usage based on the number of cores available to the process.
      • CPU time – similar to CPU usage, but not converted to percentage.
    • Memory
      • Allocated memory – total number of bytes allocated and not garbage collected.
      • Mallocs – number of malloc operations per measurement interval.
      • Frees – number of free operations per measurement interval.
      • Lookups – number of pointer lookup operations per measurement interval.
      • Heap objects – number of heap objects.
      • Heap non-idle – heap space in bytes currently used.
      • Heap idle – heap space in bytes currently unused.
      • VM Size – virtual memory size
      • Current RSS – resident set size that is a portion of process memory held in RAM.
      • Max RSS – peak resident set size during application execution.
    • Garbage collection
      • Number of GCs – number of garbage collection cycles per measurement interval.
      • GC CPU fraction – fraction of CPU used by garbage collection.
      • GC total pause – amount of time garbage collection took during measurement interval.
    • Runtime
      • Number of goroutines – number of currently running goroutines.
      • Number of cgo calls – number of cgo calls made per measurement interval.

    Supported metrics for Python applications:

    • CPU
      • CPU usage – percentage of total time the CPU was busy. This is a best effort to calculate absolute CPU usage based on the number of cores available to the process.
      • CPU time – similar to CPU usage, but not converted to percentage.
    • Memory
      • VM Size – virtual memory size
      • Current RSS – resident set size that is a portion of process memory held in RAM.
      • Max RSS – peak resident set size during application execution.
    • Garbage collection
      • Collected objects – number of collected objects by garbage collected (Python 3).
      • Uncollected objects – number of objects, which are not yet collected.
      • Uncollectable objects – number of objects, which cannot be collected (Python 3).
      • Collections – number of garbage collection cycles (Python 3).
    • Runtime
      Active threads – number of active threads

    See also:

    Anomaly detection

    StackImpact continuously observes the hot spot and error profiles reported by the agents from each application in order to detect changes, which it is worth looking at.

    An anomlary alert notification can be sent in case of an anomaly to an endpoint. Alert endpoints can be added in the in the Configuration section. The endpoint can be an email address, webhook URL or Slack Incoming Webhook.

    See also:

    Agent overhead

    The agent overhead is measured to be less than 1% for applications under high load.