Monitoring

Introduction

This page will go over all the necessary steps to enhance observability of an SSV node.

The SSV node is instrumented with OpenTelemetry metrics, where the default exporter is Prometheus. You can use Grafana to explore the monitoring dashboard. We recommend to have a separate monitoring for your Execution and Beacon nodes to have maximum visibility of your operations.

Requirements

The provided dashboards use pod as a label to template across the Grafana panels. This is useful if you are running more than one SSV node as it allows you to see performance metrics for each SSV node.

The following is an example of a Prometheus configuration that will scrape metrics from your SSV node. It is important to note that the pod label is not added to the metrics by default, so you will need to add it manually. Prometheus by default adds an instance label to the metrics.

In the example below you can find 2 options (as comments) for how you can add the pod label:

global:
  scrape_interval:     10s
  evaluation_interval: 10s

scrape_configs:
  - job_name: ssv
    metrics_path: /metrics
    static_configs:
      - targets:
          - "localhost:15000"
    relabel_configs:
      ## Two options to choose from:
      ## 1. If you want a static name, perhaps because you are running 
      ## multiple instances of ssv-node, or want to set a specific name, uncomment these:
      # - target_label: pod
      #   replacement: ssv-node-1

      ## 2. This will use the host specified in the targets list by uncommenting the following lines
      # - source_labels: [__address__]
      #   target_label: pod

Monitoring Setup

Using an existing monitoring stack

If you are already running a monitoring stack, scraping metrics from your SSV node has not changed. You will need to ensure that your Prometheus instance is scraping metrics from your SSV node via MetricsAPIPort (set in your SSV node's configuration file, 15000 by default).

Once again, pod label is used, which is common for Kubernetes, but if you are running your SSV node on a different platform, you will need to change this to the correct label.

Kubernetes

If you are running your SSV node on Kubernetes, you will need to ensure that your Prometheus instance is scraping metrics from your SSV node. For this, we often recommend a project called kube-prometheus-stack.

Due to the dynamic nature of Kubernetes, a common pattern is to use a ServiceMonitor or PodMonitor to scrape metrics from your SSV node. kube-prometheus-stack by default adds the pod label to the metrics, so dashboards should work out of the box.

You can read more on ServiceMonitors and PodMonitors in the Prometheus Operator documentation.

Using Dashboard

If you are using Grafana instance, you should be able to access your instance on the 3000 port by default. So Grafana's address would look like http://1.2.3.4:3000 with the public IP address of your server. In this case, you need to expose your 3000 TCP port (default Grafana port), so that you can access the dashboard.

Alternatively, you can use Grafana Cloud and connect it to your Prometheus instance. In that case, you will have to expose your 9090 TCP port (default Prometheus port), so that Grafana can access your metrics.

Operational Dashboard

Download the .json file of the dashboard on top of the Dashboard Runbook section.

Metrics Index

All of the metrics from the dashboard are described on the Metrics Index page.

Quick note on Prometheus and Grafana

Interpreting Rates

Some panels in our default dashboard may look confusing at first glance; for example, how is it possible that my SSV node has connected to 34.7 peers over the last minute? Why does it have decimal points?

This is where the Prometheus rate() function comes in. rate() is a function that calculates the rate of change of a metric over a given time interval, and it is displayed on a per-second basis.

For example, if we have a calculation of rate(ssv_network_peers_connected[1m]), this means that the rate of peers connected is calculated as the number of peers connected over the last minute, and it is displayed on a per-second basis. We sometimes multiply this by 60 to get a rate over a minute. This should clarify why many panels have decimal points over metrics that, in theory, should be whole numbers.

Calculating Rates

Rates are a powerful way to see how a metric changes over time, but they force us to specify a time interval to calculate the rate. Rather than hardcoding intervals in our dashboards, we leverage __rate_interval all across our dashboards to calculate rates over a given time interval. You can read more about __rate_interval it in Grafana's post. This, however, means that you need to configure your Prometheus data source to match Prometheus's scrape interval. You can do this in the Grafana UI by clicking on the Prometheus datasource and then changing it under Interval behaviour. Having a mismatch between those two will result in incorrect rate calculations and dashboards not being displayed correctly.

The default scrape interval for Prometheus is 1 minute, whereas the default interval for a Prometheus data source in Grafana is 15 seconds.

Last updated