2021-11-09

Monitoring the RED metrics in a Go application with and Prometheus and Grafana

What are the RED metrics?

Metrics are numeric measurements on time series used to provide information about the status of a system. They can be counters, gauges, and histograms, and their observations are combined and reported to a system to allow exploration of aggregate system behavior usually using dashboards and alerts.

Prometheus is the leading tool for this.

RED makes reference to a subset of metrics commonly used by SREs at Google, to monitor services. It stands for:

Requests per second: the amount of demand a system is handling.
Error rate: the number of requests that have failed.
Duration of request: the amount of time it takes for a request to be processed.

Go implementation

This is an example implementation of the RED metrics on an Go HTTP service using chi as router.

We can start defining the metrics we want to measure and labels. Labels will help us have granular insight into metrics by grouping them. In this case we could have groups by the request method, request path, and response code:

const (
	labelMethod = "method"
	labelPath   = "path"
	labelCode   = "code"
)

type Metrics struct {
	requestsReceived *prometheus.CounterVec
	requestDuration  *prometheus.HistogramVec
}

Regarding the error rate, we’ll just calculate it afterwards using the requests received metric and the code label.

Then we can create a register function, which will define the metrics and their metadata, and initialize a struct to operate them with methods:

func Register() *Metrics {
	reqReceived := promauto.NewCounterVec(
		prometheus.CounterOpts{
			Namespace: "http",
			Name:      "requests_total",
			Help:      "Total number of requests received.",
		},
		[]string{labelMethod, labelPath, labelCode},
	)
	reqDuration := promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Namespace: "http",
			Name:      "request_duration_seconds",
			Help:      "Duration of a request in seconds.",
		},
		[]string{labelMethod, labelPath, labelCode},
	)

	return &Metrics{
		requestsReceived: reqReceived,
		requestDuration:  reqDuration,
	}
}

Following, we can create one method to set up the default metrics Prometheus gathers for a Go app, another to increase the numbers of requests, and another to observe the duration of each request:

func (m *Metrics) Default() http.Handler {
	return promhttp.Handler()
}

func (m *Metrics) IncRequests(method, path string, code int) {
	m.requestsReceived.WithLabelValues(method, path, strconv.Itoa(code)).Inc()
}

func (m *Metrics) ObsDuration(method, path string, code int, duration float64) {
	m.requestDuration.WithLabelValues(method, path, strconv.Itoa(code)).Observe(duration)
}

At this point we can create an endpoint where all the metrics will be available, and a middleware function that will take care of gathering those metrics for all handlers. Using chi and httpsnoop, when there are dynamic path parameters we can can get the generic path of each request, such as /api/foo/{id}, thus limiting the cardinality of metrics for the path label:

func main() {
	r := chi.NewRouter()
	metrics := metrics.Register()

	r.Route("/api", func(r chi.Router) {
		r.Use(metricsMiddleware(metrics))
		r.Post("/foo", fooHandler)
		r.Get("/bar", barHandler)
	})
	r.Get("/metrics", metrics.Default().ServeHTTP)

	err := http.ListenAndServe(":8080", r)
	if err != nil {
		log.Fatalln(err)
	}
}

func metricsMiddleware(metrics *metrics.Metrics) func(h http.Handler) http.Handler {
	return func(h http.Handler) http.Handler {
		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			m := httpsnoop.CaptureMetrics(h, w, r)
			path := chi.RouteContext(r.Context()).RoutePattern()

			metrics.IncRequests(r.Method, path, m.Code)
			metrics.ObsDuration(r.Method, path, m.Code, m.Duration.Seconds())
		})
	}
}

With this, all requests and responses will be monitored, generating metrics that will be collected and available at the /metrics endpoint, where we can point our Prometheus server to scrape.

Once we have metrics in our Prometheus server, we can use Grafana to make sense of the data with dashboards.

For each metric we can create a panel and write a query using PromQL.

Requests per second:

sum(rate(http_requests_total[1m])) by (method, path)

Error rate:

sum(irate(http_requests_total{code=~"5.."}[1m])) by (method, path) / sum(irate(http_requests_total[1m])) by (method, path) * 100

Request duration

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{code=~"2.."}[1m])) by (le, method, path))

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{code=~"2.."}[1m])) by (le, method, path))

histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{code=~"2.."}[1m])) by (le, method, path))