All these (Dropwizard) metrics

Lets look again at metrics. This time from the perspective of their definition and usability. It was not easy for me to understand types of metrics and how to read them. Which metrics are useful and therefore should be paid attention to? This blog post will focus on Dropwizard metrics with a sample application for a bit of practice.

Why measure?

The motivation to monitor applications’ behaviour is well described in a talk given by Coda Hale: Metrics Metrics Everywhere (video, presentation). Allow me to quote what I think is the most important:

Our code generates business value when it runs, not when we write it.

We need to know what our code does when it runs.

We can’t do this unless we measure it.

Coda Hale, Metrics Metrics Everywhere

This is why we should measure what we write & run (in production mainly).

How to measure?

Since this is a Dropwizard application (or service) that I would like to monitor, a sample application has been created so that metrics could be demonstrated in action.
I will use Datadog’s free plan to visualise the metrics, however this is not the only option. The diagram below illustrates the flow:

  • a customer is using a web application that communicates with a Dropwizard service over http
  • Dropwizard application gathers metrics and sends to Datadog using the metrics-datadog library by coursera
  • a developer is using a browser to view collected metrics
Working with Dropwizard metrics

What to measure?

Previous two parts were easy: we’re now (hopefully!) motivated, have an awful lot of metrics and can visualise them. What is it that we should be looking at?
Lets get a deeper dive into metrics!

Accessing metrics with Metrics Servlet

Dropwizard exposes metrics servlet by default. The easiest way to access it is by navigating to admin servlet – it’s default port would be 8081, and the servlet is registered under the root path. There you can find a link to the metrics. The full URL would be the follwoing (provided that the application runs on a localhost):

http://localhost:8081/metrics?pretty=true

More about admin servlet can be found in the documentation.
The great thing about the servlet is that it exposes the metrics as JSON. Using it is a first step in analysing your metrics and checking if they really work, even before sending them further. The JSON contains a section per each metric type:

  • gauges
  • counters
  • histograms
  • meters
  • timers

Types of metrics and metrics’ annotations

The sample project contains a UserResource with a logIn() method. If the method had no metric annotations from com.codahale.metrics.annotation package, that method would not be visible in the metrics. Some metrics would be updated though – like total number of requests: io.dropwizard.jetty.MutableServletContextHandler.requests or io.dropwizard.jetty.MutableServletContextHandler.2xx-responses. However we would not have an information about how many times this particular method has been called or how long did it take to respond.

Gauges

A gauge is the simplest metric type. It just returns a value.
Quite a lot of gauges are available in Dropwizard out of the box. As the definition suggest they return a current value of a property, for example current jvm.memory.heap.used which reflects the current heap taken by the application. The list is really long, including:

io.dropwizard.jetty.MutableServletContextHandler.percent-*
jvm.gc.*
jvm.memory.*
jvm.threads.*
org.eclipse.jetty.util.thread.QueuedThreadPool.*
other

Counters

A counter is a simple incrementing and decrementing 64-bit integer
A counter would have only a single field: count.
@Counted annotation could be used on a resource’s method and I would expect it to create a counter metric for the annotated method, however I haven’t been able to observe that in the metrics JSON. Actually the next annotation – @Metered will give a counter anyway, hence it could be used instead.
Jetty also provides a few standard counters:

io.dropwizard.jetty.MutableServletContextHandler.active-dispatches
io.dropwizard.jetty.MutableServletContextHandler.active-requests
io.dropwizard.jetty.MutableServletContextHandler.active-suspended

You can find more information about these in this blog post.

Histograms

A Histogram measures the distribution of values in a stream of data
Dropwizard doesn’t offer histograms for its metrics, however if you’re using HikariCP, you can see does how a histogram actually look like. It contains the follwoing metrics:

count
max
mean
min
p50
p75
p95
p98
p99
p999
stddev

More about Histograms and what do they actually measure (like the quantiles – p50 – p999 – for example) can be found in the Dropwizard Metrics documentation.

Meters

A meter measures the rate at which a set of events occur
Adding @Metered annotation to the logIn() method would add the following meter: com.techarchnotes.resource.UserResource.logIn. Meters contain the following information:

"com.techarchnotes.resource.UserResource.logIn" : {
"count" : 4,
"m1_rate" : 0.06012873091736082,
"m5_rate" : 0.013058921239443452,
"m15_rate" : 0.004413705625605309,
"mean_rate" : 0.02941400687290523,
"units" : "events/second"
}

The metric is self-explanatory – what we can observe are events per second, gathered for different rates: 1 minute, 5 minutes, 15 minutes as well as the mean rate. This metric gathers data for the lifetime of the application, and for the 3 recent metrics (1, 5, 15 minutes) uses so called exponentially-weighted moving average algorithm to count them. Dropwizard offers default meters as well:

ch.qos.logback.core.Appender.all
ch.qos.logback.core.Appender.debug
ch.qos.logback.core.Appender.error
ch.qos.logback.core.Appender.info
ch.qos.logback.core.Appender.trace
ch.qos.logback.core.Appender.warn
io.dropwizard.jetty.MutableServletContextHandler.1xx-responses
io.dropwizard.jetty.MutableServletContextHandler.2xx-responses
io.dropwizard.jetty.MutableServletContextHandler.3xx-responses
io.dropwizard.jetty.MutableServletContextHandler.4xx-responses
io.dropwizard.jetty.MutableServletContextHandler.5xx-responses
io.dropwizard.jetty.MutableServletContextHandler.async-dispatches
io.dropwizard.jetty.MutableServletContextHandler.async-timeouts

Timers

A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence.
Adding @Timed annotation to a method on a resource would create a timer for it. The timer has the following metrics:

com.techarchnotes.resource.UserResource.logIn: {
count: 1380,
max: 0.001977389,
mean: 0.00006600207323088171,
min: 0.000016481,
p50: 0.000047935,
p75: 0.00007088400000000001,
p95: 0.000145658,
p98: 0.00021545600000000002,
p99: 0.000308214,
p999: 0.0015237900000000001,
stddev: 0.00008797390381482896,
m15_rate: 1.403797940574774,
m1_rate: 8.80946625842269,
m5_rate: 3.6139579488499476,
mean_rate: 1.9912307032423369,
duration_units: "seconds",
rate_units: "calls/second"
}

Rates (m1, m5, m15, mean) and count are elements of Meter and relate to calls per second. The rest of the values relate to Histogram metrics and reflect the duration of a method run in seconds. Again, Dropwizard offers a few Timers out of the box:

io.dropwizard.jetty.MutableServletContextHandler.connect-requests
io.dropwizard.jetty.MutableServletContextHandler.delete-requests
io.dropwizard.jetty.MutableServletContextHandler.dispatches
io.dropwizard.jetty.MutableServletContextHandler.get-requests
io.dropwizard.jetty.MutableServletContextHandler.head-requests
io.dropwizard.jetty.MutableServletContextHandler.move-requests
io.dropwizard.jetty.MutableServletContextHandler.options-requests
io.dropwizard.jetty.MutableServletContextHandler.other-requests
io.dropwizard.jetty.MutableServletContextHandler.post-requests
io.dropwizard.jetty.MutableServletContextHandler.put-requests
io.dropwizard.jetty.MutableServletContextHandler.requests
io.dropwizard.jetty.MutableServletContextHandler.trace-requests
org.eclipse.jetty.server.HttpConnectionFactory.8080.connections
org.eclipse.jetty.server.HttpConnectionFactory.8081.connections

Useful subset of metrics

There is a lot of metrics available, which ones should be used then?
I would divide metrics into two categories: request-related and jvm-related, since those two are large groups.

Request related metrics

In this group there are mainly Timers and Meters as well as the 3 Counters mentioned at the beginning. In order not to send every quantile or rate, I would suggest limiting the list to the follwoing:

COUNT
RATE_1_MINUTE
RATE_5_MINUTE
RATE_15_MINUTE
MEAN
MAX
P95
P99

My suggested list of metrics would be the following:

Counters
io.dropwizard.jetty.MutableServletContextHandler.active-dispatches
io.dropwizard.jetty.MutableServletContextHandler.active-requests
io.dropwizard.jetty.MutableServletContextHandler.active-suspended

Meters
io.dropwizard.jetty.MutableServletContextHandler.2xx-responses
io.dropwizard.jetty.MutableServletContextHandler.4xx-responses
io.dropwizard.jetty.MutableServletContextHandler.5xx-responses

Timers
io.dropwizard.jetty.MutableServletContextHandler.delete-requests
io.dropwizard.jetty.MutableServletContextHandler.get-requests
io.dropwizard.jetty.MutableServletContextHandler.post-requests
io.dropwizard.jetty.MutableServletContextHandler.put-requests
io.dropwizard.jetty.MutableServletContextHandler.requests
org.eclipse.jetty.server.HttpConnectionFactory.8080.connections
org.eclipse.jetty.server.HttpConnectionFactory.8081.connections

JVM related metrics

There are mainly Gauges in this group. I would pick the following for the start:

Gauges
garbage collection:
jvm.gc.PS-MarkSweep.count
jvm.gc.PS-MarkSweep.time
jvm.gc.PS-Scavenge.count
jvm.gc.PS-Scavenge.time
memory:
jvm.memory.heap.max
jvm.memory.heap.used
jvm.memory.non-heap.max
jvm.memory.non-heap.used
jvm.memory.total.max
jvm.memory.total.used
threads:
jvm.threads.blocked.count
jvm.threads.count
jvm.threads.daemon.count
jvm.threads.deadlock.count
jvm.threads.new.count
jvm.threads.runnable.count
jvm.threads.waiting.count

Analysing results

To visualise how does a Meter look like in Dropwizard, lets look at the UserResource.logIn() method requests per second that have been sent to Datadog and compare them with the actual Gatling report. The lines in Datadog represent 1m, 5m and 15m rate respectively, with the 1m approximating the actual number of requests fastest of all three:

Gatling report: Actual requests / second
Datadag chart: Collected requests per second 1m, 5m and 15m rates

What does it mean? 1m rate reflects the most recent trend in your application, 5m and 15m have longer time interval, hence requests with a higher frequency would have to run longer so that these two could go up. Please remember that these values are an average from the time they are spanning, hence the value will grow slower for longer time interval (15m) and faster for the shorter (1m).

Summary

Which metrics to use depends on your application, however the set proposed above should cover a lot. If your application is using Hikari Connection Pool for example, you can set up it’s metrics with Dropwizard as well – have a look at the documentation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s