Troubleshoot cluster traffic issues with metrics. Read on to learn how to read, interpret, and act on your cluster metrics reports.
Navigating to Metrics
The Metrics dashboard is located in each cluster’s dashboard. Log into Bonsai, click on your cluster, and click on the metrics tab.
Time Window Selector
Use this selector to choose between 4 window sizes: 1. The last 1 hour (1h) 2. The last 24 hrs (1d) 3. The last 7 days (7d) 4. The last 28 days (28d)
Click on the arrows to go back or forth in time within the same time window size.
UTC and Local Timezone Toggle
Click on the timezone to toggle between displaying the graph timestamps in UTC time or your local timezone.
You can drill down to smaller windows of time on any graph by clicking and dragging to select a time range.
More information doesn’t necessarily mean more clarity. When something happens to your traffic and cluster responses, it’s important to know how to see your metrics and draw conclusions
Below we’ll cover what each graph displays and some examples of what they will look like given certain use cases, like high-traffic or clusters in different states (normal, or experiencing downtime). We’ll start with the most information-dense graph: the request heatmap.
Request Heatmap (Count & Duration)
This graph reveals how fast requests are. Each column in the graph represents a “slice” of time. Each row, or “bucket”, in the slice represents duration speed. The ‘hotter’ a bucket is colored, the more requests there are in that bucket. To further help visualize the difference in the quantity of requests for each bucket, every slice of time can be viewed as a histogram on hover.
This heatmap displays a healthy cluster with a lot of traffic, and some slow request times (toward the top), but a majority of it is occurring below 40ms.
Here we have a cluster with very little traffic. It’s important to note that the ‘heat’ color of every bucket is determined relative to the other data in the graph - so a side-by-side comparison of two request heatmaps using color won’t be accurate.
This graph is straightforward: it shows the number of requests handled by the cluster at a given time.
Request Duration Percentiles
The request duration graph, similar to the request heatmap, shows a distribution of request speed based on three percentiles of the requests in that time slice: 50%, 95%, and 99%. This is helpful in determining where the bulk of your requests sit in terms of speed, and how slow the outliers are.
Queue time is the total amount of time requests were “queued” or paused at our load balancing layer. Ideally, the queue time is 0, but in the event that you send many requests in parallel, our load balancer will queue up requests while waiting for executing requests to finish. This is part of our Quality of Service layer.
Concurrency shows the number of requests that are happening at the same time. Since clusters are limited on concurrency, this can be an important one to keep an eye on. When you reach your plan’s max concurrency, you will notice queue time start to consistently increase.
This graph shows the amount of data crossing the network - going into the cluster (shown in green), and coming from the cluster (in blue).
We expect most bandwidth graphs to look something like the graph below — a relatively higher count of ‘from client’ data compared to ‘to client’ data. These bars show the relationship of read and write data; the ‘To Client’ data coming from write - or indexing - requests, and the ‘From Client’ data the result of read requests.
The relationship between green to blue bars in this graph really depends on your use-case. A staging cluster, for example, might see a larger ratio of Write:Read data. It’s important to note that this graph deals exclusively in data - a high-traffic cluster will probably see a lot of data coming “From” the cluster, but a low-traffic cluster with very complicated queries and large request bodies will also have a larger “From Client” data than would otherwise be expected. Therefore, it’s helpful to look at request counts to get a feeling for the average ‘size’ of a request.
This graph can do two things:
- It will confirm that responses are successful: 2xx responses. This means that everything is moving along well and requests are formed correctly.
- In the less positive case, it can be a debugging tool that can help figure out where any buggy behavior is coming from. In general, 4xx requests are the result of a malformed query from your app or some client, whilst a 5xx request is from our end.
It’s important to note while reading this graph, that 5xx requests don’t necessarily mean that your cluster is down. A common situation on shared-tier architecture is a cluster that’s getting throttled by a noisy neighbor who’s taking up a lot of resources on the server. This can interrupt some (but not all) normal behavior on your cluster, resulting in a mix of 2xx and 5xx requests.
Tolerance for a few 5xx requests every now and then should be expected with any cloud service. We’re committed to getting all production clusters a 99.99% uptime (i.e. expected 0.001% 5xx responses), and we often have a track record of 4 9’s and higher.
We have a lot of people that are very sensitive to 5xx requests. In these cases, it’s usually best to be on a higher plan or a dedicated setup. Reach out to us at firstname.lastname@example.org if this is something your team needs.
Business and Enterprise Metrics
Clusters running on Business and Enterprise grade clusters will have access to some additional metrics. If you would like to get access to these metrics for your cluster, please reach out and our team will walk you through the process.
System load is the average number of processes waiting on a resource within a given period of time. This is often reported in 1, 5 and 15 minute windows. The Bonsai dashboard shows the system load average over the past minute.
It is helpful to think of system load as how saturated a node is with tasks; as long as the node's load average is lower than the number of its available CPUs, the node is able to handle all of its work without getting backed up. If the load average is larger than the number of its available CPUs, that means that some tasks are being scheduled. When tasks are delayed like this, performance suffers, and the performance impact is correlated to how high the load average gets.
Elasticsearch utilizes a number of thread pools for handling various tasks. These pools are also backed by a queue, so that if an tasks is created and a thread is not available to execute it, the task is queued until a thread becomes available. This metric shows the total number of tasks sitting in an Elasticsearch queue.
It's important to note that this is not the same as a request queue. A single request can result in multiple tasks being created within Elasticsearch. It is also important to note that these queues have a finite length. If the queue is full and an additional task is created, Elasticsearch will reject it with a message like "rejected execution (queue capacity 50)" and an HTTP 429 response.
This metric shows the number of _bulk requests that have been processed over time. Bulk requests are an efficient way to insert or update data in your cluster. Naturally, your payload sizes need to be more than 1-2 documents in order to get the benefits of bulk updates. Usually batches of 50-500 are ideal.
This metric shows the number of search requests that have been processed by Elasticsearch. It's important to distinguish between user searches and shard searches. A user search is performed by your application (often in response to some user action or search in the app), and may translate into multiple shard searches. This is true if your indices have multiple primary shards, or if you're searching across multiple indices. A query for the top X results will be passed to all relevant shards; each shard will perform the search and return the top X results. The coordinating node is then responsible for collating those results, sorting, and returning the top X.
Elasticsearch does utilize a thread pool specifically for searches, so depending on the types of searches you're running, and the volume of search traffic, it's possible to get a message like "rejected execution (queue capacity 50)" and an HTTP 429 response.
Elasticsearch is a Java-based search engine that runs in the Java Virtual Machine (JVM). The JVM has a special area of memory where objects are stored, called "heap space." This space is periodically garbage-collected, meaning objects that are no longer in use are destroyed to free up space in memory.
This metric shows the percentage of heap space that is currently occupied by Elasticsearch's objects and arrays. Lower is ideal, because high heap usage generally means more frequent, and longer-lasting garbage collection pauses, which manifests as slower performance and higher latency.
JVM Young GC Time
This metric shows the amount of time in milliseconds that the JVM spent reclaiming memory from new or short-lived objects. This type of garbage collection is expected and shouldn't lead to HTTP 503 or 504 errors. However, if it is chronic and frequent, it may lead to slow performance.
JVM Old GC Time
This metrics shows the amount of time in milliseconds the JVM spent reclaiming memory from long-surviving objects. The JVM periodically pauses the application so it can free up heap space, which means some operations are stopped for a period of time. This can result in perceived slow response times, and in some extreme cases can lead to system restarts and HTTP 503 and 504 responses.
CPU IO Wait
This metric shows the percentage of the time that the CPU(s) waited on IO operations. This means that an operation requested IO (like reading or writing to disk) and then had to wait for the system to complete the request. A certain amount of IOWait is expected for any IO operation, and usually it's on the order of nanoseconds. This isn't indicative of a problem on its own.
Excessive wait times are problematic though. It is usually correlated to high system load and a high volume of updates. Sometimes that can be addressed through hardware scaling or better throttling of updates. In rare cases it can indicate a problem with the hardware itself, like an SSD drive failing.
This metric shows the percentage of the time that the CPU(s) were executing code in user space. The user space is where all code runs, outside of the operating system's kernel. On Bonsai, this space is primarily dedicated to Elasticsearch, so the metric is roughly the amount of time the CPU spent processing instructions by the Elasticsearch code.
The metric can vary widely between clusters, based on application, hardware and use case. There is not necessarily an ideal value or range for this metric. However, large spikes or long periods of high processing times can manifest as poor performance. It often indicates that the hardware is not able to keep up with the demands of the application, although it can sometimes indicate a problem with the hardware itself.