Navigating to Metrics
Time Window Selector
- (1h) last 1 hour
- (1d) last 24 hours
- (7d) last 7 days
- (28d) last 28 days
UTC and Local Timezone Toggle
Requests (Counts & Duration) heat map
Request Duration Percentiles
Proxy Queue time
- It will confirm that responses are successful: 2xx responses. This means that everything is moving along well and requests are formed correctly.
- In the less positive case, it can be a debugging tool that can help figure out where any buggy behavior is coming from. In general, 4xx requests are the result of a malformed query from your app or some client, whilst a 5xx request is from our end.
We have a lot of people that are very sensitive to 5xx requests. In these cases, it’s usually best to be on a higher plan or a single tenant plan. Reach out to us at firstname.lastname@example.org if this is something your team needs.
Business and Enterprise Plans - Additional Metrics
Clusters running on Business and Enterprise grade clusters will have access to some additional metrics. If you would like to get access to these metrics for your cluster, please reach out and our team will walk you through the process.
System load is the average number of processes waiting on a resource within a given period of time. This is often reported in 1, 5 and 15 minute windows. The Bonsai dashboard shows the system load average over the past minute.
It is helpful to think of system load as how saturated a node is with tasks; as long as the node's load average is lower than the number of its available CPUs, the node is able to handle all of its work without getting backed up. If the load average is larger than the number of its available CPUs, that means that some tasks are being scheduled. When tasks are delayed like this, performance suffers, and the performance impact is correlated to how high the load average gets.
Elasticsearch utilizes a number of thread pools for handling various tasks. These pools are also backed by a queue, so that if an tasks is created and a thread is not available to execute it, the task is queued until a thread becomes available. This metric shows the total number of tasks sitting in an Elasticsearch queue.
It's important to note that this is not the same as a request queue. A single request can result in multiple tasks being created within Elasticsearch. It is also important to note that these queues have a finite length. If the queue is full and an additional task is created, Elasticsearch will reject it with a message like "rejected execution (queue capacity 50)" and an HTTP 429 response.
This metric shows the number of _bulk requests that have been processed over time. Bulk requests are an efficient way to insert or update data in your cluster. Naturally, your payload sizes need to be more than 1-2 documents in order to get the benefits of bulk updates. Usually batches of 50-500 are ideal.
This metric shows the number of search requests that have been processed by Elasticsearch. It's important to distinguish between user searches and shard searches. A user search is performed by your application (often in response to some user action or search in the app), and may translate into multiple shard searches. This is true if your indices have multiple primary shards, or if you're searching across multiple indices. A query for the top X results will be passed to all relevant shards; each shard will perform the search and return the top X results. The coordinating node is then responsible for collating those results, sorting, and returning the top X.
Elasticsearch does utilize a thread pool specifically for searches, so depending on the types of searches you're running, and the volume of search traffic, it's possible to get a message like "rejected execution (queue capacity 50)" and an HTTP 429 response.
Elasticsearch is a Java-based search engine that runs in the Java Virtual Machine (JVM). The JVM has a special area of memory where objects are stored, called "heap space." This space is periodically garbage-collected, meaning objects that are no longer in use are destroyed to free up space in memory.
This metric shows the percentage of heap space that is currently occupied by Elasticsearch's objects and arrays. Lower is ideal, because high heap usage generally means more frequent, and longer-lasting garbage collection pauses, which manifests as slower performance and higher latency.
JVM Young GC Time
This metric shows the amount of time in milliseconds that the JVM spent reclaiming memory from new or short-lived objects. This type of garbage collection is expected and shouldn't lead to HTTP 503 or 504 errors. However, if it is chronic and frequent, it may lead to slow performance.
JVM Old GC Time
This metrics shows the amount of time in milliseconds the JVM spent reclaiming memory from long-surviving objects. The JVM periodically pauses the application so it can free up heap space, which means some operations are stopped for a period of time. This can result in perceived slow response times, and in some extreme cases can lead to system restarts and HTTP 503 and 504 responses.
CPU IO Wait
This metric shows the percentage of the time that the CPU(s) waited on IO operations. This means that an operation requested IO (like reading or writing to disk) and then had to wait for the system to complete the request. A certain amount of IOWait is expected for any IO operation, and usually it's on the order of nanoseconds. This isn't indicative of a problem on its own.
Excessive wait times are problematic though. It is usually correlated to high system load and a high volume of updates. Sometimes that can be addressed through hardware scaling or better throttling of updates. In rare cases it can indicate a problem with the hardware itself, like an SSD drive failing.
This metric shows the percentage of the time that the CPU(s) were executing code in user space. The user space is where all code runs, outside of the operating system's kernel. On Bonsai, this space is primarily dedicated to Elasticsearch, so the metric is roughly the amount of time the CPU spent processing instructions by the Elasticsearch code.
The metric can vary widely between clusters, based on application, hardware and use case. There is not necessarily an ideal value or range for this metric. However, large spikes or long periods of high processing times can manifest as poor performance. It often indicates that the hardware is not able to keep up with the demands of the application, although it can sometimes indicate a problem with the hardware itself.