{"_id":"59fb653574b36d0010d1b434","project":"5633ebff7e9e880d00af1a53","version":{"_id":"5633ec007e9e880d00af1a56","project":"5633ebff7e9e880d00af1a53","__v":18,"createdAt":"2015-10-30T22:15:28.105Z","releaseDate":"2015-10-30T22:15:28.105Z","categories":["5633ec007e9e880d00af1a57","5633f072737ea01700ea329d","5637a37d0704070d00f06cf4","5637cf4e7ca5de0d00286aeb","564503082c74cf1900da48b4","564503cb7f1fff210078e70a","567af26cb56bac0d0019d87d","567afeb8802b2b17005ddea0","567aff47802b2b17005ddea1","567b0005802b2b17005ddea3","568adfffcbd4ca0d00aebf7e","56ba80078cf7c9210009673e","574d127f6f075519007da3d0","574fde60aef76a0e00840927","57a22ba6cd51b22d00f623a0","5a062c15a66ae1001a3f5b09","5b032aef01b0ff00038d8d5e","5b032af62d18d700038a7042"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"category":{"_id":"567b0005802b2b17005ddea3","pages":["567b001117368a0d009a6e10","567b00307c40060d005603e7","567b039a7c40060d005603ec"],"project":"5633ebff7e9e880d00af1a53","version":"5633ec007e9e880d00af1a56","__v":3,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-12-23T20:11:49.377Z","from_sync":false,"order":2,"slug":"best-practices","title":"Best Practices & Tools"},"user":"5633ec9b35355017003ca3f2","__v":0,"parentDoc":null,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2017-11-02T18:34:29.172Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":7,"body":"Troubleshooting issues with traffic to your clusters starts with metrics. Read on to learn how to read, interpret, and act on your cluster metrics reports. \n[block:api-header]\n{\n  \"title\": \"Navigating to Metrics\"\n}\n[/block]\nThe Metrics dashboard is located in each cluster's dashboard. Log into Bonsai, click on your cluster, and click on the metrics tab. \n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/778f4a3-Webr_Logs_Virginia_webr_logs_virginia_2788396288_Metrics.png\",\n        \"Webr Logs Virginia  webr logs virginia 2788396288    Metrics.png\",\n        2992,\n        2670,\n        \"#78b0ab\"\n      ],\n      \"caption\": \"The metrics tab on the cluster dashboard.\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Metrics Utilites\"\n}\n[/block]\n\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/d3057ca-Screen_Shot_2017-11-02_at_1.47.21_PM.png\",\n        \"Screen Shot 2017-11-02 at 1.47.21 PM.png\",\n        858,\n        186,\n        \"#3c3d44\"\n      ]\n    }\n  ]\n}\n[/block]\n## Time Window Selector\n\nUse this selector to choose between 4 window sizes:\n1. The last 1 hour (1h)\n2. The last 24 hrs (1d)\n3. The last 7 days (7d) \n4. The last 28 days (28d)\n\n## Time Scrubber\n\nClick on the arrows to go back or forth in time within the same time window size.\n\n## UTC and Local Timezone Toggle\n\nClick on the timezone to toggle between displaying the graph timestamps in UTC time or your local timezone.\n\n## Highlighting\n\nYou can drill down to smaller windows of time on any graph by clicking and dragging to select a time range.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/b404854-metrics_-_highlight.gif\",\n        \"metrics - highlight.gif\",\n        1592,\n        746,\n        \"#284b91\"\n      ],\n      \"caption\": \"\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Metrics Overview\"\n}\n[/block]\nMore information doesn’t necessarily mean more clarity. When something happens to your traffic and cluster responses, it’s important to know how to see your metrics and draw conclusions \n\nBelow we'll cover what each graph displays and some examples of what they will look like given certain use cases, like high-traffic or clusters in different states (normal, or experiencing downtime). We’ll start with the most information-dense graph: the request heatmap.\n\n[block:api-header]\n{\n  \"title\": \"Request Heatmap (Count & Duration)\"\n}\n[/block]\nThis graph reveals how fast requests are. Each column in the graph represents a “slice” of time. Each row, or “bucket”, in the slice represents duration speed. The ‘hotter’ a bucket is colored, the more requests there are in that bucket. To further help visualize the difference in the quantity of requests for each bucket, every slice of time can be viewed as a histogram on hover.\n\n**Example 1** \n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/a87cbef-Screen_Shot_2017-11-02_at_2.29.05_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.29.05 PM.png\",\n        3272,\n        794,\n        \"#451965\"\n      ]\n    }\n  ]\n}\n[/block]\nThis heatmap displays a healthy cluster with a lot of traffic, and some slow request times (toward the top), but a majority of it is occurring below 40ms. \n\n\n**Example 2** \n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/90d1a1e-Screen_Shot_2017-11-02_at_2.29.37_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.29.37 PM.png\",\n        3288,\n        812,\n        \"#d7d7d6\"\n      ]\n    }\n  ]\n}\n[/block]\nHere we have a cluster with very little traffic. It’s important to note that the ‘heat’ color of every bucket is determined relative to the other data in the graph - so a side-by-side comparison of two request heatmaps using color won't be accurate.\n\n\n[block:api-header]\n{\n  \"title\": \"Request Counts\"\n}\n[/block]\nThis graph is straightforward: it shows the number of requests handled by the cluster at a given time. \n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/e0832cc-Screen_Shot_2017-11-02_at_2.30.15_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.30.15 PM.png\",\n        1622,\n        472,\n        \"#6cccad\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Request Duration Percentiles\"\n}\n[/block]\nThe request duration graph, similar to the request heatmap, shows a distribution of request speed based on three percentiles of the requests in that time slice: 50%, 95%, and 99%. This is helpful in determining where the bulk of your requests sit in terms of speed, and how slow the outliers are. \n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/f51c034-Screen_Shot_2017-11-02_at_3.29.49_PM.png\",\n        \"Screen Shot 2017-11-02 at 3.29.49 PM.png\",\n        1612,\n        484,\n        \"#64baad\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Queue Time\"\n}\n[/block]\nQueue time is the total amount of time requests were “queued” or paused at our load balancing layer.  Ideally, the queue time is 0, but in the event that you send many requests in parallel, our load balancer will queue up requests while waiting for executing requests to finish.  This is part of our Quality of Service layer.\n\n\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/be463cd-Screen_Shot_2017-11-02_at_2.49.12_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.49.12 PM.png\",\n        1624,\n        462,\n        \"#98abb0\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Concurrency\"\n}\n[/block]\nConcurrency shows the number of requests that are happening at the same time. Since clusters are limited on concurrency, this can be an important one to keep an eye on.  When you reach your plan’s max concurrency, you will notice queue time start to consistently increase.\n\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/a8d5c2f-Screen_Shot_2017-11-02_at_2.49.37_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.49.37 PM.png\",\n        1622,\n        466,\n        \"#45b3c4\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Bandwidth\"\n}\n[/block]\nThis graph shows the amount of data crossing the network - going into the cluster (shown in green), and coming from the cluster (in blue). \n\nWe expect most bandwidth graphs to look something like the graph below — a relatively higher count of ‘from client’ data compared to ‘to client’ data. These bars show the relationship of read and write data; the ‘To Client’ data coming from write - or indexing - requests, and the ‘From Client’ data the result of read requests.\n\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/0a82cc8-Screen_Shot_2017-11-02_at_2.49.55_PM.png\",\n        \"Screen Shot 2017-11-02 at 2.49.55 PM.png\",\n        1658,\n        492,\n        \"#548ab3\"\n      ]\n    }\n  ]\n}\n[/block]\nThe relationship between green to blue bars in this graph really depends on your use-case. A staging cluster, for example, might see a larger ratio of Write:Read data. It’s important to note that this graph deals exclusively in data - a high-traffic cluster will probably see a lot of data coming “From” the cluster, but a low-traffic cluster with very complicated queries and large request bodies will also have a larger “From Client” data than would otherwise be expected. Therefore, it’s helpful to look at request counts to get a feeling for the average ‘size’ of a request. \n\n[block:api-header]\n{\n  \"title\": \"Response Codes\"\n}\n[/block]\n\nThis graph can do two things:\n\n  * It will confirm that responses are successful: 2xx responses. This means that everything is moving along well and requests are formed correctly.\n  * In the less positive case, it can be a debugging tool that can help figure out where any buggy behavior is coming from. In general, 4xx requests are the result of a malformed query from your app or some client, whilst a 5xx request is from our end.\n\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/55fc596-Screen_Shot_2017-11-03_at_5.36.17_PM.png\",\n        \"Screen Shot 2017-11-03 at 5.36.17 PM.png\",\n        1608,\n        380,\n        \"#8acaae\"\n      ]\n    }\n  ]\n}\n[/block]\nIt's important to note while reading this graph, that 5xx requests don’t necessarily mean that your cluster is down. A common situation on shared-tier architecture is a cluster that’s getting throttled by a noisy neighbor who’s taking up a lot of resources on the server. This can interrupt some (but not all) normal behavior on your cluster, resulting in a mix of 2xx and 5xx requests.\n\nTolerance for a few 5xx requests every now and then should be expected with any cloud service. We’re committed to getting all production clusters a 99.99% uptime (i.e. expected 0.001% 5xx responses), and we often have a track record of 4 9’s and higher. See our uptime history here:\n\n* [https://bonsai.io/blog](https://bonsai.io/blog)\n\nWe have a lot of people that are very sensitive to 5xx requests. In these cases, it’s usually best to be on a higher plan or a dedicated setup. Reach out to us at [support:::at:::bonsai.io](mailto:support@bonsai.io) if this is something your team needs.","excerpt":"How to interpret metrics from requests to Elasticsearch on the cluster metrics dashboard.","slug":"cluster-metrics","type":"basic","title":"Cluster Metrics"}

Cluster Metrics

How to interpret metrics from requests to Elasticsearch on the cluster metrics dashboard.

Troubleshooting issues with traffic to your clusters starts with metrics. Read on to learn how to read, interpret, and act on your cluster metrics reports. [block:api-header] { "title": "Navigating to Metrics" } [/block] The Metrics dashboard is located in each cluster's dashboard. Log into Bonsai, click on your cluster, and click on the metrics tab. [block:image] { "images": [ { "image": [ "https://files.readme.io/778f4a3-Webr_Logs_Virginia_webr_logs_virginia_2788396288_Metrics.png", "Webr Logs Virginia webr logs virginia 2788396288 Metrics.png", 2992, 2670, "#78b0ab" ], "caption": "The metrics tab on the cluster dashboard." } ] } [/block] [block:api-header] { "title": "Metrics Utilites" } [/block] [block:image] { "images": [ { "image": [ "https://files.readme.io/d3057ca-Screen_Shot_2017-11-02_at_1.47.21_PM.png", "Screen Shot 2017-11-02 at 1.47.21 PM.png", 858, 186, "#3c3d44" ] } ] } [/block] ## Time Window Selector Use this selector to choose between 4 window sizes: 1. The last 1 hour (1h) 2. The last 24 hrs (1d) 3. The last 7 days (7d) 4. The last 28 days (28d) ## Time Scrubber Click on the arrows to go back or forth in time within the same time window size. ## UTC and Local Timezone Toggle Click on the timezone to toggle between displaying the graph timestamps in UTC time or your local timezone. ## Highlighting You can drill down to smaller windows of time on any graph by clicking and dragging to select a time range. [block:image] { "images": [ { "image": [ "https://files.readme.io/b404854-metrics_-_highlight.gif", "metrics - highlight.gif", 1592, 746, "#284b91" ], "caption": "" } ] } [/block] [block:api-header] { "title": "Metrics Overview" } [/block] More information doesn’t necessarily mean more clarity. When something happens to your traffic and cluster responses, it’s important to know how to see your metrics and draw conclusions Below we'll cover what each graph displays and some examples of what they will look like given certain use cases, like high-traffic or clusters in different states (normal, or experiencing downtime). We’ll start with the most information-dense graph: the request heatmap. [block:api-header] { "title": "Request Heatmap (Count & Duration)" } [/block] This graph reveals how fast requests are. Each column in the graph represents a “slice” of time. Each row, or “bucket”, in the slice represents duration speed. The ‘hotter’ a bucket is colored, the more requests there are in that bucket. To further help visualize the difference in the quantity of requests for each bucket, every slice of time can be viewed as a histogram on hover. **Example 1** [block:image] { "images": [ { "image": [ "https://files.readme.io/a87cbef-Screen_Shot_2017-11-02_at_2.29.05_PM.png", "Screen Shot 2017-11-02 at 2.29.05 PM.png", 3272, 794, "#451965" ] } ] } [/block] This heatmap displays a healthy cluster with a lot of traffic, and some slow request times (toward the top), but a majority of it is occurring below 40ms. **Example 2** [block:image] { "images": [ { "image": [ "https://files.readme.io/90d1a1e-Screen_Shot_2017-11-02_at_2.29.37_PM.png", "Screen Shot 2017-11-02 at 2.29.37 PM.png", 3288, 812, "#d7d7d6" ] } ] } [/block] Here we have a cluster with very little traffic. It’s important to note that the ‘heat’ color of every bucket is determined relative to the other data in the graph - so a side-by-side comparison of two request heatmaps using color won't be accurate. [block:api-header] { "title": "Request Counts" } [/block] This graph is straightforward: it shows the number of requests handled by the cluster at a given time. [block:image] { "images": [ { "image": [ "https://files.readme.io/e0832cc-Screen_Shot_2017-11-02_at_2.30.15_PM.png", "Screen Shot 2017-11-02 at 2.30.15 PM.png", 1622, 472, "#6cccad" ] } ] } [/block] [block:api-header] { "title": "Request Duration Percentiles" } [/block] The request duration graph, similar to the request heatmap, shows a distribution of request speed based on three percentiles of the requests in that time slice: 50%, 95%, and 99%. This is helpful in determining where the bulk of your requests sit in terms of speed, and how slow the outliers are. [block:image] { "images": [ { "image": [ "https://files.readme.io/f51c034-Screen_Shot_2017-11-02_at_3.29.49_PM.png", "Screen Shot 2017-11-02 at 3.29.49 PM.png", 1612, 484, "#64baad" ] } ] } [/block] [block:api-header] { "title": "Queue Time" } [/block] Queue time is the total amount of time requests were “queued” or paused at our load balancing layer. Ideally, the queue time is 0, but in the event that you send many requests in parallel, our load balancer will queue up requests while waiting for executing requests to finish. This is part of our Quality of Service layer. [block:image] { "images": [ { "image": [ "https://files.readme.io/be463cd-Screen_Shot_2017-11-02_at_2.49.12_PM.png", "Screen Shot 2017-11-02 at 2.49.12 PM.png", 1624, 462, "#98abb0" ] } ] } [/block] [block:api-header] { "title": "Concurrency" } [/block] Concurrency shows the number of requests that are happening at the same time. Since clusters are limited on concurrency, this can be an important one to keep an eye on. When you reach your plan’s max concurrency, you will notice queue time start to consistently increase. [block:image] { "images": [ { "image": [ "https://files.readme.io/a8d5c2f-Screen_Shot_2017-11-02_at_2.49.37_PM.png", "Screen Shot 2017-11-02 at 2.49.37 PM.png", 1622, 466, "#45b3c4" ] } ] } [/block] [block:api-header] { "title": "Bandwidth" } [/block] This graph shows the amount of data crossing the network - going into the cluster (shown in green), and coming from the cluster (in blue). We expect most bandwidth graphs to look something like the graph below — a relatively higher count of ‘from client’ data compared to ‘to client’ data. These bars show the relationship of read and write data; the ‘To Client’ data coming from write - or indexing - requests, and the ‘From Client’ data the result of read requests. [block:image] { "images": [ { "image": [ "https://files.readme.io/0a82cc8-Screen_Shot_2017-11-02_at_2.49.55_PM.png", "Screen Shot 2017-11-02 at 2.49.55 PM.png", 1658, 492, "#548ab3" ] } ] } [/block] The relationship between green to blue bars in this graph really depends on your use-case. A staging cluster, for example, might see a larger ratio of Write:Read data. It’s important to note that this graph deals exclusively in data - a high-traffic cluster will probably see a lot of data coming “From” the cluster, but a low-traffic cluster with very complicated queries and large request bodies will also have a larger “From Client” data than would otherwise be expected. Therefore, it’s helpful to look at request counts to get a feeling for the average ‘size’ of a request. [block:api-header] { "title": "Response Codes" } [/block] This graph can do two things: * It will confirm that responses are successful: 2xx responses. This means that everything is moving along well and requests are formed correctly. * In the less positive case, it can be a debugging tool that can help figure out where any buggy behavior is coming from. In general, 4xx requests are the result of a malformed query from your app or some client, whilst a 5xx request is from our end. [block:image] { "images": [ { "image": [ "https://files.readme.io/55fc596-Screen_Shot_2017-11-03_at_5.36.17_PM.png", "Screen Shot 2017-11-03 at 5.36.17 PM.png", 1608, 380, "#8acaae" ] } ] } [/block] It's important to note while reading this graph, that 5xx requests don’t necessarily mean that your cluster is down. A common situation on shared-tier architecture is a cluster that’s getting throttled by a noisy neighbor who’s taking up a lot of resources on the server. This can interrupt some (but not all) normal behavior on your cluster, resulting in a mix of 2xx and 5xx requests. Tolerance for a few 5xx requests every now and then should be expected with any cloud service. We’re committed to getting all production clusters a 99.99% uptime (i.e. expected 0.001% 5xx responses), and we often have a track record of 4 9’s and higher. See our uptime history here: * [https://bonsai.io/blog](https://bonsai.io/blog) We have a lot of people that are very sensitive to 5xx requests. In these cases, it’s usually best to be on a higher plan or a dedicated setup. Reach out to us at [support@bonsai.io](mailto:support@bonsai.io) if this is something your team needs.