{"_id":"5a8fae0368264c001f20cc10","category":{"_id":"5a8fae0368264c001f20cc03","version":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-12-23T20:11:49.377Z","from_sync":false,"order":5,"slug":"best-practices","title":"Elasticsearch Tips & Tricks"},"user":"5633ec9b35355017003ca3f2","parentDoc":null,"project":"5633ebff7e9e880d00af1a53","version":{"_id":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":2,"createdAt":"2018-02-23T06:00:34.961Z","releaseDate":"2018-02-23T06:00:34.961Z","categories":["5a8fae0268264c001f20cc01","5a8fae0268264c001f20cc02","5a8fae0368264c001f20cc03","5a8fae0368264c001f20cc04","5a8fae0368264c001f20cc05","5a8fae0368264c001f20cc06","5a8fae0368264c001f20cc07","5a8fae0368264c001f20cc08","5a8fae0368264c001f20cc09","5abaa7eb72d6dc0028a07bf3"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"2.0.0","version":"2.0"},"__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-01-05T02:37:28.774Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":4,"body":"Managing shards is a basic skill for any Elasticsearch user. Shards carry system overhead and potentially stale data. Keeping your cluster clean by pruning old shards can both improve performance and reduce your server costs.\n\nThis guide will make frequent references to the Elasticsearch API using the command line tool `curl`. Interacting with your own cluster can also be done via `curl`, or via a web browser. You can also use the Interactive Console in your cluster dashboard. For more information about the Interactive Console, check on the documentation relevant to you:\n\n* [Bonsai.io Users](doc:managing-your-cluster#interactive-console) \n* [Heroku Users](doc:bonsai-elasticsearch-dashboard#interactive-console) \n* [Manifold Users](doc:managing-your-cluster-1#interactive-console)\n\nIf you're just starting out, check out our illustrated [Shard Primer](doc:what-are-shards-and-replicas), and read over the [Elasticsearch Core Concepts](doc:basic-glossary) documentation.\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Deleting Unneeded Indices\"\n}\n[/block]\nThere are some cases where one or more indices are created on a cluster for testing purposes, and are not actually being used for anything. These will count towards the shard limits; if you're getting overage notifications, then you should delete these indices.\n\nThere are also some clients that will use [aliases](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to roll out changes to mappings. This is a really nice feature that allows for zero-downtime reindexing and immediate rollbacks if there's a problem, but it can also result in lots of stale indices and data accumulating in the cluster.\n\nTo determine if you have extraneous indices, use the [/_cat/indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html) endpoint to get a list of all the indices you have in your cluster:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"GET /_cat/indices\\ngreen open test1           1 1 0 0  318b  159b \\ngreen open test2           1 1 0 0  318b  159b \\ngreen open test3           1 1 0 0  318b  159b \\ngreen open test4           1 1 0 0  318b  159b \\ngreen open test5           1 1 0 0  318b  159b \\ngreen open test6           1 1 0 0  318b  159b \",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nIf you see indices that you don't need, you can simply delete them:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Delete a single index:\\nDELETE /test1\\n\\n# Delete a group of indices:\\nDELETE /test2,test3,test4,test5\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Use a Different Sharding Scheme\"\n}\n[/block]\nIt's possible that for some reason one or more indices were created with far more shards than necessary. For example, a check of [/_cat/indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html) shows something like this:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"GET /_cat/indices\\ngreen open test1           5 2 0 0  318b  159b \\ngreen open test2           5 2 0 0  318b  159b \",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nThat 5x2 shard scheme results in 15 shards per index, or 30 total for this cluster. This is probably really overprovisioned for these indices. Choosing a more conservative shard scheme of 1x1 would reduce this cluster's usage from 30 shards down to 4.\n\nUnfortunately, the number of primary shards can not be changed once an index has been created. To  fix this, you will need to manually create a new index with the desired shard scheme and reindex the data. If you have not read [The Ideal Elasticsearch Index](https://bonsai.io/blog/ideal-elasticsearch-cluster/), it has some really nice information on capacity planning and sizing. Check out the sections on Intelligent Sharding and Benchmarking for some tips on what scheme would make more sense for your particular use case.\n[block:api-header]\n{\n  \"title\": \"Reduce replication\"\n}\n[/block]\nFor most use-cases, a single replica is perfectly sufficient for redundancy and load capacity. If any of your indices have been created with more than one replica, you can reduce it to free up shards. An index with more than one replica might look like this:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"GET /_cat/indices?v\\nhealth status index    pri rep docs.count docs.deleted store.size pri.store.size \\ngreen  open   test1    5   2   0          0            318b       159b \\ngreen  open   test2    5   2   0          0            318b       159b\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nNote that the `rep` column has a 2? That means there are actually 3 copies of the data: one primary shard, and its two replicas. Replicas are a multiplier against primary shards, so if an index has a 5×2 configuration (5 primary shards with 2 replicas), reducing replication to 1 will free up five shards, not just one. See the [Shard Primer](doc:what-are-shards-and-replicas) for more details.\n\nFortunately, reducing the replica count for the index is a small JSON body to the `_settings` endpoint:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"PUT /test1,test2/_settings -d '{\\\"index\\\":{\\\"number_of_replicas\\\":1}}'\\n{\\\"acknowledged\\\":true}\\n\\nGET /_cat/indices\\ngreen open test1           5 1 0 0  318b  159b \\ngreen open test2           5 1 0 0  318b  159b \",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nThat simple request shaved 10 shards off of this cluster's usage.\n[block:callout]\n{\n  \"type\": \"danger\",\n  \"body\": \"It might seem like a good money-saving idea to simply set all replicas to 0 so as to fit as many indices into your cluster as possible. However, this is not advisable. This means that your primary data has no live backup. If a node in your cluster goes offline, data loss is basically guaranteed. \\n\\nData can be restored from a snapshot, but this is messy and not a great failover plan. The snapshot could be an hour or more old, and any updates to your data since then either need to be reindexed or are lost for good. Additionally, the outage will last much longer. \\n\\nHaving replication of at least 1 mitigates against all these problems.\",\n  \"title\": \"Replication == Availability and Redundancy\"\n}\n[/block]\n\n[block:api-header]\n{\n  \"title\": \"Data Collocation\"\n}\n[/block]\nAnother solution for reducing shard usage involves using [aliases and custom routing rules](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html) to collocate different data models onto the same group of shards.\n\nWhat is data collocation? Many Elasticsearch clients use an index per model paradigm as the default for data organization. This is analogous to, say, a Postgres database with a table for each type of data being indexed.\n\nSharding this way makes sense most of the time, but in some rare cases users may benefit from putting all of the data into a single namespace. In the Postgres analogy, this would be like putting all of the data into a single table instead of a table for each model. An attribute (i.e., table column) is then used to filter searches by class.\n\nFor example, you might have a cluster that has three indices: `videos`, `images` and `notes`. If each of these has a conservative 1x1 sharding scheme, it would require 6 shards. But this data could potentially be compacted down into a single index, `production`, where the mapping has a `type` field of some kind to indicate whether the document belongs to the `video`, `image` or `note` class.\n\nThe latter configuration with the same 1x1 scheme would only require two shards (one primary, one replica) instead of six:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/i0UA2yiGS8eR7U4zMUVj_reduce-shards10.jpg\",\n        \"reduce-shards10.jpg\",\n        \"2400\",\n        \"1938\",\n        \"#b7170c\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nThere are several major downsides to this approach. One is that field name collisions become an issue. For example, if there is a field called `published` in two models, but one is defined as a boolean and the other is a datetime, it will create a conflict in the mapping. One will need to be renamed.\n\nAnother downside is that it is a pretty large refactor for most users, and may be more trouble than simply upgrading the plan. Overriding the default behavior in the application's Elasticsearch client may require forking the code and relying on other hacks/workarounds. \n\nThere are others. Data collocation is mentioned here as a _possibility_, and one that only works for certain users. It is by no means a recommendation.","excerpt":"","slug":"reducing-shard-usage","type":"basic","title":"Reducing Shard Usage"}

Reducing Shard Usage


Managing shards is a basic skill for any Elasticsearch user. Shards carry system overhead and potentially stale data. Keeping your cluster clean by pruning old shards can both improve performance and reduce your server costs. This guide will make frequent references to the Elasticsearch API using the command line tool `curl`. Interacting with your own cluster can also be done via `curl`, or via a web browser. You can also use the Interactive Console in your cluster dashboard. For more information about the Interactive Console, check on the documentation relevant to you: * [Bonsai.io Users](doc:managing-your-cluster#interactive-console) * [Heroku Users](doc:bonsai-elasticsearch-dashboard#interactive-console) * [Manifold Users](doc:managing-your-cluster-1#interactive-console) If you're just starting out, check out our illustrated [Shard Primer](doc:what-are-shards-and-replicas), and read over the [Elasticsearch Core Concepts](doc:basic-glossary) documentation. [block:api-header] { "type": "basic", "title": "Deleting Unneeded Indices" } [/block] There are some cases where one or more indices are created on a cluster for testing purposes, and are not actually being used for anything. These will count towards the shard limits; if you're getting overage notifications, then you should delete these indices. There are also some clients that will use [aliases](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to roll out changes to mappings. This is a really nice feature that allows for zero-downtime reindexing and immediate rollbacks if there's a problem, but it can also result in lots of stale indices and data accumulating in the cluster. To determine if you have extraneous indices, use the [/_cat/indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html) endpoint to get a list of all the indices you have in your cluster: [block:code] { "codes": [ { "code": "GET /_cat/indices\ngreen open test1 1 1 0 0 318b 159b \ngreen open test2 1 1 0 0 318b 159b \ngreen open test3 1 1 0 0 318b 159b \ngreen open test4 1 1 0 0 318b 159b \ngreen open test5 1 1 0 0 318b 159b \ngreen open test6 1 1 0 0 318b 159b ", "language": "text" } ] } [/block] If you see indices that you don't need, you can simply delete them: [block:code] { "codes": [ { "code": "# Delete a single index:\nDELETE /test1\n\n# Delete a group of indices:\nDELETE /test2,test3,test4,test5", "language": "text" } ] } [/block] [block:api-header] { "title": "Use a Different Sharding Scheme" } [/block] It's possible that for some reason one or more indices were created with far more shards than necessary. For example, a check of [/_cat/indices](https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html) shows something like this: [block:code] { "codes": [ { "code": "GET /_cat/indices\ngreen open test1 5 2 0 0 318b 159b \ngreen open test2 5 2 0 0 318b 159b ", "language": "text" } ] } [/block] That 5x2 shard scheme results in 15 shards per index, or 30 total for this cluster. This is probably really overprovisioned for these indices. Choosing a more conservative shard scheme of 1x1 would reduce this cluster's usage from 30 shards down to 4. Unfortunately, the number of primary shards can not be changed once an index has been created. To fix this, you will need to manually create a new index with the desired shard scheme and reindex the data. If you have not read [The Ideal Elasticsearch Index](https://bonsai.io/blog/ideal-elasticsearch-cluster/), it has some really nice information on capacity planning and sizing. Check out the sections on Intelligent Sharding and Benchmarking for some tips on what scheme would make more sense for your particular use case. [block:api-header] { "title": "Reduce replication" } [/block] For most use-cases, a single replica is perfectly sufficient for redundancy and load capacity. If any of your indices have been created with more than one replica, you can reduce it to free up shards. An index with more than one replica might look like this: [block:code] { "codes": [ { "code": "GET /_cat/indices?v\nhealth status index pri rep docs.count docs.deleted store.size pri.store.size \ngreen open test1 5 2 0 0 318b 159b \ngreen open test2 5 2 0 0 318b 159b", "language": "text" } ] } [/block] Note that the `rep` column has a 2? That means there are actually 3 copies of the data: one primary shard, and its two replicas. Replicas are a multiplier against primary shards, so if an index has a 5×2 configuration (5 primary shards with 2 replicas), reducing replication to 1 will free up five shards, not just one. See the [Shard Primer](doc:what-are-shards-and-replicas) for more details. Fortunately, reducing the replica count for the index is a small JSON body to the `_settings` endpoint: [block:code] { "codes": [ { "code": "PUT /test1,test2/_settings -d '{\"index\":{\"number_of_replicas\":1}}'\n{\"acknowledged\":true}\n\nGET /_cat/indices\ngreen open test1 5 1 0 0 318b 159b \ngreen open test2 5 1 0 0 318b 159b ", "language": "text" } ] } [/block] That simple request shaved 10 shards off of this cluster's usage. [block:callout] { "type": "danger", "body": "It might seem like a good money-saving idea to simply set all replicas to 0 so as to fit as many indices into your cluster as possible. However, this is not advisable. This means that your primary data has no live backup. If a node in your cluster goes offline, data loss is basically guaranteed. \n\nData can be restored from a snapshot, but this is messy and not a great failover plan. The snapshot could be an hour or more old, and any updates to your data since then either need to be reindexed or are lost for good. Additionally, the outage will last much longer. \n\nHaving replication of at least 1 mitigates against all these problems.", "title": "Replication == Availability and Redundancy" } [/block] [block:api-header] { "title": "Data Collocation" } [/block] Another solution for reducing shard usage involves using [aliases and custom routing rules](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html) to collocate different data models onto the same group of shards. What is data collocation? Many Elasticsearch clients use an index per model paradigm as the default for data organization. This is analogous to, say, a Postgres database with a table for each type of data being indexed. Sharding this way makes sense most of the time, but in some rare cases users may benefit from putting all of the data into a single namespace. In the Postgres analogy, this would be like putting all of the data into a single table instead of a table for each model. An attribute (i.e., table column) is then used to filter searches by class. For example, you might have a cluster that has three indices: `videos`, `images` and `notes`. If each of these has a conservative 1x1 sharding scheme, it would require 6 shards. But this data could potentially be compacted down into a single index, `production`, where the mapping has a `type` field of some kind to indicate whether the document belongs to the `video`, `image` or `note` class. The latter configuration with the same 1x1 scheme would only require two shards (one primary, one replica) instead of six: [block:image] { "images": [ { "image": [ "https://files.readme.io/i0UA2yiGS8eR7U4zMUVj_reduce-shards10.jpg", "reduce-shards10.jpg", "2400", "1938", "#b7170c", "" ] } ] } [/block] There are several major downsides to this approach. One is that field name collisions become an issue. For example, if there is a field called `published` in two models, but one is defined as a boolean and the other is a datetime, it will create a conflict in the mapping. One will need to be renamed. Another downside is that it is a pretty large refactor for most users, and may be more trouble than simply upgrading the plan. Overriding the default behavior in the application's Elasticsearch client may require forking the code and relying on other hacks/workarounds. There are others. Data collocation is mentioned here as a _possibility_, and one that only works for certain users. It is by no means a recommendation.