{"_id":"568b2c687a42220d004982bf","category":{"_id":"567b0005802b2b17005ddea3","pages":["567b001117368a0d009a6e10","567b00307c40060d005603e7","567b039a7c40060d005603ec"],"project":"5633ebff7e9e880d00af1a53","version":"5633ec007e9e880d00af1a56","__v":3,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-12-23T20:11:49.377Z","from_sync":false,"order":2,"slug":"best-practices","title":"Best Practices & Tools"},"user":"5633ec9b35355017003ca3f2","parentDoc":null,"project":"5633ebff7e9e880d00af1a53","version":{"_id":"5633ec007e9e880d00af1a56","project":"5633ebff7e9e880d00af1a53","__v":16,"createdAt":"2015-10-30T22:15:28.105Z","releaseDate":"2015-10-30T22:15:28.105Z","categories":["5633ec007e9e880d00af1a57","5633f072737ea01700ea329d","5637a37d0704070d00f06cf4","5637cf4e7ca5de0d00286aeb","564503082c74cf1900da48b4","564503cb7f1fff210078e70a","567af26cb56bac0d0019d87d","567afeb8802b2b17005ddea0","567aff47802b2b17005ddea1","567b0005802b2b17005ddea3","568adfffcbd4ca0d00aebf7e","56ba80078cf7c9210009673e","574d127f6f075519007da3d0","574fde60aef76a0e00840927","57a22ba6cd51b22d00f623a0","5a062c15a66ae1001a3f5b09"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"__v":17,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-01-05T02:37:28.774Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":3,"body":"For people new to Elasticsearch, shard creation can be a bit of a mystery. Shards are not created automatically, and can sometimes be added or removed through the Elasticsearch API. If you find yourself dealing with too many shards, the first step to reducing them is to figure out where they're coming from.\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Where do shards come from?\"\n}\n[/block]\nA little bit of background: whenever you create an index on a cluster, that index will be composed of shards. A shard is a [Lucene index](http://lucene.apache.org/), and the main component responsible for storing and retrieving documents. Shards play one of two roles: primary or replica. Primary shards are a logical partitioning of the data in the index and are fixed at the time that the index is created. Replica shards are extra copies used for redundancy or to handle extra search traffic, and can be added and removed on demand.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/aGeQei5AQ4SSpfwz6UyF_reduce-shards09.jpg\",\n        \"reduce-shards09.jpg\",\n        \"2732\",\n        \"2048\",\n        \"#428fd1\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nYou can specify how many primary shards and replicas are used when creating a new index.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"PUT /my_index/_settings\\n{\\n  \\\"number_of_replicas\\\": 2\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n Replicas are a multiplier on the primary shards, and the total is calculated as primary * (1+replicas). In other words, if you create an index with 3 primary shards and 2 replicas, you will have 9 total shards, not 5 or 6.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/37qyng8pSBCHjJMEHIZw_reduce-shards07.png\",\n        \"reduce-shards07.png\",\n        \"2400\",\n        \"1324\",\n        \"#3c8ccc\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Measuring your cluster’s index and shard usage\"\n}\n[/block]\nElasticsearch offers some API endpoints to explore the state of your indices and shards. The `_cat` APIs are helpful for human interaction. You can view your index states by visiting `/_cat/indices`,  which will show index names, primary shards and replicas. You can also inspect individual shard states and statistics by visiting `/_cat/shards`. See example output below:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n$ curl -s https://user:password:::at:::bonsai-12345.bonsai.io/_cat/indices?v\\nhealth status index  pri rep docs.count docs.deleted store.size pri.store.size \\ngreen  open   images   1   0          0            0       130b           130b \\ngreen  open   videos   1   0          0            0       130b           130b \\ngreen  open   notes    1   0          0            0       130b           130b \\n\\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/shards?v\\nindex  shard pri rep state   docs store ip              node      \\nimages 0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \\nnotes  0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \\nvideos 0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Strategies for reducing shard counts\"\n}\n[/block]\n**1. Delete unneeded indices**\n\nIf you look at `/_cat/indices` and notice some indices that aren't needed, you can simply delete them to free up shards. Commonly, this could include test data, or old versions of data that has since been reindexed. For example:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"curl -s -XDELETE https://user:password@bonsai-12345.bonsai.io/my_old_data\",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\n**2. Reduce replication**\n\nFor most use-cases, a single replica is perfectly sufficient for redundancy and load capacity. If any of your indices have been created with more than one replica (perhaps a default in your ES client?), you can reduce it on the fly to free up shards. Recall that replicas are a multiplier against primary shards, so if you have a 5×2 configuration, reducing replication to 1 will free up five shards, not just one.\n\nA `curl` command like this should do the trick:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\\n  \\\"settings\\\": {\\n    \\\"number_of_replicas\\\":1 \\n}}'\",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"info\",\n  \"body\": \"Note: on Bonsai, a minimum replication of 1 is recommended to keep your data available during regular system maintenance. There are also some conditions that lead to automated restarts of Elasticsearch, which can interrupt write processes and cause data corruption of a primary shard. Without replication in place, any data on the affected shard is lost and the index will need to be completely recreated.\"\n}\n[/block]\n**3. Consolidate your data**\n\nIt's actually quite common (unfortunately) for many Elasticsearch clients to adopt an index per model paradigm as the default for data organization.  Sharding this way can be a premature optimization unless your scaling trajectory is well understood (feel free to [contact us](mailto:support@bonsai.io) to chat more about this). In practice, small sets of data probably won't benefit from any kind of data partitioning. It’s perfectly reasonable to treat an index as analogous to, say, a Postgres database.\n\nElasticsearch even supports this kind of internal partitioning with their index “[types](https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping.html)” feature. Really types are just a special metadata field that are routable on the HTTP request path. For example, you might have a cluster that looks like this:\n\n  *  `/videos/video`\n  *  `/images/image`\n  *  `/notes/note`\n\nEach of these indices is composed of a single type. Even if each index has a 1x1 sharding configuration (1 primary, 1 replica), that's 6 shards in use on the cluster, regardless of the number of documents involved. From a capacity planning standpoint, it's probably over-provisioned.\n\nAn alternative approach would be to change the mappings to consolidate those types within a single index:\n\n  *  `/production/video`\n  *  `/production/image`\n  *  `/production/note`\n\nThe latter configuration would only require two shards (one primary, one replica) instead of six. Depending on the total size of the index, this may be just fine from a performance perspective, and in many cases can be superior.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/i0UA2yiGS8eR7U4zMUVj_reduce-shards10.jpg\",\n        \"reduce-shards10.jpg\",\n        \"2400\",\n        \"1938\",\n        \"#b7170c\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Optimizing your mappings\"\n}\n[/block]\nIf you'd like to go with Option 3, \"Consolidate your data,\" then there are a couple of ways to go about it. The solution strongly depends on what you're using to connect to Elasticsearch.\n\nThis guide will cover some common situations, but should by no means be considered exhaustive. If your situation isn't covered here, you’ll want to refer to the documentation for your Elasticsearch client of choice. Hopefully the code below can illustrate some guiding principles. (And [let us know](mailto:support@bonsai.io) if you have some better examples to share in your language of choice!)\n\n**1. Using the Elasticsearch API directly**\nIn this case, you can control exactly how indices are created. You can simply[ create an index](http://docs.bonsai.io/docs/creating-your-first-index), use Elasticsearch's [PUT mapping API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html) to create the types as needed, then reindex your data. Something like this should work:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Create an index called \\\"production\\\" with some blank types:\\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\\n  \\\"mappings\\\": {\\n    \\\"video\\\": {\\n      \\\"properties\\\": {}\\n    },\\n    \\\"image\\\": {\\n      \\\"properties\\\": {}\\n    },\\n    \\\"note\\\": {\\n      \\\"properties\\\": {}\\n    }\\n  }\\n}'\",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\nYou now have an index with 3 types and can index data appropriately. Documents will be sent to endpoints like `/production/video` instead of `/videos/video`. The nice thing about this is that Elasticsearch supports dynamic mappings, so you don’t need to explicitly create any fields -- Elasticsearch will do that for you when you add documents.\n\n**2. Faking it with aliases**\nThe official Elasticsearch guidebook has an interesting section on designing for scale called \"[Faking Index per User with Aliases](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html),\" which is adaptable to this context. Here, you would create a single index and several types manually, then use aliases to reroute requests accordingly. This works really well if your client is opinionated in favor of index per model, and is inflexible with alternate configurations.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n# Create an index called \\\"production\\\" with some blank types:\\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\\n  \\\"mappings\\\": {\\n    \\\"video\\\": {\\n      \\\"properties\\\": {}\\n    },\\n    \\\"image\\\": {\\n      \\\"properties\\\": {}\\n    },\\n    \\\"note\\\": {\\n      \\\"properties\\\": {}\\n    }\\n  }\\n}'\\n\\n# Set up the aliases. We'll do the one for the \\\"video\\\" model:\\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production/_alias/videos -d '{\\n  \\\"routing\\\": \\\"videos\\\",\\n  \\\"filter\\\": {\\n    \\\"term\\\": {\\n      \\\"_type\\\": \\\"video\\\"\\n    }\\n  }\\n}'\",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\nThis method fakes out the client; the client will send documents to `/videos/video`, and Elasticsearch will quietly route those updates to  `/production/video`.\n\nThe upside to this approach is that it will work for any client that is opinionated in favor of index per model. The downside is that it requires manual intervention. That is, if you later add a new model to be indexed, you'll need to remember to manually update the index mappings and create a new alias. Otherwise things might break.\n\n**3. Using the official Rails client**\nBy default, the Elasticsearch Rails will default to an index-per-model paradigm. Fortunately it's still flexible enough to allow developers to combine multiple models into a single index. The technique is to collect the settings and mappings specified in your models, merge them together, and pass the results into a combined `indices.create` call.\n\nTo start, make sure your models to be indexed are defined with an index name:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"class Video\\n  include Elasticsearch::Model\\n\\n  index_name 'production'\\n  \\n  # You can set up the shards however you want, but remember that\\n  # number_of_replicas must be >=1\\n  settings index: { number_of_shards: 1, number_of_replicas: 1 } do\\n    mappings do\\n      # Example field:\\n      indexes :title, analyzer: 'english'\\n      # ...  whatever other fields you want\\n    end\\n  end\\n  \\nend\",\n      \"language\": \"ruby\"\n    }\n  ]\n}\n[/block]\nEnsure that all models to be indexed follow this same format. When that's done, you can use a custom rake task to index all of the models on to a single index. [This gist](https://gist.github.com/ghilead/6291718b0e97232a09f7) provides a great example of how to do it (based on [this conversation](https://github.com/elastic/elasticsearch-rails/issues/80)).","excerpt":"","slug":"reducing-shard-usage","type":"basic","title":"Reducing Shard Usage"}

Reducing Shard Usage


For people new to Elasticsearch, shard creation can be a bit of a mystery. Shards are not created automatically, and can sometimes be added or removed through the Elasticsearch API. If you find yourself dealing with too many shards, the first step to reducing them is to figure out where they're coming from. [block:api-header] { "type": "basic", "title": "Where do shards come from?" } [/block] A little bit of background: whenever you create an index on a cluster, that index will be composed of shards. A shard is a [Lucene index](http://lucene.apache.org/), and the main component responsible for storing and retrieving documents. Shards play one of two roles: primary or replica. Primary shards are a logical partitioning of the data in the index and are fixed at the time that the index is created. Replica shards are extra copies used for redundancy or to handle extra search traffic, and can be added and removed on demand. [block:image] { "images": [ { "image": [ "https://files.readme.io/aGeQei5AQ4SSpfwz6UyF_reduce-shards09.jpg", "reduce-shards09.jpg", "2732", "2048", "#428fd1", "" ] } ] } [/block] You can specify how many primary shards and replicas are used when creating a new index. [block:code] { "codes": [ { "code": "PUT /my_index/_settings\n{\n \"number_of_replicas\": 2\n}", "language": "json" } ] } [/block] Replicas are a multiplier on the primary shards, and the total is calculated as primary * (1+replicas). In other words, if you create an index with 3 primary shards and 2 replicas, you will have 9 total shards, not 5 or 6. [block:image] { "images": [ { "image": [ "https://files.readme.io/37qyng8pSBCHjJMEHIZw_reduce-shards07.png", "reduce-shards07.png", "2400", "1324", "#3c8ccc", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Measuring your cluster’s index and shard usage" } [/block] Elasticsearch offers some API endpoints to explore the state of your indices and shards. The `_cat` APIs are helpful for human interaction. You can view your index states by visiting `/_cat/indices`, which will show index names, primary shards and replicas. You can also inspect individual shard states and statistics by visiting `/_cat/shards`. See example output below: [block:code] { "codes": [ { "code": "\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/indices?v\nhealth status index pri rep docs.count docs.deleted store.size pri.store.size \ngreen open images 1 0 0 0 130b 130b \ngreen open videos 1 0 0 0 130b 130b \ngreen open notes 1 0 0 0 130b 130b \n\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/shards?v\nindex shard pri rep state docs store ip node \nimages 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man \nnotes 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man \nvideos 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man ", "language": "curl" } ] } [/block] [block:api-header] { "type": "basic", "title": "Strategies for reducing shard counts" } [/block] **1. Delete unneeded indices** If you look at `/_cat/indices` and notice some indices that aren't needed, you can simply delete them to free up shards. Commonly, this could include test data, or old versions of data that has since been reindexed. For example: [block:code] { "codes": [ { "code": "curl -s -XDELETE https://user:password@bonsai-12345.bonsai.io/my_old_data", "language": "curl" } ] } [/block] **2. Reduce replication** For most use-cases, a single replica is perfectly sufficient for redundancy and load capacity. If any of your indices have been created with more than one replica (perhaps a default in your ES client?), you can reduce it on the fly to free up shards. Recall that replicas are a multiplier against primary shards, so if you have a 5×2 configuration, reducing replication to 1 will free up five shards, not just one. A `curl` command like this should do the trick: [block:code] { "codes": [ { "code": "\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\n \"settings\": {\n \"number_of_replicas\":1 \n}}'", "language": "curl" } ] } [/block] [block:callout] { "type": "info", "body": "Note: on Bonsai, a minimum replication of 1 is recommended to keep your data available during regular system maintenance. There are also some conditions that lead to automated restarts of Elasticsearch, which can interrupt write processes and cause data corruption of a primary shard. Without replication in place, any data on the affected shard is lost and the index will need to be completely recreated." } [/block] **3. Consolidate your data** It's actually quite common (unfortunately) for many Elasticsearch clients to adopt an index per model paradigm as the default for data organization. Sharding this way can be a premature optimization unless your scaling trajectory is well understood (feel free to [contact us](mailto:support@bonsai.io) to chat more about this). In practice, small sets of data probably won't benefit from any kind of data partitioning. It’s perfectly reasonable to treat an index as analogous to, say, a Postgres database. Elasticsearch even supports this kind of internal partitioning with their index “[types](https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping.html)” feature. Really types are just a special metadata field that are routable on the HTTP request path. For example, you might have a cluster that looks like this: * `/videos/video` * `/images/image` * `/notes/note` Each of these indices is composed of a single type. Even if each index has a 1x1 sharding configuration (1 primary, 1 replica), that's 6 shards in use on the cluster, regardless of the number of documents involved. From a capacity planning standpoint, it's probably over-provisioned. An alternative approach would be to change the mappings to consolidate those types within a single index: * `/production/video` * `/production/image` * `/production/note` The latter configuration would only require two shards (one primary, one replica) instead of six. Depending on the total size of the index, this may be just fine from a performance perspective, and in many cases can be superior. [block:image] { "images": [ { "image": [ "https://files.readme.io/i0UA2yiGS8eR7U4zMUVj_reduce-shards10.jpg", "reduce-shards10.jpg", "2400", "1938", "#b7170c", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Optimizing your mappings" } [/block] If you'd like to go with Option 3, "Consolidate your data," then there are a couple of ways to go about it. The solution strongly depends on what you're using to connect to Elasticsearch. This guide will cover some common situations, but should by no means be considered exhaustive. If your situation isn't covered here, you’ll want to refer to the documentation for your Elasticsearch client of choice. Hopefully the code below can illustrate some guiding principles. (And [let us know](mailto:support@bonsai.io) if you have some better examples to share in your language of choice!) **1. Using the Elasticsearch API directly** In this case, you can control exactly how indices are created. You can simply[ create an index](http://docs.bonsai.io/docs/creating-your-first-index), use Elasticsearch's [PUT mapping API](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html) to create the types as needed, then reindex your data. Something like this should work: [block:code] { "codes": [ { "code": "# Create an index called \"production\" with some blank types:\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\n \"mappings\": {\n \"video\": {\n \"properties\": {}\n },\n \"image\": {\n \"properties\": {}\n },\n \"note\": {\n \"properties\": {}\n }\n }\n}'", "language": "curl" } ] } [/block] You now have an index with 3 types and can index data appropriately. Documents will be sent to endpoints like `/production/video` instead of `/videos/video`. The nice thing about this is that Elasticsearch supports dynamic mappings, so you don’t need to explicitly create any fields -- Elasticsearch will do that for you when you add documents. **2. Faking it with aliases** The official Elasticsearch guidebook has an interesting section on designing for scale called "[Faking Index per User with Aliases](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html)," which is adaptable to this context. Here, you would create a single index and several types manually, then use aliases to reroute requests accordingly. This works really well if your client is opinionated in favor of index per model, and is inflexible with alternate configurations. [block:code] { "codes": [ { "code": "\n# Create an index called \"production\" with some blank types:\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production -d '{\n \"mappings\": {\n \"video\": {\n \"properties\": {}\n },\n \"image\": {\n \"properties\": {}\n },\n \"note\": {\n \"properties\": {}\n }\n }\n}'\n\n# Set up the aliases. We'll do the one for the \"video\" model:\ncurl -s -XPUT https://user:pass@slug-12345.bonsai.io/production/_alias/videos -d '{\n \"routing\": \"videos\",\n \"filter\": {\n \"term\": {\n \"_type\": \"video\"\n }\n }\n}'", "language": "curl" } ] } [/block] This method fakes out the client; the client will send documents to `/videos/video`, and Elasticsearch will quietly route those updates to `/production/video`. The upside to this approach is that it will work for any client that is opinionated in favor of index per model. The downside is that it requires manual intervention. That is, if you later add a new model to be indexed, you'll need to remember to manually update the index mappings and create a new alias. Otherwise things might break. **3. Using the official Rails client** By default, the Elasticsearch Rails will default to an index-per-model paradigm. Fortunately it's still flexible enough to allow developers to combine multiple models into a single index. The technique is to collect the settings and mappings specified in your models, merge them together, and pass the results into a combined `indices.create` call. To start, make sure your models to be indexed are defined with an index name: [block:code] { "codes": [ { "code": "class Video\n include Elasticsearch::Model\n\n index_name 'production'\n \n # You can set up the shards however you want, but remember that\n # number_of_replicas must be >=1\n settings index: { number_of_shards: 1, number_of_replicas: 1 } do\n mappings do\n # Example field:\n indexes :title, analyzer: 'english'\n # ... whatever other fields you want\n end\n end\n \nend", "language": "ruby" } ] } [/block] Ensure that all models to be indexed follow this same format. When that's done, you can use a custom rake task to index all of the models on to a single index. [This gist](https://gist.github.com/ghilead/6291718b0e97232a09f7) provides a great example of how to do it (based on [this conversation](https://github.com/elastic/elasticsearch-rails/issues/80)).