{"_id":"5a8fae0368264c001f20cc0e","category":{"_id":"5a8fae0368264c001f20cc03","version":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-12-23T20:11:49.377Z","from_sync":false,"order":5,"slug":"best-practices","title":"Elasticsearch Tips & Tricks"},"project":"5633ebff7e9e880d00af1a53","user":"5633ec9b35355017003ca3f2","parentDoc":null,"version":{"_id":"5a8fae0268264c001f20cc00","project":"5633ebff7e9e880d00af1a53","__v":2,"createdAt":"2018-02-23T06:00:34.961Z","releaseDate":"2018-02-23T06:00:34.961Z","categories":["5a8fae0268264c001f20cc01","5a8fae0268264c001f20cc02","5a8fae0368264c001f20cc03","5a8fae0368264c001f20cc04","5a8fae0368264c001f20cc05","5a8fae0368264c001f20cc06","5a8fae0368264c001f20cc07","5a8fae0368264c001f20cc08","5a8fae0368264c001f20cc09","5abaa7eb72d6dc0028a07bf3"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"2.0.0","version":"2.0"},"githubsync":"","__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-03-16T21:41:49.159Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":2,"body":"For people new to Elasticsearch, shards can be a bit of a mystery. Why is it possible to add or remove replica shards on demand, but not primary shards? What's the difference? How many indices can fit on a shard?\n\nThis article explains what shards are, how they work, and how they can best be used.\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Where do shards come from?\"\n}\n[/block]\nFirst, a little bit of background: Elasticsearch is built on top of [Lucene](http://lucene.apache.org/), which is a data storage and retrieval engine. What are called \"shards\" in Elasticsearch parlance are technically Lucene instances. Elasticsearch manages these different instances, spreading data across different instances, and automatically balancing those instances across different nodes.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/qNFuyoeARFmlzsQfu4Ox_reduce-shards09.jpg\",\n        \"reduce-shards09.jpg\",\n        \"2732\",\n        \"2048\",\n        \"#428fd1\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\nWhenever an Elasticsearch index is created, that index will be composed of one or more shards. This means that an Elasticsearch index can spread data across several Lucene instances. This architecture is useful for both redundancy and parallelization purposes. \n\nShards play one of two roles: primary or replica. Primary shards are a logical partitioning of the data in the index, and are fixed at the time that the index is created. Primary shards are useful for parallelization; when a large amount of data is split across several primary shards, a node can run a query on several Lucene instances in parallel, reducing the overall time of the job. \n[block:callout]\n{\n  \"type\": \"danger\",\n  \"body\": \"From one of our more popular [blog entries](https://bonsai.io/blog/ideal-elasticsearch-cluster/): \\\"Elasticsearch uses a naive hashing algorithm to route documents to a given primary shard. This design choice allows documents to be randomly distributed in a reproducible way. This avoids “hot spots” that affect performance and overallocation. However, it has one major downside, which is that **the number of primary shards can not be changed after an index has been created.** Replicas can be added and removed at will, but the number of primary shards is basically written in stone.\",\n  \"title\": \"Primary Shards Can Not Be Added/Removed Later\"\n}\n[/block]\nIn contrast, replica shards are simply extra copies of the data. They are useful for redundancy or to handle extra search traffic, and can be added and removed on demand.\n\nYou can specify how many primary shards and replicas are used when creating a new index.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"PUT /my_index/_settings\\n{\\n  \\\"number_of_shards\\\": 1,\\n  \\\"number_of_replicas\\\": 2\\n}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"Replicas and High-Availability\",\n  \"body\": \"Replica shards are not only capable of serving search traffic, but they also provide a level of protection against data loss. If a node hosting a primary shard is taken offline for some reason, Elasticsearch will promote its replica to a primary role, if a replica exists. However, if the index's replication is set to 0, then it is not in a High-Availability configuration. In the event of a data loss incident, the data will simply be lost.\"\n}\n[/block]\n Replicas are a multiplier on the primary shards, and the total is calculated as primary * (1+replicas). In other words, if you create an index with 3 primary shards and 2 replicas, you will have 9 total shards, not 5 or 6.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/f5yajUc0QmaFq8b1gnYr_reduce-shards07.png\",\n        \"reduce-shards07.png\",\n        \"2400\",\n        \"1324\",\n        \"#3c8ccc\",\n        \"\"\n      ]\n    }\n  ]\n}\n[/block]\n\n[block:api-header]\n{\n  \"type\": \"basic\",\n  \"title\": \"Measuring your cluster’s index and shard usage\"\n}\n[/block]\nElasticsearch offers some API endpoints to explore the state of your indices and shards. The `_cat` APIs are helpful for human interaction. You can view your index states by visiting `/_cat/indices`,  which will show index names, primary shards and replicas. You can also inspect individual shard states and statistics by visiting `/_cat/shards`. See example output below:\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"\\n$ curl -s https://user:password:::at:::bonsai-12345.bonsai.io/_cat/indices?v\\nhealth status index  pri rep docs.count docs.deleted store.size pri.store.size \\ngreen  open   images   1   0          0            0       130b           130b \\ngreen  open   videos   1   0          0            0       130b           130b \\ngreen  open   notes    1   0          0            0       130b           130b \\n\\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/shards?v\\nindex  shard pri rep state   docs store ip              node      \\nimages 0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \\nnotes  0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \\nvideos 0     p       STARTED    0  130b XXX.XXX.XXX.XXX Sugar Man \",\n      \"language\": \"curl\"\n    }\n  ]\n}\n[/block]\nWe've also made this easy by creating a live interactive console for you. Just visit your cluster's dashboard console, chose `GET` from the dropdown, and run `/_cat/indices?v` or `/_cat/shards?v`. Check out the documentation on the Interactive Console below:\n\n* [Bonsai.io Users](doc:managing-your-cluster#interactive-console) \n* [Heroku Users](doc:bonsai-elasticsearch-dashboard#interactive-console) \n* [Manifold Users](doc:managing-your-cluster-1#interactive-console) \n[block:api-header]\n{\n  \"title\": \"Frequently Asked Questions About Shards\"\n}\n[/block]\n## Q. How Many Indices Can I Put On A Shard?\n\n**Short Answer:** An index can have many shards, but any given shard can only belong to **one** index.\n\n**Long Answer:** It is possible to \"fake\" multiple indices per shard by using \"type\" fields and [index aliases](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html). There is a [section](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html) of the Elasticsearch Guide dedicated to this approach. Essentially, if you have several data models, you could put two or more into a single index instead of separate indices, and use aliases to perform automated filtering of queries.\n\nThe downside to this approach is that it requires a fair amount of work. Most frameworks and content management systems with integrated Elasticsearch clients will not take this approach by default. Developers will need to create and manage the index settings manually instead of relying on automated tools.\n\n## Q. How Many Shards Do I Need?\n\n**Answer:** It depends. Because replicas can be added/removed at will, the real question is how many _primary_ shards are needed. Increasing the number of primary shards for an index is one way to improve performance because it allows the query to be processed in parallel. But there are a lot of reasons to *not* have a [kajillion primary shards](https://www.elastic.co/guide/en/elasticsearch/guide/current/kagillion-shards.html).\n\nGenerally, we recommend that if you don’t expect data to grow significantly, then:\n\n- One primary shard is fine if you have less than 100K documents\n- One primary shard per node is good if you have over 100K documents\n- One primary shard per CPU core is good if you have at least a couple million documents\n\nIf you anticipate lots of growth -- orders of magnitude within a short time -- then the problem is a little more complicated. Take a read through the [Ideal Elasticsearch Index](https://bonsai.io/blog/ideal-elasticsearch-cluster/) post, especially the benchmarking section. Or shoot an email to [support@bonsai.io](mailto:support@bonsai.io).\n\n## Q. How Do I Reduce The Number of Shards I'm Using?\n\n**Answer:** There is an article dealing with this very subject: [Reducing Shard Usage](doc:reducing-shard-usage).","excerpt":"","slug":"what-are-shards-and-replicas","type":"basic","title":"Shard Primer"}
For people new to Elasticsearch, shards can be a bit of a mystery. Why is it possible to add or remove replica shards on demand, but not primary shards? What's the difference? How many indices can fit on a shard? This article explains what shards are, how they work, and how they can best be used. [block:api-header] { "type": "basic", "title": "Where do shards come from?" } [/block] First, a little bit of background: Elasticsearch is built on top of [Lucene](http://lucene.apache.org/), which is a data storage and retrieval engine. What are called "shards" in Elasticsearch parlance are technically Lucene instances. Elasticsearch manages these different instances, spreading data across different instances, and automatically balancing those instances across different nodes. [block:image] { "images": [ { "image": [ "https://files.readme.io/qNFuyoeARFmlzsQfu4Ox_reduce-shards09.jpg", "reduce-shards09.jpg", "2732", "2048", "#428fd1", "" ] } ] } [/block] Whenever an Elasticsearch index is created, that index will be composed of one or more shards. This means that an Elasticsearch index can spread data across several Lucene instances. This architecture is useful for both redundancy and parallelization purposes. Shards play one of two roles: primary or replica. Primary shards are a logical partitioning of the data in the index, and are fixed at the time that the index is created. Primary shards are useful for parallelization; when a large amount of data is split across several primary shards, a node can run a query on several Lucene instances in parallel, reducing the overall time of the job. [block:callout] { "type": "danger", "body": "From one of our more popular [blog entries](https://bonsai.io/blog/ideal-elasticsearch-cluster/): \"Elasticsearch uses a naive hashing algorithm to route documents to a given primary shard. This design choice allows documents to be randomly distributed in a reproducible way. This avoids “hot spots” that affect performance and overallocation. However, it has one major downside, which is that **the number of primary shards can not be changed after an index has been created.** Replicas can be added and removed at will, but the number of primary shards is basically written in stone.", "title": "Primary Shards Can Not Be Added/Removed Later" } [/block] In contrast, replica shards are simply extra copies of the data. They are useful for redundancy or to handle extra search traffic, and can be added and removed on demand. You can specify how many primary shards and replicas are used when creating a new index. [block:code] { "codes": [ { "code": "PUT /my_index/_settings\n{\n \"number_of_shards\": 1,\n \"number_of_replicas\": 2\n}", "language": "json" } ] } [/block] [block:callout] { "type": "warning", "title": "Replicas and High-Availability", "body": "Replica shards are not only capable of serving search traffic, but they also provide a level of protection against data loss. If a node hosting a primary shard is taken offline for some reason, Elasticsearch will promote its replica to a primary role, if a replica exists. However, if the index's replication is set to 0, then it is not in a High-Availability configuration. In the event of a data loss incident, the data will simply be lost." } [/block] Replicas are a multiplier on the primary shards, and the total is calculated as primary * (1+replicas). In other words, if you create an index with 3 primary shards and 2 replicas, you will have 9 total shards, not 5 or 6. [block:image] { "images": [ { "image": [ "https://files.readme.io/f5yajUc0QmaFq8b1gnYr_reduce-shards07.png", "reduce-shards07.png", "2400", "1324", "#3c8ccc", "" ] } ] } [/block] [block:api-header] { "type": "basic", "title": "Measuring your cluster’s index and shard usage" } [/block] Elasticsearch offers some API endpoints to explore the state of your indices and shards. The `_cat` APIs are helpful for human interaction. You can view your index states by visiting `/_cat/indices`, which will show index names, primary shards and replicas. You can also inspect individual shard states and statistics by visiting `/_cat/shards`. See example output below: [block:code] { "codes": [ { "code": "\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/indices?v\nhealth status index pri rep docs.count docs.deleted store.size pri.store.size \ngreen open images 1 0 0 0 130b 130b \ngreen open videos 1 0 0 0 130b 130b \ngreen open notes 1 0 0 0 130b 130b \n\n$ curl -s https://user:password@bonsai-12345.bonsai.io/_cat/shards?v\nindex shard pri rep state docs store ip node \nimages 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man \nnotes 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man \nvideos 0 p STARTED 0 130b XXX.XXX.XXX.XXX Sugar Man ", "language": "curl" } ] } [/block] We've also made this easy by creating a live interactive console for you. Just visit your cluster's dashboard console, chose `GET` from the dropdown, and run `/_cat/indices?v` or `/_cat/shards?v`. Check out the documentation on the Interactive Console below: * [Bonsai.io Users](doc:managing-your-cluster#interactive-console) * [Heroku Users](doc:bonsai-elasticsearch-dashboard#interactive-console) * [Manifold Users](doc:managing-your-cluster-1#interactive-console) [block:api-header] { "title": "Frequently Asked Questions About Shards" } [/block] ## Q. How Many Indices Can I Put On A Shard? **Short Answer:** An index can have many shards, but any given shard can only belong to **one** index. **Long Answer:** It is possible to "fake" multiple indices per shard by using "type" fields and [index aliases](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html). There is a [section](https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html) of the Elasticsearch Guide dedicated to this approach. Essentially, if you have several data models, you could put two or more into a single index instead of separate indices, and use aliases to perform automated filtering of queries. The downside to this approach is that it requires a fair amount of work. Most frameworks and content management systems with integrated Elasticsearch clients will not take this approach by default. Developers will need to create and manage the index settings manually instead of relying on automated tools. ## Q. How Many Shards Do I Need? **Answer:** It depends. Because replicas can be added/removed at will, the real question is how many _primary_ shards are needed. Increasing the number of primary shards for an index is one way to improve performance because it allows the query to be processed in parallel. But there are a lot of reasons to *not* have a [kajillion primary shards](https://www.elastic.co/guide/en/elasticsearch/guide/current/kagillion-shards.html). Generally, we recommend that if you don’t expect data to grow significantly, then: - One primary shard is fine if you have less than 100K documents - One primary shard per node is good if you have over 100K documents - One primary shard per CPU core is good if you have at least a couple million documents If you anticipate lots of growth -- orders of magnitude within a short time -- then the problem is a little more complicated. Take a read through the [Ideal Elasticsearch Index](https://bonsai.io/blog/ideal-elasticsearch-cluster/) post, especially the benchmarking section. Or shoot an email to [support@bonsai.io](mailto:support@bonsai.io). ## Q. How Do I Reduce The Number of Shards I'm Using? **Answer:** There is an article dealing with this very subject: [Reducing Shard Usage](doc:reducing-shard-usage).