Reducing Shard Usage

Managing shards is a basic skill for any Elasticsearch user. Shards carry system overhead and potentially stale data. Keeping your cluster clean by pruning old shards can both improve performance and reduce your server costs.

This guide will make frequent references to the Elasticsearch API using the command line tool curl. Interacting with your own cluster can also be done via curl, or via a web browser. You can also use the Interactive Console in your cluster dashboard.

If you’re just starting out, check out our illustrated Shard Primer, and read over the Elasticsearch Core Concepts documentation.

Deleting Unneeded Indices

There are some cases where one or more indices are created on a cluster for testing purposes, and are not actually being used for anything. These will count towards the shard limits; if you’re getting overage notifications, then you should delete these indices.

There are also some clients that will use aliases to roll out changes to mappings. This is a really nice feature that allows for zero-downtime reindexing and immediate rollbacks if there’s a problem, but it can also result in lots of stale indices and data accumulating in the cluster.

To determine if you have extraneous indices, use the /_cat/indices endpoint to get a list of all the indices you have in your cluster:

GET /_cat/indices
green open test1           1 1 0 0  318b  159b
green open test2           1 1 0 0  318b  159b
green open test3           1 1 0 0  318b  159b
green open test4           1 1 0 0  318b  159b
green open test5           1 1 0 0  318b  159b
green open test6           1 1 0 0  318b  159b

If you see indices that you don’t need, you can simply delete them:

# Delete a single index:
DELETE /test1

# Delete a group of indices:
DELETE /test2,test3,test4,test5

Use a Different Sharding Scheme

It’s possible that for some reason one or more indices were created with far more shards than necessary. For example, a check of /_cat/indices shows something like this:

GET /_cat/indices
green open test1           5 2 0 0  318b  159b
green open test2           5 2 0 0  318b  159b

That 5x2 shard scheme results in 15 shards per index, or 30 total for this cluster. This is probably really overprovisioned for these indices. Choosing a more conservative shard scheme of 1x1 would reduce this cluster’s usage from 30 shards down to 4.

Unfortunately, the number of primary shards can not be changed once an index has been created. To fix this, you will need to manually create a new index with the desired shard scheme and reindex the data. If you have not read The Ideal Elasticsearch Index, it has some really nice information on capacity planning and sizing. Check out the sections on Intelligent Sharding and Benchmarking for some tips on what scheme would make more sense for your particular use case.

Reduce replication

For most use-cases, a single replica is perfectly sufficient for redundancy and load capacity. If any of your indices have been created with more than one replica, you can reduce it to free up shards. An index with more than one replica might look like this:

GET /_cat/indices?v
health status index    pri rep docs.count docs.deleted store.size pri.store.size
green  open   test1    5   2   0          0            318b       159b
green  open   test2    5   2   0          0            318b       159b

Note that the rep column has a 2? That means there are actually 3 copies of the data: one primary shard, and its two replicas. Replicas are a multiplier against primary shards, so if an index has a 5×2 configuration (5 primary shards with 2 replicas), reducing replication to 1 will free up five shards, not just one. See the Shard Primer for more details.

Fortunately, reducing the replica count for the index is a small JSON body to the _settings endpoint:

PUT /test1,test2/_settings -d '{"index":{"number_of_replicas":1}}'
{"acknowledged":true}

GET /_cat/indices
green open test1           5 1 0 0  318b  159b
green open test2           5 1 0 0  318b  159b

That simple request shaved 10 shards off of this cluster’s usage.

Replication == Availability and Redundancy

It might seem like a good money-saving idea to simply set all replicas to 0 so as to fit as many indices into your cluster as possible. However, this is not advisable. This means that your primary data has no live backup. If a node in your cluster goes offline, data loss is basically guaranteed.

Data can be restored from a snapshot, but this is messy and not a great failover plan. The snapshot could be an hour or more old, and any updates to your data since then either need to be reindexed or are lost for good. Additionally, the outage will last much longer.

Having replication of at least 1 mitigates against all these problems.

Data Collocation

Another solution for reducing shard usage involves using aliases and custom routing rules to collocate different data models onto the same group of shards.

What is data collocation? Many Elasticsearch clients use an index per model paradigm as the default for data organization. This is analogous to, say, a Postgres database with a table for each type of data being indexed.

Sharding this way makes sense most of the time, but in some rare cases users may benefit from putting all of the data into a single namespace. In the Postgres analogy, this would be like putting all of the data into a single table instead of a table for each model. An attribute (i.e., table column) is then used to filter searches by class.

For example, you might have a cluster that has three indices: videos, images and notes. If each of these has a conservative 1x1 sharding scheme, it would require 6 shards. But this data could potentially be compacted down into a single index, production, where the mapping has a type field of some kind to indicate whether the document belongs to the video, image or note class.

The latter configuration with the same 1x1 scheme would only require two shards (one primary, one replica) instead of six:

There are several major downsides to this approach. One is that field name collisions become an issue. For example, if there is a field called published in two models, but one is defined as a boolean and the other is a datetime, it will create a conflict in the mapping. One will need to be renamed.

Another downside is that it is a pretty large refactor for most users, and may be more trouble than simply upgrading the plan. Overriding the default behavior in the application’s Elasticsearch client may require forking the code and relying on other hacks/workarounds.

There are others. Data collocation is mentioned here as a possibility, and one that only works for certain users. It is by no means a recommendation.