Reducing Data Usage

Bonsai meters on the total amount of disk space a cluster can consume. This is for capacity planning purposes, and to ensure multitenant customers have their fair share of resources. Bonsai calculates a cluster’s disk usage by looking at the total data store size in bytes. This information can be found in the Index Stats API.

Resolving disk overages can be resolved in a couple different ways that we will cover in this documentation: 

  1. Remove Stale Data / Indices
  2. Purge Deleted Documents
  3. Reindex with Smaller Mappings
  4. Upgrade the Subscription

Remove Stale Data / Indices

There are some cases where one or more indices are created on a cluster for testing purposes, and are not actually being used for anything. These will count towards the data limits; if you’re getting overage notifications, then you should delete these indices.

GET /_cat/indices
green open prod20180101    1 1 1015123 0  32M  64M
green open prod20180201    1 1 1016456 0  35M  70M
green open prod20180301    1 1 1017123 0  39M  78M
green open prod20180401    1 1 1018456 0  45M  90M
green open prod20180501    1 1 1019123 0  47M  94M
green open prod20180601    1 1 1020456 0  51M  102M

Removing the old and unneeded indices in the example above would free up 356MB. A single command could do it:

# Delete a group of indices:
DELETE /prod20180101,prod20180201,prod20180301,prod20180401,prod20180501

Purge Deleted Documents

Data in Elasticsearch is spread across lots of files called segments. Segments each contain some number of documents. An index could have dozens, hundreds or even thousands of segment files, and Elasticsearch will periodically merge some segment files into others.

When a document is deleted in Elasticsearch, its segment file is simply updated to mark the document as deleted. The data is not actually removed until that segment file is merged with another. Elasticsearch normally handles segment merging automatically, but forcing a segment merging will reduce the overall disk footprint of the cluster by eliminating deleted documents.

This can be done through the Optimize / Forcemerge API, but the same effect can be accomplished more efficiently by simply reindexing. Reindexing will cause the data to be refreshed, and no deleted documents will be tracked by Elasticsearch. This will reduce disk usage.

To check whether this will work for you, look at the  /_cat/indices data. There is a column called docs.deleted, which shows how many documents are sitting on the disk and are marked as deleted. This should give a sense of how much data could be freed up by reindexing. For example:

health status index     pri rep docs.count docs.deleted store.size pri.store.size
green  open   my_index  3   2   15678948   6895795      47.1G      15.7G

In this case, the  docs.deleted is around 30% of the primary store, or around 4.8G of primary data. With replication, this works out to something like 14.4GB of total disk marked for deletion. Reindexing would reduce the cluster’s disk footprint by this much. The result would look like this:

health status index     pri rep docs.count docs.deleted store.size pri.store.size
green  open   my_index  3   2   15678948   0            32.7G      10.9G

Protip: Queue Writes and Reindex in the Background for Minimal Impact

Your app’s search could be down or degraded during a reindex. If reindexing will take a long time, that may make this option unfeasible. However, you could minimize the impact by using something like Kafka to queue writes while reindexing to a new index.

Search traffic can continue to be served from the old index until its replacement is ready. Flush the queued updates from Kafka into the new index, then destroy the old index and use an alias to promote the new index.

The tradeoff of this solution is that you’ll minimize the impact to your traffic/users, but you’ll need to set up and manage the queue software. You’ll also have a lot of duplicated data for a short period of time, so your footprint could be way above the subscription limit for a short time.

To prevent the state machine from disabling your cluster, you might want to consider temporarily upgrading to perform the operation, then downgrading when you’re done. Billing is prorated, so this would not add much to your invoice. You can always email us to discuss options before settling on a decision.

Reindex with Smaller Mappings

Mappings define how data is stored and indexed in an Elasticsearch cluster. There are some settings which can cause the disk footprint to grow exponentially.

For example, synonym expansion can lead to lots of extra tokens to be generated per input token (if you’re using WordNet, see our documentation article on it, specifically Why Wouldn’t Everyone Want WordNet?). If you’re using lots of index-time synonym expansion, then you’re essentially inflating the document sizes with lots of data, with the tradeoff (hopefully) being improved relevancy.

Another example would be Ngrams. Ngrams are tokens generated from the parts of other tokens. A token like “hello” could be broken into 2-grams like “he”, “el”, “ll”, and “lo”. In 3-grams, it would be “hel”, “ell” and “llo”. And so on. The Elasticsearch Guide has more examples.

It’s possible to generate multiple gram sizes for a single, starting with values as low as 1. Some developers use this to maximize substring matching. But there is an exponential growth in the number of grams generated for a single token:

This relationship is expressed mathematically as:

In other words, a token with a length of 5 and a minimum gram size of 1 would result in (1/2)*5*(5+1)=15 grams. A token with a length of 10 would result in 55 grams. The grams are generated per token, which leads to an explosion in terms for a document.

As a sample calculation: if a typical document in your corpus has a field with ~1,000 tokens and a Rayleigh distribution of length with an average of ~5, you could plausibly see something like a 1,100-1,200% inflation in disk footprint using Ngrams of minimum size 1. In other words, if the non-grammed document would need 100KB on disk, the Ngrammed version would need over 1MB. Virtually none of this overhead would improve relevancy, and would probably even hurt it.

Nested documents are another example of a feature can also increase your data footprint without necessarily improving relevancy.

The point is that there are plenty of features available that lead to higher disk usage than one might think at first glance. Check on your mappings carefully: look for large synonym expansions, make sure you’re using Ngrams with a minimum gram size of 3 or more (also look into EdgeNGrams if you’re attempting autocomplete), and see if you can get away with fewer nested objects. Reindex your data with the updated mappings, and you should see a definite improvement.

Upgrade the Subscription

If you find that you’re unable to remove data, reindex, or update your mappings – or that these changes don’t yield a stable resolution – then you will simply need to upgrade to the next subscription level.

Upgrading Direct Bonsai cluster

Upgrading a Heroku Bonsai cluster

Upgrading a Manifold Bonsai cluster