Improving Throughput and Performance
Bonsai implements connection management as a metered plan resource. Our connection management proxy is designed to help model and encourage best practices for using Elasticsearch efficiently at different cluster sizes.
This document outlines several additional strategies that users can take to maximize the performance and throughput for their cluster.
Implement HTTP Keep-alive
Normally when an application makes a request to the cluster, it needs to perform a lot of steps. Very generally, it looks like this:
- Set up the connection locally
- Reach the target server
- Establish a secure communication protocol via SSL/TLS (this step is actually a dozen smaller steps, see The SSL/TLS Handshake for an infographic)
- Send the authentication credentials and the request
- Wait for a response
- Receive the response
- Tear down the connection
- Return the response to the application
As you can see, there is quite a bit of overhead to accomplish a simple search request. There is an optimization to be made here, called HTTP keep-alive. Essentially, the application opens a persistent connection to the cluster and may send and receive multiple search requests over it without the need to perform redundant tasks like negotiating a cryptographic protocol.
Persistent connections are “free” on Bonsai; the concurrency allowances only apply to active, in-flight connections. Otherwise idle HTTP keep-alive connections are not tracked toward this limit. Users should maintain a reasonably sized pool of keep-alive connections to reduce the setup cost of each connection and reduce the average latency.
This strategy saves a significant amount of overhead and reduces overall request latencies substantially. Some benchmarks of applications in production have seen a 20-25% improvement in throughput (requests per second) by implementing HTTP keep-alive.
In the ideal implementation, the application would have access to a local pool of persistent connections, from which it would draw to perform search requests. The local pool would also queue requests when all persistent connections are in use. This implementation would allow the developer to fine-tune the local pool size to minimize latency without exceeding the concurrency allowance for the cluster’s plan.
Consult the documentation for your Elasticsearch client for more details on implementing this feature.
Utilize a Durable Message Queue
For best performance and reliability, updates to data should be pushed into a durable message queue for asynchronous grouping and processing in batches. Some examples of message queues are: Apache Kafka, AWS Kinesis, IronMQ, Disque, Redis, Rescue, and RabbitMQ.
These services are resistant to network partitions and random node outages. There may be times, particularly on multi tenant class clusters, where a node is randomly restarted. Consider a case where an application pushes updates directly in to Elasticsearch. Such an interruption, even for a few seconds, could lead to error logs and emails about HTTP 503 responses.
But with a message queue acting as an intermediary, temporary connection issues are resolved automatically, and the application does not need to bubble up errors to the user or administrators. A message queue can also avoid the overhead of single-document updates by passing updates in batches. This strategy is one way to avoid an HTTP 429 response.
Prefer Bulk Updates Over Single-Document Updates
It’s worth noting that Bonsai distinguishes Bulk API updates from other writes. This is because the Bulk API is the more efficient means of data ingress into Elasticsearch, and is the preferred means of data ingest to the cluster. Single-document updates are helpful for maintaining synchronization over time, but should be avoided as a means for mass ingest.
Batch sizes should be determined experimentally based on your document size. A good heuristic is to aim for a batch size that takes approximately 1 second to complete. If requests are slower than one second, reduce the batch size. If requests are consistently faster than one-second, add more documents to the batch size. Repeat until bulk updates average about 1 second.