Keeping Elasticsearch in Sync

Company

Elastic

Date Published

Nov. 19, 2013

Author

Andrew Cholakian

Word count

3453

Language

Hacker News points

None

URL

www.elastic.co/blog/found-keeping-elasticsearch-in-sync

Summary

Andrew Cholakian's article discusses strategies for effectively integrating Elasticsearch with an existing application by managing data flow from a primary data source, such as an SQL database, into Elasticsearch. The bulk API is highlighted as essential for efficient data replication due to its ability to handle multiple document updates quickly and reduce repetitive updates. The article emphasizes the importance of designing a replication strategy that considers acceptable replication delay and data consistency, as perfect synchronization is rarely needed. It advises against frequent updates due to the high cost of modifying documents in Elasticsearch, which is based on Lucene. To manage updates efficiently, Cholakian suggests using queues with uniqueness constraints to group updates and minimize write churn, and describes the implementation of a queue-worker system to handle bulk imports. For datasets that only grow, such as logs, a simpler range-based batching method is recommended. The article also advises against marking source records for updates due to potential performance issues and recommends adopting a strategy that ensures idempotence to handle replication errors effectively.