Company
Date Published
Author
Dimitrios Liappis • Lynn Asiimwe (Google Cloud) • Christopher Newcomer (Canonical)
Word count
1540
Language
-
Hacker News points
None

Summary

In February 2019, Elastic discovered a Linux kernel bug affecting SSD disks during a stability test for Elasticsearch, prompting a collaborative investigation with Google Cloud and Canonical. The investigation revealed that the issue, characterized by index corruption, was linked to the use of SSD drives with the SCSI interface and the multi-queue block layer (blk_mq) feature enabled on certain Linux kernels. Despite initial suspicions, the problem was not related to Elasticsearch's cross-cluster replication or soft deletes features. Google Cloud and Canonical contributed to identifying workarounds, such as increasing memory or disabling the multi-queue SCSI feature, and pinpointing the kernel version where the corruption fixes were first implemented. Canonical's extensive testing confirmed that the corruption did not occur on Ubuntu mainline kernels version 5.0 and above. The collaborative efforts led to the release of updated Ubuntu kernels for various platforms, addressing the issue and improving data reliability when using SSDs with the blk_mq feature enabled.