Minimizing LLM Distraction with Cross-Encoder Re-Ranking
Blog post from Vespa
Vespa's latest update introduces support for declarative global re-ranking, allowing for the streamlined deployment of multi-phase ranking pipelines at scale without the need for extensive coding or complex infrastructure management. This enhancement is particularly beneficial in the context of integrating Large Language Models (LLMs) with text retrieval systems, where accurate retrieval and ranking are crucial to prevent LLMs from generating inaccurate responses due to irrelevant context. Vespa leverages multi-vector and cross-encoder models, which perform better in zero-shot settings without in-domain fine-tuning, to improve ranking accuracy. The update enables the use of phased ranking, where each stage of the retrieval process filters out less relevant documents, culminating in a global re-ranking phase that employs ONNX models for inference. This process benefits from GPU acceleration, reducing costs and enhancing performance through Vespa Cloud's autoscaling capabilities. The new feature is available from Vespa version 8.164 and aims to improve the robustness and efficiency of search applications by employing state-of-the-art cross-encoders, as demonstrated on the BEIR benchmark.