Home / Companies / Vespa / Blog / Post Details
Content Deep Dive

Minimizing LLM Distraction with Cross-Encoder Re-Ranking

Blog post from Vespa

Post Details
Company
Date Published
Author
Bjørn C Seime
Word Count
991
Language
English
Hacker News Points
-
Summary

Vespa's latest update introduces support for declarative global re-ranking, allowing for the streamlined deployment of multi-phase ranking pipelines at scale without the need for extensive coding or complex infrastructure management. This enhancement is particularly beneficial in the context of integrating Large Language Models (LLMs) with text retrieval systems, where accurate retrieval and ranking are crucial to prevent LLMs from generating inaccurate responses due to irrelevant context. Vespa leverages multi-vector and cross-encoder models, which perform better in zero-shot settings without in-domain fine-tuning, to improve ranking accuracy. The update enables the use of phased ranking, where each stage of the retrieval process filters out less relevant documents, culminating in a global re-ranking phase that employs ONNX models for inference. This process benefits from GPU acceleration, reducing costs and enhancing performance through Vespa Cloud's autoscaling capabilities. The new feature is available from Vespa version 8.164 and aims to improve the robustness and efficiency of search applications by employing state-of-the-art cross-encoders, as demonstrated on the BEIR benchmark.