IR evaluation metrics with uncertainty estimates
Blog post from Vespa
The tutorial demonstrates building a text search application using Vespa, focusing on evaluating and comparing two ranking functions, NativeRank and BM25, through various metrics such as recall, reciprocal rank, and normalized discounted cumulative gain. It involves setting up a Vespa application package for passage ranking experiments, deploying it in a Docker container, and feeding it with a sample of the passage ranking dataset. The process includes querying the application using both a QueryModel and Vespa Query Language (YQL), and assessing the models' performance by obtaining point estimates and computing uncertainty estimates using bootstrap sampling. The results are visualized using plots to highlight the differences in ranking functions, enhancing the understanding of their impact on application performance. The tutorial also addresses the significance of measuring uncertainty in evaluation metrics to better interpret the effects of changes in ranking functions.