Releasing Canica: A Text Dataset Viewer

Post Details

Company

Lakera

Date Published

Nov. 14, 2025

Author

Lakera Team

Word Count

918

Language

-

Hacker News Points

-

Source URL

www.lakera.ai/blog/releasing-canica

Summary

Lakera has developed and released canica, a text dataset viewer designed to enhance the quality assessment of datasets used for training machine learning models. Canica allows users to interactively explore datasets as 2D plots using algorithms like t-SNE or UMAP, facilitating a visual understanding of data clusters and semantic relationships. The tool addresses challenges in dimensionality reduction by providing features that link the 2D visualizations back to the original embedding space, enabling users to explore local neighborhoods and focus on specific data subsets. Released under the MIT license, canica is available on GitHub and can be installed via pip, offering the machine learning community a new resource for data analysis and visualization, with further exploration and contributions encouraged through a tutorial notebook on GitHub.