Home / Companies / Aspect Build / Blog / Post Details
Content Deep Dive

Fetching ML models under Bazel

Blog post from Aspect Build

Post Details
Company
Date Published
Author
Alex Eagle
Word Count
684
Language
English
Hacker News Points
-
Summary

The text discusses an approach to manage the fetching of NLTK data for Machine Learning tasks under Bazel, highlighting the drawbacks of the traditional installation method suggested by NLTK, which involves non-hermetic and non-reproducible processes that rely on machine state. Instead of using the typical installation path that requires network access and manual setup, the author suggests using Bazel's capabilities to download and manage the data more efficiently and reproducibly. By employing the Bazel Downloader and the http_archive helper, the required NLTK data, such as the Punkt tokenizer, is fetched directly from GitHub, ensuring that it is available in a consistent manner across different environments. The process involves setting up a cache folder structure that mimics NLTK's expectations, using a macro to facilitate documentation and organization, and configuring the Python target to use the downloaded data by setting the NLTK_DATA environment variable. This method ensures that the data is hermetically provided to the test action at runtime without needing network access, thus enhancing portability and reliability.