Large Language Models (LLMs) have gained significant attention, but training them presents challenges, particularly in data loading. For those interested in training LLMs on a smaller scale, the process of downloading and managing large datasets like the 1TB codeparrot/github-code dataset can be daunting. Lance, a columnar data format optimized for machine learning workflows, offers a solution by allowing efficient data access without loading entire datasets into memory. By using Lance in combination with PyArrow and a tokenizer, users can preprocess and save a manageable subset of a larger dataset, facilitating training while keeping memory usage low. This approach, demonstrated through a Python script, enables efficient management of large datasets, making it possible to tokenize and process data for LLMs with limited resources.