Scaling Embedding Generation Pipelines From Pandas to Ray Data

Post Details

Company

Anyscale

Date Published

Sept. 4, 2024

Author

Marwan Sarieddine

Word Count

2,154

Company Posts That Month

4

Language

English

Hacker News Points

-

Source URL

www.anyscale.com/blog/scaling-embedding-generation-pipelines-from-pandas-to-ray-data

Summary

This blog post explores scaling up a pipeline that generates text embeddings using Ray Data and Sentence Transformers. The author demonstrates an easy migration from a pandas-based pipeline to a Ray Data-based pipeline, highlighting significant performance improvements with minimal code changes. The improved Ray Data pipeline delivers a 10x performance improvement over the naive implementation and allows for distribution of workload across a cluster of machines with GPUs and CPUs compared to running pandas on a single machine.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Vector Search	36	3,675	269	79	+77%
Data Pipeline	4	1,400	332	68	+111%
RAG	2	1,936	254	78	-19%
LLM	1	3,889	441	129	+7%
Real-time	1	3,932	887	192	+47%
Serverless	1	647	170	80	+31%