Content Deep Dive
The Sphere Dataset in Weaviate
Blog post from Weaviate
Post Details
Company
Date Published
Author
Zain Hasan
Word Count
1,129
Language
English
Hacker News Points
-
Source URL
Summary
Meta has released an open-source dataset called Sphere, which consists of 134 million documents broken up into 906 million 100-word snippets. It is one of the largest knowledge bases that can help solve knowledge-intensive natural language tasks such as question-answering and fact-checking. The dataset aims to act as a "universal, uncurated and unstructured source of knowledge." However, accessing and using Sphere in its current open-source format is challenging for the average developer due to its enormity. To make this resource more accessible, Weaviate now offers Sphere as JSON or Parquet files that can be easily imported with Python and Spark.