Data Provenance Explorer Launches to Tackle Data Transparency Crisis
Blog post from Cohere
Understanding systematic differences in commercially available datasets reveals significant challenges in data accessibility and safety, influenced by licensing and geographic representation. A comprehensive audit highlights the growing disparity between commercially open and closed data, with an increasing number of publicly released datasets restricted from commercial use, impacting small companies and fostering a quality gap in data available for commercial applications. The audit uncovers a Western-centric bias in datasets, with limited representation from Asian, African, and South American regions, potentially leading to biases in model performance for non-Western users. Legal ambiguities further complicate data usage, as existing open-source licenses, primarily designed for software, are applied to datasets without modification, causing challenges in legal compliance and responsible data stewardship. The launch of the Data Provenance Explorer and a global initiative aims to enhance data transparency and responsible use, addressing the ethical, legal, and transparency issues identified.