logo

Data

At Boson AI, we recognize that high quality data is the foundation of effective AI models. Our data curation process ensures that our models are trained on diverse, high-quality sources that enable them to understand and generate content across multiple domains, languages and tasks.

Loading content...

Key Features

  • Deep crawling and scraping
  • Multi-stage processing pipeline
  • Deduplication and filtering
  • Human annotation
  • Automatic quality assessment

* We gratefully acknowledge the use of the visualization "Treemap of Pile components by effective size" from the seminal work by Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).