Data
At Boson AI, we recognize that high quality data is the foundation of effective AI models. Our data curation process ensures that our models are trained on diverse, high-quality sources that enable them to understand and generate content across multiple domains, languages and tasks.
Loading content...
Key Features
- Deep crawling and scraping
- Multi-stage processing pipeline
- Deduplication and filtering
- Human annotation
- Automatic quality assessment
* We gratefully acknowledge the use of the visualization "Treemap of Pile components by effective size" from the seminal work by Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).