Logo
RedPajama-Data logo

RedPajama-Data

Code for preparing large datasets for training large language models

Visit Website
Screenshot of RedPajama-Data
December 29th, 2024

About RedPajama-Data

RedPajama-Data is a repository that contains code for preparing large datasets for training large language models. It is designed to support the development of open datasets by releasing massive web datasets with billions or even trillions of tokens. The repository includes various ML heuristics and classifiers specifically for English data. RedPajama-Data is an essential tool for researchers and developers working on natural language processing and language model training.

Key Features

4 features
  • Supports the preparation of large datasets for training large language models.
  • Includes ML heuristics and classifiers for English data.
  • Enables the creation of web datasets with billions or trillions of tokens.
  • Provides a framework for developing open datasets for natural language processing.

Use Cases

4 use cases
  • Training large language models.
  • Text generation.
  • Natural language processing research.
  • Developing open datasets.
Loading reviews...

Browse All Tools in These Categories