The RedPajama-Data repository contains code for preparing large datasets for training large language models. RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset.