Instruction-Finetuning Dataset Collection (Alpaca-CoT)

This repository will continuously collect various instruction tuning datasets. And we standardize different datasets into the same format, which can be directly loaded by the code of Alpaca model.

We also have conducted empirical study on various instruction-tuning datasets based on the Alpaca model, as shown in https://github.com/PhoebusSi/alpaca-CoT.

If you think this dataset collection is helpful to you, please like this dataset and star our github project!

You are in a warm welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train Alpaca model with these datasets and open source the model checkpoints.

Contribute

Welcome to join us and become a contributor to this project! If you want to share some datasets, adjust the data in the following format:

example.json
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

Folder should be like this:

Alpaca-CoT
|
|----example
|    |
|    |----example.json
|    |
|     ----example_context.json
...

Create a new pull request in Community and publish your branch when you are ready. We will merge it as soon as we can.

Data Usage and Resources

Data Format

All data in this folder is formatted into the same templates, where each sample is as follows:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

alpaca

alpaca_data.json

This dataset is published by Stanford Alpaca. It contains 52K English instruction-following samples obtained by Self-Instruction techniques.

alpaca_data_cleaned.json

This dataset is obtained here. It is a revised version of alpaca_data.json by stripping of various tokenization artifacts.

alpacaGPT4

alpaca_gpt4_data.json

This dataset is published by Instruction-Tuning-with-GPT-4. It contains 52K English instruction-following samples generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

alpaca_gpt4_data_zh.json

This dataset is generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT.

Chain-of-Thought

CoT_data.json

This dataset is obtained by formatting the combination of 9 CoT datasets published by FLAN. It contains 9 CoT tasks involving 74771 samples.

CoT_CN_data.json

This dataset is obtained by tranlating CoT_data.json into Chinese, using Google Translate(en2cn).

formatted_cot_data folder

This folder contains the formatted English data for each CoT dataset.

formatted_cot_data folder

This folder contains the formatted Chinese data for each CoT dataset.

CodeAlpaca

code_alpaca.json

This dataset is published by codealpaca. It contains code generation task involving 20022 samples.

finance

finance_en.json

This dataset is collected from here. It contains 68912 financial related instructions in English.

firefly

firefly.json

his dataset is collected from here. It contains 1649398 chinese instructions in 23 nlp tasks.

GPT4all

gpt4all.json

This dataset is collected from here. It contains 806199 en instructions in code, storys and dialogs tasks.

gpt4all_without_p3.json

gpt4all without Bigscience/P3, contains 437605 samples.

GPTeacher

GPTeacher.json

This dataset is collected from here. It contains 29013 en instructions generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.

Guanaco

GuanacoDataset.json

This dataset is collected from here. It contains 534610 en instructions generated by text-davinci-003 upon 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.

Guanaco_additional_Dataset.json

A new additional larger dataset for different languages.

HC3

HC3_ChatGPT.json/HC3_Human.json

This dataset is collected from here. It contains 37175 en/zh instructions generated by ChatGPT and human.

HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.json

HC3 dataset without deduplication instructions.

instinwild

instinwild_en.json & instinwild_cn.json

The two datasets are obtained here. It contains 52191 English and 51504 Chinese instructions, which are collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types. (Colossal AI used these datasets to train the ColossalChat model.)

instruct

instruct.json

The two datasets are obtained here. It contains 888969 English instructions, which are caugmentation performed using the advanced NLP tools provided by AllenAI.

Natural Instructions

natural-instructions-1700tasks.zip

This dataset is obtained here. It contains 5040134 instructions, which are collected from diverse nlp tasks

prosocial dialog

natural-instructions-1700tasks.zip

This dataset is obtained here. It contains 165681 English instructions, which are produuced by GPT-3 rewrites questions and humans feedback

xP3

natural-instructions-1700tasks.zip

This dataset is obtained here. It contains 78883588 instructions, which are collected by prompts & datasets across 46 of languages & 16 NLP tasks

Chinese-instruction-collection

all datasets of Chinese instruction collection

combination

alcapa_plus_belle_data.json

This dataset is the combination of English alpaca_data.json and Chinese belle_data_cn.json.

alcapa_plus_cot_data.json

This dataset is the combination of English alpaca_data.json and CoT CoT_data.json.

alcapa_plus_belle_cot_data.json

This dataset is the combination of English alpaca_data.json, Chinese belle_data_cn.json and CoT CoT_data.json.

Citation

Please cite the repo if you use the data collection, code, and experimental findings in this repo.

@misc{alpaca-cot,
  author = {Qingyi Si, Zheng Lin },
  school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
  title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}

Cite the original Stanford Alpaca, BELLE and FLAN papers as well, please.

Instruction-Finetuning Dataset Collection (Alpaca-CoT)

Contribute

Data Usage and Resources

Data Format

alpaca

alpaca_data.json

alpaca_data_cleaned.json

alpacaGPT4

alpaca_gpt4_data.json

alpaca_gpt4_data_zh.json

Chain-of-Thought

CoT_data.json

CoT_CN_data.json

formatted_cot_data folder

formatted_cot_data folder

CodeAlpaca

code_alpaca.json

finance

finance_en.json

firefly

firefly.json

GPT4all

gpt4all.json

gpt4all_without_p3.json

GPTeacher

GPTeacher.json

Guanaco

GuanacoDataset.json

Guanaco_additional_Dataset.json

HC3

HC3_ChatGPT.json/HC3_Human.json

HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.json

instinwild

instinwild_en.json & instinwild_cn.json

instruct

instruct.json

Natural Instructions

natural-instructions-1700tasks.zip

prosocial dialog

natural-instructions-1700tasks.zip

xP3

natural-instructions-1700tasks.zip

Chinese-instruction-collection

combination

alcapa_plus_belle_data.json

alcapa_plus_cot_data.json

alcapa_plus_belle_cot_data.json

Citation

Models trained or fine-tuned on QingyiSi/Alpaca-CoT

Space using QingyiSi/Alpaca-CoT 1