The dataset viewer is not available for this dataset.
Error code: JobManagerCrashedError
Need help to make the dataset viewer work? Open a discussion for direct support.
Instruction-Finetuning Dataset Collection (Alpaca-CoT)
This repository will continuously collect various instruction tuning datasets. And we standardize different datasets into the same format, which can be directly loaded by the code of Alpaca model.
We also have conducted empirical study on various instruction-tuning datasets based on the Alpaca model, as shown in https://github.com/PhoebusSi/alpaca-CoT.
If you think this dataset collection is helpful to you, please like
this dataset and star
our github project!
You are in a warm welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train Alpaca model with these datasets and open source the model checkpoints.
Contribute
Welcome to join us and become a contributor to this project! If you want to share some datasets, adjust the data in the following format:
example.json
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
Folder should be like this:
Alpaca-CoT
|
|----example
| |
| |----example.json
| |
| ----example_context.json
...
Create a new pull request in Community and publish your branch when you are ready. We will merge it as soon as we can.
Data Usage and Resources
Data Format
All data in this folder is formatted into the same templates, where each sample is as follows:
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
alpaca
alpaca_data.json
This dataset is published by Stanford Alpaca. It contains 52K English instruction-following samples obtained by Self-Instruction techniques.
alpaca_data_cleaned.json
This dataset is obtained here. It is a revised version of
alpaca_data.json
by stripping of various tokenization artifacts.
alpacaGPT4
alpaca_gpt4_data.json
This dataset is published by Instruction-Tuning-with-GPT-4. It contains 52K English instruction-following samples generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.
alpaca_gpt4_data_zh.json
This dataset is generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT.
Chain-of-Thought
CoT_data.json
This dataset is obtained by formatting the combination of 9 CoT datasets published by FLAN. It contains 9 CoT tasks involving 74771 samples.
CoT_CN_data.json
This dataset is obtained by tranlating
CoT_data.json
into Chinese, using Google Translate(en2cn).
formatted_cot_data folder
This folder contains the formatted English data for each CoT dataset.
formatted_cot_data folder
This folder contains the formatted Chinese data for each CoT dataset.
CodeAlpaca
code_alpaca.json
This dataset is published by codealpaca. It contains code generation task involving 20022 samples.
finance
finance_en.json
This dataset is collected from here. It contains 68912 financial related instructions in English.
firefly
firefly.json
his dataset is collected from here. It contains 1649398 chinese instructions in 23 nlp tasks.
GPT4all
gpt4all.json
This dataset is collected from here. It contains 806199 en instructions in code, storys and dialogs tasks.
gpt4all_without_p3.json
gpt4all without Bigscience/P3, contains 437605 samples.
GPTeacher
GPTeacher.json
This dataset is collected from here. It contains 29013 en instructions generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer.
Guanaco
GuanacoDataset.json
This dataset is collected from here. It contains 534610 en instructions generated by text-davinci-003 upon 175 tasks from the Alpaca model by providing rewrites of seed tasks in different languages and adding new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
Guanaco_additional_Dataset.json
A new additional larger dataset for different languages.
HC3
HC3_ChatGPT.json/HC3_Human.json
This dataset is collected from here. It contains 37175 en/zh instructions generated by ChatGPT and human.
HC3_ChatGPT_deduplication.json/HC3_Human_deduplication.json
HC3 dataset without deduplication instructions.
instinwild
instinwild_en.json & instinwild_cn.json
The two datasets are obtained here. It contains 52191 English and 51504 Chinese instructions, which are collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types. (Colossal AI used these datasets to train the ColossalChat model.)
instruct
instruct.json
The two datasets are obtained here. It contains 888969 English instructions, which are caugmentation performed using the advanced NLP tools provided by AllenAI.
Natural Instructions
natural-instructions-1700tasks.zip
This dataset is obtained here. It contains 5040134 instructions, which are collected from diverse nlp tasks
prosocial dialog
natural-instructions-1700tasks.zip
This dataset is obtained here. It contains 165681 English instructions, which are produuced by GPT-3 rewrites questions and humans feedback
xP3
natural-instructions-1700tasks.zip
This dataset is obtained here. It contains 78883588 instructions, which are collected by prompts & datasets across 46 of languages & 16 NLP tasks
Chinese-instruction-collection
all datasets of Chinese instruction collection
combination
alcapa_plus_belle_data.json
This dataset is the combination of English
alpaca_data.json
and Chinesebelle_data_cn.json
.
alcapa_plus_cot_data.json
This dataset is the combination of English
alpaca_data.json
and CoTCoT_data.json
.
alcapa_plus_belle_cot_data.json
This dataset is the combination of English
alpaca_data.json
, Chinesebelle_data_cn.json
and CoTCoT_data.json
.
Citation
Please cite the repo if you use the data collection, code, and experimental findings in this repo.
@misc{alpaca-cot,
author = {Qingyi Si, Zheng Lin },
school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}
Cite the original Stanford Alpaca, BELLE and FLAN papers as well, please.
- Downloads last month
- 1,089