Today I am announcing Dolphin, an open-source and uncensored, and commercially licensed dataset and series of instruct-tuned language models based on Microsoft's Orca paper.
The dataset is released here, with Apache-2.0 license. The dataset can be used for commercial or non-commercial purposes.
The models are currently in-progress. More information will be released here as it becomes available.
As I read Orca: Progressive Learning from Complex Explanation Traces of GPT-4 by Mukherjee et. al. of Microsoft, I had to consider the implications for Open Source AI.
This was pretty awesome stuff. But, I realized that while Microsoft would probably release their LLaMA-13b based model (as of the time of this writing they still haven't) I concluded that they might not release the dataset.
Therefore, I resolved to duplicate their efforts, download the data myself, and train the model myself, so that Dolphin can be released on other sizes of LLaMA as well as other foundational models such as Falcon, OpenLLaMA, RedPajama, MPT, RWKV.
This was a nontrivial undertaking. With the help of an all-star team of open-source AI/ML engineers, we have completed the Dolphin dataset.
Our dataset consists of:
~1 million of FLANv2 augmented with GPT-4 completions
~3.5 million of FLANv2 augmented with GPT-3.5 completions
We followed the submix and system prompt distribution outlined in the Orca paper. With a few exceptions. We included all 75k of CoT in the FLAN-1m dataset rather than sampling that. Also, we found that many items were duplicated, so we removed duplicates, resulting in 3.5m instructs in the ChatGPT dataset.
Then we filtered out instances of alignment, refusal, avoidance, and bias, in order to produce an uncensored model upon which can be layered your personalized alignment LoRA.
We currently plan to release Dolphin on:
Xgen 7b 8k
LLaMA 13b (Non-commercial)
MPT 30b 8k
LLaMA 33b (Non-commercial)
LLaMA 65b (Non-commercial)
The Dolphin models that are released will be subject to the license of the foundational model on which it is trained. (LLaMA releases will be non-commercial)
I would like to thank the motley crew of Open Source AI/ML engineers who have worked beside me in this endeavor. Including:
Wing "Caseus" Lian and NanoBit of OpenAccess AI Collective
Tom "TheBloke" Jobbins for quantizing and amplifying
Special thanks to EdenCoder and chirper.ai for mentorship and financial sponsorship.
Special thanks to Kilkonie for his very valued mentorship.
All the other people in the Open Source AI community who have taught me and helped me along the way.