OpenOrca: open source dataset and instruct-tuned LLMs

Today, OpenOrca, an open-source dataset and series of instruct-tuned language models, is being announced. The inspiration for this project came from reading the Orca paper by Mukherjee et. al. of Microsoft, which showcased some impressive research. However, it was uncertain whether Microsoft would release the dataset, so the decision was made to replicate their efforts and create OpenOrca. With the help of a dedicated team of open-source AI/ML engineers, the OpenOrca dataset has been completed, consisting of FLANv2 augmented with GPT-4 and GPT-3.5 completions. The team is currently fine-tuning OpenOrca on LLaMA-13b and plans to release it in mid-July 2023. They are also seeking GPU compute sponsors for training on various platforms. The compute costs for different model sizes have been estimated, and the team is grateful for the support of their current sponsors. The author expresses gratitude to all the individuals in the Open Source AI community who have contributed to this endeavor.

https://erichartford.com/openorca