Magicoder is a series of fully open-source Large Language Models (LLMs) for code that aims to close the gap with top code models while having no more than 7B parameters. These models are trained on synthetic instruction data using a novel approach called OSS-Instruct, which incorporates open-source code snippets to produce high-quality instruction data for code. The goal of Magicoder is to address the bias inherent in synthetic data generated by LLMs by providing them with a diverse range of open-source references. Magicoder and its enhanced version, MagicoderS, outperform other code models on various coding benchmarks, including Python text-to-code generation and data-science program completion. Interestingly, MagicoderS-CL-7B even surpasses the well-known ChatGPT on HumanEval+ in terms of pass@1. The use of OSS-Instruct opens up new possibilities for low-bias and high-quality instruction tuning using abundant open-source references.
https://arxiv.org/abs/2312.02120