OpenCoder: Open Cookbook for Top-Tier Code Large Language Models

OpenCoder is an open and reproducible code LLM family that rivals the performance of Top-Tier Code LLM. This family includes 1.5B and 8B base and chat models for both English and Chinese languages, trained on 2.5 trillion tokens of raw code and code-related web data. Empowering researchers, OpenCoder provides model weights, inference code, training data, processing pipeline, experimental results, and training protocols. RefineCode offers a large-scale pretraining corpus, and ablation studies offer insights on design choices. Contributors from various institutions have released resources like model weights, data pipelines, and evaluation tools for advancing code AI.

https://opencoder-llm.github.io/