Llama 2 on ONNX runs locally

Llama 2 is an optimized version of the Llama 2 model, available from Meta under the Llama Community License Agreement. Microsoft permits users to use, modify, redistribute, and create derivatives of their contributions to the optimized version. Access to the ONNX files in this repository is controlled, and users must fill out a request form to gain permissions to the Llama 2 model. The Llama 2 model consists of decoder layers, each constructed from a self-attention layer and a feed-forward multi-layer perceptron. Llama 2 utilizes the Grouped Query Attention mechanism for improved efficiency. Examples of code usage for running Llama 2 with ONNX are provided in the repository. Additionally, there is a chat bot interface available for interaction with Llama 2. Users are advised to follow specific formatting guidelines when using the fine-tuned models for dialogue applications in order to achieve the desired features and performance. The first inference session may be slow due to the need to generate JIT binaries, but subsequent runs are faster. Users can optimize inference speed by putting inputs/outputs on the target device and following the provided guidelines. Microsoft and Meta emphasize the importance of responsible development and provide resources and tools for developers to ensure responsible AI usage.

https://github.com/microsoft/Llama-2-Onnx