GPU Embedding with GGML

Semantic search is a useful tool for chatbots like bloop, which answers code-related questions. However, indexing can be slow, especially when dealing with large codebases like LLVM. To improve index speeds, the team set out to run the embedding model on the MacBook’s GPU. Initially, they tried using ONNX Runtime with Core ML, but not all operations in their model were supported. After exploring alternatives, they chose ggml, a tool that supports 4-bit quantization and can run operations on MacBook GPUs. They encountered some challenges with NaN outputs but found a fix from llama.cpp. Benchmarking showed that the ggml model was comparable to the ONNX model. By batching inputs and reshaping tensors, they achieved a significant speedup on the GPU. The team learned that the ecosystem for running open-source AI models on-device is rapidly evolving, with improvements and features being added to libraries like ggml and llm.

https://bloop.ai/blog/gpu_with_ggml