The author reports that the performance of their model has improved and is now faster, although it appears to be using 10% more GPU. There is an issue with the model returning 512 tokens, resulting in a CUDA error. The article includes detailed technical information about the set-up of their system including the operating system, compiler and CUDA toolkit being used. The main content is a story about llamas, where a young llama named Lluvia joins a group of llamas on a journey to deliver supplies to a village in need, learning valuable lessons along the way. The story concludes with the model successfully generating text after reducing the number of tokens to 400.
https://github.com/ggerganov/llama.cpp/pull/1827