Testing the LLaMa Language Model on Lattepanda Sigma: Unleashing AI Capabilities on a SBC

by L.P

Preface

In the era of artificial intelligence (AI) and big data, the demand for language processing and intelligent programming has steadily increased, and the development of language models has also been advancing. More and more enterprises and research institutions are actively investing in the development and application of large language models. Compared to traditional language models, large language models have superior performance and functionality, providing developers with more innovation and convenience in various scenarios. Besides, the development of single board computers (SBC) has played a significant role in the advancement of large language models. As a result of the progress in single board computers, many large language models are now capable of running on these devices. This has opened up new possibilities and opportunities, as developers can now leverage the power of single board computers to deploy and utilize large language models efficiently.

LLaMA (Large Language Model Meta AI) is a large-scale language model with the most advanced technology, designed to help researchers advance their work in the subfield of artificial intelligence. The model is designed to be widely applicable, suitable for many different use cases rather than being designed for specific fine-tuning models. LLaMA's code is open source, which makes it easier for other researchers to test new methods to limit or eliminate the problems in large language models. They also provide a set of evaluation benchmarks in their paper to assess the model's biases and toxicity, to demonstrate the limitations of the model and support further research in this critical field.

https://arxiv.org/abs/2302.13971

LLaMA 7B

LLaMA 7B is the smallest model in the LLaMA series with 7 billion parameters, trained on 1 trillion tokens. It is a pared-down version of the GPT series model with a similar number of parameters, while maintaining a high level of performance. Compared to LLaMA 13B, 7B has a smaller scale and is suitable for resource-constrained environments, but still delivers good performance. LLaMA 7B provides a stable and feasible solution for running on a single chip, enabling developers to achieve high-quality language processing functions at a relatively low cost.

LLaMA 13B

LLaMA 13B is one of the basic large-scale language models in the LLaMA series released by Meta, with 13 billion parameters. As a medium-sized language model, it provides better performance and comprehension than 7B. In most benchmark tests, LLaMA 13B performs better than some existing large-scale language models, such as GPT-3. Notably, like other models in the LLaMA series, LLaMA 13B is trained solely on publicly available data, making it compatible with open-source, while most existing models rely on undisclosed or unrecorded data. The main difference between LLaMA 13B and LLaMA 7B lies in their parameter sizes, with LLaMA 13B having 13 billion parameters, compared to LLaMA 7B's 7 billion parameters. Generally, the greater the number of parameters, the larger the model's capacity and the more complex patterns it can learn and express. However, larger models also require more computing resources for training and operation. Therefore, these two models offer different tradeoffs between performance and resource demands. It should be noted that, despite LLaMA 13B having fewer parameters than GPT-3 (17.5 billion parameters), it outperforms it in most benchmark tests, further proving that models trained on public datasets can achieve excellent performance.

LLAMA.cpp

LLAMA7B and 13B, as mentioned earlier, typically require large amounts of computing power to run properly, which can discourage users without powerful GPUs. However, developer Georgi Gerganov has created a pure C/C++ version of the LLaMA model (with a simple Python code example) for inference. Compared to LLAMA, this pure C/C++ version has no additional dependencies, can be compiled directly into an executable file for any hardware, and has mixed precision with F16 and F32 support. It also supports 4-bit quantization and can run solely on a CPU without a GPU. Additionally, it can handle various mainstream large-scale language models, such as LLAMA, Alpaca, and Vicuna.

Quantization

After designing the structure of a deep neural network model, the core objective of the training process is to determine the weight parameters of each neuron, which are usually stored as floating-point numbers with different levels of precision, such as 16, 32, or 64 bits, based on GPU-accelerated training. Quantization is the process of reducing the precision of these weight parameters to reduce hardware requirements.

For example, the LLaMA model uses 16-bit floating-point precision, and the 7B version has 7 billion parameters, resulting in a full size of 13 GB. Users need at least that much memory and disk space for the model to be usable. This size is even more prohibitive for the 13B version, which is 24 GB. However, via quantization, such as reducing the precision to 4 bits, the 7B and 13B versions can be compressed to approximately 4 GB and 8 GB, respectively, making them more accessible for consumer-grade hardware.

The quantization implementation in LLaMA.cpp is based on the author's other library, ggml, which implements tensors in machine learning models using C/C++. Tensors are the core data structures in neural network models and are commonly used in frameworks such as TensorFlow and PyTorch. By using C/C++ instead, LLaMA.cpp can support a broader range of hardware, achieve higher efficiency, and lay a foundation for its existence.

By running LLAMA.cpp on single-board computers, developers can introduce advanced natural language processing techniques into a wide range of low-power, low-cost hardware systems. This will inject powerful momentum into the development of intelligent hardware and intelligent systems, while greatly enhancing people's quality of life.

This article will use the LattePanda Sigma (with Ubuntu20.04) as an example to run llama7B and 13B (quantized to 4 bits) on a single-board computer. The LattePanda Sigma is a powerful and compact x86 Windows single-board computer (SBC) equipped with the 13th generation Intel® Core™ i5-1340P Rapter Lake (12 cores, 16 threads) processor. With 16GB of RAM (dual-channel LPDDR5-6400MHz) and lightning-fast speed, it can easily handle multiple tasks simultaneously. You can learn more about this board on the DFRobot website.

https://www.dfrobot.com/product-2671.html

Installation

Model Download

The LLaMA model released by Facebook is not allowed for commercial use, and the official has not officially open-sourced the model weights (although there are already many third-party download links online). Therefore, to obtain the model, you need to first fill out the official Facebook application form. Once your request is approved, you will receive a link to download the tokenizer and model files. Use the signature URL provided in the email to edit the download.sh script to download the model weights and tokenizer.

Installation and Running

Refer to the steps mentioned in the README of LLaMA.cpp on GitHub and follow the guides for different systems to install and quantize the model.

https://github.com/ggerganov/llama.cpp#usage

Testing

Benchmarking

The standards for accuracy and logic testing are as follows:

Given that various large language models now have standard testing procedures, this evaluation focuses on their ability to run on Sigma, serving as a reference for those choosing SBS (Single Board Server). The question settings and the judgment of accuracy and logic are entirely subject to subjective judgment and are for reference only.

Testing Results:

Llama 7B (Q4)

Startup time is less than 2 seconds (the time required from starting to load the model to when the model is ready to generate output).

Response speed is 5 tokens/s, which is considered the fastest in the LLM being evaluated (due to its small size). However, the cost of high speed is poor accuracy and logic ability, and it often gives answers that are unrelated to the question. If the answer requires step-by-step examples, it will stop after the first sentence without any subsequent output. Additionally, the program will crash after about 10 rounds of running and needs to be restarted.

Logic Ability:

Llama 13B (Q4)

Startup time is 2 minutes and 10 seconds.

Response speed is slow, and letters are outputted visibly one by one. However, the accuracy rate for answers is much higher, with a significant improvement in correct answers. However, responses involving URLs are still dead links (due to the database being too old). The testing process shows high stability, although the fan is running at full capacity, the program did not automatically crash, and it was able to answer questions normally even after about 20 rounds.

Testing Summary

In terms of startup speed, the 7B only takes 2 seconds compared to the 13B, which takes 2 minutes. Additionally, the 7B is much faster in terms of fixed token response time. However, the 13B outperforms the 7B in terms of logic and accuracy. If you value fast and efficient Q&A, the 7B is the better choice. However, if your project relies on large data foundations that require more accurate responses and stronger logical conversation ability, then the 13B model is recommended.

Reference

https://github.com/ggerganov/llama.cpp