Comprehensive Guide of Running LLM (LLaMA, Alpaca, and BERT) on Single Board Computer

In this article, we will shall present to you a selection of large language model (LLM), such as LLaMA, Alpaca, and BERT AI Models, which can be executed on a central processing unit (CPU). These models are based on natural language processing (NLP) techniques and find utility in various domains like text generation, text comprehension, and conversational systems.

Furthermore, we will guide you on how to configure and test these models on different single board computers, such as Raspberry Pi or LattePanda.

Why do certain AI models employ CPUs while others utilize GPUs?

CPUs excel at handling intricate and heterogeneous computational tasks. However, due to factors like cache capacity, memory bandwidth, and thread switching, they may encounter limitations when confronted with large-scale and direct matrix operations.

On the other hand, GPUs are primarily designed for graphic rendering. They excel at executing a multitude of identical or similar instructions, such as matrix multiplication and vector addition. GPUs are highly suitable for executing simple and routine computational tasks. Nevertheless, they may be influenced by factors like control units, branch prediction, and synchronization mechanisms when faced with complex and irregular computational tasks.

If an AI model necessitates parallel execution of numerous matrix operations, employing a GPU would yield higher efficiency.

If an AI model encounters complex and heterogeneous computational tasks, utilizing a CPU would offer greater flexibility.

Figure: CPU vs GPU for the deployment of deep learning models

(Source: https://blog.purestorage.com)

AI models that can run on CPU

LLaMA

Test 1：

LLM Introduction: LLaMA 7B, LLaMA 13B, LLAMA.cpp

LLaMA is one of the world’s most advanced large language models, and its code is open source. LLaMA has several versions, the smallest of which is LLaMA 7B, with 7 billion parameters. It is a reduced version of the GPT series of models, with a similar number of parameters but maintaining high performance.

LLaMA 13B has 13 billion parameters. As a medium-sized language model, its performance and comprehension are superior to 7B and surpass GPT-3.

LLaMA also has a pure C/C++ version of the model (with a simple Python code example) for inference. It supports 4-bit quantization and can run on CPUs without the need for a GPU. In addition, it can handle various mainstream large-scale language models such as LLAMA, Alpaca, and Vicuna.

Tested single-board computer: LattePanda Sigma Single Board Server

LattePanda Sigma is a powerful x86 Windows single-board computer equipped with a 13th generation Intel® Core™ i5-1340P Rapter Lake (12 cores, 16 threads) processor.

With up to 32GB of RAM (dual-channel LPDDR5-6400MHz), it can easily handle multiple tasks simultaneously.

CPU: Intel® Core™ i5-1340P (LattePanda Sigma)

Benchmark parameters:

Tutorial: Click Here

In this tutorial, we used the version with 16GB RAM.

If you want to running large language models, we recommend the version with 32GB RAM.

Test 2：

LLaMA & Alpaca

LLM Introduction: LLaMA-7B-Q4, Alpaca-7B-Q4, LLaMA2-7B-chat-hf-Q4 LLaMA-7B-Q4 and Alpaca-7B-Q4 are models optimized for dialogue scenarios, while LLaMA2-7B-chat-hf-Q4 is a general pre-trained model.

CPU: Quad-core Broadcom BCM2711 processor, 1.5 GHz

Tested single-board computer: Raspberry Pi 4B (8G) The Raspberry Pi 4B single-board computer is inexpensive, and we recommend using the 8GB Raspberry Pi 4B and a smaller quantized model to experience and test the performance of LLM on Raspberry Pi.

However, the Raspberry Pi has limited RAM, so the speed of running LLM may be significantly reduced.

Benchmark parameters:

Tutorial: Click Here

Github：

4bit/Llama-2-7b-chat-hf · Hugging Face

tloen/alpaca-lora-7b · Hugging Face

GitHub - tatsu-lab/stanford_alpaca: Code and documentation to train Stanford's Alpaca models, and generate the data.

Test 3:

Running BERT with the support of OpenVINO

LLM Introduction: BERT

The BERT model is a pre-trained language model based on the Transformer, which can be used for various natural language processing tasks such as text classification, named entity recognition, and question answering.

OpenVINO is a comprehensive toolkit for quickly deploying applications and solutions. OpenVINO can accelerate the inference of the BERT model and improve its performance. For example, OpenVINO can be used to accelerate BERT inference on a CPU.

CPU: Intel® UHD Graphics (Frequency: 450 – 800MHz)

Test single-board computer: LattePanda 3 Delta

LattePanda 3 Delta is a high-performance x86 single-board computer based on the Intel Core i7-1165G7 processor, with 8GB LPDDR4X memory.

Test Results:

High accuracy, supports text or web page links as data sources.

Fast response time, 3 tokens/s.

Tutorial： Click Here