Deploy and run LLM on Lattepanda 3 Delta 864 (LLaMA, LLaMA2, Phi-2, ChatGLM2)
Introduction
This article will guide you on how to deploy and run popular LLMs (Large Language Models) on the Lattepanda 3 Delta 864, including LLaMA, LLaMA2, Phi-2, and ChatGLM2. We will compare the differences in runtime speed, resource consumption, and model performance among these LLMs to assist you in selecting a device that meets your needs and to provide a reference for AI research with limited hardware resources. Additionally, we will discuss the key steps and considerations to help you experience and test the performance of LLMs on the Lattepanda 3 Delta 864.
How to Choose LLM
LLM usually puts forward the prerequisite requirements for CPU/GPU in the project requirements. Since GPU inference for LLM is not currently available on the Lattepanda 3 Delta 864, we need to prioritize models that support CPU. Due to the RAM limitations, we should give preference to smaller models. Generally, a model requires RAM that is double its memory size to operate smoothly. Quantized models have lower memory demands. Therefore, we recommend using quantized models to experience the performance of LLMs on the Lattepanda 3 Delta 864.
The following list is a selection of smaller models from the open_llm_leaderboard on the Huggingface website and the latest popular models.
P.S.
1.ARC(AI2 Reasoning Challenge)
2.HellaSwag(Testing the model's common sense reasoning abilities)
3.MMLU(Measuring Massive Multitask Language Understanding)
4.TruthfulQA(Measuring How Models Mimic Human Falsehoods)
How to run LLM
We used LLaMA.cpp and the CPU of the Lattepanda 3 Delta 864 to infer LLMs. Here, we will take ChatGLM-6B as an example to provide you with detailed instructions on how to deploy and run an LLM on the Lattepanda 3 Delta 864, which has 8GB RAM, 64GB eMMC, and is running Ubuntu 20.04.
Quantization
The following is the process of quantizing ChatGLM2-6B 4bit via GGML on a Linux PC:
The first section of the process is to set up ChatGLM.cpp on a Linux PC, download the ChatGLM-6B-int4 model, convert and copy it to a USB drive. We need the Linux PC’s extra power to convert the model as the 8GB of RAM in a delta864 is insufficient.
Clone the ChatGLM.cpp repository into your local machine:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the chatglm.cpp
folder:
git submodule update --init --recursive
Install necessary packages:
python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece
Compile the project using CMake:
sudo apt-get install cmake
cmake -B build
cmake --build build -j --config Release
pip uninstall transformers
pip install transformers==4.33.0
Download the model and other files to chatglm.cpp/THUDM/chatglm-6b: https://huggingface.co/THUDM/chatglm-6b-int4
Use convert.py
to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:
python3 chatglm_cpp/convert.py -i THUDM/chatglm-6b -t q4_0 -o chatglm-ggml.bin
Model Deployment
Here is the process of deploying and running ChatGLM-6B-q4 on Lattepanda 3 delta 864 Ubuntu 20.04:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp
git submodule update --init --recursive
python3 -m pip install -U pip
python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece
sudo apt-get install cmake
cmake -B build
cmake --build build -j --config Release
pip uninstall transformers
pip install transformers==4.33.0
To run the model in interactive mode, add the -i
flag. For example:
cd chatglm.cpp
./build/bin/main -m chatglm-ggml.bin -i
In interactive mode, your chat history will serve as the context for the next round of conversation.
Run ./build/bin/main -h
to explore more options!
Summary
Test for Lattepanda 3 Delta 864 (8GB) & LLM
Test for Raspberry Pi 5 (8GB) & LLM
Reference:
Related Article:
Deploy and run LLM on LattePanda Sigma (LLaMA, Alpaca, LLaMA2, ChatGLM)
Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp
Deploy and run LLM on Raspberry Pi 4B (LLaMA, Alpaca, LLaMA2, ChatGLM)