How to Use Optimum-Intel to Accelerate LLaMA 3.1 on LattePanda Sigma (x86 SBC) Integrated GPU?
Editor's Note:
LLaMA 3.1 is now widely running on various local devices, but mostly on CPUs. This article introduces a method using Intel Optimum-Intel to optimize its performance. The author utilized this method on a LattePanda Sigma, an x86 single board computer/server, leveraging the integrated GPU to accelerate LLaMA 3.1's inference speed, with impressive results. This method can be applied to devices with Intel integrated graphics.
Introduction:
Those who have read previous articles know that we can use the LattePanda Sigma to run large language models around 10B parameters. Today, let's explore how to use the Optimum-Intel tool from OpenVINO to perform inference with the integrated GPU on the LLaMA-3.1-8B-Instruct model.
Optimum Intel:
Optimum-Intel serves as an interface layer between the Transformers and Diffusers libraries and various optimization tools provided by Intel. It offers developers a simplified way to use these libraries with Intel's hardware-optimized technologies, such as OpenVINO™, IPEX, etc., accelerating inference performance of AI models based on Transformer or Diffusion architectures on Intel hardware.
Optimum-intel
LLaMA 3.1 Overview:
LLaMA 3.1 with 405B parameters supports a context length of 128K tokens and was trained on 150 trillion tokens using over 16,000 H100 GPUs. Researchers have found that LLaMA 3.1 405B is comparable to top industry models like GPT-4, Claude 3.5 Sonnet, and Gemini Ultra, based on evaluations across more than 150 benchmark datasets.
Download the pre-trained weights of the Meta-Llama-3.1-8B-Instruct model provided by the ModelScope community using the following command. If you already have it, you can skip this step.
git clone --depth=1 https://www.modelscope.cn/LLM-Research/Meta-Llama-3.1-8B-Instruct.git
LattePanda Sigma Overview:
The LattePanda Sigma is equipped with a 13th-gen Intel Core i5-1340P processor, featuring 12 cores and 16 threads, with a turbo boost of up to 4.6GHz, delivering extreme performance and multitasking capabilities. This single-board computer uses Intel Iris Xe integrated graphics, which has 80 execution units (EU) and a maximum dynamic frequency of 1.4GHz, providing excellent graphics performance and rendering quality. We can leverage its integrated GPU to perform LLM inference.
LattePanda Sigma CPU
LattePanda Sigma GPU
Setting Up the Development Environment:
Please download and install Anaconda, then create and activate a virtual environment named llama31
using the following commands, and install Optimum Intel and its dependencies, OpenVINO and NNCF:
conda create -n llama31 python=3.11 # Create virtual environment conda activate llama31 # Activate virtual environment git clone https://gitee.com/Pauntech/llama3.1-model.git # git clone the repo frem gitee python -m pip install --upgrade pip # Upgrade pip to the latest version pip install optimum-intel[openvino,nncf] # Install Optimum Intel and its dependencies OpenVINO and NNCF pip install -U transformers # Upgrade transformers library to the latest version
Install Result
Using Optimum-CLI to Quantize the LLaMA 3.1 Model to INT4:
Optimum-CLI is a cross-platform command-line tool that comes with Optimum-Intel, allowing you to perform model quantization without writing code. You can use it to quantize the LLaMA 3.1 model and convert it to the OpenVINO format:
optimum-cli export openvino --model C:\Meta-Llama-3.1-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym llama31_int4
The meanings of the parameters in the optimum-cli
command are as follows:
--model
specifies the path of the model to be quantized.
--task
specifies the task type.
--weight-format
specifies the precision of the model parameters.
--group-size
defines the group size during the quantization process.
--ratio
determines the proportion of weights retained during quantization.
--sym
indicates that symmetric quantization is used.
Quantization Result
Building a Chatbot Based on the LLaMA 3.1 Model:
First, install the required packages:
pip install gradio mdtex2html streamlit -i https://mirrors.aliyun.com/pypi/simple/
Then run python llama31_chatbot.py
, and the result will be as shown below:
Chatbot Interface and GPU Running
Conclusion:
The Optimum Intel toolkit based on OpenVINO™ is simple and easy to use. With just one command, you can quantize the LLaMA 3.1 model to INT4, and with some basic preprocessing, you can use LLaMA 3.1 on the LattePanda Sigma and achieve excellent results. If you need to run large language models locally, you might consider deploying them on the LattePanda Sigma. Next, we will introduce how to use the ipexllm tool to utilize LattePanda's integrated GPU for LLM inference.