Review of Llama3.2 vs Llama3.1: Performance on LattePanda Mu x86 Compute Module

by L.P

In the rapidly evolving world of Artificial Intelligence (AI), model updates are common, each promising improved performance, efficiency, or both. The recent update from Llama3.1 to Llama3.2 is no exception. This article will delve into the benchmark results of these two Large Language Models (LLMs) on the SBC LattePanda Mu x86 compute module, focusing on their speed and efficiency. For those interested in related performance comparisons, our previous article on running SLMs (phi3, gemma2, mathstral, llama3.1) on the LattePanda Mu provides additional insights.

Overview of Llama Models

Before diving into the benchmark results, let’s briefly overview the Llama models. Llama is a family of large language models developed for natural language processing tasks. These models are known for their impressive performance in generating human-like text, language understanding, and other NLP tasks.

Llama3.1 and Llama3.2 are two versions of the model, with Llama3.2 being the newer iteration. The primary difference between the two lies in their architecture and optimization techniques. Llama3.2 includes lightweight models with 1B and 3B parameters, specifically optimized for edge computing and mobile devices.

Overview of LattePanda Mu

LattePanda Mu is an x86 compute module featuring an Intel N100 quad-core processor, 8GB LPDDR5 memory, and 64GB storage. It offers various interfaces like HDMI, USB, and PCIe lanes, making it suitable for AI and edge computing applications. The module's versatility and open-source carrier board files allow for easy customization for specific use cases.

Overview of Runtime frame

Ollama

Ollama is a lightweight and scalable framework primarily designed for building and running LLMs on local machines. Its main features and operational architecture include:

Underlying Implementation: The foundation of Ollama is built around the LLaMA model and the llama.cpp framework. By utilizing lightweight implementations, efficient memory management, quantization techniques, and hardware acceleration support, Ollama enables large language models to run efficiently in resource-constrained environments.
Local Execution: Ollama is designed for easy deployment and operation of large language models on local machines. Users can install and use Ollama across different operating systems, allowing for seamless loading, running, and interaction with the models.

Installing Ollama on LattePanda Mu (Windows)

Ollama, an open-source runtime framework, is designed to facilitate the execution of language models like Llama 3.1 on various platforms, including the Raspberry Pi. To install Ollama, follow these steps:

1. Update Your System: get a list and packages. This can be done using the following command:

sudo apt-get update && sudo apt-get upgrade

2. Install Ollama: install Ollama using the following command:

curl -fsSL https://ollama.com/install.sh | sh

3. Run llama3.2 model

ollama run llama3.2:1b

ollama run llama3.2:3b

OpenVINO

OpenVINO (Open Visual Inference and Neural Network Optimization) is an open-source toolkit developed by Intel, primarily used for optimizing and accelerating the inference process of deep learning models. The OpenVINO Runtime automatically optimizes the deep learning pipeline by employing aggressive graph fusion, memory reuse, load balancing, and parallel inference across hardware like CPUs, GPUs, and VPUs, to reduce end-to-end latency and increase throughput. On the Lattepanda Mu, the OpenVINO framework can be used to accelerate the inference of LLMs with integrated graphics.

Installing OpenVINO on LattePanda Mu (Windows)

1. Install Anaconda

Visit the official website and download the installation package for your corresponding system version. Follow the step-by-step instructions to confirm the installation completion.

2. Download GIT

3. Install Microsoft Visual C++ Redistributable

4. Create a Conda environment and specify the Python version in Anaconda Prompt and install setup tools.

conda create -n yolov8 python=3.8

git clone --depth=1 https://github.com/openvinotoolkit/openvino_notebooks.git

cd openvino_notebooks

python -m pip install --upgrade pip wheel setuptools

pip install -r requirements.txt

jupyter lab notebooks/llm-chatbot

Performance

Ollama

llama3.2:1b q4 18 tokens/s

Test Report of llama3.2 on LattePanda Mu x86 Compute Module

llama3.2:3b q4 11.18 tokens/s

Speed Test of llama3.2 on LattePanda Mu x86 Compute Module

Openvino

llama3.2-1b q4 13.1 tokens/s

disables AWQ

Speed Test of OpenVino llama3.2 on LattePanda Mu x86 Compute Module

Test of OpenVino llama3.2 on LattePanda Mu x86 Compute Module

Test Repost of OpenVino llama3.2 on LattePanda Mu x86 Compute Module

llama3.2-1b q4 13.59 tokens/s

enables AWQ

Conclusion

Token speed on LattePanda Mu

Model	Token speed (tokens/s)	Size	Runtime frame
llama 3.2: 1b-q4	18	1.3GB	ollama using CPU
	13.1	825MB	Openvino disable AWQ
	13.59	824MB	Openvino enables AWQ
	4.28	824MB	Openvino CPU enables AWQ
llama 3.2: 3b-q4	11.18	2GB	Ollama using CPU
Llama 3.1-8b-q4	3.18	4.7GB	Ollama using CPU

llama3.2 1b: The original model is 2.3GB and can be directly quantized on LattePanda Mu

“(AWQ) is an algorithm that adjusts model weights to achieve more accurate INT4 compression. It slightly improves the quality of the generated compressed LLM but requires significant additional time to adjust the weights on the calibration dataset.”

In addition to token speed, model size, and runtime frame are important considerations when deploying AI models in real-world applications. A smaller model size can facilitate easier deployment on devices with limited resources, while a faster runtime frame ensures a smoother user experience.

The benchmark results indicate that Llama3.2 has a smaller model size than Llama3.1, which can benefit deployment on edge devices or in environments with constrained resources. For example, the 1b-q4 version of Llama3.2 has a size of 1.3GB, significantly smaller than the 8b-q4 version of Llama3.1, which has a size of 4.7GB. Moreover, for inferencing llama3.2 on the LattePanda Mu SBC, the Ollama framework is preferred over OpenVINO.