Skip to main content

Making AI Talk Faster! How Advantech Unleashes Peak Large Language Model Performance on Qualcomm GPUs with vLLM

· loading
Author
Advantech ESS

Hey everyone! AI enthusiasts, our partner AEs and sales elites, and all friends curious about cutting-edge technology!

Imagine having a smooth conversation with an AI assistant that understands your needs and responds almost instantly, without annoying delays. This isn’t a distant future; it’s a goal Advantech is actively working towards! In the wave of AI, Large Language Models (LLMs) are undoubtedly shining stars, capable of writing, translating, coding, and even performing complex reasoning. However, making these massive models “speak” quickly and effectively on edge or cloud devices is no easy feat. It requires robust hardware support and efficient software technology.

Today, we want to share an exciting experimental result from the Advantech engineering team. We delved into how combining the high-performance Qualcomm Cloud AI 100 Ultra accelerator with vLLM, an open-source library designed specifically for LLM inference, can significantly boost the execution efficiency of AI models. This not only demonstrates Advantech’s commitment to continuous R&D in the AI field but also opens up more possibilities for our customers and partners!

Why is Fast and Efficient LLM Inference So Important?

While Large Language Models are powerful, they are also incredibly “large,” often containing billions or even trillions of parameters. In practical applications (the “inference” stage), every time a user inputs a question or command, the model needs to perform extensive calculations to generate a response. If this process is too slow, the user experience is severely impacted, and many real-time applications (such as smart customer service, voice assistants, and real-time translation) become difficult to implement.

Traditional inference methods are often inefficient and resource-intensive. This is where technologies like vLLM come into play. vLLM is a library specifically designed for large language model inference. It employs many advanced techniques (such as PagedAttention) to significantly increase throughput (imagine how many tokens/symbols the AI can process or generate per second) and reduce latency, making AI responses faster and smoother.

And the Qualcomm Cloud AI 100 Ultra accelerator is a hardware accelerator built for high-performance AI inference. It provides powerful computing capabilities, making it highly suitable for running these complex AI models.

Combining the software advantages of vLLM with the powerful performance of Qualcomm hardware is precisely the core objective of this experiment!

Experiment Revealed: How Did We Do It?

To verify the actual performance of vLLM on Qualcomm GPUs, our engineers conducted a series of environment setups and performance tests. The entire process can be summarized into the following key steps:

  1. Prepare Hardware and System Environment: First, we need a device equipped with the Qualcomm Cloud AI 100 Ultra accelerator and with the Ubuntu 22.04 operating system installed. This serves as the basic platform for the experiment.

  2. Install the Qualcomm SDK: Qualcomm provides dedicated Software Development Kits (SDKs) for its AI accelerators, including the Apps SDK and Platform SDK. These SDKs contain drivers, libraries, and development tools, which are crucial for software to fully utilize hardware performance. We downloaded the required SDK versions from the Qualcomm Package Manager.

    image_1742441005470.png
    image_1742451221676.png
    After downloading, you will see files similar to these:
    image_1742451443879.png
    Next, we execute the decompression and installation commands:

    unzip aic_apps.Core.1.19.6.0.Linux-AnyCPU.zip
    unzip aic_platform.Core.1.19.6.0.Linux-AnyCPU.zip
    ./qaic-apps-1.19.6.0/x86_64/deb/install.sh
    ./qaic-platform-sdk-1.19.6.0/x86_64/deb/install.sh
    

    After the installation is complete, restart the system for the settings to take effect. We can use the qaic-util -t 1 command to check if the GPU is working correctly and view its status. If you see output similar to the image below, it means the hardware and basic software environment are ready!

    qaic-util -t 1
    

    image_1742454417253.png

  3. Create the vLLM Docker Environment: For ease of deployment and management, we packaged vLLM and its related dependencies into a Docker container. This step uses tools provided by the Qualcomm SDK to build a Docker Image containing vLLM.

    cd qaic-apps-1.19.6.0/common/tools/docker-build/
    python3.10 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    python build_image.py --user_specification_file ./sample_user_specs/user_image_spec_vllm.json --apps_sdk <path_to_apps_sdk_zip_file> --platform_sdk <path_to_platform_sdk_zip_file> --tag 1.19.6.0
    

    After successful creation, you can use the docker images command to see the generated vLLM Docker Image.

    docker images
    

    image_1742455454608.png

  4. Configure Multi-GPU Usage (Disable ACS): If you need to utilize multiple Qualcomm GPUs simultaneously to accelerate inference (which is very important for running larger models or handling more concurrent requests), we need to perform additional configuration to disable ACS (Access Control Services). This ensures that multiple devices can be effectively used in coordination.

    python QAicChangeAcs.py all
    

    image_1742456784452.png
    (Reference document: multi device)

  5. Launch the vLLM OpenAI Compatible Server: Enter the created Docker container and launch the OpenAI Compatible API Server provided by vLLM. This Server allows us to call the model for inference through a standard API interface, just as conveniently as using OpenAI’s services. We specified the model to load (e.g., TinyLlama or DeepSeek-8B), the devices to use (one or more Qualcomm GPUs), and some optimization parameters. First, run the Docker container and enter its bash environment:

    docker run -it --rm --entrypoint bash --name tytest -p 8000:8000 --device=/dev/accel/accel0 [--device=/dev/accel/accel1] [--device=/dev/accel/accel2] [--device=/dev/accel/accel3] -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 [-e MODEL=deepseek-ai/DeepSeek-R1-Distill-Llama-8B] [-e HF_TOKEN=hf_xxxxxxxx] -v /home/adv/.cache/:/root/.cache ty-qaiqaic-x86_64-ubuntu20-py310-py38-release-qaic_platform-qaic_apps-pybase-pytools-vllm:1.19.6.0
    

    Then, start the vLLM Server inside the container:

    source /opt/vllm-env/bin/activate
    python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-model-len 4096 --max-num-seq 16 --max-seq_len-to-capture 128 --device qaic --block-size 32 --quantization mxfp6 --kv-cache-dtype mxint8 --device-group 0,1,2,3
    
  6. Perform Performance Benchmarking: After the Server is started and the model is successfully loaded, we can proceed with stress testing! We used the industry-standard ShareGPT dataset to simulate real-world usage scenarios, testing inference performance under different models and varying numbers of GPUs, with a particular focus on “throughput (Total Token throughput)”. First, download the test dataset:

    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
    

    Then, execute the benchmark script inside the Docker container:

    docker exec -it tytest bash
    cd /opt/qti-aic/integrations/vllm/
    python3 benchmarks/benchmark_serving.py --backend openai --base-url http://127.0.0.1:8000 --dataset-name=sharegpt --dataset-path=./ShareGPT_V3_unfiltered_cleaned_split.json --sharegpt-max-input-len 128 --sharegpt-max-model-len 256 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --seed 12345
    

Exciting Results: Significant Performance Improvement!

After rigorous testing, we compared the performance of running the TinyLlama (1.1B) and DeepSeek-8B models on the Qualcomm Cloud AI 100 Ultra accelerator using different versions of the Qualcomm SDK (1.18.2.0 vs 1.19.6.0).

Below are the comparison results for throughput (Total Token throughput, tok/s) that we compiled:

Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 Device Used: 1 Qualcomm GPU

SDK Version Total Token Throughput (tok/s)
1.18.2.0 321.19
1.19.6.0 614.35

Wow! Aren’t these numbers surprising? For the TinyLlama model, simply updating to the new SDK version resulted in a nearly doubled throughput! This means that within the same amount of time, the AI model can process more requests or generate more content, significantly increasing efficiency.

Below are screenshots of the test results:

SDK Version Test Result
SDK 1.18.2.0 (1 device)
image_1742464535183.png
SDK 1.19.6.0 (1 device)
image_1742467353787.png

Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B Device Used: 1 Qualcomm GPU

SDK Version Total Token Throughput (tok/s)
1.18.2.0 158.10
1.19.6.0 162.48

For the larger DeepSeek-8B model, the improvement with the new SDK on a single GPU, while not as dramatic as with TinyLlama, still shows steady progress.

Below are screenshots of the test results:

SDK Version Test Result
SDK 1.18.2.0 (1 device)
deepseek8b_1dev_1742464484694.png
SDK 1.19.6.0 (1 device)
image_1742469188207.png

Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B Device Used: 2 Qualcomm GPUs

SDK Version Total Token Throughput (tok/s)
1.18.2.0 228.15
1.19.6.0 254.62

Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B Device Used: 4 Qualcomm GPUs

SDK Version Total Token Throughput (tok/s)
1.18.2.0 225.16
1.19.6.0 339.29

When we used multiple Qualcomm GPUs to run the DeepSeek-8B model, the advantage of the new SDK became even more apparent, especially when using 4 GPUs, where throughput increased by over 50%! This demonstrates the optimization in the new SDK for multi-device collaboration, better unleashing the hardware’s potential.

Below are screenshots of the test results:

SDK Version Test Result
SDK 1.18.2.0 (2 devices)
deepseek8b_2dev_1742464606439.png
SDK 1.19.6.0 (2 devices)
image_1742517352189.png
SDK 1.18.2.0 (4 devices)
deepseek8b_4dev_1742464621770.png
SDK 1.19.6.0 (4 devices)
image_1742467081541.png

What Do These Results Mean?

The success of this experiment not only validates the feasibility of vLLM on the Qualcomm Cloud AI 100 Ultra accelerator but, more importantly, clearly demonstrates that through the tight integration and continuous optimization of software and hardware, we can significantly enhance the inference performance of large language models.

For Advantech, this signifies:

  • More Powerful AI Solutions: We can provide customers with hardware platforms that run LLMs faster and more efficiently.
  • Broader Application Scenarios: High-performance LLM inference capabilities will help drive more advanced AI applications in areas such as smart manufacturing, smart healthcare, and smart retail, including real-time voice interaction, intelligent decision support, and automatic content generation.
  • Continuous Technological Leadership: This experiment proves that the Advantech engineering team possesses the ability to deeply research and integrate the latest AI technologies. We are constantly exploring how to provide customers with the best AI inference solutions.

Conclusion and Future Outlook

Through this experiment on LLM inference using vLLM on the Qualcomm Cloud AI 100 Ultra accelerator, we successfully demonstrated significant performance improvements, particularly by updating the SDK version and utilizing multiple GPUs. This further confirms the importance of software optimization in unleashing hardware potential.

Advantech will continue to cultivate this field, explore more optimization techniques, and integrate these high-performance AI inference capabilities into our products and solutions. We believe that through continuous R&D and innovation, Advantech will be able to bring a smarter and more efficient future to our customers!

If you are interested in our AI solutions or would like to learn more technical details, please feel free to contact our AE or sales team! We look forward to unlocking the infinite possibilities of AI with you!


References:

Related

Bringing Edge AI to Life! How Advantech Simplifies Deployment with Docker Containers to Accelerate Your Innovative Applications
· loading
Advantech AI Demystified: Build Your Own RAG Q&A System with OpenWebUI, AIR-310 in Action!
· loading
Unlocking the Future of Healthcare! Inside Advantech's AI Lab: When MONAI Meets Holoscan
· loading