Skip to main content

Easily Deploy Large Language Models to the Cloud: DeviceOn x NVIDIA LLM-NIM Edge AI Deployment Revealed

· loading
Author
Advantech ESS
Table of Contents

Preface: How Difficult Is AI Model Deployment, Really?
#

Imagine if powerful AI models could not only run on the cloud but also be deployed directly to factories, hospitals, transportation hubs, and various other “edge devices”—that’s the magic of “Edge AI”! In reality, however, large language models (LLMs) are massive, the environments are complex, and deployment and management can truly be a headache. This is where Advantech’s DeviceOn management platform, paired with NVIDIA’s LLM-NIM inference service, becomes the best savior for developers and IT professionals!

Background & Technical Overview: What Exactly Are LLM-NIM and DeviceOn?
#

  • NVIDIA LLM-NIM (Inference Microservice)
    Think of it as a “model-as-a-service” micro container that packages complex inference engines (such as TensorRT-LLM, Triton Server) all together, making deployment as easy as grabbing a drink from the fridge!

  • Advantech DeviceOn
    Advantech’s proprietary cloud device management platform, featuring robust OTA (over-the-air updates), container management, remote monitoring, and more. With just one platform, you can remotely distribute, update, start, and monitor AI models, easily managing hundreds of edge devices.

Market Demand?
Every industry wants to bring AI directly on-site to enable real-time judgment and automated decision-making. From smart manufacturing and intelligent healthcare to traffic monitoring, edge AI has become an unstoppable trend!

Implementation Process & Key Findings: “One-Stop” Model Deployment
#

Step 1: Basic Startup, Official Models in One Click
#

As long as you have an NVIDIA GPU (such as the RTX 4000 Ada, 20GB VRAM), Docker, NVIDIA Container Toolkit, and an NGC account, you can pull LLM-NIM’s standard models directly:

export PROJ_DIR=$(pwd)
export NIM_CACHE_DIR="$PROJ_DIR/nim-cache"
mkdir -p "$NIM_CACHE_DIR"
chmod -R a+w "$NIM_CACHE_DIR"
echo "NGC_API_KEY=nvapi-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" > my.env

Pro Tip
Check the most suitable Profile ID for your hardware before starting to avoid “hardware incompatibility” issues:

docker run -it --rm \
  --gpus all \
  --shm-size=16GB \
  --env-file ./my.env \
  -v "$NIM_CACHE_DIR:/opt/nim/.cache" \
  -u $(id -u) \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:1.15.1 \
  list-model-profiles

With the Profile ID, you can select the optimal settings based on your GPU count.
If VRAM is insufficient, remember to use NIM_MAX_MODEL_LEN to adjust sequence length and avoid “memory overflow”:

docker run -it --rm \
  --gpus='"device=0"' \
  --shm-size=16GB \
  --env-file ./my.env \
  -v "$NIM_CACHE_DIR:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8001:8000 \
  -e NIM_MAX_MODEL_LEN=7520 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:1.15.1

Step 2: Advanced Usage, Hugging Face Models Supported
#

Want to use custom or quantized models? No problem!
First, use a Python script to download the model to your local cache:

export CURRENT_DIR=$(pwd)
export NIM_CACHE_DIR="$CURRENT_DIR/model_cache"
export NIM_MODEL_NAME=hf://meta-llama/Llama-3.1-8B-Instruct
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

mkdir -p "$NIM_CACHE_DIR"
chmod -R a+w "$NIM_CACHE_DIR"

python download_hf_model.py \
  --download_base_path $NIM_CACHE_DIR \
  --model_name $HF_MODEL_NAME \
  --hf_token $HF_TOKEN

Then verify whether the model and hardware are a perfect match:

docker run -it --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME \
  -v "$NIM_CACHE_DIR:/opt/nim/.cache" \
  nvcr.io/nim/nvidia/llm-nim:<IMAGE_TAG> \
  list-model-profiles

After seeing the Profile IDs below, choose tp1 (single GPU), tp2 (dual GPUs), or tp4 (quad GPUs) according to your hardware:

MODEL PROFILES
- Compatible with system and runnable:
  - 19214bd8cb3da329701e81c3aba5a51f966e25178a955b726003a9e051489b5b (vllm-bf16-tp4-pp1-...)
  - 6a7decb8c10747c7d32fb1dc07fd74c83a683771432175d30516e955226e74a5 (vllm-bf16-tp2-pp1-...)
  - 574eb0765118b2087b5fd6c8684a79e682bd03062f80343cfd9e2140ffa962cd (vllm-bf16-tp1-pp1-...)

Start the service:

export NIM_MODEL_PROFILE=<YOUR_PROFILE_ID>

docker run -it --rm \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME \
  -e NIM_MODEL_PROFILE \
  -v "$NIM_CACHE_DIR:/opt/nim/.cache" \
  -p 8001:8000 \
  nvcr.io/nim/nvidia/llm-nim:<IMAGE_TAG>

Step 3: Service Validation, OpenAI API Compatibility for Ultimate Convenience
#

Once the service starts successfully, the log will show:

An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
...

Test with cURL:

curl -X 'POST' \
  'http://localhost:8001/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/llama-3.1-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      }
    ],
    "max_tokens": 50
  }'

Results & Applications: DeviceOn Makes LLM Deployment Effortless
#

No More Hassle with Model Distribution: One-Click OTA Delivery
#

LLM models can range from several GBs to tens of GBs. Downloading them during container startup? Too slow! The best practice is to pre-distribute via DeviceOn OTA:

  1. Download the model and compress it into a ZIP file.
  2. Upload it to DeviceOn Portal’s OTA management.
  3. Select the target device group and deliver it to the specified path (e.g., /var/lib/deviceon/nim-cache).

Remember to set directory permissions to avoid “Permission denied” errors:

sudo mkdir -p /var/lib/deviceon/nim-cache
sudo chmod -R 777 /var/lib/deviceon/nim-cache

Evolved Deployment Commands: DeviceOn GUI, Zero Learning Curve
#

Traditional Docker commands are cumbersome. DeviceOn offers graphical management:

  1. Enter Container Management and click “Create Container”.
  2. Fill in the container name and image (e.g., nvcr.io/nim/meta/llama-3.1-8b-instruct:1.15.1).
  3. Enter startup parameters (you can use env files to set API keys or other sensitive information).
  4. One-click conversion to JSON format, then deploy!

Service Verification & Testing: Deployment Success in Just Two Steps
#

  1. Check Container Logs
    In the DeviceOn interface, click Logs. If you see An example cURL request: or Uvicorn running on http://0.0.0.0:8000, it’s a success!

  2. Functional Test
    Send an inference request with cURL. If you receive a JSON response, you’re all set:

curl -X 'POST' \
  'http://<Edge_Device_IP>:8001/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/llama-3.1-8b-instruct",
    "messages": [{"role":"user", "content":"Hello, DeviceOn!"}],
    "max_tokens": 50
  }'

FAQs & Troubleshooting Guide
#

  • OOM/Out of Memory?
    Reduce NIM_MAX_MODEL_LEN, or set NIM_KVCACHE_PERCENT=0.5 to free up memory.
  • Insufficient Permissions?
    Use the DeviceOn Shell Command chmod -R 777 /var/lib/deviceon/nim-cache.
  • Slow Download?
    Download first in a high-bandwidth environment, then use OTA to distribute to edge devices.

Conclusion & Future Outlook: Advantech Continues to Lead in Edge AI Deployment!
#

This DeviceOn x NVIDIA LLM-NIM integration demo makes the deployment, distribution, updating, and maintenance of large AI models simple, fast, and secure.
Whether you’re an AE, sales representative, or enterprise IT professional, you can now deploy AI on-site in the most intuitive way, opening a new chapter in smart industries.

Advantech continues to deepen its focus in edge AI computing and will offer more automated and intelligent tools in the future to help customers meet ever-changing market demands and achieve truly large-scale intelligent deployment!


References

Related

Opening Pandora's Box: Effortlessly House Your Python AI Applications in Docker Containers!
· loading
Breakthrough in Edge AI Robotics! Jetson Thor × GR00T-1.5 Practical Deep Dive
· loading
AI Model Optimization Unveiled: Achieve Lightning-Fast LLM Performance on Your Hardware!
· loading