tencent cloud

Tencent Cloud EdgeOne

Quick Guide

PDF
Focus Mode
Font Size
Last updated: 2026-04-15 11:47:16
This document guides you through the complete process of deploying open-source or custom models and invoking APIs for edge inference on EdgeOne. You will learn how to create inference services, manage service credentials, initiate inference requests, and perform fundamental troubleshooting to help you integrate AI model capabilities for edge inference into your applications. Using the deployment of the Llama-3.2-3B-Instruct large language model as an example, we will demonstrate the end-to-end workflow from environment preparation to model deployment and invocation.
Note:
Edge inference currently only supports the Enterprise Edition plans and requires applying for an allowlist before use. If you need it, please contact the business department or Contact Us.
This guide uses Linux/macOS environments as an example. Windows users should operate in WSL or Docker Desktop environments.

Prerequisites

Before starting, please ensure you have prepared the following items:
Preparation Item
Description
Tencent Cloud account
a real-name authenticated Tencent Cloud account
Enterprise Edition plan
have subscribed to the EdgeOne Enterprise Edition plan
Docker Environment
Docker is installed locally (Docker 20.10+ is recommended) for building and pushing images.
Tencent Cloud Container Registry (TCR)
Activated Tencent Cloud Container Registry (TCR) (Individual Edition is sufficient) for storing custom images.
GPU Server (Optional)
If local testing of image operation is required, a machine with an NVIDIA GPU is necessary; for building only, it is not required.

Create an inference service and invoke it

The following steps demonstrate how to deploy a model as an online service and invoke it. This process applies to both traditional models (such as OCR, ASR) and large language models (LLMs) or text-to-image models, and so on.

Step 1: Log in to the Tencent Cloud console

1. Open your browser and access the Tencent Cloud console, then log in using your Tencent Cloud account.
2. Enter EdgeOne in the search bar at the top of the console, or find EdgeOne in the left navigation, and click to go to.
3. In the left navigation of the EdgeOne console, find and click the Service Overview menu, select Edge Inference to go to the Edge Inference management page.

Step 2: Prepare a custom image (taking Llama-3.2-3B as an example)

Edge inference deploys models through a containerized approach. You need to write a Dockerfile to package the model files, inference framework, and dependencies into a complete Docker image.

2.1 Create a project directory

Create a working directory locally:
mkdir llama3-edge-inference && cd llama3-edge-inference

2.2 Write a Dockerfile

Create a Dockerfile file in the project directory. The following example uses vLLM as the inference framework to download the Llama-3.2-3B-Instruct model from ModelScope:
# ============================================================
# Dockerfile for Llama3.2-3B Model with vLLM 0.6.3.post1
# ============================================================

# Use the official vLLM OpenAI-compatible image as the base image
FROM vllm/vllm-openai:v0.6.3.post1

# Set environment variables to skip HuggingFace Hub online verification (offline mode)
ENV HF_HUB_OFFLINE=1
ENV HF_HOME=/data/models
ENV TRANSFORMERS_CACHE=/data/models

# Install modelscope to download models (using a domestic mirror source to accelerate)
RUN pip3 install --no-cache-dir modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple

# Create a model directory and download the model files
RUN mkdir -p /data/models/LLM-Research/Llama-3.2-3B-Instruct && \\
python3 -c "from modelscope import snapshot_download; snapshot_download('LLM-Research/Llama-3.2-3B-Instruct', local_dir='/data/models/LLM-Research/Llama-3.2-3B-Instruct')"

# Expose the inference service port
EXPOSE 8000

# Startup Parameters Description:
# --host 0.0.0.0 Listen on all network interfaces
# --port 8000 Service listening port
# --model Model file path
# --trust-remote-code Trust remote code (required for some models)
# --dtype half Use half-precision inference to reduce GPU memory usage
# --max-model-len 8192 Maximum context length
CMD ["--host", "0.0.0.0", \\
"--port", "8000", \\
"--model", "/data/models/LLM-Research/Llama-3.2-3B-Instruct", \\
"--trust-remote-code", \\
"--dtype", "half", \\
"--max-model-len", "8192"]
Key Parameters Description:
Parameter
Description
FROM vllm/vllm-openai:v0.6.3.post1
The base image comes pre-installed with the vLLM inference engine and an OpenAI-compatible API.
HF_HUB_OFFLINE=1
Disable online model verification to ensure container startup without public network access.
--dtype half
Using FP16 half-precision inference, the Llama-3.2-3B model requires approximately 6GB of GPU memory.
--max-model-len 8192
Set the maximum context window to 8192 tokens.
Note:
Please adjust the Dockerfile based on your actual model and inference framework. If you use other frameworks (such as SGLang, and so on), see the containerization documentation for the corresponding framework.

Step 3: Build the Docker image

3.1 Build the image

Run the following command in the directory containing the Dockerfile to build the image:
docker build -t llama3-3b-vllm:v1.0 .
The build process will perform the following operations in sequence:
1. Pull the vLLM base image.
2. Install ModelScope dependencies.
3. Download the Llama-3.2-3B-Instruct model files (approximately 6GB, please be patient).
Note:
The initial build may take 10-30 minutes, depending on network speed. It is recommended to execute this in a stable network environment. If the download fails due to network issues, you can rerun the build command, and Docker will resume from the last interrupted step. For more Docker tutorials, see the official Docker documentation.

3.2 Verify the image (Optional)

After the build is complete, you can view the Docker image information:
docker images | grep llama3-3b-vllm
Expected output:
llama3-3b-vllm v1.0 xxxxxxxxxxxx xx minutes ago approximately 15GB
If you have a local GPU environment, run the following command to test whether the image works properly:
docker run --gpus all -p 8000:8000 llama3-3b-vllm:v1.0
Once the service starts successfully (when the log shows Uvicorn running on http://0.0.0.0:8000), execute the test request in another terminal window:
curl http://localhost:8000/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{
"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'

Step 4: Push the image to Tencent Cloud Container Registry (TCR)

The built image needs to be uploaded to Tencent Cloud Container Registry (TCR) for the edge inference platform to pull and deploy.

4.1 Activate TCR

If you have not activated TCR yet, see TCR Quick Start.
2. Create a namespace (such as edge-inference).
3. Create an image repository (such as llama3-3b-vllm) under the namespace.

4.2 Log in to TCR

Run the following command in your local terminal to log in to TCR (replace with your actual instance information):
# Personal Edition TCR Login
docker login ccr.ccs.tencentyun.com --username=<Tencent Cloud account ID>
You will be prompted to enter the password. Please input the image repository password you set in the TCR console.
Note:
If using the Enterprise Edition TCR, the login address format is <instance name>.tencentcloudcr.com, see the TCR console to obtain the specific address.

4.3 Tag and push the image

# Tag the Image (Replace with Your Actual Repository Address)
docker tag llama3-3b-vllm:v1.0 ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0

# Push Image to TCR
docker push ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0
Note:
Due to the large size of the model image (approximately 15GB), the push time is determined by your upstream bandwidth. It is recommended to perform this operation in an environment with sufficient bandwidth.

4.4 Verify successful upload

Log in to the TCR console and confirm that the image has been successfully uploaded in the corresponding image repository. You can see the image tagged v1.0.

Step 5: Creates a project

Projects are the primary resources for edge inference, used to organize and manage multiple inference services. A project can contain multiple inference services, and you can bind Tags to projects for access control.
1. On the Edge Inference page of the EdgeOne console, click Create a project.

Creates a project.


2. Fill in the project information:
Project Name: Enter the project name (such as llm-inference-project)
Bind Plan: Select the Enterprise Edition plan you have subscribed to.

Fill in the project information.


3. After the project is created, click the project name to go to the project details page, where you can now create an inference service.

Step 6: Create an inference service

Services are the core units of edge inference. Creating a service means deploying the image you uploaded to edge nodes in container form. The platform will provide you with a publicly accessible service address.
In the created project, click Create Service.

Creating a Service



6.1 Basic Settings

Configuration Item
Description
Example Value
Service Name
It serves as the unique identifier of the service and cannot be modified after creation.
llama3-3b-service
Description
The purpose of the service, up to 60 characters.
Llama-3.2-3B Inference Service

6.2 Image Settings

Configuration Item
Description
Example Value
Image
Select the image that has been uploaded to TCR under your account.
ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0
Startup Command
The command executed when the container starts. If not specified, the ENTRYPOINT/CMD in the image will be used.
Leave blank (use the CMD defined in the Dockerfile)
Listening Port
The port on which your inference service's HTTP Server listens.
8000
Environment Variable
Runtime Environment Variable Configuration
Variable name: HF_HOME, Variable value: /data/models
Request Path
The API path for the client to call the inference service.
/v1/chat/completions
About Startup Command:
In this example, the Dockerfile has already set the startup parameters via CMD, so this field can be left blank. If you need to override the default parameters, you can enter the complete startup command, for example:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /data/models/LLM-Research/Llama-3.2-3B-Instruct --trust-remote-code --dtype half --max-model-len 8192

6.3 Resource Settings

Configuration Item
Description
Select Resources
Currently, two GPU resource specifications are provided: Entry-level and Basic. The selected specification cannot be changed after selection.
Note:
The selected instance specifications (especially GPU memory) must meet the requirements for model operation. Llama-3.2-3B requires approximately 6GB of memory for FP16 inference. Please select specifications with ≥8GB of memory; otherwise, it may cause out-of-memory (OOM) errors and deployment failure.

6.4 Advanced Settings

Configuration Item
Description
Recommended Configuration
Auto Scaling
Auto: Automatically scales based on request volume.
Manual: Fixed number of instances, running persistently with continuous billing.
It is recommended to select Auto to save costs.
Concurrency
Maximum concurrent requests per instance
For LLM inference, it is recommended to set it to 1-5, depending on the model size and video memory.

6.5 Complete the creation

After completing the above configurations, click Create. The system will begin deploying the service. Please wait patiently until the service status changes from "Deploying" to "Running", indicating that the service is ready.
Note:
The initial deployment requires pulling the image to edge nodes. Due to the large image size, deployment may take 5-15 minutes. Deployment logs can be viewed on the service details page.

Step 7: Create an API Token and obtain service information

After the service is running successfully, you need to create an API Token for authentication and retrieve the service access address.

7.1 Go to Service Details

Click the service name to go to the service details page. In Basic Information, you can view:
Service Status: Should display as "Running".
Access Address: The public network access URL assigned by the platform (such as https://your-service-id.edgeone-infer.com).
Request Sample: Automatically generated call sample by the platform.

7.2 Create API Token

All requests for edge inference require authentication via Bearer Token. Therefore, you need to create an API Token first.
1. Click Request Sample in the service details.

Request Example


2. Click Create API Token. Enter a Token name (such as my-first-Token), and the system will automatically generate an API Token.

Create Token


3. After creation is completed, Token will be automatically filled into the request example. You can also centrally manage all Tokens on the API Token page in the left menu.
Note:
API Token is an important credential for accessing inference services. Please keep it secure. By default, the Token is displayed in masked form. Click to copy the full content. Never disclose your Token publicly.

Step 8: Call the model service

After the service is successfully deployed and you obtain the API Token, you can call the model via HTTP API. Please replace YOUR_SERVICE_URL and YOUR_BEARER_TOKEN in the following example with your actual information:
curl https://YOUR_SERVICE_URL/v1/chat/completions \\
-H "Authorization: Bearer YOUR_BEARER_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, who are you?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
Expected Response Example:
{
"id": "cmpl-xxxxxxxx",
"object": "chat.completion",
"created": 1739260800,
"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I'm Llama, a helpful AI assistant developed by Meta. I'm here to help you with questions, provide information, and assist with various tasks. How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}

Request Parameters

Parameter
Type
Required
Description
model
string
Yes
Model path, consistent with the model file path inside the container.
messages
array
Yes
Dialogue message list, containing role (system/user/assistant) and content
max_tokens
integer
No
Maximum generated number of tokens, determined by the model by default
temperature
float
No
Temperature, ranging from 0-2. The higher the value, the more random the output. Default: 1.0.
top_p
float
No
Nucleus sampling parameter, ranging from 0-1. Default: 1.0.
stream
boolean
No
Whether to enable streaming output, default: false
stop
string/array
No
Stop token

Troubleshooting Common Issues

Issue
Possible cause
Solution
The service status keeps displaying "Deploying".
The image is too large, resulting in a longer pull time.
Wait for 15-30 minutes; check whether the image has been correctly uploaded to TCR.
The service status displays "Deployment failed".
Insufficient GPU memory or abnormal image startup.
Check whether the resource specifications meet the model requirements; view the deployment logs to troubleshoot errors.
The API request returns 401.
Token is invalid or expired.
Check whether the Authorization header is correct; confirm that the Token format is Bearer <token>
The API request returns 502/504.
The service is not ready or the request has timed out.
Confirm that the service status is "Running"; appropriately increase the timeout duration.
Returns an OOM error
Insufficient GPU memory
Reduce the --max-model-len parameter value; select higher specifications for GPU resources.


Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback