Preparation Item | Description |
Tencent Cloud account | a real-name authenticated Tencent Cloud account |
Enterprise Edition plan | have subscribed to the EdgeOne Enterprise Edition plan |
Docker Environment | Docker is installed locally (Docker 20.10+ is recommended) for building and pushing images. |
Tencent Cloud Container Registry (TCR) | Activated Tencent Cloud Container Registry (TCR) (Individual Edition is sufficient) for storing custom images. |
GPU Server (Optional) | If local testing of image operation is required, a machine with an NVIDIA GPU is necessary; for building only, it is not required. |
mkdir llama3-edge-inference && cd llama3-edge-inference
Dockerfile file in the project directory. The following example uses vLLM as the inference framework to download the Llama-3.2-3B-Instruct model from ModelScope:# ============================================================# Dockerfile for Llama3.2-3B Model with vLLM 0.6.3.post1# ============================================================# Use the official vLLM OpenAI-compatible image as the base imageFROM vllm/vllm-openai:v0.6.3.post1# Set environment variables to skip HuggingFace Hub online verification (offline mode)ENV HF_HUB_OFFLINE=1ENV HF_HOME=/data/modelsENV TRANSFORMERS_CACHE=/data/models# Install modelscope to download models (using a domestic mirror source to accelerate)RUN pip3 install --no-cache-dir modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple# Create a model directory and download the model filesRUN mkdir -p /data/models/LLM-Research/Llama-3.2-3B-Instruct && \\python3 -c "from modelscope import snapshot_download; snapshot_download('LLM-Research/Llama-3.2-3B-Instruct', local_dir='/data/models/LLM-Research/Llama-3.2-3B-Instruct')"# Expose the inference service portEXPOSE 8000# Startup Parameters Description:# --host 0.0.0.0 Listen on all network interfaces# --port 8000 Service listening port# --model Model file path# --trust-remote-code Trust remote code (required for some models)# --dtype half Use half-precision inference to reduce GPU memory usage# --max-model-len 8192 Maximum context lengthCMD ["--host", "0.0.0.0", \\"--port", "8000", \\"--model", "/data/models/LLM-Research/Llama-3.2-3B-Instruct", \\"--trust-remote-code", \\"--dtype", "half", \\"--max-model-len", "8192"]
Parameter | Description |
FROM vllm/vllm-openai:v0.6.3.post1 | The base image comes pre-installed with the vLLM inference engine and an OpenAI-compatible API. |
HF_HUB_OFFLINE=1 | Disable online model verification to ensure container startup without public network access. |
--dtype half | Using FP16 half-precision inference, the Llama-3.2-3B model requires approximately 6GB of GPU memory. |
--max-model-len 8192 | Set the maximum context window to 8192 tokens. |
docker build -t llama3-3b-vllm:v1.0 .
docker images | grep llama3-3b-vllm
llama3-3b-vllm v1.0 xxxxxxxxxxxx xx minutes ago approximately 15GB
docker run --gpus all -p 8000:8000 llama3-3b-vllm:v1.0
Uvicorn running on http://0.0.0.0:8000), execute the test request in another terminal window:curl http://localhost:8000/v1/chat/completions \\-H "Content-Type: application/json" \\-d '{"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct","messages": [{"role": "user", "content": "Hello!"}],"max_tokens": 50}'
edge-inference).llama3-3b-vllm) under the namespace.# Personal Edition TCR Logindocker login ccr.ccs.tencentyun.com --username=<Tencent Cloud account ID>
<instance name>.tencentcloudcr.com, see the TCR console to obtain the specific address.# Tag the Image (Replace with Your Actual Repository Address)docker tag llama3-3b-vllm:v1.0 ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0# Push Image to TCRdocker push ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0
v1.0.
llm-inference-project)

Configuration Item | Description | Example Value |
Service Name | It serves as the unique identifier of the service and cannot be modified after creation. | llama3-3b-service |
Description | The purpose of the service, up to 60 characters. | Llama-3.2-3B Inference Service |
Configuration Item | Description | Example Value |
Image | Select the image that has been uploaded to TCR under your account. | ccr.ccs.tencentyun.com/edge-inference/llama3-3b-vllm:v1.0 |
Startup Command | The command executed when the container starts. If not specified, the ENTRYPOINT/CMD in the image will be used. | Leave blank (use the CMD defined in the Dockerfile) |
Listening Port | The port on which your inference service's HTTP Server listens. | 8000 |
Environment Variable | Runtime Environment Variable Configuration | Variable name: HF_HOME, Variable value: /data/models |
Request Path | The API path for the client to call the inference service. | /v1/chat/completions |
CMD, so this field can be left blank. If you need to override the default parameters, you can enter the complete startup command, for example:python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /data/models/LLM-Research/Llama-3.2-3B-Instruct --trust-remote-code --dtype half --max-model-len 8192
Configuration Item | Description |
Select Resources | Currently, two GPU resource specifications are provided: Entry-level and Basic. The selected specification cannot be changed after selection. |
Configuration Item | Description | Recommended Configuration |
Auto Scaling | Auto: Automatically scales based on request volume. Manual: Fixed number of instances, running persistently with continuous billing. | It is recommended to select Auto to save costs. |
Concurrency | Maximum concurrent requests per instance | For LLM inference, it is recommended to set it to 1-5, depending on the model size and video memory. |
https://your-service-id.edgeone-infer.com).
my-first-Token), and the system will automatically generate an API Token.
YOUR_SERVICE_URL and YOUR_BEARER_TOKEN in the following example with your actual information:curl https://YOUR_SERVICE_URL/v1/chat/completions \\-H "Authorization: Bearer YOUR_BEARER_TOKEN" \\-H "Content-Type: application/json" \\-d '{"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello, who are you?"}],"max_tokens": 256,"temperature": 0.7}'
{"id": "cmpl-xxxxxxxx","object": "chat.completion","created": 1739260800,"model": "/data/models/LLM-Research/Llama-3.2-3B-Instruct","choices": [{"index": 0,"message": {"role": "assistant","content": "Hello! I'm Llama, a helpful AI assistant developed by Meta. I'm here to help you with questions, provide information, and assist with various tasks. How can I help you today?"},"finish_reason": "stop"}],"usage": {"prompt_tokens": 25,"completion_tokens": 42,"total_tokens": 67}}
Parameter | Type | Required | Description |
model | string | Yes | Model path, consistent with the model file path inside the container. |
messages | array | Yes | Dialogue message list, containing role (system/user/assistant) and content |
max_tokens | integer | No | Maximum generated number of tokens, determined by the model by default |
temperature | float | No | Temperature, ranging from 0-2. The higher the value, the more random the output. Default: 1.0. |
top_p | float | No | Nucleus sampling parameter, ranging from 0-1. Default: 1.0. |
stream | boolean | No | Whether to enable streaming output, default: false |
stop | string/array | No | Stop token |
Issue | Possible cause | Solution |
The service status keeps displaying "Deploying". | The image is too large, resulting in a longer pull time. | Wait for 15-30 minutes; check whether the image has been correctly uploaded to TCR. |
The service status displays "Deployment failed". | Insufficient GPU memory or abnormal image startup. | Check whether the resource specifications meet the model requirements; view the deployment logs to troubleshoot errors. |
The API request returns 401. | Token is invalid or expired. | Check whether the Authorization header is correct; confirm that the Token format is Bearer <token> |
The API request returns 502/504. | The service is not ready or the request has timed out. | Confirm that the service status is "Running"; appropriately increase the timeout duration. |
Returns an OOM error | Insufficient GPU memory | Reduce the --max-model-len parameter value; select higher specifications for GPU resources. |
Was this page helpful?
You can also Contact sales or Submit a Ticket for help.
Help us improve! Rate your documentation experience in 5 mins.
Feedback