产品动态
公告
产品发布记录


initContainers 自动下载模型文件,会导致启动时间过长,因此建议使用共享存储来挂载 AI 大模型(即先通过 Job 任务将模型下载至共享存储,随后将该存储卷挂载到大模型运行的 Pod 内)。如此一来,后续 Pod 启动即可跳过模型下载环节,尽管仍需从共享存储通过网络加载模型,但如果选用高性能的共享存储(例如 Turbo 类型),这一过程依然迅速有效。

apiVersion: v1kind: PersistentVolumeClaimmetadata:name: ai-modellabels:app: ai-modelspec:storageClassName: cfs-aiaccessModes:- ReadWriteManyresources:requests:storage: 100Gi
storageClassName:apiVersion: v1kind: PersistentVolumeClaimmetadata:name: webuilabels:app: webuispec:accessModes:- ReadWriteManystorageClassName: cfs-airesources:requests:storage: 100Gi

LLM_MODEL 环境变量来替换所需的大语言模型。USE_MODELSCOPE 环境变量控制是否从 ModelScope 下载。apiVersion: batch/v1kind: Jobmetadata:name: vllm-download-modellabels:app: vllm-download-modelspec:template:metadata:name: vllm-download-modellabels:app: vllm-download-modelannotations:eks.tke.cloud.tencent.com/root-cbs-size: '100' # 如果用超级节点,默认系统盘只有 20Gi,vllm 镜像解压后会撑爆磁盘,用这个注解自定义一下系统盘容量(超过20Gi的部分会收费)。spec:containers:- name: vllmimage: vllm/vllm-openai:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B- name: USE_MODELSCOPEvalue: "1"command:- bash- -c- |set -exif [[ "$USE_MODELSCOPE" == "1" ]]; thenexec modelscope download --local_dir=/data/$LLM_MODEL --model="$LLM_MODEL"elseexec huggingface-cli download --local-dir=/data/$LLM_MODEL $LLM_MODELfivolumeMounts:- name: datamountPath: /datavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: batch/v1kind: Jobmetadata:name: sglang-download-modellabels:app: sglang-download-modelspec:template:metadata:name: sglang-download-modellabels:app: sglang-download-modelannotations:eks.tke.cloud.tencent.com/root-cbs-size: '100' # 如果用超级节点,默认系统盘只有 20Gi,sglang 镜像解压后会撑爆磁盘,用这个注解自定义一下系统盘容量(超过20Gi的部分会收费)。spec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B- name: USE_MODELSCOPEvalue: "1"command:- bash- -c- |set -exif [[ "$USE_MODELSCOPE" == "1" ]]; thenexec modelscope download --local_dir=/data/$LLM_MODEL --model="$LLM_MODEL"elseexec huggingface-cli download --local-dir=/data/$LLM_MODEL $LLM_MODELfivolumeMounts:- name: datamountPath: /datavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: batch/v1kind: Jobmetadata:name: ollama-download-modellabels:app: ollama-download-modelspec:template:metadata:name: ollama-download-modellabels:app: ollama-download-modelspec:containers:- name: ollamaimage: ollama/ollama:latestenv:- name: LLM_MODELvalue: deepseek-r1:7bcommand:- bash- -c- |set -exollama serve &sleep 5 # sleep 5 seconds to wait for ollama to startexec ollama pull $LLM_MODELvolumeMounts:- name: datamountPath: /root/.ollama # ollama 的模型数据存储在 `/root/.ollama` 目录下,挂载 CFS 类型的 PVC 到该路径。volumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: apps/v1kind: Deploymentmetadata:name: vllmlabels:app: vllmspec:selector:matchLabels:app: vllmreplicas: 1template:metadata:labels:app: vllmspec:containers:- name: vllmimage: vllm/vllm-openai:latestimagePullPolicy: Alwaysenv:- name: PYTORCH_CUDA_ALLOC_CONFvalue: expandable_segments:True- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7Bcommand:- bash- -c- |vllm serve /data/$LLM_MODEL \\--served-model-name $LLM_MODEL \\--host 0.0.0.0 \\--port 8000 \\--trust-remote-code \\--enable-chunked-prefill \\--max_num_batched_tokens 1024 \\--max_model_len 1024 \\--enforce-eager \\--tensor-parallel-size 1securityContext:runAsNonRoot: falseports:- containerPort: 8000readinessProbe:failureThreshold: 3httpGet:path: /healthport: 8000initialDelaySeconds: 5periodSeconds: 5resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:nvidia.com/gpu: "1"volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model# vLLM needs to access the host's shared memory for tensor parallel inference.- name: shmemptyDir:medium: MemorysizeLimit: "2Gi"restartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: vllm-apispec:selector:app: vllmtype: ClusterIPports:- name: apiprotocol: TCPport: 8000targetPort: 8000
apiVersion: apps/v1kind: Deploymentmetadata:name: vllmlabels:app: vllmspec:selector:matchLabels:app: vllmreplicas: 1template:metadata:labels:app: vllmannotations:eks.tke.cloud.tencent.com/gpu-type: V100 # 指定 GPU 卡型号eks.tke.cloud.tencent.com/root-cbs-size: '100' # 超级节点系统盘默认只有 20Gi,vllm 镜像解压后会撑爆磁盘,用这个注解自定义一下系统盘容量(超过20Gi的部分会收费)。spec:containers:- name: vllmimage: vllm/vllm-openai:latestimagePullPolicy: Alwaysenv:- name: PYTORCH_CUDA_ALLOC_CONFvalue: expandable_segments:True- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7Bcommand:- bash- -c- |vllm serve /data/$LLM_MODEL \\--served-model-name $LLM_MODEL \\--host 0.0.0.0 \\--port 8000 \\--trust-remote-code \\--enable-chunked-prefill \\--max_num_batched_tokens 1024 \\--max_model_len 1024 \\--enforce-eager \\--tensor-parallel-size 1securityContext:runAsNonRoot: falseports:- containerPort: 8000readinessProbe:failureThreshold: 3httpGet:path: /healthport: 8000initialDelaySeconds: 5periodSeconds: 5resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:nvidia.com/gpu: "1"volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model# vLLM needs to access the host's shared memory for tensor parallel inference.- name: shmemptyDir:medium: MemorysizeLimit: "2Gi"restartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: vllm-apispec:selector:app: vllmtype: ClusterIPports:- name: apiprotocol: TCPport: 8000targetPort: 8000
--served-model-name 参数指定大模型名称,应与前面下载 Job 中指定的名称一致。/data 目录下。apiVersion: apps/v1kind: Deploymentmetadata:name: sglanglabels:app: sglangspec:selector:matchLabels:app: sglangreplicas: 1template:metadata:labels:app: sglangspec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32Bcommand:- bash- -c- |set -xexec python3 -m sglang.launch_server \\--host 0.0.0.0 \\--port 30000 \\--model-path /data/$LLM_MODELresources:limits:nvidia.com/gpu: "1"ports:- containerPort: 30000volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model- name: shmemptyDir:medium: MemorysizeLimit: 40GirestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: sglangspec:selector:app: sglangtype: ClusterIPports:- name: apiprotocol: TCPport: 30000targetPort: 30000
apiVersion: apps/v1kind: Deploymentmetadata:name: sglanglabels:app: sglangspec:selector:matchLabels:app: sglangreplicas: 1template:metadata:labels:app: sglangannotations:eks.tke.cloud.tencent.com/gpu-type: V100 # 指定 GPU 卡型号eks.tke.cloud.tencent.com/root-cbs-size: '100' # 超级节点系统盘默认只有 20Gi,sglang 镜像解压后会撑爆磁盘,用这个注解自定义一下系统盘容量(超过20Gi的部分会收费)。spec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32Bcommand:- bash- -c- |set -xexec python3 -m sglang.launch_server \\--host 0.0.0.0 \\--port 30000 \\--model-path /data/$LLM_MODELresources:limits:nvidia.com/gpu: "1"ports:- containerPort: 30000volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model- name: shmemptyDir:medium: MemorysizeLimit: 40GirestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: sglangspec:selector:app: sglangtype: ClusterIPports:- name: apiprotocol: TCPport: 30000targetPort: 30000
LLM_MODEL 环境变量指定大模型名称,应与前面下载 Job 中指定的名称一致。/data 目录下。apiVersion: apps/v1kind: Deploymentmetadata:name: ollamalabels:app: ollamaspec:selector:matchLabels:app: ollamareplicas: 1template:metadata:labels:app: ollamaspec:containers:- name: ollamaimage: ollama/ollama:latestimagePullPolicy: IfNotPresentcommand: ["ollama", "serve"]env:- name: OLLAMA_HOSTvalue: ":11434"resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:cpu: 4000mmemory: 4Ginvidia.com/gpu: "1"ports:- containerPort: 11434name: ollamavolumeMounts:- name: datamountPath: /root/.ollamavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: ollamaspec:selector:app: ollamatype: ClusterIPports:- name: serverprotocol: TCPport: 11434targetPort: 11434
apiVersion: apps/v1kind: Deploymentmetadata:name: ollamalabels:app: ollamaspec:selector:matchLabels:app: ollamareplicas: 1template:metadata:labels:app: ollamaannotations:eks.tke.cloud.tencent.com/gpu-type: V100spec:containers:- name: ollamaimage: ollama/ollama:latestimagePullPolicy: IfNotPresentcommand: ["ollama", "serve"]env:- name: OLLAMA_HOSTvalue: ":11434"resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:cpu: 4000mmemory: 4Ginvidia.com/gpu: "1"ports:- containerPort: 11434name: ollamavolumeMounts:- name: datamountPath: /root/.ollamavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: ollamaspec:selector:app: ollamatype: ClusterIPports:- name: serverprotocol: TCPport: 11434targetPort: 11434
/root/.ollama 目录下,因此需要将已经下载好 AI 大模型的 CFS 类型 PVC 挂载到该路径。127.0.0.1),通过指定 OLLAMA_HOST 环境变量,强制对外暴露 11434 端口。nvidia.com/gpu 资源,以便让 Pod 调度到 GPU 机型并分配 GPU 卡使用。eks.tke.cloud.tencent.com/gpu-type 指定 GPU 类型,可选 V100、T4、A10*PNV4、A10*GNV4,具体可参考 GPU 规格。apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:name: vllmspec:minReplicas: 1maxReplicas: 2scaleTargetRef:apiVersion: apps/v1kind: Deploymentname: vllmmetrics: # 更多 GPU 指标参考 https://www.tencentcloud.com/document/product/457/38929?from_cn_redirect=1#gpu- pods:metric:name: k8s_pod_rate_gpu_used_request # GPU利用率 (占 Request)target:averageValue: "80"type: AverageValuetype: Podsbehavior:scaleDown:policies:- periodSeconds: 15type: Percentvalue: 100selectPolicy: MaxstabilizationWindowSeconds: 300scaleUp:policies:- periodSeconds: 15type: Percentvalue: 100- periodSeconds: 15type: Podsvalue: 4selectPolicy: MaxstabilizationWindowSeconds: 0
behavior:scaleDown:selectPolicy: Disabled


apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # docker hub 中的 mirror 镜像,长期自动同步,可放心使用env:- name: OPENAI_API_BASE_URLvalue: http://vllm-api:8000/v1 # vllm 的地址- name: ENABLE_OLLAMA_API # 禁用 Ollama API,只保留 OpenAI APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # docker hub 中的 mirror 镜像,长期自动同步,可放心使用env:- name: OPENAI_API_BASE_URLvalue: http://sglang:30000/v1 # sglang 的地址- name: ENABLE_OLLAMA_API # 禁用 Ollama API,只保留 OpenAI APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # docker hub 中的 mirror 镜像,长期自动同步,可放心使用env:- name: OLLAMA_BASE_URLvalue: http://ollama:11434 # ollama 的地址- name: ENABLE_OPENAI_API # 禁用 OpenAI API,只保留 Ollama APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
/app/backend/data 目录(如账号密码、聊天历史等数据),本文将 PVC 挂载到该路径。kubectl port-forward 命令来暴露服务:kubectl port-forward service/webui 8080:8080
http://127.0.0.1:8080 即可。apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata:name: aispec:parentRefs:- group: gateway.networking.k8s.iokind: Gatewaynamespace: envoy-gateway-systemname: ai-gatewayhostnames:- "ai.your.domain"rules:- backendRefs:- group: ""kind: Servicename: webuiport: 8080
parentRefs 引用定义好的 Gateway(通常一个 Gateway 对应一个 CLB)。hostnames 替换为您自己的域名,确保域名能正常解析到 Gateway 对应的 CLB 地址。backendRefs 指定 OpenWebUI 的 Service。apiVersion: networking.k8s.io/v1kind: Ingressmetadata:name: webuispec:rules:- host: "ai.your.domain"http:paths:- path: /pathType: Prefixbackend:service:name: webuiport:number: 8080
host 字段需填写您的自定义域名,确保域名能正常解析到 Ingress 对应的 CLB 地址。backend.service 则需指定为 OpenWebUI 的 Service。
文档反馈