tencent cloud

Introduction and Samples of Model Inference Files
Last updated: 2025-05-09 17:46:58
Introduction and Samples of Model Inference Files
Last updated: 2025-05-09 17:46:58

Introduction to TI-ONE Built-In Inference Framework tiinfer

tiinfer is an inference framework used by the Tencent Cloud TI-ONE model service. Based on the basic capabilities provided by mosec, tiinfer provides the following features:
High-performance HTTP service.
Free orchestration of the calculation process.

Installation

tiinfer supports GNU and Linux systems and depends on Python 3.8 or later versions. Use the following commands to install a PyPI package:
pip install -i https://mirrors.cloud.tencent.com/pypi/simple tiinfer

HelloWorld!

The following code shows the simplest HelloWorld example:
from typing import Dict
import tiinfer

### Log Code ###
# Add the following code to print the request log recorded by the framework to the terminal.
import logging
import sys
logger_formatter = "%(asctime)s %(levelname)s    %(module)s:%(lineno)d    %(message)s"
logging.basicConfig(stream=sys.stdout, format=logger_formatter, level=logging.DEBUG)
### Log Code ###

# tiinfer supports the native mosec.Worker.
class HelloWorld(mosec.Worker):
    def forward(self, req: Dict) -> Dict:
        return {"hello": f"world. raw req is {req}"}

# Launch two processes to process requests simultaneously.
tiinfer.append_worker(HelloWorld, num=2)
Save the above code as the model_service.py file. Execute the following commands to start the entire inference service:
TI_MODEL_DIR=pwd python3 -m tiinfer --timeout 30000
The value of the configuration item timeout indicates the Web service timeout period in milliseconds. After the model is started, an HTTP service will be started on port 8501. The request address is /v1/models/m:predict. Use the following command to access:
> curl -X POST -d '{"key": "values"}' http://127.0.0.1:8501/v1/models/m:predict
{
"hello": "world. raw is {'key': 'values'}"
}

Architecture of tiinfer




To reduce the difficulty of model deployment, the tiinfer image provided by the platform has already packaged many popular inference engines. It will read the model_service.py file provided by the customer and automatically start the HTTP service. The specific service startup process is as follows:
1. Use pip to install the Python dependencies listed in the requirements.txt file.
2. Read model_service.py and instantiate several processes.
3. Start the HTTP service.




Framework Built-in Environment Variables

The framework operates in single-process mode by default. Users can adjust the built-in environment variables of the framework as needed to enable multi-process mode and make full use of resources.
Environment Variables
Description
Default Value
TI_MODEL_DIR
Model path
/data/model/
TI_PREPROCESS_NUMS
Number of pre-processing processes
0
TI_INFERENCE_NUMS
Number of inference processes
1
TI_POSTPROCESS_NUMS
Number of post-processing processes
0
TI_INFERENCE_MAX_BATCH_SIZE
Inference batch number
1
Notes:
1. When TI_PREPROCESS_NUMS==0 and TI_POSTPROCESS_NUMS==0,
the preprocess, predict, and postpress functions in the model_service.py file are executed in one process.
2. Otherwise, the preprocess, predict, and postpress functions in the model_service.py file are executed in different processes respectively.
3. The load function and the predict function are in the same process.
4. Note the adjustment of the value of TI_INFERENCE_MAX_BATCH_SIZE to avoid triggering a GPU Out of Memory (OOM) error.

Customizing Inference Worker

Each inference worker needs to inherit mosec.Worker to define the processing logic:
1. Optional: Override the __init__ function to perform some initialization work. Only override it when necessary. Note that you need to first call super.__init__() to complete the parent class initialization.
2. Required: Customize the forward function to provide processing capabilities. Generally, data processing work is completed within the forward function.
class Worker:
def __init__(self)
def forward(data)
After defining the inference worker, you need to call the tiinfer.append_worker() function for orchestration.
def append_worker(
    worker: Type[mosec.Worker],
    num: int = 1,
    start_method: str = "spawn",
    env: Union[None, List[Dict[str, str]]] = None,
) -> None
"""
worker: A processing worker that inherits mosec.Worker and implements the forward method.
num: Number of processes for parallel computing (≥1)
start_method: Process startup method ("spawn" or "fork")
env: Environment variables required before process startup
"""
During a complete inference process, it is generally necessary to pre-process the input before entering it into the model for inference, and post-process the inference result before finally returning it to the caller. The inference process is usually completed on the GPU, while pre-processing and post-processing often involve CPU computations or even some IO processing. If pre-processing, post-processing, and inference are handled in the same process, the number of started processes is mainly limited by the GPU's video memory and computing power. At this point, separating pre-processing and post-processing from the inference process can fully utilize the CPU's processing capability. See the following code snippet:
import tiinfer
from mosec import Worker
from typing import Dict, Any


class MyPreprocess(Worker):
def forward(self, data: Dict) -> Any:
# Input is a Dict converted from JSON and requires some necessary pre-processing.

class MyPredict(Worker):
def __int__(self):
super.__init__()
# Read and load the model.
def forward(self, data: Any) -> Any:
# The input is the result of pre-processing. The inference result is obtained after inference by calling the model.


class MyPostprocess(Worker):
def forward(self, data: Any) -> Dict:
# The input is the inference result. Convert it into a Dict through post-processing and return it as JSON to the caller.

# Orchestration handling process: 1 x pre-processing worker > 2 x inference workers > 1 x post-processing worker
tiinfer.append_worker(MyPreprocess, 1)
tiinfer.append_worker(MyPredict, 2)
tiinfer.append_worker(MyPostprocess, 1)

Overview of model_service.py

Requirements for model_service.py

Refer to the relevant chapters in the introduction of tiinfer mentioned above.

PyTorch Implementation Example for Categorization Model Inference Script


import logging
import os
import time
from typing import Dict, List
from urllib.request import urlretrieve

import cv2  # type: ignore
import numpy as np  # type: ignore

import torch  # type: ignore

import tiinfer
import tiinfer.utils
import mosec


### Log Code ###
# Add the following code to print the request log recorded by the framework to the terminal.
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter(
    "%(asctime)s - %(process)d - %(levelname)s - %(filename)s:%(lineno)s - %(message)s"
)
sh = logging.StreamHandler()
sh.setFormatter(formatter)
logger.addHandler(sh)
### Log Code ###

# The pre-processing procedure decodes the input Base64-encoded string and performs some scaling operations as required by the model.
class Pre(mosec.Worker):
    def forward(self, req: Dict) -> cv2.Mat:
        # The pre-processed input data is a Dict converted from JSON.
        img_base64_bytes = req["image"]
        img = tiinfer.utils.image_to_cv2_mat(img_base64_bytes)
        # bgr -> rgb
        img = img[:, :, ::-1]
        # Perform some pre-processing on the image.
        img = cv2.resize(img, (256, 256))
        crop_img = (
            img[16 : 16 + 224, 16 : 16 + 224].astype(np.float32) / 255
        )  # center crop
        crop_img -= [0.485, 0.456, 0.406]
        crop_img /= [0.229, 0.224, 0.225]
        crop_img = np.transpose(crop_img, (2, 0, 1))
        return crop_img

# Load the model, perform inference on the pre-processed results, convert the results into the final format, and then pass them to the caller.
class Infer(mosec.Worker):
    def __init__(self) -> None:
        super().__init__()
        # Retrieve the current directory and load the model as needed.
        self.root_path = tiinfer.TI_MODEL_DIR
# Preferentially use a GPU.
self.device = (
            torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
        )
### Start Loading Non-accelerated Models ###
        # The model exists under the model directory.
        model_file = os.path.join(self.root_path, "model/resNet50.pt")
        ### End Loading Non-accelerated Models ###
### Start Loading Accelerated Models ###
# #Additionally import tiacc_inference for accelerated models.
# import tiacc_inference
# model_file = os.path.join(self.root_path, "model/tiacc.pt")
### End Loading Accelerated Models ###
        # Load the model.
        self.model = torch.jit.load(model_file)
        self.model.eval()

        # Categorization requires knowledge of the final category.
        self.categories = load_categories()

    def forward(self, img: cv2.Mat) -> Dict:
        with torch.no_grad():
            batch = torch.stack(
                [torch.tensor(arr, device=self.device) for arr in [img]]
            )
            pred_results = self.model(batch)
            prob = torch.nn.functional.softmax(pred_results, dim=1)
            top1_prob, top1_catid = torch.topk(prob, 1)
            return [
                {
                    "confidence": top1_prob[i].tolist()[0],
                    "pred": self.categories[top1_catid[i].tolist()[0]],
                }
                for i in range(top1_prob.size(0))
            ][0]


# Read the category information corresponding to the tag ID from the tag file.
def load_categories() -> List[str]:
    logging.info("loading categories file...")
    local_filename = "imagenet_classes.txt"
    if not os.path.exists("imagenet_classes.txt"):
        local_filename, _ = urlretrieve(
            "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
        )
    with open(local_filename, encoding="utf8") as file:
        return list(map(lambda x: x.strip(), file.readlines()))

# Orchestration handling process: pre-processing for 2 processes > inference for 1 process.
tiinfer.append_worker(Pre, 2)
tiinfer.append_worker(Infer, 1)

Downloading Demos

The platform provides demos for various models. See the table below for the download addresses.
Format
Scenario
Download Address
TorchScript
Category
TorchScript
Detection
TorchScript
NLP
TorchScript
OCR
Detectron2
Detection
MMDetection
Detection
HuggingFace
NLP
SavedModel
NLP
SavedModel
Recommendation
FrozenGraph
NLP
ONNX
Detection

Introduction to TI-ACC Inference Acceleration Function

tiacc_inference.load()

The format-optimized models of Detectron2 or MMDetection in TI-ONE require the use of the tiacc-inference.load() function to load the models. Other formats do not need to use the load method from tiacc_inference. Just use the native load method.
MMDetection Original example:
from mmdet.apis import init_detector
model = init_detector(config, checkpoint, device=device)
Usage example after optimization:
import tiacc_inference
model = tiacc_inference.load('tiacc.pt') # tiacc.pt is the new model generated after model optimization.
Detectron2 (for PyTorch models exported by Detectron2): Original example:
import torch
model = torch.load(checkpoint) # .pth model file
Usage example after optimization:
import tiacc_inference
model = tiacc_inference.load('tiacc.pt') # tiacc.pt is the new model generated after model optimization.
Detectron2 (for models constructed by Detectron2.modeling.build_model): Original example:
from detectron2.config import get_cfg
from detectron2.modeling import build_model
cfg = get_cfg()
cfg.MODEL.DEVICE = device
cfg.MODEL.WEIGHTS = checkpoint
model = build_model(cfg)
Usage example after optimization:
import tiacc_inference
model = tiacc_inference.load('tiacc.pt') # tiacc.pt is the new model generated after model optimization.

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback