Product Overview

Overview

Product Features

Product Advantage

Purchase Guide

Pricing Guide

Purchase Guide

Process for Purchasing with Vouchers

Refund Instructions

Introduction of Avatar

Introduction to Image Categories

Basic Image Library

Guide on Avatar and Voice Clone

Digital Human Platform Operation Guide

Accessing Platform

Avatar Production and Asset Management

Digital Human Conversation Interaction Application and Management

Broadcast Digital Human Video Generation and Management

Operations Management and Analysis

Server API Integration

Digital Human API Access Mode Overview

Avatar aPaas API Calling Methods

Avatar Image Customization and Voice Clone API Documentation

Video Generation Service API Documentation

Interactive Digital Human Service API Documentation

Personal Asset Management API Documentation

Client SDK Integration

Overall Introduction

3D Client-Side Rendering SDK Integration

2D Client-Side Rendering SDK Integration

Digital Human SSML Markup Language Specification

Related Agreement

DSA (Data Sharing Agreement)

FAQs

Video Production API - No Training Video

PDF

Focus Mode

Font Size

Last updated: 2025-10-10 11:35:20

API Description
No need to train, it can generate a new video with lip-sync matching the input text or audio based on real person video footage.
Through the Audio and Video Production Progress Query API, the final video is ultimately returned. Currently, audio and video resources are retained for only 7 days. Please download them as soon as possible.
Calling Protocol
HTTPS + JSON
POST     /v2/ivh/videomaker/broadcastservice/videomakenotrain
Header   Content-Type: application/json;charset=utf-8
Request Parameters
Parameters
Type
Mandatory
Description
RefVideoUrl
string
Yes
Template video URL. This field is required.
1. Support muxing formats: mp4, mov, avi
Note:
**Note**: The video itself requires the encoding format to be H.264. If not, you can switch it using the method below. See encoding format conversion.
2. File size: up to 5G.
3. Video resolution: pixel requirement between 360 and 4096.
4. Video duration: supports 1-600 seconds, recommended duration is 10-120 seconds (when ConcurrencyType is Exclusive, supports 20-minute video duration).
5. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production.
Requirements for input video material:
1. Human face visuals: require real-person appearance (if it is a cartoon character, facial features must be similar to human proportions). When speaking, the face should face the camera, with horizontal rotation no more than 45 degrees and pitch no more than 15 degrees. Avoid obstruction of the face and ensure stable facial lighting.
2. Speaking audio: no limit.
Description of output video:
1. Output format: mp4
2. Output resolution: maximum supported output 4096, i.e. 4K video.
2.1 When ConcurrencyType is Shared, the maximum output is 2560*1440. When the input video resolution is ≤2560*1440, the output resolution matches the input video.
2.2 When ConcurrencyType is Exclusive, the maximum output is 4096*4096. When the input video resolution is ≤4096*4096, the output resolution matches the input video. (4K video has twice the real-time rate of 2K video synthesis.)
DriverType
string
Yes
Driver type. This field is required.
1. Text-driven, field InputSsml required.
2. OriginalVoice: original voice audio-driven, field InputAudioUrl required.
IdentityWrittenUrl
string
No
PDF authorization letter in PDF format, file size less than 10 MB.
IdentityVideoUrl
string
No
Video format authorization letter in mp4 format, file size less than 5 GB.
InputAudioUrl
string
No
The audio URL for driving the Digital Human. This field is required when DriverType is OriginalVoice.
1. Supported formats: wav, mp3, wma, m4a, aac, ogg
2. Duration cannot exceed 10 minutes, no less than 1 second (when ConcurrencyType is Exclusive, supports 20-minute audio duration).
3. Size requirements: no more than 100 MB
4. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production.
InputSsml
string
No
Broadcast Text content supports SSML tags. Supported tag types refer to Digital Human SSML Markup Language Specification. For tag syntax, see the example. Content must not include line breaks. Symbols must be escaped. The upper limit is 2000 characters, at least 4 characters (counted as unicode characters). This field is required when DriverType is empty or Text.
SpeechParam
object
No
Define audio parameters. This field is required when DriverType is Text.
SpeechParam.Speed
float
No
The speech rate (1.0 is normal speed, range [0.5-1.5]. A value of 0.5 indicates the slowest speed and a value of 1.5 indicates the fastest speed. Speech rate control is not effective when DriverType is set to audio-driven type). This field is required when DriverType is Text.
SpeechParam.TimbreKey
string
No
Voice type key. This field is required when DriverType is Text.
SpeechParam.Volume
int
No
Volume level, ranging from 0 to 10. The default is 0, which represents normal volume. The higher the values, the louder the volume.
Note:
TimbreKey does not support audio volume adjustment between male_1-20 and female_1-23 (male voice 1-20, female voice 1-23).
SpeechParam.EmotionCategory
string
No
Controls the emotion of the synthesized audio, supported only for multi-emotion timbres. See the Personal Asset Management API Paginated Query Timbre List for available values.
SpeechParam.EmotionIntensity
int
No
Controls the intensity of the synthesized audio emotion, with a range of [50,200]. This is only effective when EmotionCategory is not empty.
SpeechParam.TimbreLanguage
string
No
Voice type language. See the Personal Asset Management API Paginated Query Timbre List for selectable languages. A corresponding language must be selected when synthesizing multilingual voice types.
ConcurrencyType
string
No
Resource type used for video production tasks.
1. Exclusive: Uses concurrent calls, does not deduct from the hourly package, requires purchase concurrency. If not purchased, task submission fails.
2. Shared: Calls deduct from the hourly package, requires purchase of the hourly package. If not purchased, task submission fails.
3. Leave blank: If you purchase concurrency or both concurrency and hourly package, it will be "Exclusive". If you do not purchase concurrency but purchase the hourly package, it will be "Shared". If neither is purchased, task submission fails.
VideoLoop
int
No
When the audio duration is more than the video duration, the generated video will align with the audio duration. The following is two alignment modes:
0: Reverse splicing, 1: Forward splicing. Default 0.
VideoParametersConsistent
int
No
Video output standard (bitrate and frame rate) alignment option, default 0.
Do not force alignment of input video formats.
Align the input video format.
Parameter description:
When "0" is selected
1. Bitrate:
The encoding mode defaults to CRF=17. The encoder dynamically adjusts the output bitrate based on video content complexity to underwrite stable quality.
2. Frame rate:
The default frame rate is 25fps.
When "1" is selected
1. Bitrate processing rule
Input bitrate < 1000kbps → Output bitrate changes to 1000kbps.
Input bitrate > 9000kbps → Output bitrate changes to 9000kbps.
Input bitrate within [1000kbps, 9000kbps] → Output bitrate matches the input.
2. Frame rate processing rule
Input frame rate < 15fps → Output frame rate changes to 15fps.
Input frame rate > 60fps → Output frame rate changes to 60fps.
Input frame rate within [15fps, 60fps] → Output frame rate matches the input.
CallbackUrl
string
No
When a user adds a callback URL, video production results will be sent in a fixed format via a POST request to the URL address. For the fixed format, see Appendix II: Callback Request Body Format. Note:
1. Limit CallbackUrl length less than 1000.
2. Only one request will be sent. Regardless of the issue causing the request to fail, it cannot be resent.
VideoParam
object
No
Define the detailed parameters for video synthesis.
VideoParam.RefPhotoUrl
string
No
URL of the user-upload human face reference image.
When the input video contains several faces, the VideoMakeNoTrain video generation API only allows selection of one human face as the target to replace its lip-sync, achieving the effect where lip movement matches the audio. This parameter is used to specify which person is targeted.
If no human face reference image is provided, it will default to selecting the person with the largest face proportion in the first frame containing a face in the video.
Image file requirements:
1. File size: ≤10 MB
2. Image size: resolution requirement between 192 and 4096.
3. Format support: jpg, jpeg, png, bmp, webp
4. Include a clear front-facing figure of a person appearing in the video.
VideoParam.DisableIdDetect
int
No
Face ID trace switch. Default 0.
0: Enable face ID trace. When turned on, you can specify the face ID parameter (VideoParam.RefPhotoUrl) to drive the face. If not specified, it uses the first valid face ID detected.
1: Disable face ID trace. After disabling, the face ID parameter (VideoParam.RefPhotoUrl) becomes invalid. All detected face IDs will drive the system (if several faces appear in the same frame, the largest face will be used as the valid face ID).
VideoParam.DisableOcclusionDetect
int
No
Obstruction detection switch. Default 0.
0: Enable obstruction detection. When turned on, lip-sync will not be driven when the mouth is obstructed.
1: Disable obstruction detection. After disabling, lip-sync will still be driven when the mouth is obstructed.
VideoParam.EnableFakeTeeth
int
No
Dentures switch. Default 0.
0: Disable dentures processing. After closing, the final video's teeth will learn the original video's teeth features.
1: Enable dentures processing. When enabled, the final video's teeth will regenerate and not refer to the original video's teeth features.
VideoParam.MakeType
string
No
Video customization type.
Default: default configuration, the video clip starts from the 0th second of the original video by default, StartTime and EndTime are disabled.
Custom: Specify video segment. Fill in StartTime and EndTime to select a recording clip (must be greater than 5s). The default generated video will perform loop back and forth with this specified video segment.
Circle: starting and ending frames align, can be filled StartTime to specify video start time (EndTime does not take effect), and at this point VideoLoop parameter is disabled.
CustomOnlyStart: specify video start time with StartTime, EndTime does not take effect, at this point VideoLoop parameter is disabled, defaults to reverse splicing.
VideoParam.StartTime
float
No
Start time in seconds (3 decimal places), this parameter is valid only when MakeType is set to Custom or Circle. If filled in, the generated video will start from this position; if not filled in, the video will start from the default start time.
VideoParam.EndTime
float
No
End time, in seconds (3 decimal places), this parameter is valid only when MakeType is Custom. When filling in, the video that ends by default ends at this position; if not filled in, the default selection is the end time of the selected video clips.
VideoParam.DisableIntervals
Array of [DisableInterval]
No
Defining a list of time segments in the video (currently support a maximum of 5 segments). The list must be in chronological order, otherwise task submission fails.
For example: [[1,2],[3,4],[5,6]] indicates that lip-sync is not applied to the original video from 1s–2s, 3s–4s, and 5s–6s.
DisableInterval
Parameters
Type
Mandatory
Description
StartTime
float
No
No drive segment start time, in seconds (3 decimal places)
EndTime
float
No
No drive segment end time, in seconds (3 decimal places)
Response Parameter
Parameters
Type
Mandatory
Description
TaskId
string
Yes
video production task ID, use the TaskId to access the Audio and Video Production Progress Query API to obtain production progress and production result
Request Sample
Text-driven
{
    "Header": {},
    "Payload": {
        "RefVideoUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/ref_video.mp4",
        "DriverType": "Text",
        "InputSsml": "Hello, I am the virtual <phoneme alphabet=\\"py\\" ph=\\"fu4\\">anchor</phoneme>",
        "SpeechParam": {
            "TimbreKey": "female_1",
            "Volume": 1,
            "Speed": 1.0
        }
    }
}
Audio-driven
{
    "Header": {},
    "Payload": {
        "RefVideoUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/ref_video.mp4",
        "DriverType": "OriginalVoice",
        "InputAudioUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/audio.mp3"
    }
}
Response Sample
{
    "Header": {
        "Code": 0,
        "DialogID": "",
        "Message": "",
        "RequestID": "fde854eaa981c7f2f7285d1c7eca335b",
        "SessionID": "gzb7dec22117297528294581119"
    },
    "Payload": {
        "TaskId": "81883d47c6154edf8e276531f09227b6"
    }
}
Encoding Format Conversion Tutorial
This tutorial is dedicated to video conversion to H.264 encoding format, the main principle is: implement conversion via FFmpeg.
Step One: Install FFmpeg
Windows systems:
Download the Windows version compression package from the FFmpeg official website.
macOS system:
Available for use with Homebrew, the command is brew install ffmpeg.
Linux system:
Install FFmpeg on Ubuntu/Debian using the command: sudo apt-get install ffmpeg.
Install FFmpeg on Fedora using the command: sudo dnf install ffmpeg.
Step Two: Switch
1. Prepare two paths: the pre-conversion video path and the post-conversion video path, then enter the following command in the terminal.
ffmpeg -i before_conversion_video_path -c:v libx264 -preset medium -crf 23 -c:a aac -b:a 128k after_conversion_video_path
Example:
The pre-conversion video path is "/Users/xxxx/Downloads/picture/video/video_name.mov".
The post-conversion video path is "/Users/xxxx/Downloads/picture/video/converted_video_name.mp4".
2. Enter the command in the terminal:
ffmpeg -i "/Users/xxxx/Downloads/picture/video/video_name.mov" -c:v libx264 -preset medium -crf 23 -c:a aac -b:a 128k "/Users/xxxx/Downloads/picture/video/converted_video_name.mp4"
Note:
Windows systems need to switch directory to the folder where the compressed package .exe resides using cd, then execute commands.
Step Three: Verify
Verify whether it is H.264: After retrieving the converted video from the specified path in the previous step, you can check it with the aid of the native player or third-party software such as VLC.
﻿
﻿

Help and Support

Was this page helpful?

You can also Contact sales or Submit a Ticket for help.

Help us improve! Rate your documentation experience in 5 mins.

Feedback

tencent cloud

Tencent Cloud AI Digital Human

Video Production API - No Training Video

API Description

Calling Protocol

Request Parameters

DisableInterval

Response Parameter

Request Sample

Response Sample

Encoding Format Conversion Tutorial

Step One: Install FFmpeg

Step Two: Switch

Step Three: Verify

Help and Support

Parameters	Type	Mandatory	Description
RefVideoUrl	string	Yes	Template video URL. This field is required. 1. Support muxing formats: mp4, mov, avi Note: Note: The video itself requires the encoding format to be H.264. If not, you can switch it using the method below. See encoding format conversion. 2. File size: up to 5G. 3. Video resolution: pixel requirement between 360 and 4096. 4. Video duration: supports 1-600 seconds, recommended duration is 10-120 seconds (when ConcurrencyType is Exclusive, supports 20-minute video duration). 5. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production. Requirements for input video material: 1. Human face visuals: require real-person appearance (if it is a cartoon character, facial features must be similar to human proportions). When speaking, the face should face the camera, with horizontal rotation no more than 45 degrees and pitch no more than 15 degrees. Avoid obstruction of the face and ensure stable facial lighting. 2. Speaking audio: no limit. Description of output video: 1. Output format: mp4 2. Output resolution: maximum supported output 4096, i.e. 4K video. 2.1 When ConcurrencyType is Shared, the maximum output is 25601440. When the input video resolution is ≤25601440, the output resolution matches the input video. 2.2 When ConcurrencyType is Exclusive, the maximum output is 40964096. When the input video resolution is ≤40964096, the output resolution matches the input video. (4K video has twice the real-time rate of 2K video synthesis.)
DriverType	string	Yes	Driver type. This field is required. 1. Text-driven, field InputSsml required. 2. OriginalVoice: original voice audio-driven, field InputAudioUrl required.
IdentityWrittenUrl	string	No	PDF authorization letter in PDF format, file size less than 10 MB.
IdentityVideoUrl	string	No	Video format authorization letter in mp4 format, file size less than 5 GB.
InputAudioUrl	string	No	The audio URL for driving the Digital Human. This field is required when DriverType is OriginalVoice. 1. Supported formats: wav, mp3, wma, m4a, aac, ogg 2. Duration cannot exceed 10 minutes, no less than 1 second (when ConcurrencyType is Exclusive, supports 20-minute audio duration). 3. Size requirements: no more than 100 MB 4. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production.
InputSsml	string	No	Broadcast Text content supports SSML tags. Supported tag types refer to Digital Human SSML Markup Language Specification. For tag syntax, see the example. Content must not include line breaks. Symbols must be escaped. The upper limit is 2000 characters, at least 4 characters (counted as unicode characters). This field is required when DriverType is empty or Text.
SpeechParam	object	No	Define audio parameters. This field is required when DriverType is Text.
SpeechParam.Speed	float	No	The speech rate (1.0 is normal speed, range [0.5-1.5]. A value of 0.5 indicates the slowest speed and a value of 1.5 indicates the fastest speed. Speech rate control is not effective when DriverType is set to audio-driven type). This field is required when DriverType is Text.
SpeechParam.TimbreKey	string	No	Voice type key. This field is required when DriverType is Text.
SpeechParam.Volume	int	No	Volume level, ranging from 0 to 10. The default is 0, which represents normal volume. The higher the values, the louder the volume. Note: TimbreKey does not support audio volume adjustment between male_1-20 and female_1-23 (male voice 1-20, female voice 1-23).
SpeechParam.EmotionCategory	string	No	Controls the emotion of the synthesized audio, supported only for multi-emotion timbres. See the Personal Asset Management API Paginated Query Timbre List for available values.
SpeechParam.EmotionIntensity	int	No	Controls the intensity of the synthesized audio emotion, with a range of [50,200]. This is only effective when EmotionCategory is not empty.
SpeechParam.TimbreLanguage	string	No	Voice type language. See the Personal Asset Management API Paginated Query Timbre List for selectable languages. A corresponding language must be selected when synthesizing multilingual voice types.
ConcurrencyType	string	No	Resource type used for video production tasks. 1. Exclusive: Uses concurrent calls, does not deduct from the hourly package, requires purchase concurrency. If not purchased, task submission fails. 2. Shared: Calls deduct from the hourly package, requires purchase of the hourly package. If not purchased, task submission fails. 3. Leave blank: If you purchase concurrency or both concurrency and hourly package, it will be "Exclusive". If you do not purchase concurrency but purchase the hourly package, it will be "Shared". If neither is purchased, task submission fails.
VideoLoop	int	No	When the audio duration is more than the video duration, the generated video will align with the audio duration. The following is two alignment modes: 0: Reverse splicing, 1: Forward splicing. Default 0.
VideoParametersConsistent	int	No	Video output standard (bitrate and frame rate) alignment option, default 0. Do not force alignment of input video formats. Align the input video format. Parameter description: When "0" is selected 1. Bitrate: The encoding mode defaults to CRF=17. The encoder dynamically adjusts the output bitrate based on video content complexity to underwrite stable quality. 2. Frame rate: The default frame rate is 25fps. When "1" is selected 1. Bitrate processing rule Input bitrate < 1000kbps → Output bitrate changes to 1000kbps. Input bitrate > 9000kbps → Output bitrate changes to 9000kbps. Input bitrate within [1000kbps, 9000kbps] → Output bitrate matches the input. 2. Frame rate processing rule Input frame rate < 15fps → Output frame rate changes to 15fps. Input frame rate > 60fps → Output frame rate changes to 60fps. Input frame rate within [15fps, 60fps] → Output frame rate matches the input.
CallbackUrl	string	No	When a user adds a callback URL, video production results will be sent in a fixed format via a POST request to the URL address. For the fixed format, see Appendix II: Callback Request Body Format. Note: 1. Limit CallbackUrl length less than 1000. 2. Only one request will be sent. Regardless of the issue causing the request to fail, it cannot be resent.
VideoParam	object	No	Define the detailed parameters for video synthesis.
VideoParam.RefPhotoUrl	string	No	URL of the user-upload human face reference image. When the input video contains several faces, the VideoMakeNoTrain video generation API only allows selection of one human face as the target to replace its lip-sync, achieving the effect where lip movement matches the audio. This parameter is used to specify which person is targeted. If no human face reference image is provided, it will default to selecting the person with the largest face proportion in the first frame containing a face in the video. Image file requirements: 1. File size: ≤10 MB 2. Image size: resolution requirement between 192 and 4096. 3. Format support: jpg, jpeg, png, bmp, webp 4. Include a clear front-facing figure of a person appearing in the video.
VideoParam.DisableIdDetect	int	No	Face ID trace switch. Default 0. 0: Enable face ID trace. When turned on, you can specify the face ID parameter (VideoParam.RefPhotoUrl) to drive the face. If not specified, it uses the first valid face ID detected. 1: Disable face ID trace. After disabling, the face ID parameter (VideoParam.RefPhotoUrl) becomes invalid. All detected face IDs will drive the system (if several faces appear in the same frame, the largest face will be used as the valid face ID).
VideoParam.DisableOcclusionDetect	int	No	Obstruction detection switch. Default 0. 0: Enable obstruction detection. When turned on, lip-sync will not be driven when the mouth is obstructed. 1: Disable obstruction detection. After disabling, lip-sync will still be driven when the mouth is obstructed.
VideoParam.EnableFakeTeeth	int	No	Dentures switch. Default 0. 0: Disable dentures processing. After closing, the final video's teeth will learn the original video's teeth features. 1: Enable dentures processing. When enabled, the final video's teeth will regenerate and not refer to the original video's teeth features.
VideoParam.MakeType	string	No	Video customization type. Default: default configuration, the video clip starts from the 0th second of the original video by default, StartTime and EndTime are disabled. Custom: Specify video segment. Fill in StartTime and EndTime to select a recording clip (must be greater than 5s). The default generated video will perform loop back and forth with this specified video segment. Circle: starting and ending frames align, can be filled StartTime to specify video start time (EndTime does not take effect), and at this point VideoLoop parameter is disabled. CustomOnlyStart: specify video start time with StartTime, EndTime does not take effect, at this point VideoLoop parameter is disabled, defaults to reverse splicing.
VideoParam.StartTime	float	No	Start time in seconds (3 decimal places), this parameter is valid only when MakeType is set to Custom or Circle. If filled in, the generated video will start from this position; if not filled in, the video will start from the default start time.
VideoParam.EndTime	float	No	End time, in seconds (3 decimal places), this parameter is valid only when MakeType is Custom. When filling in, the video that ends by default ends at this position; if not filled in, the default selection is the end time of the selected video clips.
VideoParam.DisableIntervals	Array of [DisableInterval]	No	Defining a list of time segments in the video (currently support a maximum of 5 segments). The list must be in chronological order, otherwise task submission fails. For example: [[1,2],[3,4],[5,6]] indicates that lip-sync is not applied to the original video from 1s–2s, 3s–4s, and 5s–6s.