Overview
Product Features
Product Advantage
Parameters | Type | Mandatory | Description |
RefVideoUrl | string | Yes | Template video URL. This field is required. 1. Support muxing formats: mp4, mov, avi Note: **Note**: The video itself requires the encoding format to be H.264. If not, you can switch it using the method below. See encoding format conversion. 2. File size: up to 5G. 3. Video resolution: pixel requirement between 360 and 4096. 4. Video duration: supports 1-600 seconds, recommended duration is 10-120 seconds (when ConcurrencyType is Exclusive, supports 20-minute video duration). 5. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production. Requirements for input video material: 1. Human face visuals: require real-person appearance (if it is a cartoon character, facial features must be similar to human proportions). When speaking, the face should face the camera, with horizontal rotation no more than 45 degrees and pitch no more than 15 degrees. Avoid obstruction of the face and ensure stable facial lighting. 2. Speaking audio: no limit. Description of output video: 1. Output format: mp4 2. Output resolution: maximum supported output 4096, i.e. 4K video. 2.1 When ConcurrencyType is Shared, the maximum output is 2560*1440. When the input video resolution is ≤2560*1440, the output resolution matches the input video. 2.2 When ConcurrencyType is Exclusive, the maximum output is 4096*4096. When the input video resolution is ≤4096*4096, the output resolution matches the input video. (4K video has twice the real-time rate of 2K video synthesis.) |
DriverType | string | Yes | Driver type. This field is required. 1. Text-driven, field InputSsml required. 2. OriginalVoice: original voice audio-driven, field InputAudioUrl required. |
IdentityWrittenUrl | string | No | PDF authorization letter in PDF format, file size less than 10 MB. |
IdentityVideoUrl | string | No | Video format authorization letter in mp4 format, file size less than 5 GB. |
InputAudioUrl | string | No | The audio URL for driving the Digital Human. This field is required when DriverType is OriginalVoice. 1. Supported formats: wav, mp3, wma, m4a, aac, ogg 2. Duration cannot exceed 10 minutes, no less than 1 second (when ConcurrencyType is Exclusive, supports 20-minute audio duration). 3. Size requirements: no more than 100 MB 4. Ensure download speed, otherwise it may cause download failure or lead to a drop in real-time rate for video production. |
InputSsml | string | No | Broadcast Text content supports SSML tags. Supported tag types refer to Digital Human SSML Markup Language Specification. For tag syntax, see the example. Content must not include line breaks. Symbols must be escaped. The upper limit is 2000 characters, at least 4 characters (counted as unicode characters). This field is required when DriverType is empty or Text. |
SpeechParam | object | No | Define audio parameters. This field is required when DriverType is Text. |
SpeechParam.Speed | float | No | The speech rate (1.0 is normal speed, range [0.5-1.5]. A value of 0.5 indicates the slowest speed and a value of 1.5 indicates the fastest speed. Speech rate control is not effective when DriverType is set to audio-driven type). This field is required when DriverType is Text. |
SpeechParam.TimbreKey | string | No | Voice type key. This field is required when DriverType is Text. |
SpeechParam.Volume | int | No | Volume level, ranging from 0 to 10. The default is 0, which represents normal volume. The higher the values, the louder the volume. Note: TimbreKey does not support audio volume adjustment between male_1-20 and female_1-23 (male voice 1-20, female voice 1-23). |
SpeechParam.EmotionCategory | string | No | Controls the emotion of the synthesized audio, supported only for multi-emotion timbres. See the Personal Asset Management API Paginated Query Timbre List for available values. |
SpeechParam.EmotionIntensity | int | No | Controls the intensity of the synthesized audio emotion, with a range of [50,200]. This is only effective when EmotionCategory is not empty. |
SpeechParam.TimbreLanguage | string | No | Voice type language. See the Personal Asset Management API Paginated Query Timbre List for selectable languages. A corresponding language must be selected when synthesizing multilingual voice types. |
ConcurrencyType | string | No | Resource type used for video production tasks. 1. Exclusive: Uses concurrent calls, does not deduct from the hourly package, requires purchase concurrency. If not purchased, task submission fails. 2. Shared: Calls deduct from the hourly package, requires purchase of the hourly package. If not purchased, task submission fails. 3. Leave blank: If you purchase concurrency or both concurrency and hourly package, it will be "Exclusive". If you do not purchase concurrency but purchase the hourly package, it will be "Shared". If neither is purchased, task submission fails. |
VideoLoop | int | No | When the audio duration is more than the video duration, the generated video will align with the audio duration. The following is two alignment modes: 0: Reverse splicing, 1: Forward splicing. Default 0. |
VideoParametersConsistent | int | No | Video output standard (bitrate and frame rate) alignment option, default 0. Do not force alignment of input video formats. Align the input video format. Parameter description: When "0" is selected 1. Bitrate: The encoding mode defaults to CRF=17. The encoder dynamically adjusts the output bitrate based on video content complexity to underwrite stable quality. 2. Frame rate: The default frame rate is 25fps. When "1" is selected 1. Bitrate processing rule Input bitrate < 1000kbps → Output bitrate changes to 1000kbps. Input bitrate > 9000kbps → Output bitrate changes to 9000kbps. Input bitrate within [1000kbps, 9000kbps] → Output bitrate matches the input. 2. Frame rate processing rule Input frame rate < 15fps → Output frame rate changes to 15fps. Input frame rate > 60fps → Output frame rate changes to 60fps. Input frame rate within [15fps, 60fps] → Output frame rate matches the input. |
CallbackUrl | string | No | When a user adds a callback URL, video production results will be sent in a fixed format via a POST request to the URL address. For the fixed format, see Appendix II: Callback Request Body Format. Note: 1. Limit CallbackUrl length less than 1000. 2. Only one request will be sent. Regardless of the issue causing the request to fail, it cannot be resent. |
VideoParam | object | No | Define the detailed parameters for video synthesis. |
VideoParam.RefPhotoUrl | string | No | URL of the user-upload human face reference image. When the input video contains several faces, the VideoMakeNoTrain video generation API only allows selection of one human face as the target to replace its lip-sync, achieving the effect where lip movement matches the audio. This parameter is used to specify which person is targeted. If no human face reference image is provided, it will default to selecting the person with the largest face proportion in the first frame containing a face in the video. Image file requirements: 1. File size: ≤10 MB 2. Image size: resolution requirement between 192 and 4096. 3. Format support: jpg, jpeg, png, bmp, webp 4. Include a clear front-facing figure of a person appearing in the video. |
VideoParam.DisableIdDetect | int | No | Face ID trace switch. Default 0. 0: Enable face ID trace. When turned on, you can specify the face ID parameter (VideoParam.RefPhotoUrl) to drive the face. If not specified, it uses the first valid face ID detected. 1: Disable face ID trace. After disabling, the face ID parameter (VideoParam.RefPhotoUrl) becomes invalid. All detected face IDs will drive the system (if several faces appear in the same frame, the largest face will be used as the valid face ID). |
VideoParam.DisableOcclusionDetect | int | No | Obstruction detection switch. Default 0. 0: Enable obstruction detection. When turned on, lip-sync will not be driven when the mouth is obstructed. 1: Disable obstruction detection. After disabling, lip-sync will still be driven when the mouth is obstructed. |
VideoParam.EnableFakeTeeth | int | No | Dentures switch. Default 0. 0: Disable dentures processing. After closing, the final video's teeth will learn the original video's teeth features. 1: Enable dentures processing. When enabled, the final video's teeth will regenerate and not refer to the original video's teeth features. |
VideoParam.MakeType | string | No | Video customization type. Default: default configuration, the video clip starts from the 0th second of the original video by default, StartTime and EndTime are disabled. Custom: Specify video segment. Fill in StartTime and EndTime to select a recording clip (must be greater than 5s). The default generated video will perform loop back and forth with this specified video segment. Circle: starting and ending frames align, can be filled StartTime to specify video start time (EndTime does not take effect), and at this point VideoLoop parameter is disabled. CustomOnlyStart: specify video start time with StartTime, EndTime does not take effect, at this point VideoLoop parameter is disabled, defaults to reverse splicing. |
VideoParam.StartTime | float | No | Start time in seconds (3 decimal places), this parameter is valid only when MakeType is set to Custom or Circle. If filled in, the generated video will start from this position; if not filled in, the video will start from the default start time. |
VideoParam.EndTime | float | No | End time, in seconds (3 decimal places), this parameter is valid only when MakeType is Custom. When filling in, the video that ends by default ends at this position; if not filled in, the default selection is the end time of the selected video clips. |
VideoParam.DisableIntervals | Array of [DisableInterval] | No | Defining a list of time segments in the video (currently support a maximum of 5 segments). The list must be in chronological order, otherwise task submission fails. For example: [[1,2],[3,4],[5,6]] indicates that lip-sync is not applied to the original video from 1s–2s, 3s–4s, and 5s–6s. |
Parameters | Type | Mandatory | Description |
StartTime | float | No | No drive segment start time, in seconds (3 decimal places) |
EndTime | float | No | No drive segment end time, in seconds (3 decimal places) |
Parameters | Type | Mandatory | Description |
TaskId | string | Yes | video production task ID, use the TaskId to access the Audio and Video Production Progress Query API to obtain production progress and production result |
{"Header": {},"Payload": {"RefVideoUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/ref_video.mp4","DriverType": "Text","InputSsml": "Hello, I am the virtual <phoneme alphabet=\\"py\\" ph=\\"fu4\\">anchor</phoneme>","SpeechParam": {"TimbreKey": "female_1","Volume": 1,"Speed": 1.0}}}
{"Header": {},"Payload": {"RefVideoUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/ref_video.mp4","DriverType": "OriginalVoice","InputAudioUrl": "http://virtualhuman-cos-test-1251316161.cos.ap-nanjing.myqcloud.com/audio.mp3"}}
{"Header": {"Code": 0,"DialogID": "","Message": "","RequestID": "fde854eaa981c7f2f7285d1c7eca335b","SessionID": "gzb7dec22117297528294581119"},"Payload": {"TaskId": "81883d47c6154edf8e276531f09227b6"}}
ffmpeg -i before_conversion_video_path -c:v libx264 -preset medium -crf 23 -c:a aac -b:a 128k after_conversion_video_path
ffmpeg -i "/Users/xxxx/Downloads/picture/video/video_name.mov" -c:v libx264 -preset medium -crf 23 -c:a aac -b:a 128k "/Users/xxxx/Downloads/picture/video/converted_video_name.mp4"
Was this page helpful?
You can also Contact sales or Submit a Ticket for help.
Help us improve! Rate your documentation experience in 5 mins.
Feedback