Content | Description |
Language type | Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, and Arabic. The corresponding language type can be set through the API parameter engine_model_type. |
Supports industries | Common, finance, gaming, education, health care |
Audio Properties | Sampling Rate: 16000 Hz or 8000 Hz Sampling Accuracy: 16 bits Sound channel: mono |
Audio Format | pcm、wav、opus、speex、silk、mp3、m4a、aac |
Request Protocol | wss protocol |
Request URL | wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters} |
API Authentication | signature authentication mechanism. For details, see Signature Generation |
Response Format | Unified JSON format |
Data Transmission | It is recommended to send a 40ms duration data packet every 40ms (1:1 real-time rate), corresponding to pcm sizes of 640 bytes at 8k sampling rate and 1280 bytes at 16k sampling rate. The audio sending rate is too fast, exceeding the 1:1 real-time rate, or the sending interval between audio data packets exceeds 6 seconds, which may cause an engine error. The backend will return an error and actively disconnect. |
Concurrency Limit | The default single account concurrent connection limit is 20. If you need to increase the concurrent limit, submit a ticket for consultation. |
Field Name | Type | Description |
code | Integer | Status code. 0 indicates success; non-zero values indicate an error. |
message | String | Error description. When an error occurs, display the reason for the error occurrence. As the business develops or experience optimization progresses, this text may be frequently updated. |
voice_id | String | unique audio stream id, generated by the client during the handshake phase and assigned to the API call parameters |
message_id | String | unique message id |
result | Result | latest speech recognition result |
final | Integer | When this field returns 1, it means the audio stream recognition is completed. |
Field Name | Type | Description |
slice_type | Integer | Recognition result type: 0: Start of a Sentence Recognition 1: In the process of sentence recognition, voice_text_str is an Unstable Result (the recognition result may still change). 2: End of sentence recognition, voice_text_str is a Steady-State Result (the recognition result no longer changes). During audio sending, the slice_type sequence that may be returned during the recognition process includes: 0-1-2: Start of sentence recognition, recognition in progress (may return multiple 1s), recognition completed 0-2: Start of sentence recognition, recognition completed Return directly the complete recognition result of a paragraph |
index | Integer | Sequence number of the current sentence in the entire audio stream, starting from 0 and incrementing sentence by sentence |
start_time | Integer | Start time of the current sentence in the audio stream |
end_time | Integer | End time of the current sentence in the audio stream |
voice_text_str | String | Current paragraph text result, coded as UTF8 |
word_size | Integer | Number of word results in the current paragraph |
word_list | Word Array | Word list of the current sentence, Word Structure Format: word: String type, content of the word start_time: Integer type, start time of the word in the entire audio stream end_time: Integer type, end time of the word in the entire audio stream stable_flag: Integer type, stable result of the word, 0 indicates the word may change in subsequent recognition, 1 indicates the word will not change in subsequent recognition |
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
key1=value2&key2=value2...(URL encode both key and value)
Parameter Name | Required | Type | Description |
secretid | Y | String | |
timestamp | Y | Integer | Current UNIX timestamp, unit: seconds. If the difference from the current time is too large, it will cause a signature expiration error. |
expired | Y | Integer | Expiration time of the signature UNIX timestamp, in seconds. Expired must be greater than timestamp and expired - timestamp less than 90 days. |
nonce | Y | Integer | Random positive integer. User needs to generate it themselves, up to 10 digits. |
engine_model_type | Y | String | Engine Model Type Phone call scenario 8k_zh: Chinese telephone common 8k_en: English telephone common Non-phone call scenario 16k_zh_large: large model engine for Mandarin, Chinese dialects, and English [large model version]. The current model supports language recognition for Chinese, English, and multiple Chinese dialects, has a large number of parameters, and features language model performance enhancement. It greatly improves recognition accuracy against low-quality audio such as loud noise, strong echo, low voice volume, and voice from far away. 16k_zh: Mandarin common 16k_yue: Cantonese 16k_zh-TW: Chinese (Traditional) 16k_ar: Arabic 16k_en: English 16k_ko: Korean 16k_ja: Japanese 16k_th: Thai 16k_id: Indonesian 16k_ms: Malay |
voice_id | Y | String | A 16-character String serves as the unique identifier for each audio, user-generated. |
voice_format | N | Int | Audio encoding format, optional, default value is 4.1:pcm;4:speex(sp);6:silk;8:mp3;10:opus(Opus format audio stream encapsulation notes.);12:wav;14:m4a(each chunk must be a complete m4a audio);16:aac |
needvad | N | Integer | 0: Disable vad, 1: Enable vad If the voice segment exceeds 60 seconds, enable vad (voice detection and segmentation function). |
hotword_id | N | String | Hotword table id. If this parameter is not set, the default hotword list will automatically take effect; if this parameter is set, the hotword list will take effect. |
reinforce_hotword | N | Integer | Enhanced hotword feature. Default is 0, where 0: not enabled, 1: enable. When turned on (only supports 8k_zh, 16k_zh), the homophonic replacement function will be enabled. Homophones are configured in hotwords. For example: After the term "蜜制" is set and the enhancement feature is enabled, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制" and "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation. |
customization_id | N | String | self-learning model id. if this parameter is not set, the last launched self-learning model will take effect automatically; if this parameter is set, the self-learning model will take effect. |
filter_dirty | N | Integer | whether to filter profanity (Currently supports Mandarin engine). Default value is 0. 0: not filter profanity; 1: filter dirty words; 2: replace profanity with "*". |
filter_modal | N | Integer | Whether to filter modal particles (Currently supports Mandarin engine). Default value is 0. 0: not filter modal particles; 1: filter some modal particles; 2: strictly filter modal particles. |
filter_punc | N | Integer | whether to filter periods at the end of sentences (Currently supports Mandarin engine). Default value is 0. 0: does not filter periods at the end of sentences; 1: filters periods at the end of sentences. |
filter_empty_result | N | Integer | Callback recognition empty result, default is 1. 0: callback empty result; 1: Do Not Callback Empty Result. Note: If slice_type=0 and slice_type=2 paired callback is needed, set filter_empty_result=0. Generally needed in outbound call scenarios for paired return, use slice_type=0 to determine whether voice occurs. |
convert_num_mode | N | Integer | Whether to perform intelligent conversion of Arabic numerals (Currently supports Mandarin engine). 0: do not convert, directly output Chinese numbers, 1: intelligently convert to Arabic numerals based on the scenario, 3: enable math-related number conversion. Default value is 1. |
word_info | N | Int | Whether to display word-level timestamp. 0: do not display; 1: display, excluding punctuation timestamp; 2: display, including punctuation timestamp. Support for engines 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH. Default is 0. |
vad_silence_time | N | Integer | Voice segmentation detection threshold. Silence duration exceeding the threshold will be considered as sentence segmentation (commonly used in customer service scenarios, must be used in conjunction with needvad = 1). Value ranges from 240 to 2000 ms. Do not adjust this parameter arbitrarily as it may affect recognition performance. Currently only supports 8k_zh, 8k_zh_finance, and 16k_zh engine models. |
max_speak_time | N | Integer | Forced segmentation feature, value ranges from 5000 to 90000 (unit: ms), default value 0 (not enabled). In continuous speaking without interruption, this parameter will implement forced segmentation (at this point the result changes into steady state, slice_type=2). For example: in gaming commentary scenarios, when the commentator continues uninterrupted commentary and sentence segmentation is unable, set this parameter to 10000 to receive slice_type=2 callbacks every 10 seconds. |
noise_threshold | N | Float | Noise threshold. Default value is 0. Value range: [-2,2]. For some audio segments, a larger value indicates a higher likelihood of being judged as noise, while a smaller value indicates a higher likelihood of being judged as human voice. Use with caution: may affect recognition accuracy |
signature | Y | String | API signature parameters |
hotword_list | N | String | Temporary hot word list: this parameter is used for improve recognition accuracy. Single hot word limit: "hotword|weight", each hotword no more than 30 characters (maximum 10 Chinese characters), weight 1-11, for example: "Tencent Cloud|5" or "ASR|11"; Restrictions for the temporary term list: multiple terms separated by commas, supports up to 128 hotwords, for example: "Tencent Cloud|10, speech recognition|5, ASR|11"; hotword_id (hot word list) differs from hotword_list (temporary hot word list) hotword_id: hot word list. You must first create a hot word list on the console or via API, then obtain the corresponding hotword_id as the input parameter to use the hotword function. hotword_list: temporary hot word list. Each time a request is made, directly enter the temporary hot word list to use the hotword function. The list is not retained on the cloud. Suitable for users with a massive number of hot words demand. Note: If both hotword_id and hotword_list are provided, hotword_list will be used first. When term weight is set to 11, the current term will be upgraded to a super term. It is advisable to only set important and must-effective terms to 11. Setting too many terms with a weight of 11 will affect overall accuracy. |
input_sample_rate | N | Integer | pcm format 8k audio can be upsampled to 16k for recognition when the engine sampling rate is mismatched, effectively improving recognition accuracy. Only 8000 is supported. For example, if 8000 is input, the pcm audio sampling rate is 8k. When the engine selects 16k_zh, the 8k pcm audio can be recognized normally under the 16k_zh engine. Note: This parameter is applicable only to pcm format audio. If no input value is provided, it will maintain the default state, where the default call engine sampling rate equals the pcm audio sample rate. |
asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
HepdTRX6u155qIPKNKC+3U0j1N0=
wss://asr.cloud.tencent.com/asr/v2/125922***?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D
OpusHead (4 Byte) | Frame Data Length (2 Byte) | Opus Frame Compressed Data |
opus | Length len | opus decode data with length of len |
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}
{"type": "end"}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real time","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time speech recognition","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
{"code":4008,"message":"Background recognition server audio fragment waiting timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}
Value | Description |
4001 | Invalid parameter, see message for details |
4002 | Authentication failed |
4003 | AppID service not activated, please activate the service in the console |
4004 | No available free quota |
4005 | Account overdue, service suspended. Please top up promptly |
4006 | Account concurrent API calls exceeded the limit |
4007 | Audio decoding failed, please check that the uploaded audio data format matches the request parameters |
4008 | Client data upload timed out |
4009 | Client disconnected |
4010 | Client uploaded unknown text message |
5000 | Backend error, please retry |
5001 | Backend recognition server failed, please retry |
5002 | Backend recognition server failed, please retry |
Esta página foi útil?
Você também pode entrar em contato com a Equipe de vendas ou Enviar um tíquete em caso de ajuda.
comentários