Token usage per minute
Tokens Per Minute (TPM), the token usage per minute. It represents the upper limit on the total number of tokens (input + output) that a service can process within one minute. This is a key quota metric that imposes limitations on service throughput.
RPM
Requests Per Minute (RPM), the number of requests per minute. It represents the upper limit on the number of independent requests (API calls) that a service can process within one minute. This is a key quota metric that imposes limitations on service concurrency capacity.
Per-output Token latency
Time Per Output Token (TPOT), the latency per output Token (excluding the first Token). It represents the average time required for the model to generate each subsequent output Token after the first Token is produced. This metric determines the fluency of "streaming output" described below.
First Token Latency
Time To First Token (TTFT), the first token latency. It refers to the time it takes from when a user sends a complete request to when the model returns the first token. This metric directly impacts the perceived "responsiveness" for users.
Token
Token. The basic unit for processing text in large language models. In Chinese, a word, a character, or even a punctuation mark may be divided into one or more Tokens. It is the core unit for measuring model processing volume and computational cost.