The data analysis feature of CDN helps users analyze traffic patterns by deeply examining vast amounts of log data. To optimize user experience, sampling-based statistical techniques are introduced in data analysis, ensuring both accuracy and timeliness of queries even when processing large datasets.
What is sampling data statistics
In data analysis, sampling refers to selecting a representative subset from the entire dataset for analysis, in order to extract valuable information. For example, when conducting a social survey, researchers cannot survey every single person; therefore, they select a portion of the population as a representative sample, using the responses from this sample to reflect the tendencies of the entire population.
Which indicators will be sampled for statistics
The CDN utilizes dynamic sampling techniques to adapt to varying log data volumes from different users, ensuring the accuracy and efficiency of data analysis. For data analysis queries such as TOP URLs, TOP 100 client IPs, TOP 100 Referers, and TOP User Agents, sampling is used for statistical analysis when the domain's QPS reaches the following conditions: QPS is in the range [10,000, 100,000), and the sampling rate is 10%
QPS is in the range [100,000, 1,000,000), and the sampling rate is 1%
QPS is in the range [1,000,000, +∞), and the sampling rate is 0.1%
The sampling strategy determines the QPS based on data at 5-minute intervals. If the QPS meets the above conditions, sampling is triggered; otherwise, no sampling occurs. An example is shown below:
If the domain's QPS (queries per second) reaches 10,000 in the 5-minute log data from 00:01 to 00:05, then 10% sampling is applied, meaning 10% of the log entries from the 5-minute sample are used for calculation.
If the domain's QPS reaches 100,000 in the 5-minute log data collected from 00:06 to 00:10, then 1% sampling is applied, meaning 1% of the log entries from the 5-minute sample are used for calculation.
If the domain's QPS is 5000 in the 5-minute log data collected from 00:11 to 00:15, then no sampling is applied, and the calculation is based on all request logs.
Note:
The CDN continuously optimizes and adjusts its sampling strategy based on the scale of platform log data and users' actual needs. If you have any questions about the data analysis query results, please feel free to contact us. How to use full data statistics?
If your business needs require in-depth analysis of all log data, we recommend using the CDN's Real-time Logs feature. Real-time Logs can transfer detailed, complete log data to your designated log analysis system (such as Tencent Cloud CLS), allowing you to perform fine-grained data processing using the complete dataset. With Real-time Logs, you can ensure more accurate data analysis results in scenarios requiring higher data precision, thus providing more accurate data support for your business decisions. Explanation of Data Representativeness
The CDN provides a unique identifier (Request ID) for each request log. The sampling system uses this unique identifier to perform sampling analysis on your data, ensuring the randomness of the sampling factor. Our tests show that when the features you need to analyze constitute a high percentage of the overall data, sampling analysis can provide you with fast and accurate results. However, we must also point out that when the features you need to analyze constitute a small percentage of the overall data, the results of the sampling analysis may be skewed due to the small sample size.
For example, you have a dataset with 10,000 log entries, containing three URL paths A, B, and C, with quantities distributed as 7000 (70%), 2900 (29%), and 100 (1%), respectively. In the ideal scenario, after 10% sampling, the sample sizes for URL paths A, B, and C would be 700, 290, and 10. However, because the sample size for URL C is too small, the accuracy of estimating the overall population based on the sample will be significantly reduced. In this case, the results of your drill-down analysis on URL C may not meet expectations.