Background Description
In TDMQ for Apache Pulsar, the maximum message size is 5 MB. An oversized message body will fail to be sent. Therefore, the client needs to compress large messages to support sending messages of 20 MB.
Large Message Handling in TDMQ for Apache Pulsar
In TDMQ for Apache Pulsar, the maximum message size is 5 MB by default. If a producer attempts to send a message exceeding 5 MB, the message will fail to be sent. When the client sends a message that exceeds this limit, we can adopt the following two methods to handle it:
Chunking messages: TDMQ for Apache Pulsar provides the chunking messages feature. When the chunking mechanism is enabled, the client can automatically split large messages and ensure message integrity, and the consumer can automatically reassemble the messages.
Message compression: Identical character sequences in message data are replaced to reduce the message size. TDMQ for Apache Pulsar supports four compression algorithms: LZ4, ZLIB, ZSTD, and SNAPPY.
It is recommended that large messages be compressed.
Compression Algorithm Analysis and Comparison
Algorithm Introduction
LZ4
LZ4 is a lossless data compression algorithm that delivers extremely fast compression and decompression speeds with minimal CPU consumption.
ZLIB
ZLIB is a commonly used lossless data compression technique that effectively reduces the size of sent and received data, thereby improving network transmission efficiency and capacity. ZLIB is a variant of the Lempel-Ziv compression algorithm, which can compress original data to less than half its original size and supports compression and decompression operations.
ZSTD
ZSTD is a Huffman coding-based compression algorithm and a variant of LZ77. It can efficiently compress different types of data. As a real-time encoding algorithm, it compresses large data faster and more efficiently. Compared with other compression algorithms, ZSTD achieves a higher compression ratio while balancing compression speed.
SNAPPY
SNAPPY is a lossless compression technique that relies on the LZ77 principle to achieve compression. Its core principle is that whenever two repeated strings are found in a data stream, shorter code is used to represent the string, reducing the data stream size.
Algorithm Comparison
|
ZLIB 1.2.11-1 | 2.743 | 110 MB/s | 400 MB/s |
LZ4 1.8.1 | 2.101 | 750 MB/s | 3,700 MB/s |
ZSTD 1.3.4-1 | 2.877 | 470 MB/s | 1,380 MB/s |
SNAPPY 1.1.4 | 2.091 | 530 MB/s | 1,800 MB/s |
Throughput: LZ4 > SNAPPY > ZSTD > ZLIB
Compression ratio: ZSTD > ZLIB > LZ4 > SNAPPY
Network bandwidth consumption: The SNAPPY algorithm consumes the most network bandwidth, and the ZSTD algorithm consumes the least.
Test of Various Compression Algorithms
Test Results
Note:
The following test results are for reference only. The compression effect needs to be verified based on the specific message body content.
|
5 MB | Random message body | LZ4 (threshold: 5 MB) | 9.95 MB | 31 ms | 0.049 ms |
|
| ZLIB | 7.26 MB | 31 ms | 0.038 ms |
|
| ZSTD | 8.20 MB | 31 ms | 0.039 ms |
|
| SNAPPY (threshold: 5 MB) | 9.70 MB | 33 ms | 0.046 ms |
6 MB | Random message body | ZLIB (threshold: 6 MB) | 8.71 MB | 35 ms | 0.044 ms |
|
| ZSTD (threshold: 6 MB) | 9.84 MB | 35 ms | 0.046 ms |
20 MB | Same message body | LZ4 | 0.16 MB | 41 ms | 0.006 ms |
|
| ZLIB | 0.20 MB | 42 ms | 0.006 ms |
|
| ZSTD | 0.01 MB | 42 ms | 0.003 ms |
|
| SNAPPY | 2.47 MB | 41 ms | 0.021 ms |
40 MB | Same message body | LZ4 | 0.32 MB | 123 ms | 0.008 ms |
|
| ZLIB | 0.39 MB | 122 ms | 0.008 ms |
|
| ZSTD | 0.01 MB | 124 ms | 0.004 ms |
|
| SNAPPY | 4.95 MB | 123 ms | 0.036 ms |
80 MB | Same message body | LZ4 | 0.63 MB | 241 ms | 0.009 ms |
|
| ZLIB | 0.39 MB | 244 ms | 0.01 ms |
|
| ZSTD | 0.01 MB | 243 ms | 0.004 ms |
|
| SNAPPY (threshold: 80 MB) | 9.9 MB | 243 ms | 0.056 ms |
160 MB | Same message body | LZ4 | 1.26 MB | 484 ms | 0.013 ms |
|
| ZLIB | 1.56 MB | 479 ms | 0.016 ms |
|
| ZSTD | 0.03 MB | 481 ms | 0.004 ms |
320 MB | Same message body | LZ4 | 2.5 MB | 1,035 ms | 0.03 ms |
|
| ZLIB | 3.1 MB | 1,008 ms | 0.027 ms |
|
| ZSTD | 0.03 MB | 949 ms | 0.004 ms |
585 MB | Same message body | LZ4 | 4.59 MB | 1,705 ms | 0.027 ms |
|
| ZLIB | 5.67 MB | 1,733 ms | 0.03 ms |
|
| ZSTD | 0.11 MB | 1,722 ms | 0.006 ms |
Summary:
In purely random data streams, the compression efficiency of the four algorithms is not high. When the message size exceeds 5 MB, none of the four compression algorithms can compress the message below 5 MB.
In data streams with a lot of duplicate data, the four compression algorithms can achieve high compression rates. LZ4, ZLIB, and ZSTD can compress messages within 600 MB to within 5 MB.
Message Compression Demo and Usage Test
For details about the message compression demo, see tdmq-sdk-Demo. Usage Test
Producer-side calling parameters:
java -jar tdmq-sdk-demo-1.0-SNAPSHOT-jar-with-dependencies.jar pulsar://xxxx:6650
eyJrZXlJZCI6ImRlZmF1bHRfa2V5SWQiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJzdXBlcnVzZXIifQ.dYcCfp4XrdWRKdKaWylobY-_xEExfRCi1pMvNyZXbqU
pulsar-78ra8ownxb7d/BigMSGSpace/BigMSGTopic subname 1 500 0 1 20480 1 0
Consumer-side calling parameters:
java -jar tdmq-sdk-demo-1.0-SNAPSHOT-jar-with-dendencies.jar pulsar://xxxx:6650
eyJrZXlJZCI6ImRlZmF1bHRfa2V5SWQiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJzdXBlcnVzZXIifQ.dYcCfp4XrdWRKdKaWylobY-_xEExfRCi1pMvNyZXbqU
pulsar-92d7w2mjwmv9/BigMessSpace/BigMessTopic subname 1 500 1