tencent cloud

Regular Expression Extraction
Last updated: 2025-09-29 10:17:19
Regular Expression Extraction
Last updated: 2025-09-29 10:17:19
The Ckafka Connector's data processing function provides the ability to extract message content based on regular expressions. The extraction uses the open source regular expression package re2.
The standard regular expression package in Java java.util.regex, as well as other widely used regex packages such as PCRE, Perl RE, and Python (re), all use a backtracking implementation strategy. If a pattern has two alternative solutions a|b, the engine will first attempt to match the subpattern a. If it fails to match, it will reset the input stream and try to match the subpattern b.
If this match mode is deeply nested, this policy needs to perform exponential nested parsing on the input data. If the input string is long, the matching time can trend toward infinite.
In contrast, the RE2J algorithm uses a nondeterministic finite automaton to check all matching items in a single parsing of the input data, thereby achieving regular expression matching in linear time.
Regular expression extraction in data processing is suitable for extracting specific fields from long array types of messages. The following introduces the method of use and several common extraction modes of the regular expression auto generation power provided by CKafka.

Regular Expression Auto Generation

Regular expression auto generation is suitable for the log parsing pattern where each line of log text is an original log, and each log can be extracted into multiple key-values by regular expression.
When configuring the single-line full regular expression mode, you need to enter a log sample first and then customize the regular expression. After configuration, the system will extract key-values based on the capture groups in the regular expression.
The following content will give you a detailed look at how to collect single-line full regular expression mode logs.

Prerequisites

Assume that one of your log raw data is:
2022-09-29 12:32:43.492 INFO  [RepositoryConfigurationDelegate:127][main]  - [TID: N/A] [TID: N/A] Bootstrapping Spring Data Elasticsearch repositories in DEFAULT mode.
The custom regular expression configured is:
(?<time>[0-9]{4}[-\\/:\\s\\.][0-9]{2}[-\\/:\\s\\.][0-9]{2}[-\\/:T\\s][0-9]{2}[-\\/:\\s\\.][0-9]{2}[-\\/:\\s\\.][0-9]{2}(?:[-\\/:\\s\\.][0-9]+)?(?:[zZ]|(?:[\\+-])(?:[01]\\d|2[0-3]):?(?:[0-5]\\d)?)?)\\s(?<log>\\w+\\s+\\[\\w+:\\w+\\]\\[\\w+\\]\\s+-\\s+\\[\\w+:\\s+\\w+/\\w+\\]\\s+\\[\\w+:\\s+\\w+/\\w+\\]\\s+\\w+\\s+\\w+\\s+\\w+\\s+\\w+\\s+\\w+\\s+\\w+\\s+\\w+\\s+\\w+\\.)
The system extracts the corresponding key-value based on capture group extraction, and you can customize the key name for each group as follows:
{"time":"2022-09-29 12:32:43.492",
"log":"INFO  [RepositoryConfigurationDelegate:127][main]  - [TID: N/A] [TID: N/A] Bootstrapping Spring Data Elasticsearch repositories in DEFAULT mode."}

Operation Steps

1. On the data processing rule configuration page, enter a log sample in the original data, set the parsing mode to Regular Expression Extraction, and click Regular Expression Auto Generation under the parsing mode.
2. In the pop-up "Regular Expression Auto Generation" modal view, based on actual retrieval and analysis requirements, select the log content to extract key-value, enter the key name in the pop-up textbox, and click Confirm Extraction.
3. The system will automatically generate a regular expression for this content, and the extracted results will appear in the key-value table.
4. Repeat Step 2 until all key-value pairs are extracted.
5. Click Submit, and the system will automatically generate a complete regular expression based on the extracted key-value pairs.

Case 1: Extract Mobile Phone Field

Enter message:
{"message":
[
{"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
{"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
{"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
]
}
Target message:
{
"0": "\\"phoneNumber\\":\\"13890000000\\"",
"1": "\\"phoneNumber\\":\\"15920000000\\"",
"2": "\\"phoneNumber\\":\\"18830000000\\""
}
The regular expression used is:
"phoneNumber":"(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}"

Case 2: Extract Email Field

Enter message:
{"message":
[
{"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
{"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
{"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
]
}
Target message:
{
"0": "\\"email\\":\\"123456@qq.com\\"",
"1": "\\"email\\":\\"123456789@163.com\\"",
"2": "\\"email\\":\\"usr333@gmail.com\\""
}
The regular expression used is:
"email":"\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*"

Case 3: Extract ID Card Field

Enter message:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message": "{\\"email\\":\\"123456@qq.com\\",\\"phoneNumber\\":\\"13890000000\\",\\"IDNumber\\":\\"130423199301067425\\"},{\\"email\\":\\"123456789@163.com\\",\\"phoneNumber\\":\\"15920000000\\",\\"IDNumber\\":\\"610630199109235723\\"},{\\"email\\":\\"usr333@gmail.com\\",\\"phoneNumber\\":\\"18830000000\\",\\"IDNumber\\":\\"42060219880213301X\\"}"
}
Target message, here preserve external fields and extract N IDNumber fields from the message field separately:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message.0": "130423199301067425",
"message.1": "610630199109235723",
"message.2": "42060219880213301X"
}
The regular expression used is:
[1-9]\\d{5}(18|19|20)\\d{2}((0[1-9])|(1[0-2]))(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]
Here process through multiple processing links. The processing result of link 1 is:
At this point, the message field needs post-processing. The processing result of link 2 is as follows:
Processing Result:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message.0": "130423199301067425",
"message.1": "610630199109235723",
"message.2": "42060219880213301X"
}
Here extract the required IDNumber fields, delete the original message field, retain external required fields such as operation, and N necessary data information from the message.

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback