Data integration, processing, and dumping all run concurrently in multiple subtasks (workers) at the underlying layer. The worker parallelism is 1 by default. The system will automatically check whether there is a data heap, and if so, increase the number of running workers to improve the task processing capability. Currently, the maximum number of workers is equal to the number of partitions in the Kafka topic. You cannot set the number of workers; instead, the system will automatically adjust it.
CKafka Connector works in the form of tasks, and its task performance depends on the service capabilities of upstream and downstream components. For example, if the upstream is the Kafka > CKafka Connector > Elasticsearch linkage, then when there is no performance bottleneck in the upstream and downstream, CKafka Connector will improve the data processing capability by adjusting the task parallelism; if a performance bottleneck is hit, data flow will become slower. You can identify bottlenecks by viewing monitoring metrics and configuring heap alarms.
Number of tasks
The maximum number of tasks per account is 200 by default. If you need more, submit a ticket for application.
Number of connections
The maximum number of connections per account is 100 by default. If you need more, submit a ticket for application.
Number of topics
The maximum number of topics per account is 200 by default. If you need more, submit a ticket for application.
Number of schemas
The maximum number of schemas per account is 100 by default, each of which can contain up to 100 fields. If you need more, submit a ticket for application.
QPS for integration over HTTP
The maximum QPS per HTTP access point is 2,000 by default. If you need more, submit a ticket for application.
Batch size of reported data
The maximum size of data reported in each batch over HTTP is 5 MB, and an error will be reported if this limit is exceeded.
Number of data records reported per batch
The maximum number of data records reported in each batch over HTTP is 500, and an error will be reported if this limit is exceeded.
Task data loss in extreme cases
Data processing and data distribution tasks are essentially to create CKafka producers and consumers to produce and consume in the selected task data source topic.
If the consumption is successful, that is, after the data is successfully delivered to the data target resource, the consumer group corresponding to the task (datahub-<task ID>) will submit the message offset corresponding to the data. However, if the task is restarted due to an exception before submitting the offset after successfully delivering the data, the data will be repeatedly delivered to the target resource. Therefore, we recommend you conduct idempotent processing for extreme cases in your business logic code.
If the message has not been successfully consumed, but the configured maximum message retention period (including the dynamic message retention policy and the manually configured topic-level message retention policy) has elapsed, and the message is deleted, the task will not be able to consume the expired message, and the data contained in it will be missing in the data target.