The data quality monitoring node can monitor the data quality of related data source tables (such as the existence of dirty data) by configuring quality monitoring rules. It monitors the quality of data source tables (such as dirty data) and supports custom scheduling to periodically execute validation tasks. This document describes how to use the data quality monitoring node for task monitoring.
Usage Scenario
The Data Quality functionality aims to timely perceive source data modification and dirty data generated during the ETL process, automatically block problematic tasks, and prevent dirty data from spreading downstream. This effectively avoids decision-making deviation caused by data quality issues while reducing time and resource consumption from task reruns. For details, please refer to Data Quality. Use Limits
Supported table types for monitoring: EMR-hive/iceberg/starrocks, DLC, Doris, TCHouse-D/P/X.
Supported table types for monitoring:
Only support monitoring tables in the data source bound to the workspace where the current node (data quality monitoring node) resides.
Each node only supports monitoring one table data, but allows configuring multiple monitoring rules. Description: If you need to monitor multiple tables, create multiple nodes.
Only supported under projects with simple mode as the project mode and task scheduling as the scheduling mode.
Quality monitoring rules created in DataStudio are only supported in data development for running, implementing modifications, publishing, and other management operations. The rules can also be viewed in the data quality module, but they fail to trigger scheduling execution, and related management operations are not allowed.
If modification is needed for monitoring rules in the data quality monitoring node medium configuration and the publisher node is deployed, the original generated monitoring rules of that node will be replaced.
Prerequisites
The data source has been created and bound to the current project, and the monitoring table has been created in the data source. Before executing the quality monitoring task, you need to first create the data source table to be monitored by the monitoring node.
Created resource groups.
Steps
Step 1: Create a Data Quality Monitoring Node
1. Log in to the WeData Console, switch to the target region, and enter the offline development page. Click Offline Development > Orchestration Space in the left sidebar, and select the corresponding project in the dropdown list.
2. Right-click the target workflow, select create node > data quality > data quality monitoring.
3. In the node creation dialog box, manually input the node name, click confirm, and the creation is completed. You can develop and configure the corresponding task in the node.
Step Two: Configure Data Quality Monitoring Rule
Monitoring Object
Enter the Create Monitoring interface, sequentially select the data source type, data source, database, and monitoring table of the monitoring object.
Step 3: Configure Execution Policies
|
Execution Engine | Here you can select Hive and Spark, about the purchased EMR resource. In general, Hive tables can directly select Hive engine. |
Computing Resource | Select default Here you can select the resource group in the EMR cluster. In general, you can directly select default. |
Execution resource | The execution resource here is the scheduling resource group already bound to the project. |
Step Four: Configure Quality Monitoring Rule
Click Create New Rule on the rule list to open the rule creation popup, which supports selecting rules for quality validation. It allows adding multiple rules at once, and the newly added rules will be directly associated with the monitor.
|
Rule Type | Available options: system template, custom template, or Custom SQL (if you select a rule template from the Left Tree, the selected template parameters will be displayed by default here). System template: WeData has built-in 76 rule templates that can be used for free. 20 of them are applicable to reasoning tables. For more details on each template, see system template description. Custom template: You can add rules applicable to own business in the rule template menu for easy reuse. For detailed operations guidance, see Custom Template Description. Custom SQL: directly fill in SQL statement as detection rule. |
Monitoring Object | Monitored objects can be divided into: table-level and field-level (if you select a rule template from the Left Tree, the selected template parameters will be displayed by default here). At table level, monitoring is available for the number of rows and table size (only supported for Hive tables). At field level, monitoring is available for whether the field is empty, whether to repeat, mean, maximum value and minimum value. |
Select Template | WeData has already built in 76 rule templates that can be used for free. (If you select a rule template from the Left Tree, the selected template parameters will be displayed by default here). |
Detection Range | You can choose full table or conditional scan. Full table: The quality rule will verify the full data in the table. Conditional scan: The quality rule will verify only the detection range filled in here. For example: pt_date='${yyyy-MM-dd-1d}'
Note: Here, fill in the partition field to avoid full table scan for every quality task, leading to computational resource wastage. In SQL, ${yyyy-MM-dd-1d} is a date variable that represents one day before execution date. It will be replaced with a specific date during Quality Task Execution. For example: When the quality task is executed at 2024-05-02 00:00:00, ${yyyy-MM-dd-1d} will be replaced with 2024-05-01. |
Trigger Condition | Comparison operator: select less than. Compare value: fill in 1. The number of table rows is less than 1, indicating that when no new data was added yesterday, an alarm is triggered with the time variable filled in combination with the detection range. Note: Here fill the abnormal value as the trigger condition, that is: conditions for triggering alarms. |
Trigger Level | Select Medium. Trigger level can be divided into: High, Medium, Low. High: When an alarm is triggered, immediately block downstream task execution (valid only when associated with production tasks). Medium: Trigger alarm only. Low: No alarm is triggered, only abnormal result display. |
Support setting rule subscription information individually or in batches in the rule list.
Support editing or deleting rules in the rule list.
Step Five: Configure Task Scheduling
If you need to periodically execute created node tasks, click the scheduling configuration on the right side of the editing page, and configure the scheduling information of that node task based on business needs. For configuration details, please see Scheduling Setting. Step Six: Task Saved, Submit, Approve, Run
You can execute the following debugging operation based on need to check whether it meets expectations.
1. Save and submit task.
2. Run task. After the operation is completed, you can view execution results at the bottom of the node editing interface. If execution fails, troubleshoot accordingly.
3. Advanced execution (Option). For example, if you want to modify scheduling time at runtime, select advanced execution.
4. Task approval (Option).
If you wish to have quality node launch approved by a dedicated person before going live, you can log in to the WeData Console, enable task approval in Project Management > Basic Information Configuration > Approval Configuration, and select an approver.
Once enabled, the quality node will go through the approval workflow before submission, and submission is allowed only after the approver's consent.
Note:
After enabling approval, if you are the project administrator or the approver is yourself, the approval page will not pop up when you submit a task.
Following Steps
Task operations: After the task is submitted and published, it will run periodically based on node configuration. You can click Ops in the upper right corner of the node editing interface to enter ops center and view the scheduling and operating status of periodic tasks (for example, node running status, trigger rule detail). For more information, please refer to Task Operations. Data quality: Once the data quality monitoring node is published, you can also access the data quality module to view the detailed monitoring page of the table, but execution of management operations such as modification and deletion is not allowed. For more information, please refer to Data Quality Monitoring List.