Overview
Data Lake Compute (DLC) supports column-level data masking. You can use the data masking feature to associate masking rules for columns involving sensitive data, and you can configure a series of masking algorithms targeting different user groups to achieve refined masking applications based on roles. For example, for mobile number data, you may want to grant full access permissions to the users in the customer service group, grant only the permission to view the last 4 numbers to the users in the analysis group, and grant strict masking permissions that display as NULL to the users in the finance group.
Restrictions and Limitations
The current restrictions and limitations of the DLC data masking feature are as follows.
Supported Engine Types
Currently, only the following engine types with a kernel version later than October 1, 2024, are supported: Spark standard engine, SuperSQL Spark job engine, and SuperSQL SparkSQL engine. Among these, data desensitization for the SuperSQL SparkSQL engine is an allowlist feature. If you need to use it, submit a ticket for activation. For information on DLC engine categories, refer to Data Engine Introduction. Applicable Scope
1. The data masking feature will only take effect for DLC regular users. For DLC user types, see permission overview. 2. The configuration for a masking policy tag takes about 1 minute to take effect.
3. Only effective for the metadata under the default data catalog, DataLakeCatalog.
Special Restrictions
1. When the SuperSQL SparkSQL engine is used, views are not supported by masking policies.
2. Users or working groups should have the SELECT permission on databases and tables. Otherwise, queries by the user or working group on that table will return an error. For DLC permission types, refer to Sub-Account Permission Management. Supported Masking Methods
Currently, DLC supports the following data masking methods for column values:
Default Value
The default masking value for the column based on the column's data type is returned. Use this rule when you want to hide the value of the column but display the data type.
Supported data types: STRING, BINARY, INT, DECIMAL, BIGINT, FLOAT, DOUBLE, BOOLEAN, TIMESTAMP, DATE, ARRAY.
|
STRING | "" |
BINARY | [] |
INT | 0 |
DECIMAL | 0 |
BIGINT | 0 |
FLOAT | 0 |
DOUBLE | 0 |
BOOLEAN | false |
TIMESTAMP | 1970-01-01 08:00:00 |
DATE | 1970-01-01 |
ARRAY | [] |
Retaining the First 4 Characters
The first 4 characters of the column's value are returned, replacing the rest of the string with XXXXX. If the column's value is equal to or less than 4 characters in length, return the column's value after the value has been run through the SHA-256 hash function. You can only use this rule with columns that use the STRING data type.
Supported data type: STRING
|
abcd@example.com | abcdxxxxxxxxxx |
abc | ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad |
Retaining the Last 4 Characters
The last 4 characters of the column's value are returned, replacing the rest of the string with XXXXX. If the column's value is equal to or less than 4 characters in length, then return the column's value after the value has been run through the SHA-256 hash function. Use this rule with columns that use the STRING data type.
Supported data type: STRING
|
abcd@example.com | xxxxxxxxxx.com |
abc | ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad |
Hashing
The column's value is returned after the value has been run through the SHA-256 hash function. Use this rule when you want the end user to be able to use this column in a JOIN operation for a query. You can only use this rule with columns that use the STRING or BYTES data types.
Supported data types: STRING and BYTES
|
abcd@example.com | 3f7768839b5bcba43f589cc3af54efaea18bceb1df8b05a7dffaec3e7b43b269 |
Setting as NULL
NULL is returned regardless of the column value's data type. Use this rule when you want to hide both the value and the data type of the column.
Supported data types: Unlimited
Date Masking
Only the year part of a date string and defaults the month and date to 01/01 is displayed. Use this rule with columns that use the DATE and TIMESTAMP data types.
Supported data types: TIMESTAMP and DATE
|
2015-03-05T09:32:05.359 | 2015-01-01 00:00:00 |
No Masking
Plaintext of the column value is displayed, without being masked.
Supported data types: All
Masking Method Selection Recommendations
You can flexibly select the masking method for the column value based on the following recommendations:
|
Default Value | Column value is expected to be hidden but data type needs to be displayed to users. |
Retaining the first 4 characters | Plaintext of the column value is expected to be hidden but part of the characters for information confirmation needs to be displayed, for example, the first 4 characters of the customer's email address used by the customer service personnel for confirmation. |
Retaining the last 4 characters | Plaintext of the column value is expected to be hidden but part of the characters for information confirmation needs to be displayed, for example, the last 4 characters of the customer's mobile number used by the customer service personnel for confirmation. |
Hashing | This rule is used in scenarios where end users can use this column in a JOIN operation for a query or use this column for GROUP BY statistics. |
Setting as NULL | This rule is used in scenarios where the column value and its data type are expected to be hidden. |
Date masking | This rule is used for the scenario where only the year part is displayed and the rest of the date information is hidden, for example, the year of birth for confirmation. |
No masking | Recommended for users who need to use the plaintext. |
Data Masking Workflow
To configure DLC data desensitization, you can follow the steps below:
Overview of Masking Policy Tags
A masking policy tag refers to a tag customized by users and associated with columns containing sensitive data. In a masking policy tag, you can configure refined masking methods for multiple user groups. The following example shows the procedure to set a masking policy tag named "mobile number" and set the masking methods for 3 DLC workgroups:
Name of masking policy tag: Mobile number masking
Masking method configurations for workgroups:
|
Customer service personnel group | No masking |
Analysis personnel group | Displaying the last 4 characters |
Finance personnel group | Setting as NULL |
Effect:
After associating the above masking policy tag with two columns, such as Phone_number1 and Phone_number2, DLC users in the customer service personnel group get the plaintext when querying by Phone_number1 and Phone_number2, DLC users in the analysis personnel group get the result displaying as *********4320, and DLC users in the finance personnel group get the result displaying as NULL.
Masking Policy Execution Priority
Assuming a DLC user is added to multiple working groups, and different masking methods are associated with these different working groups, multiple masking policies may exist for a specific user. If conflicting masking methods exist for a user, the system will apply the method with the highest priority among the working groups of the user, where working groups higher in the list have a higher priority.
For example, in the masking policy tag above, if Zhang San belongs to both the analyst personnel group and the finance personnel group, the query for Zhang San will display data desensitized as, for example, *********4320.
Step 1: Creating a Masking Policy Tag
Creating a Masking Policy Tag
1. Go to the DLC console, choose Metadata Management > Data Desensitization, click Create Masking Policy Tag, and create a masking policy tag in the pop-up dialog box. 2. In Configure Masking Method, select the corresponding working groups to add them to the Selected dialog box, and configure a masking method for each working group.
3. After configuration, you can drag and drop working group modules to sort them from top to bottom, which determines the policy priority. Working groups positioned higher in the list have a higher priority. If no manual sorting is performed, policies will be applied based on the default top-down order.
4. Click OK to complete the creation.
Note:
1. The DLC data desensitization feature currently only takes effect for users within user groups that have been configured with a masking method. It does not take effect for user groups without a configured masking method, or for users not added to any user group.
2. To apply desensitization to all DLC users, it is recommended to:
2.1 Configure a masking method for all user groups.
2.2 Add all DLC users to a user group.
Step 2: Binding the Tag to a Data Column
2. Locate the field requiring desensitization, or search for the field name in the top-right search box. Then, in the Masking Policy Tag column for that field, click , and select the desired masking policy tag for binding from the dialog box. 3. If no matching masking policy tag is found, click Create Masking Policy Tag in the dialog box to quickly create one.
Note:
1. The various built-in DLC masking methods have field type restrictions. For example, date desensitization can only be associated with columns of TIMESTAMP or DATE data types. For details, refer to Masking Method Selection Recommendations.
2. A field can only be associated with one masking policy tag. However, a single masking policy tag can be reused across multiple data columns.
Effective Time:
After you configure a masking policy tag and bind it to a column, the configuration takes about 1 minute to take effect. During this period, users may still get plaintext in a query. Just wait a moment for the configuration to take effect.
Step 3: Performing a Query
After you bind a masking policy tag to a column requiring desensitization, the configuration is successful if the data type of the column supports all masking methods within that policy tag. Users see the masked results according to the masking method applied to their user groups when querying column data. The following uses a virtual case to further show the effects that the DLC data masking feature can achieve.
Case Description
Assume that company A has a customer list customer_list containing sensitive information, with detailed fields as follows:
|
123456789 | High | 45,600 | abc@example.com |
234567891 | Medium | 15,000 | bcd@example.com |
345678912 | Low | 2,000 | cde@example.com |
456789123 | Low | 1,000 | def@example.com |
There are 3 user groups in company A, including the customer service personnel group, the finance personnel group, and the analysis personnel group. Currently, company A hopes that the Mobile Number and Email Address fields containing users' sensitive PII information can only be used by the customer service personnel group, the Consumption Amount field and the Customer Level tag can only be used by the finance personnel group, but the analysis personnel can view the hashed value of the Customer Level to conduct statistical analysis of customer hierarchy.
Based on the above requirements, create the following 3 masking policy tags:
|
Customer service personnel group: No masking Finance personnel group: NULL Analysis personnel group: NULL | Customer service personnel group: NULL Finance personnel group: No masking Analysis personnel group: NULL | Customer service personnel group: NULL Finance personnel group: No masking Analysis personnel group: Hashing |
Assume that after the masking policy tags are bound to the corresponding columns as per the above table,
1. For a user existing only within a specific working group, running SELECT * FROM customer_list returns the following results:
Customer service personnel group: This group has been granted a rule of no desensitization for contact information. The following results will be returned:
|
123456789 | NULL | NULL | abc@example.com |
234567891 | NULL | NULL | bcd@example.com |
345678912 | NULL | NULL | cde@example.com |
456789123 | NULL | NULL | def@example.com |
Finance personnel group: This group has been granted a rule of no desensitization for consumption amount and customer level. The following results will be returned:
|
NULL | High | 45,600 | NULL |
NULL | Medium | 15,000 | NULL |
NULL | Low | 2,000 | NULL |
NULL | Low | 1,000 | NULL |
Analyst personnel group: This group has been assigned hash desensitization for customer level. The following results will be returned. While the actual customer level is obscured, statistical analysis can still be performed using the hash value:
|
NULL | 4fa3c0d004d0750fc7bf8631993bd7c668fd33f8d089e0103ad8ef3fc1d9f4bb | 45,600 | NULL |
NULL | 35d8f8d59e2630de970e35271547d087278074addd61ce31940da69d82d19929 | 15,000 | NULL |
NULL | 49542bc83b9d59935686144f352b6acb2264992720d0dbe780be50b56b87fef7 | 2,000 | NULL |
NULL | 49542bc83b9d59935686144f352b6acb2264992720d0dbe780be50b56b87fef7 | 1,000 | NULL |
2. If a user exists in both the customer service personnel group and the finance personnel group, according to the top-down priority of masking rules, the effective masking rules for this user are as follows:
|
Customer service personnel group: No desensitization | Customer service personnel group: NULL | Customer service personnel group: NULL |
Running SELECT * FROM customer_list returns the following results:
|
123456789 | NULL | NULL | abc@example.com |
234567891 | NULL | NULL | bcd@example.com |
345678912 | NULL | NULL | cde@example.com |
456789123 | NULL | NULL | def@example.com |