tencent cloud

Feedback

Migrating HDFS Data to Metadata Acceleration-Enabled Bucket

Last updated: 2024-03-25 16:04:01

    Overview

    COS offers the metadata acceleration feature to provide high-performance file system capabilities. Metadata acceleration leverages the powerful metadata management feature of Cloud HDFS (CHDFS) at the underlying layer to allow using file system semantics for COS access. The designed system metrics can reach a bandwidth of up to 100 GB/s, over 100,000 queries per second (QPS), and a latency of milliseconds. Buckets with metadata acceleration enabled can be widely used in scenarios such as big data, high-performance computing, machine learning, and AI. For more information on metadata acceleration, see Metadata Acceleration Overview.
    COS provides the Hadoop semantics through the metadata acceleration service. Therefore, you can use COSDistCp to easily implement two-way data migration between COS and other Hadoop file systems. This document describes how to use COSDistCp to migrate files in the local HDFS to a metadata acceleration bucket in COS.

    Environment Preparations Before Migration

    Migration tools

    1. Download the JAR packages of the tools as listed below and place them in the local directory on the node running the migration task in the cluster, such as /data01/jars.
    EMR environment
    Installation notes
    JAR Filename
    Description
    Download Address
    cos-distcp-1.12-3.1.0.jar
    COSDistCp package, whose data needs to be copied to COSN.
    For more information, see COSDistCp.
    chdfs_hadoop_plugin_network-2.8.jar
    OFS plugin
    
    Self-built environment such as Hadoop or CDH
    Software dependency
    Hadoop 2.6.0 or later and Hadoop-COS 8.1.5 or later are required. The cos_api-bundle plugin version must match the Hadoop-COS version as described in Releases.
    Installation notes
    Install the following plugins in the Hadoop environment:
    JAR Filename
    Description
    Download Address
    cos-distcp-1.12-3.1.0.jar
    COSDistCp package, whose data needs to be copied to COSN.
    For more information, see COSDistCp.
    chdfs_hadoop_plugin_network-2.8.jar
    OFS plugin
    Hadoop-COS
    8.1.5 or later
    For more information, see Hadoop.
    cos_api-bundle
    The version needs to match the Hadoop-COS version.
    
    Note:
    Hadoop-COS supports access to metadata acceleration buckets in the format of cosn://bucketname-appid/ starting from v8.1.5.
    The metadata acceleration feature can only be enabled during bucket creation and cannot be disabled once enabled. Therefore, carefully consider whether to enable it based on your business conditions. You should also note that legacy Hadoop-COS packages cannot access metadata acceleration buckets.
    2. Create a metadata acceleration bucket and configure the HDFS protocol for it as instructed in "Creating Bucket and Configuring the HDFS Protocol" in Using HDFS to Access Metadata Acceleration-Enabled Bucket.
    3. Modify the migration cluster's core-site.xml and distribute the configuration to all nodes. If only data needs to be migrated, you don't need to restart the big data component.
    Key
    Value
    Configuration File
    Description
    fs.cosn.trsf.fs.ofs.impl
    com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
    core-site.xml
    COSN implementation class, which is required.
    fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl
    com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
    core-site.xml
    COSN implementation class, which is required.
    fs.cosn.trsf.fs.ofs.tmp.cache.dir
    In the format of `/data/emr/hdfs/tmp/`
    core-site.xml
    Temporary directory, which is required. It will be created on all MRS nodes. You need to ensure that there are sufficient space and permissions.
    fs.cosn.trsf.fs.ofs.user.appid
    `appid` of your COS bucket
    core-site.xml
    Required
    fs.cosn.trsf.fs.ofs.ranger.enable.flag
    false
    core-site.xml
    This key is required. You need to check whether the value is `false`.
    fs.cosn.trsf.fs.ofs.bucket.region
    Bucket region
    core-site.xml
    This key is required. Valid values: eu-frankfurt (Frankfurt), ap-chengdu (Chengdu), and ap-singapore (Singapore).
    4. You can verify the migration by accessing the metadata acceleration bucket over the private network as instructed in "Configuring Computing Cluster to Access COS" in Using HDFS to Access Metadata Acceleration-Enabled Bucket. Use the migration cluster submitter to verify whether COS can be accessed successfully.

    Existing Data Migration

    1. Determine the directories to be migrated

    Generally, HDFS storage data will be first migrated. The directory to be migrated in the source HDFS cluster will be selected, and the target path needs to be the same as the source path.
    Suppose you need to migrate the HDFS directory hdfs:///data/user/target to cosn://{bucketname-appid}/data/user/target.
    To ensure that the files in the source directory remain unchanged during migration, the snapshot feature of HDFS will be used to create a snapshot of the source directory (named the current date).
    hdfs dfsadmin -disallowSnapshot hdfs:///data/user/
    hdfs dfsadmin -allowSnapshot hdfs:///data/user/target
    hdfs dfs -deleteSnapshot hdfs:///data/user/target {current date}
    hdfs dfs -createSnapshot hdfs:///data/user/target {current date}
    Sample successful execution:
    
    
    If you don't want to create a snapshot, you can directly migrate the target files in the source directory.

    2. Use COSDistCp for migration

    Start a COSDistCp task to copy files from the source HDFS to the target COS bucket.
    A COSDistCp task is essentially a MapReduce task. The printed MapReduce task log will show whether the task is executed successfully. If the task fails, you can view the YARN page and submit the log or exception information to the COS team for troubleshooting. You can use COSDistCp to execute a migration task in the following steps: (1) Create a temporary directory. (2) Run a COSDistCp task. (3) Migrate failed files again.

    (1) Create a temporary directory

    hadoop fs -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar -mkdir cosn://bucket-appid/distcp-tmp

    (2) Run a COSDistCp task

    nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 >> ./distcp.log &
    The parameters are as detailed below. You can adjust their values as needed.
    --taskNumber=VALUE: Number of copy threads. Example: --taskNumber=10.
    --workerNumber=VALUE: Number of copy threads. COSDistCp will create a copy thread pool for each copy process based on this value set. Example: workerNumber=4.
    --bandWidth: Maximum bandwidth for reading each migrated file (in MB/s). Default value: -1, which indicates no limit on the read bandwidth. Example: --bandWidth=10.
    --cosChecksumType=CRC32C: CRC32C is used by default, but the HDFS cluster must be able to check COMPOSITE_CRC32. The Hadoop version must be 3.1.1 or later; otherwise, you need to change this parameter to --cosChecksumType=CRC64.
    Note:
    The formula for calculating the total bandwidth limit of COSDistCp migration is: taskNumber * workerNumber * bandWidth. You can set workerNumber to 1, use the taskNumber parameter to control the number of concurrent migrations, and use the bandWidth parameter to control the bandwidth of a single concurrent migration.
    When the copy operation ends, the task log will output statistics on the copy. The counters are as follows: Here, FILES_FAILED indicates the number of failed files. If there is no FILES_FAILED counter, all files have been migrated successfully.
    CosDistCp Counters
    BYTES_EXPECTED=10198247
    BYTES_SKIPPED=10196880
    FILES_COPIED=1
    FILES_EXPECTED=7
    FILES_FAILED=1
    FILES_SKIPPED=5
    
    The specific statistics items in the output result are as detailed below:
    Statistics Item
    Description
    BYTES_EXPECTED
    Total size (in bytes) to copy according to the source directory
    FILES_EXPECTED
    Number of files to copy according to the source directory, including the directory itself
    BYTES_SKIPPED
    Total size (in bytes) of files that can be skipped (same length or checksum value)
    FILES_SKIPPED
    Number of source files that can be skipped (same length or checksum value)
    FILES_COPIED
    Number of source files that are successfully copied
    FILES_FAILED
    Number of source files that failed to be copied
    FOLDERS_COPIED
    Number of directories that are successfully copied
    FOLDERS_SKIPPED
    Number of directories that are skipped

    3. Migrate failed files again

    COSDistCp not only solves most problems of inefficient file migration but also allows you to use the --delete parameter to guarantee the complete consistency between the HDFS and COS data.
    When using the --delete parameter, you need to add the --deleteOutput=/xxx(custom) parameter but not the --diffMode parameter.
    nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=--src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 --delete --deleteOutput=/dele-xx >> ./distcp.log &
    After execution, the different data between HDFS and COS will be moved to the trash directory, and the list of moved files will be generated in the /xxx/failed directory. You can run hadoop fs -rm URL or hadoop fs -rmr URL to delete the data in the trash directory.

    Incremental Migration

    If any incremental data needs to be migrated afterwards, you only need to repeat the steps of full migration until all data has been migrated.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support