tencent cloud



Terakhir diperbarui:2022-05-04 12:35:56


    After migrating data from HDFS to COS by using the hadoop distcp command, you can use the Hadoop-cos-DistChecker tool to verify the integrity of the migrated directory. Based on the parallel processing capabilities of MapReduce, it can quickly check the source directory against the destination directory.

    Operating Environment

    • Hadoop-cos above v5.8.2 (for details, see hadoop-cos release.)
    • Hadoop MapReduce runtime environment

    • If you are using a self-built Hadoop cluster, the Hadoop-cos dependency should be of the latest version (with GitHub release as 5.8.2 or above) to obtain the CRC64 checksum.
    • If you are using Tencent Cloud EMR, only clusters created after May 8, 2020 contain the Hadoop-cos version above. To deal with earlier clusters, please contact us.


    Since Hadoop-cos-distchecker needs to get CRC64 checksum for files from Hadoop-cos (CosN file system) before running, you should first configure fs.cosn.crc64.checksum.enabled to true to do so. Once this tool finishes, configure the value back to false to stop getting CRC64 checksum.


    The CRC64 checksum in Hadoop-COS is not compatible with the CRC32C checksum in HDFS. Therefore, after using this tool, be sure to set the above parameter to false. Otherwise, Hadoop-COS may fail to run in some scenarios where the file system getFileChecksum API is called.

    Parameter description

    • Source file list: a list of subdirectories and files obtained by running the following command:

      hadoop fs -ls -R hdfs://host:port/{source_dir} | awk '{print $8}' > check_list.txt

      Its format is as follows:

    • Source directory: the directory where the source files are stored; it usually serves as the source path for data migration through the distcp command. For example, hdfs://host:port/source_dir is the source directory in the following sample:

      hadoop distcp hdfs://host:port/source_dir cosn://examplebucket-appid/dest_dir

      This is also the common parent directory in the source file path list, such as / benchmarks in the sample above.

    • Destination directory: the destination directory to check against.

    Command line syntax

    Hadoop-cos-DistChecker is a MapReduce job-based program, and can be submitted just like a MapReduce job:

    hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App <Absolute path of the source file list> <Absolute path representation of the source directory> <Absolute path representation of the destination directory> [optional parameters]

    The “optional parameters” are for Hadoop.


    The example below describes how to use this tool by checking hdfs:// against cosn://hdfs-test-1250000000/benchmarks.

    First, run the following command:

    hadoop fs -ls -R hdfs:// | awk '{print $8}' > check_list.txt

    Export all the source paths to be checked to a check_list.txt file which stores the list of source file paths, as shown below:

    Then, run the following command to put check_list.txt to the HDFS:

    hadoop fs -put check_list.txt hdfs://

    Run the Hadoop-cos-DistChecker to check hdfs:// against cosn://hdfs-test-1250000000/benchmarks, and output the result to the cosn://hdfs-test-1250000000/check_result path by using the following command:

    hadoop jar hadoop-cos-distchecker-2.8.5-1.0-SNAPSHOT.jar com.qcloud.cos.hadoop.distchecker.App hdfs:// hdfs:// cosn://hdfs-test-1250000000/benchmarks cosn://hdfs-test-1250000000/check_result

    Hadoop-cos-DistChecker will read the source file list and source directory, and run the MapReduce job to perform a distributed check. The final check result will be output to cosn://examplebucket-appid/check_result.

    The check report is as follows:

    hdfs://       hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://    hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://  hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_0,CRC64,1566310986176587838,1566310986176587838,SUCCESS,'The source file and the target file are the same.'
    hdfs://  hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_control/in_file_test_io_1,CRC64,-6584441696534676125,-6584441696534676125,SUCCESS,'The source file and the target file are the same.'
    hdfs://       hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_0,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_data/test_io_1,CRC64,3534425600523290380,3534425600523290380,SUCCESS,'The source file and the target file are the same.'
    hdfs://      hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write,None,None,None,SUCCESS,'The source file and the target file are the same.'
    hdfs://     hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/_SUCCESS,CRC64,0,0,SUCCESS,'The source file and the target file are the same.'
    hdfs://   hdfs://,cosn://hdfs-test-1250000000/benchmarks/TestDFSIO/io_write/part-00000,CRC64,-4804567387993776854,-4804567387993776854,SUCCESS,'The source file and the target file are the same.'

    Check Report Format

    The check report is in the following format:

    Source file path in `check_list.txt`, absolute path of the source file, absolute path of the destination file, checksum algorithm, checksum of the source file, checksum of the destination file, check result, result description

    There are 7 check results:

    • SUCCESS: The source and destination files exist and are the same.
    • MISMATCH: The source and destination files exist but are different.
    • UNCONFIRM: The system cannot determine whether the source and destination files are the same. This may be because the destination file already existed in COS before the CRC64 feature was launched, and thus its CRC64 checksum cannot be obtained.
    • UNCHECKED: The check is not performed. This is mainly because the source file cannot be read, or its checksum cannot be obtained.
    • SOURCE_FILE_MISSING: The source file does not exist.
    • TARGET_FILE_MISSING: The destination file does not exist.
    • TARGET_FILESYSTEM_ERROR: The destination file system is not CosN.


    Why is there a negative CRC64 checksum in the check report?

    A CRC64 checksum may contain 20 digits, which exceeds the range of the Java long type. However, they have the same underlying bytes. Therefore, when the long value is printed, it may be negative.

    Hubungi Kami

    Hubungi tim penjualan atau penasihat bisnis kami untuk membantu bisnis Anda.

    Dukungan Teknis

    Buka tiket jika Anda mencari bantuan lebih lanjut. Tiket kami tersedia 7x24.

    Dukungan Telepon 7x24