tencent cloud

Cloud Object Storage

Release Notes and Announcements
Release Notes
Announcements
Product Introduction
Overview
Features
Use Cases
Strengths
Concepts
Regions and Access Endpoints
Specifications and Limits
Service Regions and Service Providers
Billing
Billing Overview
Billing Method
Billable Items
Free Tier
Billing Examples
Viewing and Downloading Bill
Payment Overdue
FAQs
Getting Started
Console
Getting Started with COSBrowser
User Guide
Creating Request
Bucket
Object
Data Management
Batch Operation
Global Acceleration
Monitoring and Alarms
Operations Center
Data Processing
Content Moderation
Smart Toolbox
Data Processing Workflow
Application Integration
User Tools
Tool Overview
Installation and Configuration of Environment
COSBrowser
COSCLI (Beta)
COSCMD
COS Migration
FTP Server
Hadoop
COSDistCp
HDFS TO COS
GooseFS-Lite
Online Tools
Diagnostic Tool
Use Cases
Overview
Access Control and Permission Management
Performance Optimization
Accessing COS with AWS S3 SDK
Data Disaster Recovery and Backup
Domain Name Management Practice
Image Processing
Audio/Video Practices
Workflow
Direct Data Upload
Content Moderation
Data Security
Data Verification
Big Data Practice
COS Cost Optimization Solutions
Using COS in the Third-party Applications
Migration Guide
Migrating Local Data to COS
Migrating Data from Third-Party Cloud Storage Service to COS
Migrating Data from URL to COS
Migrating Data Within COS
Migrating Data Between HDFS and COS
Data Lake Storage
Cloud Native Datalake Storage
Metadata Accelerator
GooseFS
Data Processing
Data Processing Overview
Image Processing
Media Processing
Content Moderation
File Processing Service
File Preview
Troubleshooting
Obtaining RequestId
Slow Upload over Public Network
403 Error for COS Access
Resource Access Error
POST Object Common Exceptions
API Documentation
Introduction
Common Request Headers
Common Response Headers
Error Codes
Request Signature
Action List
Service APIs
Bucket APIs
Object APIs
Batch Operation APIs
Data Processing APIs
Job and Workflow
Content Moderation APIs
Cloud Antivirus API
SDK Documentation
SDK Overview
Preparations
Android SDK
C SDK
C++ SDK
.NET(C#) SDK
Flutter SDK
Go SDK
iOS SDK
Java SDK
JavaScript SDK
Node.js SDK
PHP SDK
Python SDK
React Native SDK
Mini Program SDK
Error Codes
Harmony SDK
Endpoint SDK Quality Optimization
Security and Compliance
Data Disaster Recovery
Data Security
Cloud Access Management
FAQs
Popular Questions
General
Billing
Domain Name Compliance Issues
Bucket Configuration
Domain Names and CDN
Object Operations
Logging and Monitoring
Permission Management
Data Processing
Data Security
Pre-signed URL Issues
SDKs
Tools
APIs
Agreements
Service Level Agreement
Privacy Policy
Data Processing And Security Agreement
Contact Us
Glossary

Migrating HDFS Data to Metadata Acceleration-Enabled Bucket

PDF
Mode fokus
Ukuran font
Terakhir diperbarui: 2024-03-25 16:04:01

Overview

COS offers the metadata acceleration feature to provide high-performance file system capabilities. Metadata acceleration leverages the powerful metadata management feature of Cloud HDFS (CHDFS) at the underlying layer to allow using file system semantics for COS access. The designed system metrics can reach a bandwidth of up to 100 GB/s, over 100,000 queries per second (QPS), and a latency of milliseconds. Buckets with metadata acceleration enabled can be widely used in scenarios such as big data, high-performance computing, machine learning, and AI. For more information on metadata acceleration, see Metadata Acceleration Overview.
COS provides the Hadoop semantics through the metadata acceleration service. Therefore, you can use COSDistCp to easily implement two-way data migration between COS and other Hadoop file systems. This document describes how to use COSDistCp to migrate files in the local HDFS to a metadata acceleration bucket in COS.

Environment Preparations Before Migration

Migration tools

1. Download the JAR packages of the tools as listed below and place them in the local directory on the node running the migration task in the cluster, such as /data01/jars.
EMR environment
Installation notes
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin

Self-built environment such as Hadoop or CDH
Software dependency
Hadoop 2.6.0 or later and Hadoop-COS 8.1.5 or later are required. The cos_api-bundle plugin version must match the Hadoop-COS version as described in Releases.
Installation notes
Install the following plugins in the Hadoop environment:
JAR Filename
Description
Download Address
cos-distcp-1.12-3.1.0.jar
COSDistCp package, whose data needs to be copied to COSN.
For more information, see COSDistCp.
chdfs_hadoop_plugin_network-2.8.jar
OFS plugin
Hadoop-COS
8.1.5 or later
For more information, see Hadoop.
cos_api-bundle
The version needs to match the Hadoop-COS version.

Note:
Hadoop-COS supports access to metadata acceleration buckets in the format of cosn://bucketname-appid/ starting from v8.1.5.
The metadata acceleration feature can only be enabled during bucket creation and cannot be disabled once enabled. Therefore, carefully consider whether to enable it based on your business conditions. You should also note that legacy Hadoop-COS packages cannot access metadata acceleration buckets.
2. Create a metadata acceleration bucket and configure the HDFS protocol for it as instructed in "Creating Bucket and Configuring the HDFS Protocol" in Using HDFS to Access Metadata Acceleration-Enabled Bucket.
3. Modify the migration cluster's core-site.xml and distribute the configuration to all nodes. If only data needs to be migrated, you don't need to restart the big data component.
Key
Value
Configuration File
Description
fs.cosn.trsf.fs.ofs.impl
com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.AbstractFileSystem.ofs.impl
com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
core-site.xml
COSN implementation class, which is required.
fs.cosn.trsf.fs.ofs.tmp.cache.dir
In the format of `/data/emr/hdfs/tmp/`
core-site.xml
Temporary directory, which is required. It will be created on all MRS nodes. You need to ensure that there are sufficient space and permissions.
fs.cosn.trsf.fs.ofs.user.appid
`appid` of your COS bucket
core-site.xml
Required
fs.cosn.trsf.fs.ofs.ranger.enable.flag
false
core-site.xml
This key is required. You need to check whether the value is `false`.
fs.cosn.trsf.fs.ofs.bucket.region
Bucket region
core-site.xml
This key is required. Valid values: eu-frankfurt (Frankfurt), ap-chengdu (Chengdu), and ap-singapore (Singapore).
4. You can verify the migration by accessing the metadata acceleration bucket over the private network as instructed in "Configuring Computing Cluster to Access COS" in Using HDFS to Access Metadata Acceleration-Enabled Bucket. Use the migration cluster submitter to verify whether COS can be accessed successfully.

Existing Data Migration

1. Determine the directories to be migrated

Generally, HDFS storage data will be first migrated. The directory to be migrated in the source HDFS cluster will be selected, and the target path needs to be the same as the source path.
Suppose you need to migrate the HDFS directory hdfs:///data/user/target to cosn://{bucketname-appid}/data/user/target.
To ensure that the files in the source directory remain unchanged during migration, the snapshot feature of HDFS will be used to create a snapshot of the source directory (named the current date).
hdfs dfsadmin -disallowSnapshot hdfs:///data/user/
hdfs dfsadmin -allowSnapshot hdfs:///data/user/target
hdfs dfs -deleteSnapshot hdfs:///data/user/target {current date}
hdfs dfs -createSnapshot hdfs:///data/user/target {current date}
Sample successful execution:


If you don't want to create a snapshot, you can directly migrate the target files in the source directory.

2. Use COSDistCp for migration

Start a COSDistCp task to copy files from the source HDFS to the target COS bucket.
A COSDistCp task is essentially a MapReduce task. The printed MapReduce task log will show whether the task is executed successfully. If the task fails, you can view the YARN page and submit the log or exception information to the COS team for troubleshooting. You can use COSDistCp to execute a migration task in the following steps: (1) Create a temporary directory. (2) Run a COSDistCp task. (3) Migrate failed files again.

(1) Create a temporary directory

hadoop fs -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar -mkdir cosn://bucket-appid/distcp-tmp

(2) Run a COSDistCp task

nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 >> ./distcp.log &
The parameters are as detailed below. You can adjust their values as needed.
--taskNumber=VALUE: Number of copy threads. Example: --taskNumber=10.
--workerNumber=VALUE: Number of copy threads. COSDistCp will create a copy thread pool for each copy process based on this value set. Example: workerNumber=4.
--bandWidth: Maximum bandwidth for reading each migrated file (in MB/s). Default value: -1, which indicates no limit on the read bandwidth. Example: --bandWidth=10.
--cosChecksumType=CRC32C: CRC32C is used by default, but the HDFS cluster must be able to check COMPOSITE_CRC32. The Hadoop version must be 3.1.1 or later; otherwise, you need to change this parameter to --cosChecksumType=CRC64.
Note:
The formula for calculating the total bandwidth limit of COSDistCp migration is: taskNumber * workerNumber * bandWidth. You can set workerNumber to 1, use the taskNumber parameter to control the number of concurrent migrations, and use the bandWidth parameter to control the bandwidth of a single concurrent migration.
When the copy operation ends, the task log will output statistics on the copy. The counters are as follows: Here, FILES_FAILED indicates the number of failed files. If there is no FILES_FAILED counter, all files have been migrated successfully.
CosDistCp Counters
BYTES_EXPECTED=10198247
BYTES_SKIPPED=10196880
FILES_COPIED=1
FILES_EXPECTED=7
FILES_FAILED=1
FILES_SKIPPED=5

The specific statistics items in the output result are as detailed below:
Statistics Item
Description
BYTES_EXPECTED
Total size (in bytes) to copy according to the source directory
FILES_EXPECTED
Number of files to copy according to the source directory, including the directory itself
BYTES_SKIPPED
Total size (in bytes) of files that can be skipped (same length or checksum value)
FILES_SKIPPED
Number of source files that can be skipped (same length or checksum value)
FILES_COPIED
Number of source files that are successfully copied
FILES_FAILED
Number of source files that failed to be copied
FOLDERS_COPIED
Number of directories that are successfully copied
FOLDERS_SKIPPED
Number of directories that are skipped

3. Migrate failed files again

COSDistCp not only solves most problems of inefficient file migration but also allows you to use the --delete parameter to guarantee the complete consistency between the HDFS and COS data.
When using the --delete parameter, you need to add the --deleteOutput=/xxx(custom) parameter but not the --diffMode parameter.
nohup hadoop jar /data01/jars/cos-distcp-1.10-2.8.5.jar -libjars /data01/jars/chdfs_hadoop_plugin_network-2.8.jar --src=--src=hdfs:///data/user/target/.snapshot/{current date} --dest=cosn://{bucket-appid}/data/user/target --temp=cosn://bucket-appid/distcp-tmp/ --preserveStatus=ugpt --skipMode=length-checksum --checkMode=length-checksum --cosChecksumType=CRC32C --taskNumber 6 --workerNumber 32 --bandWidth 200 --delete --deleteOutput=/dele-xx >> ./distcp.log &
After execution, the different data between HDFS and COS will be moved to the trash directory, and the list of moved files will be generated in the /xxx/failed directory. You can run hadoop fs -rm URL or hadoop fs -rmr URL to delete the data in the trash directory.

Incremental Migration

If any incremental data needs to be migrated afterwards, you only need to repeat the steps of full migration until all data has been migrated.

Bantuan dan Dukungan

Apakah halaman ini membantu?

masukan