tencent cloud

Elastic MapReduce

Release Notes and Announcements
Release Notes
Announcements
Security Announcements
Product Introduction
Overview
Strengths
Architecture
Features
Use Cases
Constraints and Limits
Technical Support Scope
Product release
Purchase Guide
EMR on CVM Billing Instructions
EMR on TKE Billing Instructions
EMR Serverless HBase Billing Instructions
Getting Started
EMR on CVM Quick Start
EMR on TKE Quick Start
EMR on CVM Operation Guide
Planning Cluster
Administrative rights
Configuring Cluster
Managing Cluster
Managing Service
Monitoring and Alarms
TCInsight
EMR on TKE Operation Guide
Introduction to EMR on TKE
Configuring Cluster
Cluster Management
Service Management
Monitoring and Ops
Application Analysis
EMR Serverless HBase Operation Guide
EMR Serverless HBase Product Introduction
Quotas and Limits
Planning an Instance
Managing an Instance
Monitoring and Alarms
Development Guide
EMR Development Guide
Hadoop Development Guide
Spark Development Guide
Hbase Development Guide
Phoenix on Hbase Development Guide
Hive Development Guide
Presto Development Guide
Sqoop Development Guide
Hue Development Guide
Oozie Development Guide
Flume Development Guide
Kerberos Development Guide
Knox Development Guide
Alluxio Development Guide
Kylin Development Guide
Livy Development Guide
Kyuubi Development Guide
Zeppelin Development Guide
Hudi Development Guide
Superset Development Guide
Impala Development Guide
Druid Development Guide
TensorFlow Development Guide
Kudu Development Guide
Ranger Development Guide
Kafka Development Guide
Iceberg Development Guide
StarRocks Development Guide
Flink Development Guide
JupyterLab Development Guide
MLflow Development Guide
Practical Tutorial
Practice of EMR on CVM Ops
Data Migration
Practical Tutorial on Custom Scaling
API Documentation
History
Introduction
API Category
Cluster Resource Management APIs
Cluster Services APIs
User Management APIs
Data Inquiry APIs
Scaling APIs
Configuration APIs
Other APIs
Serverless HBase APIs
YARN Resource Scheduling APIs
Making API Requests
Data Types
Error Codes
FAQs
EMR on CVM
Service Level Agreement
Contact Us

HDFS Data Migration Using DistCp

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2025-01-03 15:05:10
If you need to migrate your HDFS raw data to EMR, you can achieve this using either of the following: migrate data with Tencent Cloud Object Storage (COS) service as a transfer stop; migrate data with DistCp, a built-in tool of Hadoop for large inter/intra-cluster copying. This document describes how to migrate data with the second method.
DistCp (distributed copy) is a file migration tool that comes with Hadoop. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. To use DistCp, your cluster and the EMR cluster must be connected over network. To migrate data with DistCp, perform the following steps:

Step 1. Configure a Network

Migrating local self-built HDFS files to EMR

The migration of local self-built HDFS files to an EMR cluster requires a direct connection for network connectivity. You can contact Tencent Cloud technical team for assistance.

Migrating self-built HDFS files in CVM to EMR

If the network where the CVM instance resides and the one where the EMR cluster resides are in the same VPC, the files can be transferred freely.
Otherwise, a peering connection is required for network connectivity.

Using a peering connection

IP CIDR block 1: Subnet A 192.168.1.0/24 in VPC1 of Guangzhou. IP CIDR block 2: Subnet B 10.0.1.0/24 in VPC2 of Beijing.
1. Log in to the VPC console, enter the Peering Connections page, select the region Guangzhou at the top of the page, select VPC1, and click + Create.


2. On the peering connection creation page, configure the following fields:
Name: Enter a peering connection name, such as PeerConn.
Local region: Enter a local region, such as Guangzhou.
Local network: Enter a local network, such as VPC1.
Destination account type: Select the account of the peer network. If the two networks in Guangzhou and Beijing are under the same account, select My account; otherwise, select Other accounts.
Note:
If both the local and peer networks are in the same region (such as Guangzhou), the communication is free of charge, and you do not need to set the bandwidth cap. Otherwise, fees will be incurred and you can set the bandwidth cap.
Peer region: Enter a peer region, such as Beijing.
Peer network: Enter a peer network, such as VPC2.


3. A peering connection between VPCs under the same account takes effect immediately after creation. If the VPCs are under different accounts, the peering connection takes effect only after the peer account accepts it. For details, see Creating Intra-account Peering Connection and Creating Cross-account Peering Connection.
4. Configure the local and peer route tables for the peering connection.
Log in to the VPC console and select Subnet to enter the subnet management page. Click the ID of the route table associated with the specified subnet (such as subnet VPC1 in Guangzhou) on the local end of the peering connection to enter the route table details page.


Click Add route policy.


Enter the destination CIDR block (such as 10.0.1.0/24 for VPC2 in Beijing), select Peering connections for the next hop type, and select the created peering connection (PeerConn) for the next hop.


You've configured the route table from Guangzhou VPC1 to Beijing VPC2 in the previous steps. Now you need to repeat the steps above to configure the route table from Beijing VPC2 to Guangzhou VPC1.
After the route tables are configured, IP CIDR blocks in different VPCs can communicate with each other.

Step 2. Execute copying

# Copy the specified folder from one cluster to another
hadoop distcp hdfs://nn1:9820/foo/bar hdfs://nn2:9820/bar/foo

# Copy the specified file
hadoop distcp hdfs://nn1:9820/foo/a hdfs://nn1:9820/foo/b hdfs://nn2:9820/bar/foo

# If too many files need to be specified, use -f parameter to separate them.
Note:
For the commands above, the source and destination versions must be the same.
The copying will fail if another client is writing data to the source file or the source file was moved (the FileNotFoundException error message will occur); rewriting the source file will fail if it is being copied to the destination.


도움말 및 지원

문제 해결에 도움이 되었나요?

피드백