tencent cloud

Elastic MapReduce

Release Notes and Announcements
Release Notes
Announcements
Security Announcements
Product Introduction
Overview
Strengths
Architecture
Features
Use Cases
Constraints and Limits
Technical Support Scope
Product release
Purchase Guide
EMR on CVM Billing Instructions
EMR on TKE Billing Instructions
EMR Serverless HBase Billing Instructions
Getting Started
EMR on CVM Quick Start
EMR on TKE Quick Start
EMR on CVM Operation Guide
Planning Cluster
Administrative rights
Configuring Cluster
Managing Cluster
Managing Service
Monitoring and Alarms
TCInsight
EMR on TKE Operation Guide
Introduction to EMR on TKE
Configuring Cluster
Cluster Management
Service Management
Monitoring and Ops
Application Analysis
EMR Serverless HBase Operation Guide
EMR Serverless HBase Product Introduction
Quotas and Limits
Planning an Instance
Managing an Instance
Monitoring and Alarms
Development Guide
EMR Development Guide
Hadoop Development Guide
Spark Development Guide
Hbase Development Guide
Phoenix on Hbase Development Guide
Hive Development Guide
Presto Development Guide
Sqoop Development Guide
Hue Development Guide
Oozie Development Guide
Flume Development Guide
Kerberos Development Guide
Knox Development Guide
Alluxio Development Guide
Kylin Development Guide
Livy Development Guide
Kyuubi Development Guide
Zeppelin Development Guide
Hudi Development Guide
Superset Development Guide
Impala Development Guide
Druid Development Guide
TensorFlow Development Guide
Kudu Development Guide
Ranger Development Guide
Kafka Development Guide
Iceberg Development Guide
StarRocks Development Guide
Flink Development Guide
JupyterLab Development Guide
MLflow Development Guide
Practical Tutorial
Practice of EMR on CVM Ops
Data Migration
Practical Tutorial on Custom Scaling
API Documentation
History
Introduction
API Category
Cluster Resource Management APIs
Cluster Services APIs
User Management APIs
Data Inquiry APIs
Scaling APIs
Configuration APIs
Other APIs
Serverless HBase APIs
YARN Resource Scheduling APIs
Making API Requests
Data Types
Error Codes
FAQs
EMR on CVM
Service Level Agreement
Contact Us

Accessing Hudi Data with Hive

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2024-10-30 11:43:08

Development Preparation

Make sure you have activated Tencent Cloud and created an EMR cluster. For more details, see Creating a Cluster.
During the creation of an EMR cluster, select the Hive, Spark, and Hudi components in the software configuration interface.

Reading and Writing Hudi with Spark

Log in to the master node, switch to the hadoop user, and use SparkSQL with the HoodieSparkSessionExtension extension to read and write data:
spark-sql --master yarn \\
--num-executors 2 \\
--executor-memory 1g \\
--executor-cores 2 \\
--jars /usr/local/service/hudi/hudi-bundle/hudi-spark3.3-bundle_2.12-0.13.0.jar \\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \\
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \\
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Note:
Among them, --master specifies your master URL, --num-executors specifies the number of executors, and --executor-memory specifies the executor memory capacity. You can modify these parameters based on your actual requirements. The dependency package versions used by --jars may vary across different EMR versions. Check and use the correct dependency package located in the /usr/local/service/hudi/hudi-bundle directory.
Create a table:
-- Create a partition table


spark-sql> create table hudi_cow_nonpcf_tbl (
uuid int,
name string,
price double
) using hudi
tblproperties (
primaryKey = 'uuid'
);


-- Create a partition table


spark-sql> create table hudi_cow_pt_tbl (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);


-- Create a MOR partition table


spark-sql> create table hudi_mor_tbl (
id int,
name string,
price double,
ts bigint,
dt string
) using hudi
tblproperties (
type = 'mor',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt);
Write data:
-- insert into non-partitioned table
spark-sql> insert into hudi_cow_nonpcf_tbl select 1, 'a1', 20;


-- insert dynamic partition
spark-sql> insert into hudi_cow_pt_tbl partition (dt, hh) select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt, '10' as hh;


-- insert static partition
spark-sql> insert into hudi_cow_pt_tbl partition(dt = '2021-12-09', hh='11') select 2, 'a2', 1000;
spark-sql> insert into hudi_mor_tbl partition(dt = '2021-12-09') select 1, 'a1', 20, 1000;

Using Hive to Query Hudi Table

Log in to the Master node, switch to the hadoop user, and execute the following command to connect to Hive:
hive
Add the Hudi dependency package:
hive> add jar /usr/local/service/hudi/hudi-bundle/hudi-hadoop-mr-bundle-0.13.0.jar;
View the table:
hive> show tables;
OK
hudi_cow_nonpcf_tbl
hudi_cow_pt_tbl
hudi_mor_tbl
hudi_mor_tbl_ro
hudi_mor_tbl_rt
Time taken:0.023 seconds, Fetched:5 row(s)
Query data:
hive> select * from hudi_cow_nonpcf_tbl;
OK
20230905170525412 20230905170525412_0_0 1 8d32a1cc-11f9-437f-9a7b-8ba9532223d3-0_0-17-15_20230905170525412.parquet 1 a1 20.0
Time taken:1.447 seconds, Fetched:1 row(s)

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive> select * from hudi_mor_tbl_ro;
OK
20230808174602565 20230808174602565_0_1 id:1 dt=2021-12-09 af40667d-1dca-4163-89ca-2c48250985b2-0_0-34-1617_20230808174602565.parquet 1 a1 20.0 1000 2021-12-09
Time taken:0.159 seconds, Fetched:1 row(s)
hive> set hive.vectorized.execution.enabled=false;
hive> select name, count(*) from hudi_mor_tbl_rt group by name;
a1 1
Time taken:17.618 seconds, Fetched:1 row(s)


도움말 및 지원

문제 해결에 도움이 되었나요?

피드백