产品概述
基本概念
集群架构
产品优势
应用场景
+| 0. User create spark load job+----v----+| FE |---------------------------------++----+----+ || 3. FE send push tasks || 5. FE publish version |+------------+------------+ || | | |+---v---+ +---v---+ +---v---+ || BE | | BE | | BE | |1. FE submit Spark ETL job+---^---+ +---^---+ +---^---+ ||4. BE push with broker | |+---+---+ +---+---+ +---+---+ ||Broker | |Broker | |Broker | |+---^---+ +---^---+ +---^---+ || | | |+---+------------+------------+---+ 2.ETL +-------------v---------------+| HDFS +-------> Spark cluster || <-------+ |+---------------------------------+ +-----------------------------+
chown -R doris:doris /usr/local/service/hadoop/etc/hadoophdfs dfs -ls / 命令,验证 Hadoop 客户端是否安装成功。cd /usr/local/service/spark/jars/zip spark_jars.zip *.jar
vim /usr/local/service/spark/conf/log4j.properties
spark-submit --queue default --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/service/spark/examples/jars/spark-examples_*.jar 10
mkdir -p /usr/local/service/doris/log/spark_launcher_log;chmod 777 /usr/local/service/doris/log/spark_launcher_log
spark_home_default_dir=/usr/local/service/sparkspark_resource_path=/usr/local/service/spark/jars/spark_jars.zipyarn_client_path=/usr/local/service/hadoop/bin/yarnyarn_config_dir=/usr/local/service/hadoop/etc/hadoopspark_dpp_version=1.2-SNAPSHOT
CREATE EXTERNAL RESOURCE spark_resource_xxxPROPERTIES("type" = "spark","spark.master" = "yarn","spark.submit.deployMode" = "cluster","spark.yarn.queue" = "<xxx_queue>","spark.hadoop.yarn.resourcemanager.ha.enabled" = "true","spark.hadoop.yarn.resourcemanager.ha.rm-ids" = "rm1,rm2","spark.hadoop.yarn.resourcemanager.address.rm1" = "<rm1_host>:<rm1_port>","spark.hadoop.yarn.resourcemanager.address.rm2" = "<rm2_host>:<rm2_port>","spark.hadoop.fs.defaultFS" = "hdfs://<hdfs_defaultFS>","spark.hadoop.dfs.nameservices" = "<hdfs_defaultFS>","spark.hadoop.dfs.ha.namenodes.<hdfs_defaultFS>" = "nn1,nn2","spark.hadoop.dfs.namenode.rpc-address.<hdfs_defaultFS>.nn1" = "<nn1_host>:<nn1_port>","spark.hadoop.dfs.namenode.rpc-address.<hdfs_defaultFS>.nn2" = "<nn2_host>:<nn2_port>","spark.hadoop.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider","working_dir" = "hdfs://<hdfs_defaultFS>/doris/spark_load","broker" = "<doris_broker_name>","broker.username" = "hadoop","broker.password" = "","broker.dfs.nameservices" = "<hdfs_defaultFS>","broker.dfs.ha.namenodes.<hdfs_defaultFS>" = "nn1, nn2","broker.dfs.namenode.rpc-address.HDFS4001273.nn1" = "<nn1_host>:<nn1_port>","broker.dfs.namenode.rpc-address.HDFS4001273.nn2" = "<nn2_host>:<nn2_port>","broker.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
-- 授予spark0资源的使用权限给用户user0GRANT USAGE_PRIV ON RESOURCE "spark_resource_xxx" TO "user0"@"%";-- 授予spark0资源的使用权限给角色role0GRANT USAGE_PRIV ON RESOURCE "spark_resource_xxx" TO ROLE "role0";-- 授予所有资源的使用权限给用户user0GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";-- 授予所有资源的使用权限给角色role0GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";-- 撤销用户user0的spark0资源使用权限REVOKE USAGE_PRIV ON RESOURCE "spark_resource_xxx" FROM "user0"@"%";
LOAD LABEL test_label_01 --label(DATA INFILE ("hdfs://HDFS4001234/warehouse/ods.db/user_events/ds=2023-04-15/*" --file path)INTO TABLE user_events --doris tableFORMAT AS "parquet" --data format( event_time, user_id, op_code) --columns in fileCOLUMNS FROM PATH AS ( `ds` ) --partition columnSET( --column mappingds = ds,event_time = event_time,user_id = user_id,op_code = op_code))WITH RESOURCE 'spark_resource_xxx'( --spark job params"spark.executor.memory" = "4g","spark.default.parallelism" = "400","spark.executor.cores" = '5',"spark.executor.instances" = '10')PROPERTIES( --doris load task params"timeout" = "259200");
use xxx_db; 。show load where label='test_label_01';。- State导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load Job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。- Progress导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。LOAD 的进度范围为:0~100%。`LOAD 进度 = 当前已完成所有 Replica 导入的 Tablet 个数 / 本次导入任务的总 Tablet 个数 * 100%`**如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
cancel load where label='test_label_01';。show load 命令根据 label 查看load 任务信息。ls *<load_task_label>* 进行模糊搜索。cd /data/cdw/doris/fe/log/。文档反馈