Release Notes

Dynamic Release Record (2026)

Product Introduction

Product Overview

Product Advantages

Product Architecture

Product Features

Application Scenarios

Purchase Guide

Billing Overview

Product Version Purchase Instructions

Execute Resource Purchase Description

Billing Modes

Overdue Policy

Refund

Preparations

Overview of Account and Permission Management

Add allowlist /security groups (Optional)

Operation Guide

Console Operation

Project Management

Data Integration

Studio

Data Development

Data Analysis

Data Science

Data Governance (with Unity Semantics)

API Documentation

History

Introduction

API Category

Making API Requests

Smart Ops Related Interfaces

Project Management APIs

Resource Group APIs

Data Development APIs

Data Asset - Data Dictionary APIs

Data Development APIs

Ops Center APIs

Data Operations Related Interfaces

Data Exploration APIs

Asset APIs

Metadata Related Interfaces

Task Operations APIs

Data Security APIs

Instance Operation and Maintenance Related Interfaces

Data Map and Data Dictionary APIs

Data Quality Related Interfaces

DataInLong APIs

Platform Management APIs

Data Source Management APIs

Data Quality APIs

Platform Management APIs

Asset Data APIs

Data Source Management APIs

Data Types

Error Codes

WeData API 2025-08-06

Service Level Agreements

Related Agreement

Data Processing And Security Agreement

Glossary

PySpark

PDF

Mode fokus

Ukuran font

Terakhir diperbarui: 2026-01-07 17:48:31

Note:
It is necessary to start Hive and Spark component services in the EMR cluster.
1. The current user has permission in the EMR cluster.
2. Corresponding databases and tables have been created in Hive, such as wedata_demo_db in the example.
3. PySpark automatically submits tasks in cluster mode.
Code Example
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
﻿
spark = SparkSession.builder.appName("WeDataApp").getOrCreate()
﻿
schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("user_name", StringType(), True),
    StructField("age", IntegerType(), True)
])
﻿
data = [(1, "Alice", 25), (2, "Bob", 30)]
df = spark.createDataFrame(data, schema=schema)
﻿
df.show()
from pyspark.sql import SparkSession
﻿
spark = SparkSession.builder.appName("WeDataApp").enableHiveSupport().getOrCreate()
﻿
df = spark.sql("SELECT * FROM WeData_demo_db.user_demo")
﻿
count = df.count()
﻿
print("The number of rows in the dataframe is:", count)
Parameter Description
Parameter
Description
Python version
Supports Python 2 and Python 3.
Use the Python Environment Of the Scheduling Resource Group In a PySpark Task
Install Python Libraries In the Scheduling Resource Group
1. Go to Project Management > Execution Resource Group > Standard Scheduling Resource Group interface, click Resource Detail to enter the resource operation and maintenance interface.
﻿
2. In the resource operation and maintenance interface, click Python Package Installation to install built-in Python libraries. It is recommended to install the Python 3 version.
﻿
3. Currently, the platform only supports the installation of built-in libraries. Here, install the sklearn and pandas libraries. After the installation is complete, you can use the Python Package View feature to view the installed Python libraries.
﻿
Edit PySpark Task
1. Create a task and select a scheduling resource group that has Python packages installed.
2. Write PySpark code using Python libraries, including pandas and sklearn here.
﻿
﻿
﻿
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import pandas as pd
import sklearn

spark = SparkSession.builder.appName("WeDataApp-1").getOrCreate()

schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("user_name", StringType(), True),
    StructField("age", IntegerType(), True)
])

data = [(1, "Alice", 25), (2, "Bob", 30)]
df = spark.createDataFrame(data, schema=schema)
pandas_df = df.toPandas()
df.show()

print(pandas_df.head(10))
print(sklearn.__version__)
Debug PySpark Task
1. Click debugging and running to view the logs and results of debugging and running.
     Example: In the logs, you can see that the Python environment using the scheduling resource group is used as the environment for task execution.
spark.yarn.dist.archives,file:///usr/local/python3/python3.zip#python3
﻿﻿
2. By viewing the log results, you can see that the installed pandas library is used and the version of the installed scikit-learn library is correctly printed.
Periodic Scheduling Of PySpark Tasks
Perform periodic scheduling runs and view the logs and results of the debugging runs. In the logs, you can see that the Python environment using the scheduling resource group is used as the environment for task execution.
﻿
﻿
﻿
﻿
﻿
﻿