Tencent Cloud

Recent Pages

SparkSQL Tutorial

Last updated: 2020-11-20 12:59:00

Spark SQL is Apache Spark's module for structured data processing. It provides a DataFrame abstraction in a variety of languages to simplify working with structured datasets and lets you query the data with distributed SQL.

1. Preparations for Development

Confirm that you have activated Tencent Cloud and created an EMR cluster. When creating the EMR cluster, select the Spark component on the software configuration page.

2. Using the Interactive SparkSQL Console

First, log in to a master node of the EMR cluster before using SparkSQL. For more information on how to log in to EMR, please see Logging in to a Linux Instance. Here, you can use WebShell to log in. Click Login button on the right of the desired CVM instance and then enter the login page. The default username is root, and the password is the one you set when creating the EMR cluster. Once your credentials have been validated, you can access to the EMR command-line interface.

Run the following command in EMR command-line interface to switch to the Hadoop user and go to the directory /usr/local/service/spark:

[root@172 ~]# su hadoop
[hadoop@172 root]$ cd /usr/local/service/spark

You can access the interactive SparkSQL Console by running the following command:

[hadoop@10spark]$ bin/spark-sql --master yarn --num-executors 64 --executor-memory 2g

Here, --master indicates your master URL, --num-executors the number of executors, and --executor-memory the storage capacity of the executors. You can modify these parameters based on your actual conditions and start/stop a SparkSQLthriftserver through sbin/start-thriftserver.sh/sbin/stop-thriftserver.sh.

Below are some basic operations in SparkSQL：

Create and view a database:

spark-sql> create database sparksql;
Time taken: 0.907 seconds

spark-sql> show databases;
default
sparksql
test
Time taken: 0.131 seconds, Fetched 5 row(s)

- Create a new table in the database you just created and view the table:

spark-sql> use sparksql;
Time taken: 0.076 seconds

spark-sql> create table sparksql_test(a int,b string);
Time taken: 0.374 seconds

spark-sql> show tables;
sparksql_test false
Time taken: 0.12 seconds, Fetched 1 row(s)

- Insert two rows of data into the table and view them:

spark-sql> insert into sparksql_test values (42,'hello'),(48,'world');
Time taken: 2.641 seconds

spark-sql> select * from sparksql_test;
42 hello
48 world
Time taken: 0.503 seconds, Fetched 2 row(s)

For more information on Spark command line parameters, please see the [community documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html).

## 3. Creating a Project with Maven
Download and install Maven first and then configure its environment variables. If you are using the IDE, please set the Maven-related configuration items in the IDE.


### Creating a Maven project
Enter the directory of the Maven project, such as `D://mavenWorkplace`, and create the project by running the following commands:

mvn archetype:generate -DgroupId=$yourgroupID -DartifactId=$yourartifactID
-DarchetypeArtifactId=maven-archetype-quickstart

Here, $yourgroupID is your package name; $yourartifactID is your project name; maven-archetype-quickstart indicates to create a Maven Java project. Some files need to be downloaded during the project is being created, so please stay connected to the Internet.
After successfully creating the project, you will see a folder named $yourartifactID in the `D://mavenWorkplace` directory. The files included in the folder have the following structure:

simple
　　　---pom.xml　　　　Core configuration, under the project root directory
　　　---src
　　　　　---main　　　　　　
　　　　　　　---java　　　　 Java source code directory
　　　---resources　 Java configuration file directory
　　　　---test
　　　　　　---java　　　　 Test source code directory
　　　　　　---resources　 Test configuration directory

Among the files above, pay extra attention to the pom.xml file and the Java folder under the main directory. The pom.xml file is primarily used to create dependencies and package configurations; the Java folder is used to store your source code.

### Adding Hadoop dependencies and sample code
First, add the Maven dependencies to the pom.xml file:

　　　　　　 org.apache.spark 　　　 spark-core_2.11 　　　 2.0.2 　　　　　　　　　　　　 org.apache.spark 　　　 spark-sql_2.11 　　　 2.0.2 　　　 ``` Then, add the packaging and compiling plugins to the pom.xml file: ``` org.apache.maven.plugins maven-compiler-plugin 1.8 1.8 utf-8 maven-assembly-plugin jar-with-dependencies make-assembly package single ``` Below is an example of a complete pom.xml file: ``` 4.0.0

<groupId>$yourgroupID </groupId>
<artifactId>$yourartifactID </artifactId>
<version>1.0-SNAPSHOT</version>

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.0.2</version>
    </dependency>
    <!--spark sql-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.0.2</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>utf-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

``` >!Replace $yourgroupID and $yourartifactID with your real information.

Then, create a Java Class named Demo.java in the main>Java folder and add the following code to it:

import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

/**
 * Created by tencent on 2018/6/28.
 */
public class Demo {
    public static void main(String[] args){
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark Hive Example")
                .enableHiveSupport()
                .getOrCreate();

        Dataset<Row> df = spark.read().json(args[0]);

        RDD<Row> test = df.rdd();

        test.saveAsTextFile(args[1]);
    }
}

Compiling code and packaging it for upload

Use the local command prompt to enter the project directory and run the following command to compile and package the project:

mvn package

"Build success" indicates that package is successfully created. You can see the generated .jar package in the target folder under the project directory.
Use the scp or sftp tool to upload the compressed .jar package to the EMR cluster. Run the following command in your local shell:

scp $localfile root@public IP address:$remotefolder

Here, $localfile is the path and the name of your local file; root is the CVM instance username. You can look up the public IP address in the node information in the EMR or CVM Console. $remotefolder is the path where you want to store the file in the CVM instance. After the upload is completed, you can check whether the file is in the corresponding folder on the EMR command line.

4. Preparing the Data and Running the Demo

You can use SparkSQL to process data stored in HDFS. First, upload the data to HDFS. The built-in file people.json stored in /usr/local/service/spark/exa-mples/src/main/resources/ is used as an example here. Run the following command to upload it to HDFS:

[hadoop@10 hadoop]$ hadoop fs -put /usr/local/service/spark/examples/src/main/resources/people.json 
/user/hadoop

You can choose a different test file. Here, /user/hadoop/ is a folder under HDFS, which you can create if it does not exist.

Run the demo. First, log in to a master node of the EMR cluster and switch to the Hadoop user, as shown in the interactive SparkSQL Console. Run the following command:

[hadoop@10spark]$ bin/spark-submit --class Demo --master yarn-client $yourjarpackage /  
/user/hadoop/people.json  /user/hadoop/$output

Here, --class is the executed entry class. In this example, Demo is the class, which is also the name of the Java Class you created when adding Hadoop dependencies and sample code. --master is the master URL of the cluster, $yourjarpackage is the package name, and $output is the output folder (if the $output folder already exists before the command is executed, the program will fail).

After the program is successfully executed, you can see the result in /user/hadoop/$output:

[hadoop@172 spark]$ hadoop fs -cat /user/hadoop/$output/part-00000
[null,Michael]
[30,Andy]
[19,Justin]

For more spark-submit parameters, run the following commands, or see the official documentation.

[hadoop@10spark]$ spark-submit -h

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

tencent cloud

Recent Pages

SparkSQL Tutorial

1. Preparations for Development

2. Using the Interactive SparkSQL Console

Compiling code and packaging it for upload

4. Preparing the Data and Running the Demo

Was this page helpful?

Was this page helpful?

tencent cloud

Sign Up

Log in

Recent Pages

SparkSQL Tutorial

1. Preparations for Development

2. Using the Interactive SparkSQL Console

Compiling code and packaging it for upload

4. Preparing the Data and Running the Demo

Was this page helpful?

Was this page helpful?