tencent cloud

Use GooseFS in Kubernetes to Speed Up Spark Data
Last updated: 2025-07-17 17:48:17
Use GooseFS in Kubernetes to Speed Up Spark Data
Last updated: 2025-07-17 17:48:17

Overview

Spark running on Kubernetes can use GooseFS as the data access layer. This article explains how to use GooseFS in Kubernetes environment to speed up Spark data access.

Practical Deployment

Environment and Dependency Version

CentOS 7.4+
Kubernetes version 1.18.0+
Docker 20.10.0
Spark version 2.4.8+
GooseFS 1.2.0+

Kubernetes Deployment

For detailed Kubernetes deployment, see Kubernetes official documentation.

Accelerating Spark Data Access with GooseFS

Currently, there are two main ways to use GooseFS in Kubernetes to speed up Spark data access:
Deploy and run GooseFS Runtime Pods and Spark Runtime to accelerate Spark computing applications based on Fluid distributed data orchestration and acceleration engine (Fluid Operator architecture).
Run Spark on GooseFS in Kubernetes (Kubernetes Native deployment architecture).

Running Spark on GooseFS in Kubernetes

Prerequisites

1. Spark on Kubernetes uses the Kubernetes Native deployment and operation architecture recommended by the Spark official website. For detailed deployment methods, see Spark official website document.
2. The GooseFS cluster is deployed. For GooseFS cluster deployment, see Console quick start.
Notes:
Note: When deploying GooseFS Worker, you need to configure goosefs.worker.hostname=$(hostname -i), otherwise the client in the Spark pod will be unable to resolve the GooseFS Worker host address.

Basic Steps

1. First, download and unzip spark-2.4.8-bin-hadoop2.7.tgz.
2. Decompress the GooseFS client from the GooseFS Docker image, then compile it into the Spark image, as follows:
# Decompress the GooseFS client from the GooseFS Docker image
$ id=$(docker create goosefs/goosefs:v1.2.0)
$ docker cp $id:/opt/alluxio/client/goosefs-1.2.0-client.jar - > goosefs-1.2.0-client.jar
$ docker rm -v $id 1>/dev/null
Then, copy to the spark directory
$ cp goosefs-1.2.0-client.jar /path/to/spark-2.4.8-bin-hadoop2.7/jars
# Then, recompile the spark docker image
$ docker build -t spark-goosefs:2.4.8 -f kubernetes/dockerfiles/spark/Dockerfile .
# View the compiled docker image
$ docker image ls




Test Procedure

First, ensure the GooseFS cluster has been started and the container can access the GooseFS Master/Worker IP and port, then follow the steps below to conduct test verification.
1. Create a namespace for testing in GooseFS, such as /spark-cosntest, and add test data files.
Note:
We recommend that you avoid using permanent keys in the configuration. Configuring sub-account keys or temporary keys can help improve your business security. When authorizing a sub-account, grant only the permissions of the operations and resources that the sub-account needs, which helps avoid unexpected data leakage.
If you must use a permanent key, it is advisable to limit its permission scope by restricting executable operations, resource scope and conditions (such as access IP) to enhance usage security.
# Use sub-account keys or temporary keys to complete the configuration and enhance security. When authorizing sub-accounts, grant executable operations and resources on demand.
$ goosefs ns create spark-cosntest cosn://goosefs-test-125000000/ --secret fs.cosn.userinfo.secretId=********************************** --secret fs.cosn.userinfo.secretKey=********************************** --attribute fs.cosn.bucket.region=ap-xxxx
# Add a test data file
$ goosefs fs copyFromLocal LICENSE /spark-cosntest
2. (Optional) Create a service account used to run spark Jobs.
$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit \\
--serviceaccount=default:spark --namespace=default
3. Submit a spark Job.
--master k8s://http://127.0.0.1:8001 \\
--deploy-mode cluster \\
--name spark-goosefs \\
--class org.apache.spark.examples.JavaWordCount \\
--conf spark.executor.instance=2 \\
--conf spark.kubernetes.container.image=spark-goosefs/spark:2.4.8 \\
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \\
--conf spark.hadoop.fs.gfs.impl=com.qcloud.cos.goosefs.hadoop.GooseFileSystem \\
--conf spark.driver.extraClassPath=local:///opt/spark/jars/goosefs-1.2.0-client.jar \\
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.8.jar \\
gfs://172.16.64.32:9200/spark-cosntest/LICENSE
4. Wait for execution to complete.



Execute kubectl logs spark-goosefs-1646905692480-driver to view the job execution result.



Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback