Overview
Transparent acceleration capability is used to enhance the performance of accessing COS through CosN. CosN tool is a standard Hadoop file system implementation based on Tencent Cloud Object Storage (COS), providing support for integrating COS into big data computing frameworks such as Hadoop, Spark, and Tez. Users can utilize the CosN plug-in, which implements the Hadoop file system interface, to read and write data stored on COS. For users who have already used CosN tool to access COS, GooseFS offers a client-based path mapping method, allowing them to continue using the CosN scheme to access GooseFS without modifying their current Hive table definitions. This feature enables users to compare and test the functionality and performance of GooseFS without altering existing table definitions. For Cloud HDFS users (CHDFS), they can also achieve the purpose of accessing GooseFS using the OFS scheme by modifying configurations.
The mapping of CosN Schema and GooseFS Schema is as follows:
Assuming the UFS path corresponding to Namespace warehouse is cosn://examplebucket-1250000000/data/warehouse/, the path mapping relationship from CosN to GooseFS is as follows:
cosn://examplebucket-1250000000/data/warehouse -> /warehouse/
cosn://examplebucket-1250000000/data/warehouse/folder/test.txt ->/warehouse/folder/test.txt
The path mapping relationship from GooseFS to CosN is as follows:
/warehouse ->cosn://examplebucket-1250000000/data/warehouse/
/warehouse/ -> cosn://examplebucket-1250000000/data/warehouse/
/warehouse/folder/test.txt -> cosn://examplebucket-1250000000/data/warehouse/folder/test.txt
CosN Scheme accesses GooseFS feature by maintaining the mapping relationship between GooseFS paths and the underlying file system CosN paths on the client, and converting requests for CosN paths into requests for GooseFS. The mapping relationship is periodically refreshed. You can adjust the refresh interval by modifying the configuration item goosefs.user.client.namespace.refresh.interval in the GooseFS configuration file goosefs-site.properties, with a default value of 60s.
Note:
If the accessed CosN path cannot be converted to a GooseFS path, the corresponding Hadoop API call throws an exception.
Operation Example
This example demonstrates how to use the three Schemas gfs://, cosn://, and ofs:// to access GooseFS in Hadoop command line and Hive. The operation process is as follows:
1. Prepare the Data and Computing Cluster
Refer to the Create a bucket document to create a bucket for testing purposes. Refer to the Create a folder document to create a folder named ml-100k in the root path of the bucket. Download the ml-100k dataset from Grouplens, and upload the file u.user to <bucket root path>/ml-100k. Refer to the EMR guidance documentation, purchase an EMR cluster and configure the HIVE component.
2. Configure the Environment
Place the GooseFS client jar package (goosefs-1.0.0-client.jar) under the share/hadoop/common/lib/ directory:
cp goosefs-1.0.0-client.jar hadoop/share/hadoop/common/lib/
Note:
Configuration changes and adding jar packages need to be synchronized to all nodes in the cluster.
ii. Modify the Hadoop configuration file etc/hadoop/core-site.xml to specify the implementation class of GooseFS:
<property>
<name>fs.AbstractFileSystem.gfs.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.GooseFileSystem</value>
</property>
<property>
<name>fs.gfs.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.FileSystem</value>
</property>
iii. Execute the following Hadoop command to check whether GooseFS can be accessed via the gfs:// Scheme, where <MASTER_IP> is the IP of the MASTER node:
hadoop fs -ls gfs://<MASTER_IP>:9200/
Place the GooseFS Client jar package in the Hive auxlib directory so that Hive can load the GooseFS Client package.
cp goosefs-1.0.0-client.jar hive/auxlib/
v. Execute the following command to create a Namespace with UFS Scheme set to CosN, and list the Namespace. You can replace examplebucket-1250000000 in the command with your COS storage bucket, and replace SecretId and SecretKey with your key information:
goosefs ns create ml-100k cosn://examplebucket-1250000000/ml-100k --secret fs.cosn.userinfo.secretId=SecretId --secret fs.cosn.userinfo.secretKey=SecretKey--attribute fs.cosn.bucket.region=ap-guangzhou --attribute fs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider
goosefs ns ls
vi. Execute the command to create a Namespace with UFS Scheme set to OFS, and list the Namespace. You can replace instance-id in the command with your CHDFS instance, and replace 1250000000 with your APPID.
goosefs ns create ofs-test ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/ofs-test --attribute fs.ofs.userinfo.appid=1250000000
goosefs ns ls
3. Create a GooseFS Schema Table and Query Data
Execute using the following instructions:
create database goosefs_test;
use goosefs_test;
CREATE TABLE u_user_gfs (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'gfs://<MASTER_IP>:<MASTER_PORT>/ml-100k';
select sum(age) from u_user_gfs;
4. Create a CosN Schema Table and Query Data
Execute using the following command:
CREATE TABLE u_user_cosn (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'cosn://examplebucket-1250000000/ml-100k';
select sum(age) from u_user_cosn;
5. Modify the Implementation of CosN to a Compatible Implementation of GooseFS
Modify hadoop/etc/hadoop/core-site.xml:
<property>
<name>fs.AbstractFileSystem.cosn.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.CosN</value>
</property>
<property>
<name>fs.cosn.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.CosNFileSystem</value>
</property>
Execute the Hadoop command. If the path cannot be converted into a GooseFS path, the command output will contain error information:
hadoop fs -ls cosn://examplebucket-1250000000/ml-100k/
Found 1 items
-rw-rw-rw- 0 hadoop hadoop 22628 2021-07-02 15:27 cosn://examplebucket-1250000000/ml-100k/u.user
hadoop fs -ls cosn://examplebucket-1250000000/unknow-path
ls: Failed to convert ufs path cosn://examplebucket-1250000000/unknow-path to GooseFs path, check if namespace mounted
Re-execute the Hive query statement:
select sum(age) from u_user_cosn;
6. Create an OFS Schema Table and Query Data
Execute using the following command:
CREATE TABLE u_user_ofs (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/ofs-test/';
select sum(age) from u_user_ofs;
7. Modifying the Implementation of OFS to a Compatible Implementation of GooseFS
Modify hadoop/etc/hadoop/core-site.xml:
<property>
<name>fs.AbstractFileSystem.ofs.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.CHDFSDelegateFS</value>
</property>
<property>
<name>fs.ofs.impl</name>
<value>com.qcloud.cos.goosefs.hadoop.CHDFSHadoopFileSystem</value>
</property>
Execute the Hadoop command. If the path cannot be converted into a GooseFS path, the output result will contain error information:
hadoop fs -ls ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/ofs-test/
Found 1 items
-rw-r--r-- 0 hadoop hadoop 22628 2021-07-15 15:56 ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/ofs-test/u.user
hadoop fs -ls ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/unknown-path
ls: Failed to convert ufs path ofs://instance-id.chdfs.ap-guangzhou.myqcloud.com/unknown-path to GooseFs path, check if namespace mounted
Re-execute the Hive query statement:
select sum(age) from u_user_ofs;