tencent cloud

GooseFS Distributedload Tuning Practice
Last updated: 2025-07-17 17:33:24
GooseFS Distributedload Tuning Practice
Last updated: 2025-07-17 17:33:24

Overview

GooseFS provides users with a proximity computing-side cache file system. When files are stored in a remote storage system (such as object storage), users can use the distributedload command to load the required file data and metadata into the GooseFS cluster, reducing access latency and accelerating computing jobs.

The GooseFS distributedLoad command is as follows:
distributedLoad [-A] [--replication <num>] [--active-jobs <num>] [--expire-time] <path>
A Whether to enable atomic distributed load capacity
--active-jobs <active job count> The maximum number of data loading tasks that can be enabled simultaneously. The default upper limit is 3000. If this value is exceeded, new tasks need to wait for the current task to complete before execution.
--expire-time <arg> Set the expiry date for clearing the temporary directory used for data loading. The default is 24 hours (unit defaults to ms, supports s, min, hour, e.g., 100s).
--replication <replicas> Number of Block data replicas loaded per loading task, defaults to 1.

Practical Steps

Process Description

The complete execution process of the distributedLoad command involves the GooseFS client, JobMaster, JobWorker, and Worker modules. The details are as follows:
1. Task initiated by GooseFS Client.
2. The GooseFS JobMaster converts each file into a Job and splits it into different Tasks based on a collection of Blocks.
3. The GooseFS JobMaster splits different Tasks among JobWorkers, and different JobWorkers initiate Load operations.
4. GooseFS JobWorker concurrently executes Tasks. Each Task sends an execution request to the Worker for a Read operation and initiates a write operation. Among them, each Read operation sends a Read request to Cosn.

Based on the above process, the potential bottlenecks are as follows:
1. GooseFS Client initiates tasks concurrently at the file granularity.
2. The speed at which the GooseFS JobMaster organizes and assigns tasks.
3. Concurrency of GooseFS JobWorker task execution.
4. Concurrency of GooseFS Worker threads for executing read and write operations.
5. The processing speed of Cosn for Read Request.

Among them, the second task is memory operation, mainly signaling flow operation. The task distribution cycle is the heartbeat interval of the Worker (1s), which basically will not generate bottlenecks. Therefore, the main tuning directions are focused on the JobWorker, Worker, and Cosn modules, involving the following key parameters:
1. JobWorker module:
goosefs.user.block.worker.client.pool.max: When establishing a read-write stream, it retrieves from the client pool. A small value may block client retrieval.
goosefs.job.worker.max.active.task.num : Maximum number of tasks allowed to execute simultaneously
goosefs.job.worker.threadpool.size: Number of threads for processing tasks
goosefs.user.block.worker.client.pool.max: Set the number of jobworker worker clients. A small value may block client retrieval.
2. Worker module:
goosefs.worker.network.reader.buffer.size: impacts worker memory usage
goosefs.worker.network.block.reader.threads.max: Maximum number of threads for worker to read
3. Cosn module
fs.cosn.block.size: Size of the loaded block
fs.cosn.upload_thread_pool: Size of the read thread, shared by each worker
fs.cosn.read.ahead.block.size: The granularity at which COSN requests COS
fs.cosn.read.ahead.queue.size: COSN has a read-ahead feature, mainly for large file sequential read scenarios, while load is sequential read but not necessarily for large files.

Configuration Tuning

Configure Cosn

The throughput scale provided by the GooseFS cache cluster depends on the throughput between the cluster and COS, relying on the performance of Cosn. Under normal circumstances, the configuration of Cosn can be adjusted according to business needs.
In big data scenarios, the average file size is relatively large. The recommended configuration is as follows:
1. fs.cosn.block.size: Recommended setting is 128MB.
2. fs.cosn.upload_thread_pool: Recommended setting is 2 - 3 times the number of CPUs. The thread pool size should be appropriately increased or decreased based on CPU usage.
3. fs.cosn.read.ahead.block.size: Need to adjust according to block size:
If the average file size is in the tens of MB range, it can be configured to 4MB.
If the file is in MB or KB level, recommend setting it to the average value or median value of the file size. Try to read it back in one rpc as much as possible. The reason is that the data length occupies a relatively small body of the HTTP request, and the consumption of the rpc may be greater than the consumption of reading the data itself. Therefore, try to reduce the number of RPCs.
4. fs.cosn.read.ahead.queue.size: Recommended setting is 8 - 32. Need to set according to memory capacity value. Under normal circumstances, memory usage of a file input stream equals block.size * queue.size. Since block size equals fs.cosn.block.size / fs.cosn.read.ahead.block.size, taking the recommended setting as an example, this value equals 32. Therefore, the set value does not need to exceed 32. If set exceeding 32, it will waste resources.

Worker Node Configuration

After confirming the Cosn configuration, you can further clarify the worker node configuration:
Worker node configuration: goosefs.worker.network.reader.buffer.size. This value also needs to be estimated based on memory. The total memory occupied by read operations equals worker read concurrency limit x (buffer.size + memory usage of one inputstream + length of a single readRequest (default is 1MB)).

Configuring JobWorker

Next, clarify the JobWorker configuration:
1. goosefs.job.worker.max.active.task.num: This value can be slightly larger than the value of fs.cosn.upload_thread_pool to fully leverage the capability of COSN.
2. goosefs.user.block.worker.client.pool.max: The default value is 1024. In principle, this configuration value must be equal to twice the value of goosefs.job.worker.max.active.task.num.
3. goosefs.job.worker.threadpool.size: The default value is 10. The maximum value of this numeric value must be less than or equal to goosefs.job.worker.max.active.task.num.

When using GooseFS Client to initiate operations, you can adjust the number of active tasks on the client based on the value of goosefs.job.worker.max.active.task.num. In principle, it is required that the value of active-jobs is greater than the value of goosefs.job.worker.max.active.task.num.
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback