tencent cloud

AI Scenario Production Environment Configuration Practice
Last updated: 2025-07-17 17:42:55
AI Scenario Production Environment Configuration Practice
Last updated: 2025-07-17 17:42:55

Overview

Data Accelerator Goose FileSystem (GooseFS) provides multiple deployment methods, including control plane deployment, TKE cluster deployment, EMR cluster deployment and other methods. In AI scenarios, control plane deployment and TKE cluster deployment are usually used, and a high-availability architecture is adopted to meet business continuity requirements.

High-availability architecture refers to a Master-backup active-active architecture with multiple Master nodes. Among these nodes, only one serves as the primary (Leader) node to provide services externally, while the rest Standby nodes maintain the same file system status as the primary node by synchronizing shared logs (Journal). If the primary node fails or goes down, a Standby node is automatically selected from the current secondary nodes to take over and continue providing services externally. This eliminates the system's single point of failure and achieves overall high availability. Currently, GooseFS supports strong consistency for Master-backup node status through two methods: Raft logs and Zookeeper. In container scenarios, we recommend deploying your high-availability architecture based on the Raft log mode. This document focuses on Raft-based high-availability deployment configurations and distinguishes between sequential read and random read in different scenarios.

High-Availability Architecture Deployment Configuration Based on Raft (Sequential Read Scenario)

In the scenario where sequential read is required, see the following recommended configuration and copy-paste this configuration item to the goosefs-site.properties file to complete your high availability architecture configuration:
goosefs.master.embedded.journal.addresses=<master1>:9202,<master2>:9202,<master3>:9202

goosefs.master.metastore=ROCKS
Use when it's uncertain whether rocksdb is fixed
goosefs.master.metastore.block=HEAP

Depends on the memory size
goosefs.master.metastore.inode.cache.max.size=10000000

# rocksdb data storage place
goosefs.master.metastore.dir=/meta-1/metastore

# Mount path for the root directory must be placed in a secure directory to prevent accidental deletion
goosefs.master.mount.table.root.ufs=/meta-1/underFSStorage

# raft log storage place
goosefs.master.journal.folder=/meta-1/journal

# Timeout period for triggering master switchover should not be too low (jvm gc can cause switchover oscillation) or too large (will impact recovery availability time)
goosefs.master.embedded.journal.election.timeout=20s

# For large data volume, strongly recommend disabling
goosefs.master.startup.block.integrity.check.enabled=false

The timing to trigger a checkpoint should not be too small (frequent checkpoints will prevent participation in leader election during the checkpoint period) or too large (affecting the service restart duration). It can be estimated based on the checkpoint loading duration.
goosefs.master.journal.checkpoint.period.entries=20000000

# acl authentication switch, set based on scenario
goosefs.security.authorization.permission.enabled=false

# Recommend enabling, otherwise hostname will be used, and hostnames may be identical.
goosefs.network.ip.address.used=true

# Worker properties
goosefs.worker.tieredstore.levels=1
goosefs.worker.tieredstore.level0.alias=HDD
goosefs.worker.tieredstore.level0.dirs.quota=7TB,7TB
goosefs.worker.tieredstore.level0.dirs.path=/data-1,/data-2

# worker restart timeout period, increase as much as possible for large quantities.
goosefs.worker.registry.get.timeout.ms=3600s

# read data response timeout, default 1h
goosefs.user.streaming.data.timeout=60s

Write policy, LocalFirstPolicy is selected by default, possibly causing data imbalance
goosefs.user.block.write.location.policy.class=com.qcloud.cos.goosefs.client.block.policy.RoundRobinPolicy

# Impacts distributedLoad speed. Without considering online read impact, set it to cpu count * 2.
gosefs.job.worker.threadpool.size=50

High-Availability Architecture Deployment Configuration Based on Raft (Random Read Scenario)

In the scenario where random read is required, see the following recommended configuration and copy-paste this configuration item into the goosefs-site.properties file to complete your high availability architecture configuration:
goosefs.master.embedded.journal.addresses=<master1>:9202,<master2>:9202,<master3>:9202

goosefs.master.metastore=ROCKS
Use when it's uncertain whether rocksdb is fixed
goosefs.master.metastore.block=HEAP

# Based on memory size
goosefs.master.metastore.inode.cache.max.size=10000000

# rocksdb data storage place
goosefs.master.metastore.dir=/meta-1/metastore

# Mount path for the root directory must be placed in a secure directory to prevent accidental deletion
goosefs.master.mount.table.root.ufs=/meta-1/underFSStorage

# raft log storage place
goosefs.master.journal.folder=/meta-1/journal

# Timeout period for triggering master switchover should not be too low (jvm gc can cause switchover oscillation) or too large (will impact recovery availability time)
goosefs.master.embedded.journal.election.timeout=20s

# For large data volume, strongly recommend disabling
goosefs.master.startup.block.integrity.check.enabled=false

The timing to trigger a checkpoint should not be too low (frequent checkpoints will prevent participation in leader election during the checkpoint period) or too large (impacting service restart duration). It can be estimated based on checkpoint loading duration.
goosefs.master.journal.checkpoint.period.entries=20000000

# acl authentication switch, based on scenario
goosefs.security.authorization.permission.enabled=false

# Recommend enabling, otherwise hostname will be used, and hostnames may be identical.
goosefs.network.ip.address.used=true

# Worker properties
goosefs.worker.tieredstore.levels=1
goosefs.worker.tieredstore.level0.alias=HDD
goosefs.worker.tieredstore.level0.dirs.quota=7TB,7TB
goosefs.worker.tieredstore.level0.dirs.path=/data-1,/data-2

# worker restart timeout period, increase as much as possible for large quantities.
goosefs.worker.registry.get.timeout.ms=3600s

# read data response timeout, default 1h
goosefs.user.streaming.data.timeout=60s

# Write policy, LocalFirstPolicy is selected by default, possibly causing data imbalance
goosefs.user.block.write.location.policy.class=com.qcloud.cos.goosefs.client.block.policy.RoundRobinPolicy

# For random read cases, it is advisable to reduce the value (default 1MB) to prevent read bloat
goosefs.user.streaming.reader.chunk.size.bytes=256KB
goosefs.user.local.reader.chunk.size.bytes=256KB

# Time to wait for worker read stream to close. For large number of small file reads or random read cases, it is advisable to reduce the value (default 5s) to avoid performance decrease caused by long tail.
goosefs.user.streaming.reader.close.timeout=100ms


Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback