Cache Capability Overview
GooseFS leverages data affinity scheduling capabilities to route hotspot data from massive distributed storage systems to compute nodes near the data source, shortening the data IO path and enhancing data throughput. Before understanding GooseFS's caching capabilities, it is necessary to learn about several basic concepts associated with caching capabilities:
Namespace (NameSpace): GooseFS organizes and manages cached data through namespaces. A namespace has a many-to-one relationship with external storage; one or more namespaces can be associated with the same external storage service. GooseFS supports multiple types of external storage services such as COS, CHDFS, TStor OneCOS, and local HDFS, leveraging their persistent storage capabilities to store massive data and ensure data reliability.
Local cache storage: GooseFS provides distributed cache capabilities through the physical media of Worker nodes. It supports caching data generated by computations on the local memory, disk, or AZ-level cloud disks of computing nodes. The cache capacity of each Worker node can be configured by the user themselves.
Users can operate Cache Policy to determine whether data IO interacts with local cache storage or externally associated namespace storage during the process.
GooseFS Cache Configuration
Users can enter the goosefs-site.properties file to view GooseFS cache configuration. Currently, GooseFS cache settings can be understood from aspects such as cache levels and cache eviction policies. For cache levels, there are mainly two types: single-level storage and multi-level storage. Cache eviction policies are mainly divided into two types: LRU and LRFU. The following provides detailed introduction.
Cache Level
GooseFS provides two cache levels: single-level storage and multi-level storage. Single-level storage uses only one type of storage medium, while multi-level storage uses multiple types. It can determine which storage medium to use based on business workload conditions to provide matching I/O performance. GooseFS defaults to single-level storage. In multi-level storage scenarios, cache elimination incurs considerable performance overhead, so single-level storage is recommended. Depending on I/O performance requirements, MEM (memory), SSD (solid-state storage), and HDD (hard disk storage) can generally be selected as storage media.
Modifying the levels parameter in the goosefs-site.properties configuration file can change the cache level. An input value of 1 represents using only single-level storage, while an input value of 3 represents using three-level storage.
goosefs.worker.tieredstore.levels=1
Modifying the alias parameter in the goosefs-site.properties configuration file can change the storage medium corresponding to the cache layer.
goosefs.worker.tieredstore.level{x}.alias=MEM
Note:
Level {x} represents the storage level, for example, Level 0 represents single-level storage.
1.1 Single-level Storage
In single-level storage mode, the commonly used system configuration items are as follows:
goosefs.worker.tieredstore.levels=1
goosefs.worker.tieredstore.level0.alias=HDD
goosefs.worker.ramdisk.size=16GB
goosefs.worker.memory.size=100GB
goosefs.worker.tieredstore.level0.dirs.path=/data/GooseFSWorker,/data1/GooseFSWorker,/data2/GooseFSWorker,/data3/GooseFSWorker,/data4/GooseFSWorker
goosefs.worker.tieredstore.level0.dirs.mediumtype=MEM,MEM,MEM,SSD,SSD
goosefs.worker.tieredstore.level0.dirs.quota=16GB,16GB,16GB,100GB,100GB
Introduce the configuration item content below.
ramdisk.size: This configuration item is used to specify the size of memory consumed by a worker node. The capacity of the ramdisk must be smaller than that of memory. After configuration, GooseFS will allocate a specified size of memory space for each worker node and consume the total system memory.
memory.size: This configuration item is used to specify the total system memory of GooseFS. After configuration, GooseFS will automatically occupy the specified size of storage medium. The actual physical storage space must be larger than the memory value.
dirs.path: This configuration item is used to specify a directory. GooseFS allocates storage media for the specified directory and must be used in combination with dirs.mediumtype. As shown in the above example, it mounts /data/GooseFSWorker to the MEM storage medium and /data4/GooseFSWorker to the SSD storage medium. Note that the order of each directory must match the order of the storage media one by one.
dirs.mediumtype: This configuration item is used to specify the storage medium. GooseFS allocates storage media for the specified directory and must be used in combination with dirs.path. The default optional storage media include MEM and SSD. If other storage media, such as HDD, are mounted, the configuration can be modified as needed.
dirs.quota: This configuration item is used to specify the preset space for each directory. After configuration, GooseFS will allocate the specified space for the corresponding directory, and the order must match the directory order one by one. As shown in the above example, it assigns 16GB MEM space to directories such as /data/GooseFSWorker, /data1/GooseFSWorker, and /data2/GooseFSWorker, and assigns 100GB SSD space to directories /data3/GooseFSWorker and /data4/GooseFSWorker.
1.2 Tiered Storage
In tiered storage mode, the read-write mode of data blocks differs from that in single-level storage. In single-level storage mode, data is directly read and written from one storage medium. In multi-level storage mode, data is preferentially written to the highest storage level by default, and when reading, data needs to be moved to the highest storage level first. Specifically:
Data writing: New data blocks are written to the top-level storage medium by default. If the storage space of the top-level storage medium is full, GooseFS will sequentially store data in the next-level storage medium. If all storage media are full, GooseFS will eliminate expired data according to the user-specified cache replacement policy. If expired data cannot be eliminated and all available storage space is full, data writing is not possible.
Data reads: In multi-level storage mode, cold data will be transparently moved to lower-level storage media. When data is read, it will be heated and placed in the top-level storage.
Note:
Under the specified elimination policy, GooseFS will clean up a certain amount of data according to the assigned free-up space. This parameter can be specified via goosefs.worker.tieredstore.free.ahead.bytes, with a default value of 0.
In tiered storage mode, reading data may frequently transfer data from lower-level storage media to higher-level storage media, which may lead to frequent cache elimination, causing certain performance degradation. Generally, single-level storage is recommended.
In tiered storage mode, the commonly used system configuration items are as follows:
goosefs.worker.tieredstore.levels
goosefs.worker.tieredstore.level{x}.alias
goosefs.worker.tieredstore.level{x}.dirs.path
goosefs.worker.tieredstore.level{x}.dirs.mediumtype
goosefs.worker.tieredstore.level{x}.dirs.quota
Among them, x represents the cache level. Generally, 3 storage tiers can be configured, corresponding to storage media such as MEM, SSD, and HDD. Taking a 2-tier storage with storage media being MEM and SSD, and each tier allocated 100GB storage as an example, the corresponding configuration information is as follows:
goosefs.worker.tieredstore.levels=2
goosefs.worker.tieredstore.level0.alias=MEM
goosefs.worker.tieredstore.level0.dirs.path=/data/GooseFSWorker
goosefs.worker.tieredstore.level0.dirs.mediumtype=MEM
goosefs.worker.tieredstore.level0.dirs.quota=100GB
goosefs.worker.tieredstore.level1.alias=SSD
goosefs.worker.tieredstore.level1.dirs.path=/data1/GooseFSWorker
goosefs.worker.tieredstore.level1.dirs.mediumtype=SSD
goosefs.worker.tieredstore.level1.dirs.quota=100GB
GooseFS supports configuring multi-level storage with no limit on the number of layers, but the alias of each layer must ensure uniqueness. Generally, providing 3 cache layers, each corresponding to MEM, SSD, and HDD, achieves better tiering effects.
2. Cache Replacement Policy
GooseFS provides two cache replacement policies:
LRUAnnotator: Eliminates cache in the order of least-frequently-used, which is the default policy of GooseFS.
LRFUAnnotator: Eliminates cache based on the order of least-recently-used (LRU) and least-frequently-used (LFU), adjusting the roles of both strategies during the elimination process through weight configurations goosefs.worker.block.annotator.lrfu.step.factor and goosefs.worker.block.annotator.lrfu.attenuation.factor.
If the weight adjustment of least-recently-used is set to maximum, the performance of this policy matches that of LRUAnnotator.
You can configure the cache replacement policy in the goosefs.worker.block.annotator.class configuration item, and the correct policy name must be specified during configuration:
goosefs.worker.block.annotator.LRUAnnotator
goosefs.worker.block.annotator.LRFUAnnotator
Data Lifecycle Management
Here, the lifecycle refers to the data lifecycle within GooseFS, rather than that of the remote storage system UFS. Lifecycle management mainly includes the following four operations:
free: This operation is used to delete the data of the corresponding directory or file from GooseFS (the corresponding data in UFS will not be deleted). Such operations are mainly used to release the cache space of GooseFS for other hotter data.
load: This operation is used to load data from the corresponding directory or file in UFS into GooseFS. Such operations are primarily used to convert cold data to hot data, thereby improving I/O performance.
persist: writes GooseFS data to the UFS to store data persistently and avoid data loss.
TTL (Time to Live): The TTL attribute is primarily used to set the lifecycle of directories or files in GooseFS. After exceeding their lifecycle, they will be deleted from GooseFS and UFS. Through configuration, it can also support deleting only GooseFS data or only freeing up disk space.
1. Releasing Data From GooseFS
The `free` command can release data from GooseFS. The following example shows that after releasing data, the data state in GooseFS changes to 0%.
$ goosefs fs free /data/test.txt
/data/test.txt was successfully freed from GooseFS space.
$ goosefs fs ls /data/test.txt
-rw-rw-rw- hadoop hadoop 14 PERSISTED 03-11-2021 11:46:15:000 0% /data/test.txt
In the example, `/data/test.txt` can be any valid GooseFS file path. If the data is stored in the remote storage UFS, it can be reloaded using the `load` command.
Note:
Generally, it is recommended to use the configuration caching policy to automatically clean up historical data in GooseFS.
Loading Data Into GooseFS
The `load` command can load data into GooseFS. The following example shows that after releasing data, the data state in GooseFS changes to 100%.
$ goosefs fs load /data/test.txt
/data/test.txt loaded
$ goosefs fs ls /data/test.txt
-rw-rw-rw- hadoop hadoop 14 PERSISTED 03-11-2021 11:46:15:000 100% /data/test.txt
In the example, `/data/test.txt` can be any valid GooseFS file path. If the data is loaded from the local file system, the `copyFromLocal` command must be used. It should be noted that loading data from the local file system does not persist the data to the remote storage UFS. By default, GooseFS can automatically load data into its cache on the first access to a file, so this command is generally not required. However, if business needs require preloading data into the cache, this command can be used.
3. Persisting Data in GooseFS
The `persist` command can persist data to the remote storage system UFS.
$ goosefs fs persist /data/test.txt
$ goosefs fs ls /data/test.txt
-rw-rw-rw- hadoop hadoop 14 PERSISTED 03-11-2021 11:46:15:000 100% /data/test.txt
In the example, `/data/test.txt` can be any valid GooseFS file path. It is recommended to configure the cache policy to automatically perform the persistence operation.
Set Data TTL Property
GooseFS supports setting TTL attributes for any file or directory in a namespace. This attribute ensures business data is periodically deleted according to the specified time cycle, freeing up space for new files and enabling effective use of local disk space. The TTL attributes of GooseFS files and directories can serve as part of the file metadata, ensuring they remain effective after a cluster restart. Backend threads' periodic inspection programs check these TTL attributes and automatically clean up expired files.
The inspection cycle and inspection type of the periodic inspection program can be set in goosefs-site.properties. The following example sets the inspection cycle to 10 minutes and the inspection type to FREE.
goosefs.master.ttl.checker.interval=10m
goosefs.user.file.create.ttl.action=FREE
After the settings are completed, GooseFS background threads will inspect every 10 minutes. Expired files discovered during the current inspection cycle will have their cached resources released after the next inspection cycle completes. For example, if the first inspection at 00:00:00 finds a batch of expired files, these files' cache will be cleared at 00:10:00.
Note:
The default unit of this parameter is milliseconds (ms). You can specify the unit for the detection time period when setting the parameter, such as seconds (s), minutes (m), or hours (h). For more details, refer to the configuration description.
Data Replication Management
Data replication is mainly divided into two forms: passive replication and active replication.
Passive Replication
Each file in GooseFS contains one or more data blocks distributed across the cluster. By default, GooseFS can automatically adjust the replication level of data blocks based on load and capacity conditions. Passive replication occurs in the following scenarios:
Multiple clients reading the same block simultaneously can cause multiple blocks to exist on different workers at the same time.
When prioritizing local read is enabled, if data is not locally available, a remote read will be initiated and saved on the local worker.
When the number of replicas generated by passive replication exceeds the set number of replicas, it will asynchronously delete the extra replicas. The above process is completely transparent to users.
2. Active Replication
Active replication is achieved by setting the number of replicas for a file. For blocks that do not meet the set replica number, they will be asynchronously supplemented. For blocks that exceed the set replica number, the extra replicas will be deleted. Specifically, the level of active replication can be adjusted through the following commands:
$ goosefs fs setReplication [-R] [--max | --min ] <path>
The relevant parameters are described as follows:
max: The maximum number of replicas for a specified file. The default value is -1, which means no upper limit is set; if set to 0, it means the cold data of the specified file is not stored in GooseFS. It is generally set as a positive integer. After configuration, GooseFS will check the number of file replicas and delete the extra ones.
min: The minimum number of replicas for a specified file. The default value is 0, meaning GooseFS will delete the data and retain no replicas after the data becomes cold. It is generally set as a positive integer and must be less than the max value. After configuration, GooseFS will check the number of file replicas and automatically add replicas if the number is smaller than the minimum value.
path: The specified file name, which can be a directory or a specific file path.
R: If the input parameter in path is a directory, specifying -R allows recursive replication of all files and subdirectories under the specified directory according to the specified minimum and maximum number of replicas.
To check the replication status of a specified file, you can use the stat command. The following example shows that the maximum number of replicas for the file /data/test.txt is -1, meaning the file will be expired and deleted after it becomes cold.
$ goosefs fs stat /data/test.txt
/data/test.txt is a file path.
FileInfo{fileId=50331647, fileIdentifier=null, name=test.txt, path=/data/test.txt, ufsPath=hdfs://172.16.16.16:4007/data/test.txt, length=0, blockSizeBytes=134217728, creationTimeMs=1618193473555, completed=true, folder=false, pinned=false, pinnedlocation=[], cacheable=true, persisted=true, blockIds=[], inMemoryPercentage=100, lastModificationTimesMs=1616763603692, ttl=-1, lastAccessTimesMs=1616763603692, ttlAction=DELETE, owner=hadoop, group=supergroup, mode=420, persistenceState=PERSISTED, mountPoint=false, replicationMax=-1, replicationMin=0, fileBlockInfos=[], mountId=1, inGooseFSPercentage=100, ufsFingerprint=TYPE|FILE UFS|hdfs OWNER|hadoop GROUP|supergroup MODE|420 CONTENT_HASH|(len:0,_modtime:1616763603692) , acl=user::rw-,group::r--,other::r--, defaultAcl=}
This file does not contain any blocks.
View Cache Usage
GooseFS records the capacity and usage of local cache. You can view GooseFS's operating status through the following commands to manage and maintain the local cache accordingly.
View GooseFS cache in use, in bytes
$ goosefs fs getUsedBytes
Used Bytes: 0
View GooseFS total cache capacity, in bytes
$ goosefs fs getCapacityBytes
Capacity Bytes: 1610612736000
View GooseFS cache usage report
$ goosefs fsadmin report
GooseFS cluster summary:
Master Address: 172.16.16.16:19998
Web Port: 19999
Rpc Port: 19998
Started: 04-12-2021 10:52:05:255
Uptime: 0 day(s), 1 hour(s), 28 minute(s), and 57 second(s)
Version: 2.5.0-SNAPSHOT
Safe Mode: false
Zookeeper Enabled: false
Live Workers: 3
Lost Workers: 0
Total Capacity: 1500.00GB
Tier: HDD Size: 1500.00GB
Used Capacity: 0B
Tier: HDD Size: 0B
Free Capacity: 1500.00GB