GooseFSx (Data Accelerator Goose FileSystem extreme, GooseFSx) is a high-performance POSIX-compliant Data Accelerator introduced by Tencent Cloud. It can accelerate Cloud Object Storage (COS), deliver ultra-high performance and ultra-low latency for high-performance computing workloads, and enable users to flexibly manage cold and hot Data. Suitable for business scenarios such as high-performance computing, autonomous driving, and machine learning. GooseFSx is a fully hosted service that is easy to use, offers hourly billing, allows resources to be released after use, and persists Data through COS. As shown in the figure below:
GooseFSx can be auto-mounted as a local directory on the host (this directory is called the mount directory). When the host accesses the GooseFSx mount directory, it is the same as accessing GooseFSx on the local file system.
GooseFSx accelerates COS, enabling direct data loading from COS so that hosts can access cached data in GooseFSx at high speed; it can sink computation results generated on GooseFSx to COS for persistent, low-cost storage.
Super Directory
A super directory (Fileset) is a directory with features such as quota, file record deletion, and QoS. It is a sub-file system within the GooseFSx file system, possessing independent storage space and performance management capability. Compared with ordinary directories, super directories are more powerful, as shown in the table below.
|
Directory quota | Supported | Not supported |
Delete file records | Supported | Not supported |
Parent directory | Root directory or ordinary directory | Root directory or ordinary directory or super directory |
Directory content | Ordinary directory or file | Ordinary directory or file or super directory |
Quantity | | Single instance supports 10 billion-level ordinary directories |
Super directory, used to achieve refined resource management and policy control within a single GooseFSx file system. Typical usage is to assign different super directories to different departments, users, or applications to prevent data mixing while sharing the same GooseFSx instance; assign different capacity quotas or file quotas to each super directory to prevent a single department, user, or application from exhausting global resources; record access logs based on super directory granularity to meet compliance requirements.
Quota
Quota is used to set storage capacity limits and file quantity limits for different super directories. It supports GooseFSx in allocating and managing storage resources across multiple departments, users, or application scenarios, facilitating refined management of resources and preventing a single department, user, or application from exhausting global resources.
Capacity quota, the maximum writable capacity for a super directory. Once the quota upper limit is reached, no new data can be written.
File number quota, the maximum number of files and directories that can be written to a super directory. Once the quota upper limit is reached, no new data can be written.
When a quota is set, during each write I/O operation, the used capacity and number of files of the quota are accumulated with the capacity and number of files added by this operation. The total usage is checked to see if it exceeds the quota. If the quota is exceeded, the write I/O operation fails with an error: Disk quota exceeded. If the quota is not exceeded, the operation is allowed. This ensures that usage always stays within the quota range and prevents exceeding the limit. It is recommended to reasonably set super directory quotas and configure alarm policies to receive timely alerts when capacity, number of files, etc., are close to the quota.
To use quotas for implementing refined resource management within a single GooseFSx file system. A typical usage is to assign different quotas to different departments, users, or applications, i.e., assigning different capacity quotas or file number quotas to each super directory, with different departments, users, or applications corresponding to different super directories.
Delete File Records
Delete file records, which are operation records of deleting files or directories, including information such as path, filename, and time, is a feature provided by GooseFSx for auditing deletion operations.
Super directories have the ability to turn on or off delete file records, while ordinary directories do not support this. Once super directories enable delete file records, they will log all data deletion actions under the super directory.
Delete file records, used for quickly identifying and fixing issues based on delete file records in scenarios like accidental data deletion.
POSIX Client
The GooseFSx POSIX client is a host where the GooseFSx POSIX client software is deployed and GooseFSx is mounted as a local directory. GooseFSx can automatically deploy a host as a POSIX client. The implementation principle is to automatically deploy the GooseFSx POSIX client software on the host and mount GooseFSx as the host's local directory. Accessing the GooseFSx mounted directory from a POSIX client is the same as accessing GooseFSx in the local file system. As shown below:
When creating a GooseFSx instance, three Tencent Cloud Cloud server (CVM) instances will be automatically created under your Tencent Cloud account. The instance specifications are not less than 4C8G, deployed as POSIX client management nodes. The POSIX client management nodes are created and terminated along with the GooseFSx instance, and you do not need to manage them.
Note:
Note: Do not destroy or change the POSIX client management node, otherwise it will cause POSIX client working exception.
Configure 3 POSIX client management nodes for GooseFSx to ensure high availability. In extreme cases, if a POSIX client management node fails, it may result in the POSIX client being unable to access GooseFSx normally, but will not affect the data already written to GooseFSx. Once the POSIX client management node recovers from the fault, the POSIX client can normally access GooseFSx.
Manage POSIX client nodes, manage the deployment and deletion of POSIX clients. POSIX client management nodes do not participate in data streams, do not access data, and do not store data.
POSIX client management nodes need to interconnect with POSIX clients on the management plane. To ensure the information security of your POSIX clients, deploy POSIX client management nodes in your VPC to guarantee that management traffic does not flow out of your VPC.
GooseFSx supports automatic addition of POSIX clients, automatically adding specified hosts as GooseFSx POSIX clients, eliminating the need to manually execute commands step by step to add hosts as GooseFSx POSIX clients. GooseFSx can add POSIX clients in batches, automatically adding multiple POSIX clients at once. GooseFSx can automatically delete POSIX clients and can delete multiple POSIX clients in batches.
GooseFSx implements centralized management of POSIX clients, enables real-time query of POSIX clients using GooseFSx, allows timely deletion of POSIX clients no longer using GooseFSx, and supports immediate addition of POSIX clients requiring GooseFSx.
Data Flow
Data flow refers to the on-demand movement of data between the data accelerator GooseFSx and the object storage COS. GooseFSx preheats the required data from COS so that hosts can access the high-speed GooseFSx, achieving acceleration for COS. GooseFSx sinks the generated computation results to COS for persistent, low-cost storage or shares the computation results through COS's Internet distribution capability.
GooseFSx can simultaneously flow data with multiple intra-region or cross-regional COS storage buckets. For data flow with cross-regional COS storage buckets, it is recommended to use COS global acceleration domain names for fast speed, low cost, and straightforward usage. Alternatively, you can use a network connection interoperable with cross-regional COS storage buckets, which requires manual configuration and management.
|
How It Works | With the load balancing system leveraging Tencent's global traffic scheduling, utilizing globally distributed Cloud Data Centers, selecting the best network access linkage, and achieving proximity access for requests. | Users manually adopt cloud networking and other interconnection technologies to establish network connectivity between cross-regional COS buckets and associated VPCs. |
rate | High flow performance. For details, please contact the Storage Architect. | Indeterminacy, dependent on cross-region interconnection network. |
Costs | Low cost. | It is advisable to reuse the existing cross-region interconnection network. If a separate cross-region interconnection network is created, the cost will be high. |
Usability | Fully automated, easy to use, only requires a COS storage bucket to enable global acceleration. GooseFSx will automatically obtain and use the global acceleration domain name of the COS storage bucket. | Manually manage, for example: create or configure a cross-region interconnection network, map a cross-region COS storage bucket to the IP of an associated VPC (the user VPC associated with GooseFSx), and manage the connectivity and rate of the cross-region interconnection network. |
Data flow maintains a one-to-one correspondence between the data in GooseFSx directories and the data in COS storage buckets. During preheating, COS storage bucket objects are converted into files and preheated into the corresponding GooseFSx directories. During settlement, files in GooseFSx directories are converted into objects and settled into the corresponding paths of COS storage buckets. As shown in the following figure:
Data Flow Rules
Data flow rules are the regulations for data flow between GooseFSx directories and COS storage buckets.
Create data flow rules. Determine data flow rules between a certain directory of GooseFSx data accelerator and a certain bucket of COS, such as data flow mode and data flow bandwidth. GooseFSx can create multiple data flow rules and transfer data with multiple COS buckets simultaneously. For details, see Usage Limits of Data Flow.
Data flow modes, as shown in the table below.
|
manual mode | Manual mode is data pre-warming or data settlement triggered manually and executed automatically by the system. | Host high-performance access GooseFSx, avoid synchronous access to COS slowing down performance. Notes: Data pre-warming needs to be done in advance for GooseFSx. |
Before creating data flow rules, please ensure the data flow bandwidth is not 0; conversely, if you need to reduce the data flow bandwidth to 0, please delete all data flow rules first.
Delete data flow rules Delete data flow rules between a certain directory of GooseFSx data accelerator and a certain bucket of COS Delete data flow (data pre-warming and data settlement) corresponding to the deleted flow rules After deleting the flow rules, data flow (data pre-warming and data settlement) cannot be executed on the GooseFSx directory.
Data Flow Bandwidth
Data flow bandwidth is the bandwidth for data flow between GooseFSx data accelerator and COS. The higher the GooseFSx data flow bandwidth, the faster GooseFSx performs data flow; conversely, the slower it is. To match business requirements, you can expand or reduce data flow bandwidth at any time.
Example
The step length of data flow bandwidth is 600MB/s. After users purchase N steps, the data flow bandwidth will increase by N * 600MB/s. Conversely, it will decrease by N * 600MB/s.
The maximum value of data flow bandwidth is approximately equal to the bandwidth of the GooseFSx instance. Because data flow transmits data between GooseFSx and COS buckets, the data flow bandwidth will not exceed the maximum bandwidth of the GooseFSx instance and the maximum bandwidth of the COS bucket. Meanwhile, a GooseFSx instance can simultaneously flow data with multiple COS buckets. Therefore, the maximum value of data flow bandwidth is approximately equal to the bandwidth of the GooseFSx instance.
Status of data flow bandwidth: to be scaled out, scaling out, scaling in, running.
|
to be scaled out | Data flow bandwidth is 0. Scale it out first; otherwise, data cannot flow. | Cannot create data flow rules. Data cannot flow. |
Expanding Shrinking | Adjusting data flow bandwidth. Cannot initiate scaling-out or scaling-in of data flow bandwidth again. | Data flow rules can be created, data can flow. For details, contact the Storage Architect. |
Running | Functioning properly. Data flow bandwidth can be scaled-out or scaled-in. | Data flow rules can be created, data can flow. |
Note:
Data flow bandwidth, which requires users to pay for use. For details, see Purchase Guide. During the process of using data flow, incurred fees may occur. For details, see Data Flow Costs. Data Flow Cost
Data flow will incur GooseFSx data flow bandwidth fees; if global acceleration is enabled to access cross-regional COS buckets, global acceleration traffic fees will be incurred and charged by the COS product; reading from COS will generate COS request fees, which are charged by the COS product.
|
Flow data with COS buckets in the same region | GooseFSx data flow bandwidth fee. | GooseFSx data flow bandwidth cost, charged by the GooseFSx product. |
Flow data with COS buckets across regions | Fees for COS requests. | COS request fees, collected by the COS product. |
Flow data with COS buckets across regions | GooseFSx data flow bandwidth fee. | Data flow bandwidth cost for GooseFSx, charged by the GooseFSx product. |
| COS global acceleration traffic cost. | COS global acceleration traffic cost, charged by the COS product. |
| Fees for COS requests. | Fees for COS requests, charged by the COS product. |
Data Pre-Warming
Data preheating is the process of preheating data from a COS storage bucket to a GooseFSx directory, automatically, completely, and incrementally preheating the specified data (entire directory, a subdirectory, or a list) into GooseFSx.
Prerequisite for data preheating: The corresponding data flow rule has been created. For details, see Data Flow Rule. Data preheating maintains a one-to-one correspondence between GooseFSx data and COS storage bucket data, keeping directory and file permissions unchanged. For example, a COS storage bucket object: H1/big/test.dat, is preheated into a GooseFSx directory file: H1/big/test.dat.
Data preheating includes: single task and periodic task.
One-time task, initiated by the user and executed immediately by the system.
Periodic task, configured by the user with a periodic policy and automatically executed periodically by the system. For details, see Data Flow Periodic Policy. Data Settlement
Data settlement involves settling GooseFSx directory data into a COS storage bucket, automatically, completely, and incrementally settling your specified data (entire directory, a certain subdirectory, or list) into COS.
Prerequisite for data settlement: Corresponding data flow rules have been created. For details, see Data Flow Rules. Data settlement ensures a one-to-one correspondence between GooseFSx data and COS storage bucket data, while keeping directory and file permissions unchanged. For example, a file in the GooseFSx directory: H1/big/test.dat, is settled as an object in the COS storage bucket: H1/big/test.dat, and directories are settled as special objects.
Data settlement is divided into: data settlement, data settlement and deletion of local data, and data settlement .
|
data settlement | Settle GooseFSx data to COS. GooseFSx and COS each have a piece of data. | When creating data flow rules, select manual mode for data flow mode. Settle calculation results to COS for long-term storage or distribute them over the Internet. |
Settle data and delete local data | Settle GooseFSx data to COS, then delete GooseFSx data. GooseFSx does not have this data, COS has a piece of data. | When creating data flow rules, select manual mode for data flow mode. Settle calculation results to COS for long-term storage or distribute them over the Internet, then delete GooseFSx data and release GooseFSx space. |
Data settlement includes: single task and periodic task.
One-time task, initiated by the user and executed immediately by the system.
Periodic task, configured by the user with a periodic policy and automatically executed periodically by the system. For details, see Data Flow Periodic Policy. Data Flow Periodic Policy
Data flow periodic policy (abbreviated as periodic policy), used for periodically triggering data flow tasks, such as the periodic policy for settling COS every hour, triggering a settlement task every 1 hour, and settling specified datasets from GooseFSx to corresponding COS buckets. Suitable for scenarios such as regularly synchronizing data with COS buckets and promptly preheating data from COS buckets.
Data flow periodic policy, supports configuration of multiple periodic policies:
Per-hour periodic policy: Triggers a data flow task at the specified minute every hour.
Daily periodic policy: Triggers a data flow task at the specified hour every day.
Weekly periodic policy: Triggers a data flow task on the specified day every week.
Monthly periodic policy: Triggers a data flow task on the specified day every month.
Mount Multiple Cloud Disks
GooseFSx supports multi-mounting of cloud disks, allowing a single cloud disk to be mounted on multiple GooseFSx storage nodes simultaneously. Multi-mounting tolerates simultaneous failures of any number of nodes without interrupting the business or causing data loss, significantly improving product availability (from 99.9% to 99.9999999%).
The working mechanism of multi-mounting cloud disks is as follows:
Under normal circumstances, the first node reads and writes to this cloud disk, while other nodes do not. This is the same as the single-mount working mechanism of the cloud disk.
In case of an exception, the first node fails and will automatically switch to the second node for reading and writing this cloud disk. The switching process is as follows:
Automatic switching, host unaware, host requests during the switching process are sent to the second node via a retry mechanism.
Instant switching, because multiple nodes share the cloud disk, no data synchronization is required during the switching process, and the switch is completed in seconds.
The second node fails and will automatically switch to the third node, tolerating simultaneous faults of multiple nodes, thereby achieving extremely high availability.