What are the storage requirements for multi-task parallel training of large models?

The storage requirements for multi-task parallel training of large models are substantial and multifaceted, involving several key components: model parameters, optimizer states, gradients, intermediate activations, and training data. Here's a breakdown of each requirement and an example to illustrate the scale:

1. Model Parameters

Large models, such as GPT or BERT variants, can have billions of parameters. For instance, a 175-billion-parameter model (like GPT-3) requires approximately 700GB of storage just for the parameters (assuming 4 bytes per float32 parameter).
Storage Impact: Model weights must be stored in memory (RAM/GPU VRAM) or on high-speed storage (e.g., NVMe SSDs) for quick access during training.

2. Optimizer States

Optimizers like Adam or LAMB maintain additional states (e.g., momentum, variance) for each parameter. For Adam, this doubles the storage (another 700GB for the same model).
Storage Impact: Optimizer states can nearly double the storage footprint, requiring efficient management.

3. Gradients

Gradients are computed for each parameter during backpropagation and typically match the size of the model parameters (e.g., 700GB for a 175B model).
Storage Impact: Gradients are temporary but require high-bandwidth storage to avoid bottlenecks during backpropagation.

4. Intermediate Activations

Activations from each layer during forward passes must be stored for backward computation. For large batch sizes or deep models, this can consume terabytes of memory.
Storage Impact: Techniques like activation checkpointing (trading compute for memory) are often used to reduce this overhead.

5. Training Data

Multi-task training involves diverse datasets (e.g., text, images, code). Storing these datasets efficiently is critical. For example, a 1TB dataset with multiple modalities is common.
Storage Impact: High-throughput storage (e.g., distributed file systems like HDFS or object storage) is needed to feed data to multiple tasks in parallel.

6. Redundancy and Checkpoints

Regular checkpoints (saving model state) are essential for recovery. Storing multiple checkpoints (e.g., every few hours) can add terabytes of storage.
Example: A 700GB model saved every 2 hours during a 1-week training run with 10 checkpoints requires ~7TB for checkpoints alone.

Solutions and Recommendations

To meet these demands:

High-Performance Storage: Use NVMe SSDs or distributed storage systems (e.g., Tencent Cloud’s Cloud Block Storage (CBS) or Cloud File Storage (CFS)) for low-latency access.
Distributed File Systems: Leverage systems like Tencent Cloud’s CHDFS (Cloud Hadoop Distributed File System) for scalable, parallel data access.
Data Compression: Techniques like quantization or sparse storage can reduce the model and data footprint.
Checkpoint Optimization: Store only necessary checkpoints and use incremental backups (Tencent Cloud Cloud Object Storage (COS) is ideal for cost-effective, durable storage).

Example Scenario: Training a 1-trillion-parameter model across 100 GPUs with multi-task objectives (e.g., NLP + vision) might require:

Model + Optimizer + Gradients: ~30TB of combined storage.
Training Data: 10TB+ of diverse datasets.
Checkpoints: 50TB+ over a month-long training cycle.

Tencent Cloud’s Elastic High-Performance Computing (EHPC) and GPU instances paired with CFS/COS can provide the necessary scalability and storage bandwidth for such workloads.