The storage requirements for multi-task parallel training of large models are substantial and multifaceted, involving several key components: model parameters, optimizer states, gradients, intermediate activations, and training data. Here's a breakdown of each requirement and an example to illustrate the scale:
1. Model Parameters
- Large models, such as GPT or BERT variants, can have billions of parameters. For instance, a 175-billion-parameter model (like GPT-3) requires approximately 700GB of storage just for the parameters (assuming 4 bytes per float32 parameter).
- Storage Impact: Model weights must be stored in memory (RAM/GPU VRAM) or on high-speed storage (e.g., NVMe SSDs) for quick access during training.
2. Optimizer States
- Optimizers like Adam or LAMB maintain additional states (e.g., momentum, variance) for each parameter. For Adam, this doubles the storage (another 700GB for the same model).
- Storage Impact: Optimizer states can nearly double the storage footprint, requiring efficient management.
3. Gradients
- Gradients are computed for each parameter during backpropagation and typically match the size of the model parameters (e.g., 700GB for a 175B model).
- Storage Impact: Gradients are temporary but require high-bandwidth storage to avoid bottlenecks during backpropagation.
4. Intermediate Activations
- Activations from each layer during forward passes must be stored for backward computation. For large batch sizes or deep models, this can consume terabytes of memory.
- Storage Impact: Techniques like activation checkpointing (trading compute for memory) are often used to reduce this overhead.
5. Training Data
- Multi-task training involves diverse datasets (e.g., text, images, code). Storing these datasets efficiently is critical. For example, a 1TB dataset with multiple modalities is common.
- Storage Impact: High-throughput storage (e.g., distributed file systems like HDFS or object storage) is needed to feed data to multiple tasks in parallel.
6. Redundancy and Checkpoints
- Regular checkpoints (saving model state) are essential for recovery. Storing multiple checkpoints (e.g., every few hours) can add terabytes of storage.
- Example: A 700GB model saved every 2 hours during a 1-week training run with 10 checkpoints requires ~7TB for checkpoints alone.
Solutions and Recommendations
To meet these demands:
- High-Performance Storage: Use NVMe SSDs or distributed storage systems (e.g., Tencent Cloud’s Cloud Block Storage (CBS) or Cloud File Storage (CFS)) for low-latency access.
- Distributed File Systems: Leverage systems like Tencent Cloud’s CHDFS (Cloud Hadoop Distributed File System) for scalable, parallel data access.
- Data Compression: Techniques like quantization or sparse storage can reduce the model and data footprint.
- Checkpoint Optimization: Store only necessary checkpoints and use incremental backups (Tencent Cloud Cloud Object Storage (COS) is ideal for cost-effective, durable storage).
Example Scenario: Training a 1-trillion-parameter model across 100 GPUs with multi-task objectives (e.g., NLP + vision) might require:
- Model + Optimizer + Gradients: ~30TB of combined storage.
- Training Data: 10TB+ of diverse datasets.
- Checkpoints: 50TB+ over a month-long training cycle.
Tencent Cloud’s Elastic High-Performance Computing (EHPC) and GPU instances paired with CFS/COS can provide the necessary scalability and storage bandwidth for such workloads.