How to optimize large model storage through data lake warehouse integrated architecture?

Optimizing large model storage through a data lakehouse integrated architecture involves leveraging the combined strengths of data lakes (for cost-effective, scalable raw data storage) and data warehouses (for structured, high-performance analytics). This hybrid approach enables efficient storage, retrieval, and processing of large models (e.g., machine learning models, deep learning models) alongside their associated datasets.

Key Strategies for Optimization:

Unified Storage Layer (Data Lake):
- Store raw, unstructured, or semi-structured data (e.g., model training datasets, checkpoints, embeddings) in a cost-efficient object storage system (e.g., Tencent Cloud COS).
- Use open formats like Parquet, ORC, or Avro for efficient compression and schema evolution.
Structured Analytics Layer (Data Warehouse):
- Extract and transform relevant model metadata (e.g., hyperparameters, performance metrics, versioning) into a structured format and store it in a high-performance data warehouse (e.g., Tencent Cloud TCHouse-D).
- Enable fast querying for model lineage, experiment tracking, and optimization insights.
Metadata Management:
- Implement a centralized metadata catalog (e.g., using Tencent Cloud EMR with Hive Metastore) to track model versions, dependencies, and data lineage.
- This ensures reproducibility and efficient retrieval of model artifacts.
Lakehouse Integration (e.g., Delta Lake, Apache Hudi, Iceberg):
- Use open table formats to bridge the gap between data lakes and warehouses, enabling ACID transactions, schema enforcement, and time-travel queries.
- Example: Store model checkpoints in a Delta Lake table on COS, allowing seamless updates and rollbacks.
Compute-Optimized Access:
- Offload heavy model training/inference workloads to distributed compute engines (e.g., Tencent Cloud TI-ONE) while keeping storage in the lakehouse.
- Use columnar formats (Parquet) for analytical queries and row-based formats for transactional model metadata.

Example Workflow:

Training Phase:
- Raw datasets (images, text) are stored in Tencent Cloud COS (Data Lake).
- Model checkpoints and logs are saved in a Delta Lake table (structured layer).
- Metadata (hyperparameters, accuracy) is logged in TCHouse-D (Data Warehouse) for analysis.
Inference & Serving:
- Optimized model artifacts are retrieved from COS, while real-time query performance is handled by the data warehouse.
- Metadata catalog ensures the right model version is deployed.
Governance & Optimization:
- Automated data tiering moves infrequently accessed model versions to cheaper storage (e.g., COS Infrequent Access).
- Query acceleration (e.g., Tencent Cloud TCHouse-D’s vector indexing) speeds up similarity searches for embeddings.

By integrating a data lake (scalable storage) + data warehouse (structured analytics) + open table formats (lakehouse), large model storage becomes more efficient, cost-effective, and query-friendly. Tencent Cloud services like COS, TCHouse-D, and EMR provide a robust foundation for this architecture.