Contents
Ceph Storage: The Storage Powerhouse in the Era of AI/ML Workloads
Abstract
AI/ML training, inference, and related processes place unprecedented demands on storage performance.
This article, based on the SNIA presentation “Ceph Storage in a World of AI/ML Workloads”, analyzes the challenges of AI storage, the advantages of Ceph, and key methods to improve efficiency in real deployments.
AI/ML Workload Lifecycle
A typical AI/ML lifecycle includes:
Raw Data → Training Data → Model → Results → Retraining During training, network bandwidth, data preprocessing capability, and model size all affect overall performance.
In practice, the recommended storage throughput is 5 GB/s, with high-performance reference systems reaching 20 GB/s.
Checkpointing Challenges
Checkpoint saving is a critical step in AI model training, and its data size grows rapidly with model scale:
- Granite 13b: 170 GiB, ~5 seconds
- Llama3 70b: 913 GiB, ~28 seconds
- GPT-3 175b: 2.28 TiB, ~70 seconds
- Llama3 405b: 5.28 TiB, ~162 seconds
Storage system performance directly determines checkpoint save speed, impacting overall training efficiency and cost.
Storage Requirements for Inference
In recommendation systems or event-driven inference scenarios (e.g., Facebook data centers):
- Deep recommendation models consume 80%+ of inference compute
- Consume 50%+ of training compute These scenarios require storage systems with high concurrent I/O and low-latency response.
Why Choose Ceph?
Ceph offers significant advantages in AI/ML storage scenarios:
- Multi-protocol support: Block, File, and Object (e.g., S3, NFS, SMB)
- Hardware agnostic: No vendor lock-in, flexible CPU, memory, network, and media choices
- High scalability: From small deployments to hundreds of nodes, read throughput can grow from 20 GB/s to 160 GB/s
- Mature open-source ecosystem: Proven in production, active community, broad vendor support
Strategies to Improve Ceph Storage Efficiency
- Compression & hardware acceleration
Ceph RGW and Bluestore support data compression.
For example, S3 object compression can boost write throughput by 250%+ and read throughput by 180%+. - Thoughtful architecture design
Planning compression strategies and hardware acceleration early can significantly reduce TCO (Total Cost of Ownership).
Real-World Deployment Example
SNIA reference deployment:
- 4-node Ceph cluster
- Each node: 2×32-core CPUs, 512 GB RAM, 2×100 GbE network
- Storage: 24× TLC NVMe SSDs
- Acceleration: 4× GPUs
- Performance: 30 GB/s read, 4.66 GB/s write This shows Ceph can fully meet AI/ML training and inference storage demands under high-performance hardware conditions.
Community & Events
- Ceph Days: 2025 in Bangalore, San Jose, London, and more
- Cephalocon: 2024 at CERN, 2025 in planning
- SNIA Education Library: Rich resources on Ceph and AI/ML technologies
Key Takeaways
- Maximize GPU utilization with storage systems that match required bandwidth and throughput
- Network planning is the foundation for scaling to meet AI/ML throughput needs
- Open-source flexibility enables Ceph to quickly integrate acceleration and optimization features
References
- SNIA: Ceph Storage in a World of AI/ML Workloads
https://snia.org/sites/default/files/CSI/Ceph%20Storage%20in%20a%20World%20of%20AI_ML%20Workloads.pdf - Ceph Official Documentation: https://docs.ceph.com
- SNIA Education Library: https://snia.org/education
- Facebook AI Research: Deep Learning Recommendation Models (DLRM)
- Cephalocon: https://cephalocon.org
Author ceph-deep-dive
LastMod 0001-01-01