Analysis of IO Commit Latency Spike in Ceph Cluster
Contents
Symptom
- Environment: After abnormal node reboot in Ceph cluster
- Affected Metric: Prometheus rate value of
ceph_osd_op_w_latency
- Behavior:
- Pre-reboot: Values showed normal increment (peak ~1M)
- Post-reboot:
- Started recording from 0
- Spiked to 4.2B after 3 minutes (close to 2³²)
Investigation Process
Phase 1: Initial Hypotheses
Hypothesis | Verification Method | Conclusion |
---|---|---|
Prometheus calculation | Reviewed rate() function | Confirmed proper counter reset |
Ceph stat initialization | Inspected OSD.cc init code | Verified proper atomic init |
Phase 2: Deep Analysis
-
Key Findings:
- 3-minute zero period before spike
- 4.2B spike caused axis display anomaly
-
Mechanism Analysis:
- Uses system realtime clock for time delta
- No negative value protection
Phase 3: Verification
-
Environmental Evidence:
- NTP logs showed 4ms time reversal
-
Reproduction Method:
- Applied IO load via fio
- Manually rewound system time by 1ms
- Successfully reproduced the spike
Root Cause
-
Direct Cause:
- System time reversal caused negative time delta
- Unsigned integer conversion
-
Underlying Cause:
- Incomplete adoption of monotonic clock
- Related PR #39273 https://github.com/ceph/ceph/pull/39273 didn’t replace all realtime usages
Solution
- Long-term Improvement:
- Promote complete migration to monotonic clock
- Or add time delta validation logic
- Reported issue on official Slack https://tracker.ceph.com/issues/71580, currently being addressed by maintainers https://github.com/ceph/ceph/pull/64003
Author ceph-deep-dive
LastMod 2025-06-10