Analysis of IO Commit Latency Spike in Ceph Cluster

2025-06-10

ceph

183 words 1 min read

times read

Contents

Symptom

Environment: After abnormal node reboot in Ceph cluster
Affected Metric: Prometheus rate value of ceph_osd_op_w_latency
Behavior:
- Pre-reboot: Values showed normal increment (peak ~1M)
- Post-reboot:
  - Started recording from 0
  - Spiked to 4.2B after 3 minutes (close to 2³²)

Investigation Process

Phase 1: Initial Hypotheses

Hypothesis	Verification Method	Conclusion
Prometheus calculation	Reviewed rate() function	Confirmed proper counter reset
Ceph stat initialization	Inspected OSD.cc init code	Verified proper atomic init

Phase 2: Deep Analysis

Key Findings:
- 3-minute zero period before spike
- 4.2B spike caused axis display anomaly
Mechanism Analysis:
- Uses system realtime clock for time delta
- No negative value protection

Phase 3: Verification

Environmental Evidence:
- NTP logs showed 4ms time reversal
Reproduction Method:
- Applied IO load via fio
- Manually rewound system time by 1ms
- Successfully reproduced the spike

Root Cause

Direct Cause:
- System time reversal caused negative time delta
- Unsigned integer conversion
Underlying Cause:
- Incomplete adoption of monotonic clock
- Related PR #39273 https://github.com/ceph/ceph/pull/39273 didn’t replace all realtime usages

Solution

Long-term Improvement:
- Promote complete migration to monotonic clock
- Or add time delta validation logic
Reported issue on official Slack https://tracker.ceph.com/issues/71580, currently being addressed by maintainers https://github.com/ceph/ceph/pull/64003

Author ceph-deep-dive

LastMod 2025-06-10