Analysis of IO Commit Latency Spike in Ceph Cluster
Symptom Environment: After abnormal node reboot in Ceph cluster Affected Metric: Prometheus rate value of ceph_osd_op_w_latency Behavior: Pre-reboot: Values showed normal increment (peak ~1M) Post-reboot: Started recording from 0 Spiked to 4.2B after 3 minutes (close to 2³²) Investigation Process Phase 1: Initial Hypotheses Hypothesis Verification Method Conclusion Prometheus calculation Reviewed rate() function Confirmed proper counter reset Ceph stat initialization Inspected OSD.cc init code Verified proper atomic init Phase 2: Deep Analysis Key Findings: