ceph集群巡检脚本

🛠️ 脚本功能特点 全面的检查项目 ✅ 集群连接状态 - 验证Ceph集群可达性 ✅ 健康状态分析 - HEALTH_OK/WARN/ERR详细分析 ✅ Monit

DAOS File System: Phoenix Rising After Optane's End

Preface: Dominating the IO500 Rankings Having seen the DAOS project dominating the IO500 rankings, I’ve been keeping an eye on this project (though not diving deep into it). Background In previous articles, I briefly introduced the DAOS distributed storage project. However, with Intel terminating the Optane business in 2022, many people began to wonder: Can DAOS continue after losing its “core hardware support”? Where is its future? Short-term Impact, but Not the End The discontinuation of Optane did have a significant impact on DAOS, especially in metadata acceleration and persistence.

Execution Flow Analysis of the Ceph mgr-balancer Module

As OSD are replaced and the cluster scales in and out, the distribution of PGs across OSDs becomes increasingly unbalanced. This leads to discrepancies in actual usage rates of individual OSDs, reducing the overall utilization rate of cluster. The ceph balancer module addresses this by adjusting weights or specifying PG mappings via upmap to redistribute PGs evently. This article analyzes the execution flow when using balancer upmap mode, based on the ceph Pacific version.

OSD

Classical OSD Overall Architecture Diagram 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 graph TB subgraph "OSD Process Architecture" OSD[OSD Main Class] OSDService[OSDService<br/>Core Service] ShardedOpWQ[ShardedOpWQ<br/>Sharded Operation Queue] Messenger[Messenger<br/>Message System] end subgraph "PG Management Subsystem" PGMap[pg_map<br/>PG Mapping Table] PG[PG Class<br/>Placement Group] PGBackend[PGBackend<br/>Backend Implementation] ReplicatedBackend[ReplicatedBackend] ECBackend[ECBackend] end subgraph "Object Storage Subsystem" ObjectStore[ObjectStore<br/>Storage Abstraction Layer] FileStore[FileStore<br/>Filesystem Storage] BlueStore[BlueStore<br/>Raw Device Storage] ObjectContext[ObjectContext<br/>Object Context] end subgraph "Recovery Subsystem" RecoveryState[RecoveryState<br/>Recovery State Machine] PeeringState[PeeringState<br/>Peering State] BackfillState[BackfillState<br/>Backfill State] RecoveryWQ[RecoveryWQ<br/>Recovery Work Queue] end subgraph "Monitoring & Statistics" PGStats[PGStats<br/>PG Statistics] OSDStats[OSDStats<br/>OSD Statistics] PerfCounters[PerfCounters<br/>Performance Counters] Logger[Logger<br/>Logging System] end OSD --> OSDService OSD --> ShardedOpWQ OSD --> Messenger OSD --> PGMap PGMap --> PG PG --> PGBackend PGBackend --> ReplicatedBackend PGBackend --> ECBackend PG --> ObjectStore ObjectStore --> FileStore ObjectStore --> BlueStore PG --> ObjectContext PG --> RecoveryState RecoveryState --> PeeringState RecoveryState --> BackfillState OSD --> RecoveryWQ PG --> PGStats OSD --> OSDStats OSD --> PerfCounters OSD --> Logger OSD Core Class Structure Details 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 classDiagram class OSD { -int whoami -Messenger* cluster_messenger -Messenger* client_messenger -MonClient* monc -MgrClient* mgrc -ObjectStore* store -OSDService service -map~spg_t,PG*~ pg_map -RWLock pg_map_lock -OSDMapRef osdmap -epoch_t up_epoch -ThreadPool op_tp -ShardedOpWQ op_sharded_wq -RecoveryWQ recovery_wq -SnapTrimWQ snap_trim_wq -ScrubWQ scrub_wq +handle_osd_op(MOSDOp* op) +handle_replica_op(MOSDSubOp* op) +handle_pg_create(MOSDPGCreate* m) +handle_osd_map(MOSDMap* m) +process_peering_events() +start_boot() +shutdown() } class OSDService { -OSD* osd -CephContext* cct -ObjectStore* store -LogClient* log_client -PGRecoveryStats recovery_stats -Throttle recovery_ops_throttle -Throttle recovery_bytes_throttle -ClassHandler* class_handler -map~hobject_t,ObjectContext*~ object_contexts -LRUExpireMap object_context_lru +get_object_context(hobject_t oid) +release_object_context(ObjectContext* obc) +queue_for_recovery(PG* pg) +queue_for_scrub(PG* pg) } class ShardedOpWQ { -vector~OpWQ*~ shards -atomic~uint32_t~ next_shard +queue(OpRequestRef op) +dequeue(OpWQ* shard) +process_batch() } class OpWQ { -ThreadPool::TPHandle* handle -list~OpRequestRef~ ops -Mutex ops_lock +enqueue_front(OpRequestRef op) +enqueue_back(OpRequestRef op) +dequeue() +process() } OSD --> OSDService OSD --> ShardedOpWQ ShardedOpWQ --> OpWQ PG Class Detailed Structure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 classDiagram class PG { -spg_t pg_id -OSDService* osd -CephContext* cct -PGBackend* pgbackend -ObjectStore::CollectionHandle ch -RecoveryState recovery_state -PGLog pg_log -IndexedLog projected_log -eversion_t last_update -epoch_t last_epoch_started -set~pg_shard_t~ up -set~pg_shard_t~ acting -map~hobject_t,ObjectContext*~ object_contexts -Mutex pg_lock -Cond pg_cond -list~OpRequestRef~ waiting_for_peered -list~OpRequestRef~ waiting_for_active -map~eversion_t,list~OpRequestRef~~ waiting_for_ondisk +do_request(OpRequestRef op) +do_op(OpRequestRef op) +do_sub_op(OpRequestRef op) +execute_ctx(OpContext* ctx) +issue_repop(RepGather* repop) +eval_repop(RepGather* repop) +start_recovery_ops() +recover_object() +on_change(ObjectStore::Transaction* t) +activate() +clean_up_local() } class RecoveryState { -PG* pg -RecoveryMachine machine -boost::statechart::state_machine base +handle_event(const boost::statechart::event_base& evt) +process_peering_events() +advance_map() +need_up_thru() } class PGLog { -IndexedLog log -eversion_t tail -eversion_t head -list~pg_log_entry_t~ pending_log -set~eversion_t~ pending_dups +add(pg_log_entry_t& entry) +trim(eversion_t trim_to) +merge_log(ObjectStore::Transaction* t) +write_log_and_missing() } class PGBackend { -PG* parent -ObjectStore* store -CephContext* cct +submit_transaction() +objects_list_partial() +objects_list_range() +objects_get_attr() +objects_read_sync() +be_deep_scrub() } PG --> RecoveryState PG --> PGLog PG --> PGBackend Read/Write IO Processing Detailed Flow Write Operation Complete Flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 sequenceDiagram participant Client participant OSD participant PG participant OpWQ participant ObjectStore participant Journal participant Replica Client->>OSD: MOSDOp(write) OSD->>OSD: handle_osd_op() Note right of OSD: 1.