Classical OSD Overall Architecture Diagram 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 graph TB subgraph "OSD Process Architecture" OSD[OSD Main Class] OSDService[OSDService<br/>Core Service] ShardedOpWQ[ShardedOpWQ<br/>Sharded Operation Queue] Messenger[Messenger<br/>Message System] end subgraph "PG Management Subsystem" PGMap[pg_map<br/>PG Mapping Table] PG[PG Class<br/>Placement Group] PGBackend[PGBackend<br/>Backend Implementation] ReplicatedBackend[ReplicatedBackend] ECBackend[ECBackend] end subgraph "Object Storage Subsystem" ObjectStore[ObjectStore<br/>Storage Abstraction Layer] FileStore[FileStore<br/>Filesystem Storage] BlueStore[BlueStore<br/>Raw Device Storage] ObjectContext[ObjectContext<br/>Object Context] end subgraph "Recovery Subsystem" RecoveryState[RecoveryState<br/>Recovery State Machine] PeeringState[PeeringState<br/>Peering State] BackfillState[BackfillState<br/>Backfill State] RecoveryWQ[RecoveryWQ<br/>Recovery Work Queue] end subgraph "Monitoring & Statistics" PGStats[PGStats<br/>PG Statistics] OSDStats[OSDStats<br/>OSD Statistics] PerfCounters[PerfCounters<br/>Performance Counters] Logger[Logger<br/>Logging System] end OSD --> OSDService OSD --> ShardedOpWQ OSD --> Messenger OSD --> PGMap PGMap --> PG PG --> PGBackend PGBackend --> ReplicatedBackend PGBackend --> ECBackend PG --> ObjectStore ObjectStore --> FileStore ObjectStore --> BlueStore PG --> ObjectContext PG --> RecoveryState RecoveryState --> PeeringState RecoveryState --> BackfillState OSD --> RecoveryWQ PG --> PGStats OSD --> OSDStats OSD --> PerfCounters OSD --> Logger OSD Core Class Structure Details 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 classDiagram class OSD { -int whoami -Messenger* cluster_messenger -Messenger* client_messenger -MonClient* monc -MgrClient* mgrc -ObjectStore* store -OSDService service -map~spg_t,PG*~ pg_map -RWLock pg_map_lock -OSDMapRef osdmap -epoch_t up_epoch -ThreadPool op_tp -ShardedOpWQ op_sharded_wq -RecoveryWQ recovery_wq -SnapTrimWQ snap_trim_wq -ScrubWQ scrub_wq +handle_osd_op(MOSDOp* op) +handle_replica_op(MOSDSubOp* op) +handle_pg_create(MOSDPGCreate* m) +handle_osd_map(MOSDMap* m) +process_peering_events() +start_boot() +shutdown() } class OSDService { -OSD* osd -CephContext* cct -ObjectStore* store -LogClient* log_client -PGRecoveryStats recovery_stats -Throttle recovery_ops_throttle -Throttle recovery_bytes_throttle -ClassHandler* class_handler -map~hobject_t,ObjectContext*~ object_contexts -LRUExpireMap object_context_lru +get_object_context(hobject_t oid) +release_object_context(ObjectContext* obc) +queue_for_recovery(PG* pg) +queue_for_scrub(PG* pg) } class ShardedOpWQ { -vector~OpWQ*~ shards -atomic~uint32_t~ next_shard +queue(OpRequestRef op) +dequeue(OpWQ* shard) +process_batch() } class OpWQ { -ThreadPool::TPHandle* handle -list~OpRequestRef~ ops -Mutex ops_lock +enqueue_front(OpRequestRef op) +enqueue_back(OpRequestRef op) +dequeue() +process() } OSD --> OSDService OSD --> ShardedOpWQ ShardedOpWQ --> OpWQ PG Class Detailed Structure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 classDiagram class PG { -spg_t pg_id -OSDService* osd -CephContext* cct -PGBackend* pgbackend -ObjectStore::CollectionHandle ch -RecoveryState recovery_state -PGLog pg_log -IndexedLog projected_log -eversion_t last_update -epoch_t last_epoch_started -set~pg_shard_t~ up -set~pg_shard_t~ acting -map~hobject_t,ObjectContext*~ object_contexts -Mutex pg_lock -Cond pg_cond -list~OpRequestRef~ waiting_for_peered -list~OpRequestRef~ waiting_for_active -map~eversion_t,list~OpRequestRef~~ waiting_for_ondisk +do_request(OpRequestRef op) +do_op(OpRequestRef op) +do_sub_op(OpRequestRef op) +execute_ctx(OpContext* ctx) +issue_repop(RepGather* repop) +eval_repop(RepGather* repop) +start_recovery_ops() +recover_object() +on_change(ObjectStore::Transaction* t) +activate() +clean_up_local() } class RecoveryState { -PG* pg -RecoveryMachine machine -boost::statechart::state_machine base +handle_event(const boost::statechart::event_base& evt) +process_peering_events() +advance_map() +need_up_thru() } class PGLog { -IndexedLog log -eversion_t tail -eversion_t head -list~pg_log_entry_t~ pending_log -set~eversion_t~ pending_dups +add(pg_log_entry_t& entry) +trim(eversion_t trim_to) +merge_log(ObjectStore::Transaction* t) +write_log_and_missing() } class PGBackend { -PG* parent -ObjectStore* store -CephContext* cct +submit_transaction() +objects_list_partial() +objects_list_range() +objects_get_attr() +objects_read_sync() +be_deep_scrub() } PG --> RecoveryState PG --> PGLog PG --> PGBackend Read/Write IO Processing Detailed Flow Write Operation Complete Flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 sequenceDiagram participant Client participant OSD participant PG participant OpWQ participant ObjectStore participant Journal participant Replica Client->>OSD: MOSDOp(write) OSD->>OSD: handle_osd_op() Note right of OSD: 1.

MDS 系统架构概览 Ceph MDS是CephFS (Ceph File System) 的核心组件,负责处理所有文件系统元数据操作。MDS的设计采用分布式、可扩展的架构,支持多活MDS和

Ceph Monitor 架构解析 Monitor总体架构概览 核心功能定位 Ceph Monitor作为集群的控制平面,主要承担以下核心职责: 集群映射维护:管理Monitor

经典OSD整体架构图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 graph TB subgraph

As OSD are replaced and the cluster scales in and out, the distribution of PGs across OSDs becomes increasingly unbalanced. This leads to discrepancies in actual usage rates of individual OSDs, reducing the overall utilization rate of cluster. The ceph balancer module addresses this by adjusting weights or specifying PG mappings via upmap to redistribute PGs evently. This article analyzes the execution flow when using balancer upmap mode, based on the ceph Pacific version.