MDS 系统架构概览
Ceph MDS是CephFS (Ceph File System) 的核心组件,负责处理所有文件系统元数据操作。MDS的设计采用分布式、可扩展的架构,支持多活MDS和动态负载均衡。
MDS在Ceph生态中的定位
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
graph TB
Client[CephFS Client] --> MDS[MDS Cluster]
MDS --> RADOS[RADOS Storage Layer]
MDS --> Mon[Monitor Cluster]
subgraph "MDS In Ceph"
subgraph "Client Layer"
Client
Fuse[FUSE Client]
Kernel[Kernel Client]
end
subgraph "Metadata Layer"
MDS
MDSStandby[Standby MDS]
MDSActive[Active MDS]
end
subgraph "Storage Layer"
RADOS
OSD[OSD Cluster]
Pool[Metadata Pool]
end
subgraph "Management Layer"
Mon
Mgr[Manager]
end
end
Client -.-> Fuse
Client -.-> Kernel
MDS --> MDSStandby
MDS --> MDSActive
RADOS --> OSD
RADOS --> Pool
Mon --> Mgr
|
MDS 模块架构

MDS 核心子模块架构分析
MDSMap 管理模块
MDSMap是MDS集群状态管理的核心模块,负责维护MDS集群的拓扑信息和状态。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
graph LR
subgraph "MDSMap Module"
MDSMap[MDSMap Manager]
ActiveMDS[Active MDS List]
StandbyMDS[Standby MDS List]
FSMap[Filesystem Map]
MDSMap --> ActiveMDS
MDSMap --> StandbyMDS
MDSMap --> FSMap
end
Monitor[Monitor Cluster] --> MDSMap
MDSMap --> ClientView[Client View]
MDSMap --> MDSInstances[MDS Instances]
|
核心功能:
- MDS集群成员管理
- Rank分配和故障转移
- 文件系统到MDS的映射
- 状态同步和版本控制
关键配置参数:
1
2
3
4
5
6
|
# 设置最大活动MDS数量
ceph fs set <fsname> max_mds <count>
# 查看MDS状态
ceph mds stat
ceph fs status <fsname>
|
Session管理模块
Session管理模块负责处理客户端连接和会话状态维护。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
graph TB
subgraph "Session Management"
SessionMgr[Session Manager]
ClientSessions[Client Sessions]
SessionCache[Session Cache]
Capabilities[Capability Grants]
SessionMgr --> ClientSessions
SessionMgr --> SessionCache
SessionMgr --> Capabilities
subgraph "Session State"
Opening[Opening]
Open[Open]
Stale[Stale]
Killing[Killing]
end
ClientSessions --> Opening
ClientSessions --> Open
ClientSessions --> Stale
ClientSessions --> Killing
end
Clients[CephFS Clients] --> SessionMgr
Monitor[Monitor] --> SessionMgr
MDCache[MDS Cache] --> Capabilities
|
核心功能:
- 客户端连接认证
- 会话生命周期管理
- Capability分发和回收
- 会话超时处理
MDCache模块
MDCache是MDS的核心缓存模块,负责元数据的缓存、一致性维护和分布式协调。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
graph TB
subgraph "MDCache Architecture"
MDCache[MD Cache Manager]
subgraph "Cache Structures"
Inodes[Inode Cache]
Dentries[Dentry Cache]
Dirfrags[Dirfrag Cache]
end
subgraph "Cache Operations"
Fetch[Fetch Operations]
Discover[Discovery]
Migration[Cache Migration]
Trim[Cache Trimming]
end
subgraph "Coherency"
Locks[Distributed Locks]
Caps[Capabilities]
Leases[Client Leases]
end
MDCache --> Inodes
MDCache --> Dentries
MDCache --> Dirfrags
MDCache --> Fetch
MDCache --> Discover
MDCache --> Migration
MDCache --> Trim
MDCache --> Locks
MDCache --> Caps
MDCache --> Leases
end
RADOS[RADOS Storage] --> Fetch
Clients[CephFS Clients] --> Caps
OtherMDS[Other MDS] --> Migration
|
核心功能:
- 分布式元数据缓存
- 缓存一致性协议
- 元数据预取和预测性缓存
- 内存管理和LRU淘汰
MDS Balancer模块
负载均衡模块确保元数据负载在多个活动MDS之间均匀分布。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
graph LR
subgraph "MDS Balancer"
Balancer[Load Balancer]
subgraph "Metrics Collection"
CPUMetrics[CPU Usage]
MemMetrics[Memory Usage]
IOMetrics[I/O Metrics]
ClientLoad[Client Load]
end
subgraph "Balancing Strategies"
HotSpot[Hot Spot Detection]
Migration[Directory Migration]
Fragmentation[Dir Fragmentation]
end
Balancer --> CPUMetrics
Balancer --> MemMetrics
Balancer --> IOMetrics
Balancer --> ClientLoad
Balancer --> HotSpot
Balancer --> Migration
Balancer --> Fragmentation
end
MDSRanks[MDS Ranks] --> Balancer
MDCache --> Migration
Monitor --> Balancer
|
核心配置:
1
2
3
4
5
6
|
# 启用MDS负载均衡
ceph config set mds mds_bal_mode 2
# 设置负载均衡阈值
ceph config set mds mds_bal_need_min 0.2
ceph config set mds mds_bal_need_max 1.25
|
日志和恢复模块
MDS Journal模块负责元数据操作的持久化日志记录和故障恢复。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
graph TB
subgraph "Journal & Recovery"
Journal[MDS Journal]
subgraph "Journal Operations"
LogEvents[Log Events]
Segments[Journal Segments]
Trimming[Log Trimming]
end
subgraph "Recovery Process"
Replay[Journal Replay]
Resolution[Conflict Resolution]
Cleanup[Recovery Cleanup]
end
subgraph "Storage Backend"
MDSPool[Metadata Pool]
Objects[Journal Objects]
RADOS_Journal[RADOS Backend]
end
Journal --> LogEvents
Journal --> Segments
Journal --> Trimming
Journal --> Replay
Journal --> Resolution
Journal --> Cleanup
Journal --> MDSPool
Journal --> Objects
Journal --> RADOS_Journal
end
MDSOperations[MDS Operations] --> LogEvents
StandbyMDS[Standby MDS] --> Replay
Monitor --> Resolution
|
MDS状态机和生命周期
MDS状态转换图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
stateDiagram-v2
[*] --> boot
boot --> standby
standby --> standby_replay
standby_replay --> resolve
resolve --> reconnect
reconnect --> rejoin
rejoin --> active
active --> stopping
stopping --> [*]
active --> resolve : failover
standby --> resolve : takeover
active --> standby : rank_stop
note right of active
正常工作状态
处理客户端请求
end note
note right of standby
待机状态
等待分配rank
end note
note right of resolve
解决元数据冲突
处理分布式状态
end note
|
MDS启动和初始化流程
1
2
3
4
5
6
7
8
9
10
11
12
13
|
sequenceDiagram
participant Monitor as Monitor
participant MDS as MDS Daemon
participant RADOS as RADOS
participant Standby as Standby MDS
MDS->>Monitor: 注册为standby
Monitor->>MDS: 分配rank
MDS->>RADOS: 读取journal
MDS->>MDS: 回放journal
MDS->>Monitor: 报告active状态
Monitor->>Standby: 通知状态变化
MDS->>MDS: 开始处理客户端请求
|
MDS与上下游组件关系
MDS与Monitor的交互
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
graph LR
subgraph "MDS-Monitor Interaction"
MDS[MDS Daemon]
Monitor[Monitor Cluster]
subgraph "Information Exchange"
MDSMap_Update[MDSMap Updates]
Health_Report[Health Reports]
Beacon[MDS Beacon]
Commands[Admin Commands]
end
MDS --> MDSMap_Update
MDS --> Health_Report
MDS --> Beacon
Monitor --> Commands
MDSMap_Update --> Monitor
Health_Report --> Monitor
Beacon --> Monitor
Commands --> MDS
end
|
关键监控命令:
1
2
3
4
5
6
7
8
9
10
11
|
# 查看MDS健康状态
ceph health detail
# 检查慢速元数据IO
ceph mds perf dump
# 查看MDS告警
ceph health mute MDS_SLOW_REQUEST
# MDS性能统计
ceph daemon mds.<name> perf dump
|
MDS与RADOS的交互
MDS通过RADOS存储所有持久化元数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
graph TB
subgraph "MDS-RADOS Interaction"
MDS[MDS Daemon]
subgraph "RADOS Operations"
MetadataIO[Metadata I/O]
JournalIO[Journal I/O]
BackingStore[Backing Store Ops]
end
subgraph "RADOS Pools"
MetadataPool[Metadata Pool]
DataPool[Data Pool]
end
MDS --> MetadataIO
MDS --> JournalIO
MDS --> BackingStore
MetadataIO --> MetadataPool
JournalIO --> MetadataPool
BackingStore --> DataPool
end
LIBRADOS[librados] --> MetadataPool
LIBRADOS --> DataPool
|
MDS与客户端的交互
1
2
3
4
5
6
7
8
9
10
11
12
13
|
sequenceDiagram
participant Client as CephFS Client
participant MDS as MDS Active
participant RADOS as RADOS
Client->>MDS: 建立session
MDS->>Client: 分发capabilities
Client->>MDS: 元数据请求(open/mkdir/stat)
MDS->>RADOS: 读取/更新元数据
RADOS->>MDS: 返回结果
MDS->>Client: 返回元数据
Client->>MDS: 释放capabilities
MDS->>Client: 确认释放
|
MDS常见故障诊断和恢复
1
2
3
4
5
6
7
8
9
10
11
|
# 检查MDS状态
ceph fs status
ceph mds stat
# MDS性能分析
ceph daemon mds.<name> perf dump
ceph daemon mds.<name> dump cache
# 检查客户端连接
ceph daemon mds.<name> session ls
|
MDS监控和告警
关键性能指标
1
2
3
4
5
6
7
8
|
# MDS性能监控
ceph daemonperf mds
# 关键指标:
# - mds.inodes: 缓存的inode数量
# - mds.reply_latency: 响应延迟
# - mds.request_rate: 请求速率
# - mds.sessions: 活动会话数
|
告警配置
重要告警项目:
- MDS_SLOW_REQUEST: 慢请求告警
- MDS_SLOW_METADATA_IO: 慢元数据IO
- MDS_INSUFFICIENT_STANDBY: 待机MDS不足
- MDS_HEALTH_READ_ONLY: MDS只读状态