CephFS Caps 机制深度技术分析
🏗️ 核心架构概览
CephFS 的 capability (caps) 机制是一个复杂的分布式一致性系统,用于管理客户端对文件系统对象的访问权限。它结合了分布式锁、缓存一致性和访问控制。
架构组件关系图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
graph TB
subgraph "客户端层 (Client Layer)"
C1[ceph-fuse Client]
C2[kclient Client]
C3[libcephfs Client]
CC[Client Cache]
CI[Client Inode]
end
subgraph "MDS 集群 (MDS Cluster)"
MDS1[Active MDS<br/>Primary]
MDS2[Standby MDS]
MDS3[Standby-Replay MDS]
end
subgraph "Capability 管理层"
CM[Cap Manager<br/>MDSCap类]
CL[Lock Manager<br/>SimpleLock]
CR[Cap Revocation<br/>Locker类]
CE[Cap Export/Import<br/>Migrator]
end
subgraph "分布式锁类型"
AL[AUTH Lock<br/>文件属性锁]
LL[LINK Lock<br/>目录链接锁]
XL[XATTR Lock<br/>扩展属性锁]
FL[FILE Lock<br/>文件数据锁]
IL[INODE Lock<br/>inode锁]
DL[DENTRY Lock<br/>目录项锁]
end
subgraph "存储后端"
POOL[Metadata Pool<br/>元数据存储]
DPOOL[Data Pool<br/>数据存储]
JPOOL[Journal Pool<br/>日志存储]
MON[Monitor Cluster<br/>集群状态]
end
C1 <--> |Cap Request/Grant| MDS1
C2 <--> |Cap Request/Grant| MDS1
C3 <--> |Cap Request/Grant| MDS1
CC --> CI
CI --> C1
MDS1 --> CM
CM --> CL
CM --> CR
CM --> CE
CL --> AL
CL --> LL
CL --> XL
CL --> FL
CL --> IL
CL --> DL
MDS1 <--> POOL
MDS1 <--> JPOOL
MDS1 <--> MON
C1 <--> DPOOL
C2 <--> DPOOL
C3 <--> DPOOL
|
🔐 Capability 权限类型详解
权限位掩码定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
// src/include/ceph_fs.h - Capability 位定义
#define CEPH_CAP_GSHARED 1 /* 共享读权限 */
#define CEPH_CAP_GEXCL 2 /* 独占写权限 */
#define CEPH_CAP_GCACHE 4 /* 缓存权限 */
#define CEPH_CAP_GRD 8 /* 读数据权限 */
#define CEPH_CAP_GWR 16 /* 写数据权限 */
#define CEPH_CAP_GBUFFER 32 /* 缓冲写权限 */
#define CEPH_CAP_GWREXTEND 64 /* 扩展写权限 */
#define CEPH_CAP_GLAZYIO 128 /* 延迟IO权限 */
// 组合权限定义
#define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED)
#define CEPH_CAP_AUTH_EXCL (CEPH_CAP_GEXCL | CEPH_CAP_GSHARED)
#define CEPH_CAP_LINK_SHARED (CEPH_CAP_GSHARED)
#define CEPH_CAP_LINK_EXCL (CEPH_CAP_GEXCL | CEPH_CAP_GSHARED)
#define CEPH_CAP_XATTR_SHARED (CEPH_CAP_GSHARED)
#define CEPH_CAP_XATTR_EXCL (CEPH_CAP_GEXCL | CEPH_CAP_GSHARED)
#define CEPH_CAP_FILE_RD (CEPH_CAP_GSHARED | CEPH_CAP_GRD)
#define CEPH_CAP_FILE_WR (CEPH_CAP_GEXCL | CEPH_CAP_GWR | CEPH_CAP_GSHARED)
#define CEPH_CAP_FILE_CACHE (CEPH_CAP_GCACHE)
#define CEPH_CAP_FILE_BUFFER (CEPH_CAP_GBUFFER)
#define CEPH_CAP_FILE_EXCL (CEPH_CAP_GEXCL)
#define CEPH_CAP_FILE_WR_EXTEND (CEPH_CAP_GWREXTEND)
#define CEPH_CAP_FILE_LAZYIO (CEPH_CAP_GLAZYIO)
|
Capability 类型映射图
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
graph LR
subgraph "AUTH Cap - 文件属性"
A1[AUTH_SHARED<br/>读属性]
A2[AUTH_EXCL<br/>写属性]
end
subgraph "LINK Cap - 目录链接"
L1[LINK_SHARED<br/>读链接]
L2[LINK_EXCL<br/>写链接]
end
subgraph "XATTR Cap - 扩展属性"
X1[XATTR_SHARED<br/>读扩展属性]
X2[XATTR_EXCL<br/>写扩展属性]
end
subgraph "FILE Cap - 文件数据"
F1[FILE_RD<br/>读数据]
F2[FILE_WR<br/>写数据]
F3[FILE_CACHE<br/>缓存数据]
F4[FILE_BUFFER<br/>缓冲写入]
F5[FILE_EXCL<br/>独占访问]
F6[FILE_LAZYIO<br/>延迟IO]
end
A1 --> |stat,getattr| FileOps
A2 --> |chmod,chown| FileOps
L1 --> |readdir| DirOps
L2 --> |mkdir,rmdir| DirOps
X1 --> |getxattr| XattrOps
X2 --> |setxattr| XattrOps
F1 --> |read| IOOps
F2 --> |write| IOOps
F3 --> |page cache| CacheOps
F4 --> |write buffer| CacheOps
|
🔄 Lock 状态机和转换
SimpleLock 状态机
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
stateDiagram-v2
[*] --> LOCK_SYNC
LOCK_SYNC --> LOCK_LOCK: need_write
LOCK_SYNC --> LOCK_MIX: need_mixed
LOCK_SYNC --> LOCK_TSYN: need_tsyn
LOCK_LOCK --> LOCK_SYNC: no_writers
LOCK_LOCK --> LOCK_XLOCK: need_xlock
LOCK_LOCK --> LOCK_XLOCKDONE: xlock_finish
LOCK_MIX --> LOCK_SYNC: no_readers_writers
LOCK_MIX --> LOCK_LOCK: need_write_only
LOCK_XLOCK --> LOCK_XLOCKDONE: xlock_finish
LOCK_XLOCKDONE --> LOCK_LOCK: xlock_done
LOCK_XLOCKDONE --> LOCK_SYNC: release_all
LOCK_TSYN --> LOCK_SYNC: tsyn_finish
state LOCK_SYNC {
[*] --> Stable
Stable --> MultipleReaders
MultipleReaders --> Stable
}
state LOCK_LOCK {
[*] --> SingleWriter
SingleWriter --> ExclusiveWrite
}
state LOCK_MIX {
[*] --> MixedAccess
MixedAccess --> ReadersAndWriters
}
|
🔄 Cap 工作流程详解
完整的 Capability 生命周期
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
sequenceDiagram
participant C as Client
participant MDS as Active MDS
participant L as Locker
participant CM as CapManager
participant SL as SimpleLock
participant J as Journal
Note over C,J: Phase 1: Cap 请求
C->>MDS: MClientRequest(open)
MDS->>L: acquire_locks(inode)
L->>SL: lock(FILE_LOCK, LOCK_LOCK)
alt Lock Available
SL-->>L: lock_granted
L->>CM: issue_caps(client, inode)
CM->>CM: create_capability()
Note over CM: 创建 Capability 对象
CM->>J: journal_cap_grant()
J-->>CM: journal_committed
CM->>C: MClientCaps(grant)
else Lock Contention
SL->>SL: add_waiter(client)
Note over SL: 排队等待锁释放
SL-->>L: lock_available_callback
L->>CM: issue_caps()
CM->>C: MClientCaps(grant)
end
Note over C,J: Phase 2: Cap 使用
C->>C: install_caps()
C->>C: perform_file_io()
Note over C,J: Phase 3: Cap 撤销
MDS->>CM: revoke_caps(mask)
CM->>C: MClientCaps(revoke)
C->>C: flush_dirty_caps()
C->>C: invalidate_cache()
C->>MDS: MClientCaps(release)
MDS->>CM: process_cap_release()
CM->>SL: unlock()
SL->>SL: notify_waiters()
|
🏛️ 核心数据结构
客户端 Capability 结构
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
// src/client/Inode.h
class Cap {
public:
MetaSession *session; // MDS 会话指针
uint64_t cap_id; // Capability ID
unsigned issued; // 已发放的权限位
unsigned implemented; // 已实现的权限位
unsigned wanted; // 想要的权限位
unsigned pending; // 待处理的权限位
utime_t last_used; // 最后使用时间
int64_t gen; // 生成版本号
int64_t cap_gen; // Cap 生成号
int64_t seq; // 序列号
int64_t issue_seq; // 发放序列号
int64_t mseq; // MDS 序列号
// Cap 权限检查
bool is_valid() const { return session != nullptr; }
bool issued_caps_need_check() const;
void touch() { last_used = ceph_clock_now(); }
};
// 客户端 Inode 扩展
class Inode {
// ... 其他成员
std::map<mds_rank_t, Cap> caps; // 各 MDS 的 caps
unsigned caps_issued() const; // 已发放的所有 caps
unsigned caps_wanted() const; // 想要的所有 caps
void get_caps_issued(unsigned *issued, unsigned *implemented);
};
|
MDS 端 Capability 结构
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
// src/mds/Capability.h
class Capability {
client_t client; // 客户端标识
CInode *inode; // 指向的 inode
uint64_t cap_id; // Cap ID
unsigned issued_; // 已发放权限
unsigned pending_; // 待处理权限
unsigned wanted_; // 客户端想要的权限
utime_t last_sent; // 最后发送时间
utime_t last_revoke_stamp; // 最后撤销时间
int64_t trans_seq; // 事务序列号
int64_t client_follows; // 客户端跟随序列号
public:
// 权限管理方法
void set_wanted(unsigned w) { wanted_ = w; }
void inc_suppress() { suppress++; }
void dec_suppress() { suppress--; }
bool is_suppress() const { return suppress > 0; }
bool is_stale() const;
bool is_valid() const { return client > 0; }
// 权限检查
unsigned issued() const { return issued_; }
unsigned pending() const { return pending_; }
unsigned wanted() const { return wanted_; }
};
|
🔧 关键函数实现
Cap 发放核心函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
|
// src/mds/Locker.cc
void Locker::issue_caps(CInode *in, Capability *cap) {
dout(7) << "issue_caps for " << *in << " to client." << cap->get_client() << dendl;
unsigned was_issued = cap->issued();
unsigned wanted = cap->wanted();
unsigned issued = 0;
// 检查各种锁的状态来决定可以发放的权限
// AUTH cap - 文件属性权限
if (in->authlock.can_read(cap->get_client())) {
issued |= CEPH_CAP_AUTH_SHARED;
}
if (in->authlock.can_write(cap->get_client())) {
issued |= CEPH_CAP_AUTH_EXCL;
}
// LINK cap - 目录链接权限
if (in->linklock.can_read(cap->get_client())) {
issued |= CEPH_CAP_LINK_SHARED;
}
if (in->linklock.can_write(cap->get_client())) {
issued |= CEPH_CAP_LINK_EXCL;
}
// XATTR cap - 扩展属性权限
if (in->xattrlock.can_read(cap->get_client())) {
issued |= CEPH_CAP_XATTR_SHARED;
}
if (in->xattrlock.can_write(cap->get_client())) {
issued |= CEPH_CAP_XATTR_EXCL;
}
// FILE cap - 文件数据权限 (最复杂)
if (in->filelock.can_read(cap->get_client())) {
issued |= CEPH_CAP_FILE_RD;
if (in->filelock.can_read_projected(cap->get_client())) {
issued |= CEPH_CAP_FILE_CACHE;
}
}
if (in->filelock.can_write(cap->get_client())) {
issued |= CEPH_CAP_FILE_WR;
if (in->filelock.can_write_projected(cap->get_client())) {
issued |= CEPH_CAP_FILE_BUFFER;
if (in->filelock.get_state() == LOCK_EXCL) {
issued |= CEPH_CAP_FILE_EXCL;
}
}
}
// 限制权限为客户端实际想要的
issued &= wanted;
// 如果权限有变化,发送 grant 消息
if (issued != was_issued) {
cap->set_issued(issued);
send_cap_grant(cap, issued);
// 记录到日志
if (mds->mdlog->get_write_pos() > 0) {
mds->mdlog->submit_entry(new EMetaBlob(mds->mdlog));
}
}
}
// Cap 撤销核心函数
void Locker::revoke_caps(CInode *in, int revoke_mask, client_t client) {
dout(7) << "revoke_caps " << ccap_string(revoke_mask)
<< " on " << *in << dendl;
auto it = in->get_client_caps().find(client);
if (it == in->get_client_caps().end()) {
return; // 客户端没有 caps
}
Capability *cap = it->second;
unsigned revoking = cap->issued() & revoke_mask;
if (revoking) {
dout(7) << " revoking " << ccap_string(revoking)
<< " from client." << client << dendl;
cap->set_pending(cap->pending() | revoking);
cap->set_issued(cap->issued() & ~revoking);
// 发送撤销消息
send_cap_revoke(cap, revoking);
// 设置撤销超时
if (!cap->is_suppress()) {
mds->locker->set_cap_revoke_timeout(cap);
}
}
}
|
锁状态检查函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
// src/mds/locks.cc
bool SimpleLock::can_read(client_t client) {
switch (state) {
case LOCK_SYNC:
return true; // 同步状态允许所有客户端读
case LOCK_MIX:
return true; // 混合状态允许读
case LOCK_LOCK:
// 锁定状态只允许锁持有者读
return is_rdlocked_by(client) || is_wrlocked_by(client);
case LOCK_XLOCK:
// 排他锁状态只允许锁持有者
return is_xlocked_by(client);
default:
return false;
}
}
bool SimpleLock::can_write(client_t client) {
switch (state) {
case LOCK_LOCK:
return is_wrlocked_by(client);
case LOCK_XLOCK:
return is_xlocked_by(client);
default:
return false;
}
}
// 锁状态转换
void SimpleLock::go_lock() {
dout(7) << "go_lock on " << *get_parent() << dendl;
state = LOCK_LOCK;
// 撤销所有客户端的读权限,除了获得写锁的客户端
for (auto& p : parent->get_client_caps()) {
client_t client = p.first;
if (client != lock_client) {
revoke_client_caps(client, CEPH_CAP_FILE_RD);
}
}
}
|
🔄 分布式一致性保证机制
一致性层次模型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
graph TD
subgraph "一致性级别"
L1[Client Cache<br/>最终一致性<br/>Client-side caching]
L2[MDS Memory<br/>顺序一致性<br/>In-memory metadata]
L3[Journal WAL<br/>强一致性<br/>Write-ahead logging]
L4[RADOS Storage<br/>线性一致性<br/>Distributed storage]
end
subgraph "一致性机制"
M1[Cap Revocation<br/>缓存失效协议]
M2[Lock Ordering<br/>死锁避免]
M3[Journal Commit<br/>事务保证]
M4[RADOS ACID<br/>原子操作]
end
subgraph "故障处理"
F1[Client Failure<br/>Cap 超时回收]
F2[MDS Failure<br/>Cap 重建]
F3[Network Partition<br/>脑裂处理]
F4[Storage Failure<br/>副本恢复]
end
L1 --> |Cache Coherence| M1
L2 --> |Mutual Exclusion| M2
L3 --> |Durability| M3
L4 --> |Atomicity| M4
M1 --> F1
M2 --> F2
M3 --> F3
M4 --> F4
F1 --> L2
F2 --> L3
F3 --> L4
F4 --> L1
|
Cap 迁移流程 (MDS Failover)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
sequenceDiagram
participant C as Client
participant MDS1 as Failed MDS
participant MDS2 as Takeover MDS
participant MON as Monitor
participant RADOS as RADOS Store
Note over C,RADOS: MDS 故障检测
MDS1->>X: 故障/网络分区
MON->>MON: 检测 MDS1 故障
MON->>MDS2: assign_rank(failed_rank)
Note over C,RADOS: Cap 状态重建
MDS2->>RADOS: 读取 journal 和 metadata
RADOS-->>MDS2: 返回持久化状态
MDS2->>MDS2: replay_journal()
MDS2->>MDS2: rebuild_cap_state()
Note over C,RADOS: 客户端重连
C->>MDS2: 重新连接请求
MDS2->>C: MClientSession(renewal)
C->>MDS2: 报告当前 caps 状态
Note over C,RADOS: Cap 状态同步
MDS2->>MDS2: validate_client_caps()
alt Caps Valid
MDS2->>C: MClientCaps(confirm)
else Caps Invalid
MDS2->>C: MClientCaps(revoke_all)
C->>C: flush_and_invalidate()
C->>MDS2: 重新请求所需 caps
MDS2->>C: MClientCaps(grant)
end
|
📊 性能优化策略
Cap 缓存优化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
// src/client/Client.cc - 客户端 Cap 缓存优化
class Client {
// Cap 缓存管理
LRUObjects cap_lru; // Cap LRU 缓存
uint64_t max_caps_cache; // 最大缓存 caps 数量
void trim_caps() {
while (cap_lru.lru_get_size() > max_caps_cache) {
Inode *in = static_cast<Inode*>(cap_lru.lru_expire());
if (in) {
release_caps(in, CEPH_CAP_FILE_CACHE);
}
}
}
// 智能 Cap 预测
void predict_caps_needed(Inode *in, unsigned &wanted) {
// 基于访问模式预测需要的权限
if (in->access_pattern & ACCESS_PATTERN_SEQUENTIAL) {
wanted |= CEPH_CAP_FILE_CACHE;
}
if (in->access_pattern & ACCESS_PATTERN_RANDOM) {
wanted |= CEPH_CAP_FILE_RD;
}
if (in->dirty_pages > 0) {
wanted |= CEPH_CAP_FILE_BUFFER;
}
}
};
|