Kudu 启动慢是长期以来饱受诟病的一个问题,部分数据量大的集群,单节点重启极端耗时需要 30 分钟以上,同时其部分参数并不能进行运行时调优,需要重启才能生效,这困扰着运维人员,影响升级和变更流程。
社区以及一些组织实现了一些优化方案,具体可以见 此文 tserver启动加速 一节,一些具体的 patch 包括:
然而即使在一些高版本上,当前依然存在启动慢耗时长的问题,原因通常是:数据量大、资源限制导致的 metadata 的 compact 不及时等。
社区目前最佳的手段可能是使用 RocksDB 去存储底层的 metadata 数据,这是较新的方式,然而对于大量的存量环境,要将现有的 metadata 存储迁移到 RocksDB 上,成本较高;并且较新的特性不确定是否还有其他的问题,底层文件存储的问题可能导致数据问题,风险也较大。
近期,在探索 kudu 启动问题上,笔者注意到一个点是:在抓取 kudu 启动时间的过程中,发现 kudu tserver 启动中 sys time 较高。
结合之前探索过的 kudu 文件系统实现:
日期: 2024-10-23
标签:
#big data
#kudu Block 抽象接口 操作系统把磁盘抽象成文件,Kudu 则在文件之上再加了一层抽象——Block。在 Kudu 中,一列数据、一个 BloomFilter、一份主键索引,最终都变成一个或多个 Block 写入磁盘。Block 是 Kudu 存储引擎与本地文件系统之间的分界线:上层组件只需面对 Block 接口的 Append / Read,不必关心底层是一个独立文件(FileBlockManager)还是日志容器中的一段字节区间(LogBlockManager)。
这种隔离方式与 Unix 的设备抽象思路一致——Unix 内核用 struct file 加上 read/write 函数指针屏蔽底层设备差异;Kudu 用 Block 基类加上虚函数屏蔽底层文件系统的组织差异。
BlockId:Block 的身份标识 每个 Block 都有一个全局唯一的身份标识 BlockId(src/kudu/fs/block_id.h),本质上是一个 64 位无符号整数:
1 2 3 4 class BlockId { private: uint64_t id_; }; BlockId 可以序列化到 protobuf(CopyToPB / FromPB),也可以打印成 16 位十六进制字符串供调试。它是不透明的——上层代码不应假设 ID 的分配规则或数值含义,只需把它当作一把钥匙,用来在 BlockManager 中取回对应的数据。
Block 基类 Block 是所有 Block 的基类,接口极为简洁——只有一个方法:
1 2 3 4 5 class Block { public: virtual ~Block() = default; virtual const BlockId& id() const = 0; }; id() 返回该 Block 的 BlockId。这足以让上层代码通过 ID 引用任何 Block,而不必关心它是可读的还是可写的。Block 的两个子类——WritableBlock 和 ReadableBlock——分别定义了写路径和读路径的完整接口。
......
进一步详细分析发现,sys time 高的原因是在启动的过程中存在大量的 preadv syscall。
在包含大量存储块的环境上,启动过程中会产生海量的细碎读操作,这些读操作均需要 syscall,大量的上下文切换和读操作,会拖慢加载速度。
下文是对此问题的详细分析,以及针对该问题提出的一种优化方案。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| I20260525 17:57:30.571516 492752 log_block_manager.cc:3914] Opened 16200 log block containers
I20260525 17:57:38.019493 492753 log_block_manager.cc:3914] Opened 16400 log block containers
I20260525 17:57:45.528719 492758 log_block_manager.cc:3914] Opened 16600 log block containers
I20260525 17:57:51.387989 514375 log_block_manager.cc:4415] Read-only block manager, skipping repair
I20260525 17:57:51.389531 492740 fs_manager.cc:736] Time spent opening block manager: real 661.574s user 0.002s sys 0.000s
I20260525 17:57:51.389570 492740 fs_manager.cc:653] Opened local filesystem: /data/rnddata00/kudu/tserver_data,/data/rnddata00/kudu/tserver_wal
uuid: "8b4ce73ea5e6404494bd82becda008bf"
format_stamp: "Formatted at 2026-05-13 10:12:37 on hybrid01"
FS layout report
----------------
wal directory: /data/rnddata00/kudu/tserver_wal
metadata directory: /data/rnddata00/kudu/tserver_wal
1 data directories: /data/rnddata00/kudu/tserver_data/data
Total live blocks: 77987
Total live bytes: 18064438689
Total live bytes (after alignment): 18234310656
Total number of LBM containers: 16768 (0 full)
Total missing blocks: 0
Total orphaned blocks: 8405 (0 repaired)
Total orphaned block bytes: 21998789 (0 repaired)
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)
real 11m2.509s
user 23m53.968s
sys 53m11.945s
|
问题分析#
在注意到上面问题之后,对启动过程中的 “reading file system” 过程使用 perf 抓取了启动中的 syscall 情况,并对调用栈进行了分析。
perf 结果显示启动中存在海量的 preadv 调用。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
| LOST 7871 events!
<..... 9000 more lines of LOST events ...>
LOST 32929 events!
LOST 31516 events!
Summary of events:
kudu-tserver (30770), 62 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 26 53706.680 0.000 2065.642 5005.867 22.71%
write 5 0.078 0.013 0.016 0.019 7.57%
kudu-tserver (30771), 48 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 24 55044.566 0.000 2293.524 5010.907 22.67%
kudu-tserver (30772), 48 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 24 55024.037 0.000 2292.668 5008.982 22.67%
kernel-watcher- (30773), 6328 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 592 59870.175 0.000 101.132 222.008 4.13%
open 296 24.958 0.008 0.084 21.784 87.24%
read 898 8.968 0.002 0.010 0.035 3.38%
munmap 296 2.339 0.005 0.008 0.021 1.12%
mmap 296 2.218 0.006 0.007 0.018 0.96%
fstat 296 1.041 0.003 0.004 0.013 0.96%
close 296 0.914 0.003 0.003 0.012 1.32%
rt_sigprocmask 174 0.500 0.003 0.003 0.003 0.44%
mincore 10 0.038 0.003 0.004 0.004 4.27%
write 10 0.033 0.003 0.003 0.004 1.53%
file cache-evic (30797), 4 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 2 0.003 0.000 0.002 0.003 100.00%
sq_acceptor (30798), 611 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
poll 297 59911.950 0.000 201.724 222.064 0.46%
clone 1 0.027 0.027 0.027 0.027 0.00%
setsockopt 3 0.010 0.003 0.003 0.004 13.34%
mprotect 1 0.007 0.007 0.007 0.007 0.00%
accept 1 0.007 0.007 0.007 0.007 0.00%
mmap 1 0.006 0.006 0.006 0.006 0.00%
fcntl 1 0.003 0.003 0.003 0.003 0.00%
getsockname 1 0.003 0.003 0.003 0.003 0.00%
dir /data (30799), 2991490 events, 4.8%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 1343517 49711.550 0.000 0.037 92.131 1.46%
stat 61203 272.244 0.002 0.004 0.036 0.09%
open 30622 247.117 0.003 0.008 36.105 23.95%
futex 646 161.280 0.002 0.250 34.882 32.61%
rt_sigprocmask 18478 127.298 0.002 0.007 43.518 36.99%
fstat 30628 98.476 0.002 0.003 0.029 0.12%
close 7202 23.597 0.002 0.003 0.072 0.35%
mincore 1161 4.410 0.003 0.004 0.012 0.70%
read 1161 3.935 0.003 0.003 0.017 0.76%
write 1161 3.840 0.002 0.003 0.018 0.71%
clock_gettime 30 0.135 0.003 0.005 0.012 7.42%
brk 3 0.012 0.003 0.004 0.006 22.57%
compact_schedul (30812), 24 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 12 46007.877 0.000 3833.990 16000.743 46.51%
dir /data (30813), 8484357 events, 13.6%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4218088 35480.531 0.002 0.008 86.394 2.07%
futex 1639 355.904 0.002 0.217 36.006 23.95%
rt_sigprocmask 18090 146.308 0.002 0.008 25.533 31.46%
read 1432 21.247 0.002 0.015 15.150 71.45%
mincore 1431 5.532 0.002 0.004 0.127 2.30%
write 1432 4.747 0.002 0.003 0.012 0.52%
open 47 0.329 0.004 0.007 0.010 3.63%
fstat 85 0.283 0.003 0.003 0.006 1.35%
clock_gettime 64 0.269 0.003 0.004 0.008 3.88%
stat 29 0.169 0.004 0.006 0.008 3.62%
close 36 0.115 0.003 0.003 0.004 1.18%
dir /data (30814), 8418485 events, 13.5%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4186255 34155.452 0.002 0.008 89.695 2.08%
futex 1814 618.258 0.002 0.341 97.937 22.42%
rt_sigprocmask 17030 111.125 0.002 0.007 25.736 31.97%
mincore 1344 4.990 0.003 0.004 0.013 0.49%
read 1344 4.678 0.002 0.003 0.138 2.93%
write 1343 4.371 0.003 0.003 0.013 0.43%
open 51 0.356 0.004 0.007 0.011 3.52%
fstat 90 0.300 0.003 0.003 0.012 3.03%
clock_gettime 68 0.273 0.003 0.004 0.007 3.49%
stat 35 0.201 0.004 0.006 0.008 3.11%
close 39 0.124 0.003 0.003 0.004 1.40%
brk 3 0.010 0.003 0.003 0.005 21.07%
dir /data (30815), 8444630 events, 13.6%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4198290 33605.348 0.002 0.008 121.906 2.08%
futex 1833 357.323 0.002 0.195 41.395 22.91%
rt_sigprocmask 17866 180.114 0.002 0.010 105.896 60.01%
write 1412 5.613 0.002 0.004 1.017 18.07%
mincore 1413 5.356 0.003 0.004 0.014 0.69%
read 1412 4.770 0.002 0.003 0.015 0.57%
open 46 0.327 0.004 0.007 0.009 3.45%
clock_gettime 64 0.290 0.003 0.005 0.020 7.34%
fstat 83 0.278 0.003 0.003 0.004 0.93%
stat 31 0.192 0.005 0.006 0.013 4.91%
close 34 0.110 0.003 0.003 0.004 1.24%
dir /data (30816), 8535085 events, 13.7%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4242171 35026.262 0.002 0.008 114.182 2.04%
futex 1768 348.946 0.002 0.197 118.673 37.82%
rt_sigprocmask 19022 81.015 0.002 0.004 22.809 28.21%
mincore 1478 42.816 0.002 0.029 25.157 65.11%
read 1478 16.972 0.002 0.011 12.014 70.77%
write 1477 4.828 0.002 0.003 0.015 0.57%
open 62 0.419 0.004 0.007 0.011 3.31%
fstat 106 0.347 0.003 0.003 0.005 0.94%
clock_gettime 80 0.328 0.003 0.004 0.007 3.33%
stat 39 0.216 0.004 0.006 0.008 2.82%
close 47 0.150 0.003 0.003 0.004 1.12%
brk 9 0.033 0.003 0.004 0.006 11.34%
dir /data (30817), 8491494 events, 13.6%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4221264 34668.736 0.002 0.008 131.688 2.10%
futex 1826 319.663 0.002 0.175 33.265 25.59%
rt_sigprocmask 18300 55.593 0.002 0.003 2.237 4.04%
mincore 1436 5.364 0.002 0.004 0.015 0.53%
read 1436 4.845 0.002 0.003 0.015 0.62%
write 1436 4.667 0.002 0.003 0.014 0.40%
stat 29 1.192 0.004 0.041 1.023 85.34%
fstat 83 0.476 0.003 0.006 0.203 41.96%
open 45 0.321 0.005 0.007 0.010 3.36%
clock_gettime 64 0.254 0.003 0.004 0.007 3.66%
close 32 0.104 0.003 0.003 0.004 1.17%
gettid 1 0.003 0.003 0.003 0.003 0.00%
dir /data (30818), 8411713 events, 13.5%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4181435 34919.129 0.002 0.008 92.185 2.00%
futex 1835 354.040 0.002 0.193 31.058 22.58%
rt_sigprocmask 18180 85.027 0.002 0.005 22.419 27.37%
write 1414 54.329 0.003 0.038 33.294 68.31%
open 70 25.668 0.004 0.367 25.191 98.12%
mincore 1412 5.464 0.003 0.004 0.123 2.28%
read 1412 4.794 0.002 0.003 0.018 0.67%
fstat 121 0.393 0.003 0.003 0.004 0.68%
clock_gettime 92 0.380 0.003 0.004 0.007 3.00%
stat 39 0.216 0.004 0.006 0.009 3.02%
close 51 0.165 0.003 0.003 0.004 1.17%
brk 3 0.019 0.003 0.006 0.011 38.37%
dir /data (30819), 8509579 events, 13.7%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 4228714 36142.178 0.001 0.009 135.743 2.11%
futex 1982 312.357 0.002 0.158 39.418 24.65%
rt_sigprocmask 19496 104.292 0.002 0.005 23.827 31.56%
read 1502 56.911 0.003 0.038 29.035 64.85%
mincore 1500 18.674 0.003 0.012 13.015 69.67%
write 1502 4.865 0.002 0.003 0.022 0.64%
open 53 0.379 0.004 0.007 0.011 3.23%
clock_gettime 76 0.340 0.003 0.004 0.019 5.64%
fstat 99 0.328 0.003 0.003 0.004 0.80%
stat 31 0.195 0.004 0.006 0.013 5.97%
close 38 0.126 0.003 0.003 0.004 1.55%
brk 9 0.035 0.003 0.004 0.007 13.39%
gettid 1 0.004 0.004 0.004 0.004 0.00%
|
结合调用栈和启动中的 i/o 路径发现,这些 preadv 存在于启动中的 log conatainer metadata 记录解析上。具体如下:
启动 kudu fs 的 i/o 调用路径:
1
2
3
4
5
6
7
| LogBlockManager::Open
└─ for each data_dir: dd->ExecClosure(OpenDataDir) // 每个目录一个线程池任务
├─ env_->GetChildren(dir) // 1 次 getdents 扫整个目录
└─ for each container in containers_seen:
├─ OpenContainer (同步) // 打开 .data,可能查 size、查 fd cache
└─ dd->ExecClosure(LoadContainer) // 异步丢回同一线程池
└─ ProcessRecords(metadata_file) // 解析 .metadata
|
其中在 ProcessRecords 解析 metadata 的过程中实现如下:
1
2
3
4
5
6
7
| ReadablePBContainerFile pb_reader(std::move(metadata_reader));
RETURN_NOT_OK_HANDLE_ERROR(pb_reader.Open());
while (true) {
BlockRecordPB record;
read_status = pb_reader.ReadNextPB(&record);
...
}
|
而 ReadNextPB -> ReadFullPB -> ReadPBStartingAt 里,每条记录会调用 两次 ValidateAndReadData,底层走 RandomAccessFile::Read,顺着调用链路,在 Linux 上对应 preadv。
1
2
3
4
5
| #if defined(__APPLE__)
RETRY_ON_EINTR(r, preadvsim(fd, iov + completed_iov, iov_count, cur_offset));
#else
RETRY_ON_EINTR(r, preadv(fd, iov + completed_iov, iov_count, cur_offset));
#endif
|
也就是说:
每条 BlockRecordPB 至少 2 次 preadv;外加打开容器时的 open + fstat 的额外读。
一个典型的 tablet server,单容器若有 10 万条 record,单容器即 ~20 万次 pread;成千上万个容器并发起来很容易把启动阶段卡在系统调用上下文切换与微读上。
优化方案#
由于 Kudu log block container 具有如下特点:
- 单容器最大 block 数受 –log_container_max_blocks 限制;
- 启动阶段还会触发 log_container_live_metadata_before_compact_ratio 的 compact,进一步压缩;
- 对于 100 万条 record,每条 BlockRecordPB 序列化后通常几十字节,整文件常在几十 MB 量级
上述特点导致导致此文件天生不大,通常在几十 MB 区间上。
因此,可以考虑进行如下优化:
启动时,一次性把 metadata 文件读进内存,再解析。
预期收益:preadv syscall 数从 O(sum {containes 记录数}) 降到 O(sum {containers 数});计算量不变;内存峰值约等于一个 .metadata 文件大小 x 线程池大小(实际可控)。
具体实现上:引入一个参数 –log_container_metadata_inmem_replay_threshold_bytes, 参数值设置为 64MB, 当 metadata 文件大小小于此参数时,采用优化方案;参数设置为 0 时,回退到之前的路径上,其他任何解析错误也回退,避免影响正常启动流程。
由于存储格式问题,当前对于加密路径还没有方案完全支持,需要进一步探索。
Benchmark 测试#
注:本节与下一节的测试,为了排除 page cache 等的影响,均测量的多组,并求取其平均值。
使用 kudu 单元测试中的 benchmark 测试,对优化后的方案和优化前的方案进行了如下多组对比:
preadv 对比#
优化前:
1
2
3
4
5
| TEST_TMPDIR=/data/rnddata00/test
rm -rf $TEST_TMPDIR
mkdir -p $TEST_TMPDIR
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo perf stat -e 'syscalls:sys_enter_preadv,syscalls:sys_enter_preadv2,syscalls:sys_enter_pread64' -- env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter='EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0' --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=6710886400 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags
|
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| wal directory:
metadata directory:
1 data directories: /data/rnddata00/test/log_block_manager-test.EncryptionEnabled_LogBlockManagerTest.StartupBenchmark_0.1779945625777115-2852759-0/0/data
Total live blocks: 5000000
Total live bytes: 5000000
Total live bytes (after alignment): 20480000000
Total number of LBM containers: 2 (1 full)
Did not check for missing blocks
Did not check for orphaned blocks
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)
I20260528 13:21:33.849928 2852759 log_block_manager-test.cc:1207] Time spent reopening block manager: real 7.881s user 0.001s sys 0.000s
I20260528 13:21:33.849943 2852759 log_block_manager-test.cc:1210] Test on --block_manager=log
[ OK ] EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 (71772 ms)
[----------] 1 test from EncryptionEnabled/LogBlockManagerTest (71772 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (71773 ms total)
[ PASSED ] 1 test.
Performance counter stats for
'env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter=EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=67108864 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags'
10,000,042 syscalls:sys_enter_preadv
0 syscalls:sys_enter_preadv2
9 syscalls:sys_enter_pread64
|
优化后:
1
2
3
4
5
| TEST_TMPDIR=/data/rnddata00/test
rm -rf $TEST_TMPDIR
mkdir -p $TEST_TMPDIR
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo perf stat -e 'syscalls:sys_enter_preadv,syscalls:sys_enter_preadv2,syscalls:sys_enter_pread64’ -- env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter='EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0' --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=0 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags
|
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| wal directory:
metadata directory:
1 data directories: /data/rnddata00/test/log_block_manager-test.EncryptionEnabled_LogBlockManagerTest.StartupBenchmark_0.1779946381731206-2879106-0/0/data
Total live blocks: 5000000
Total live bytes: 5000000
Total live bytes (after alignment): 20480000000
Total number of LBM containers: 2 (1 full)
Did not check for missing blocks
Did not check for orphaned blocks
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)
I20260528 13:34:04.301520 2879106 log_block_manager-test.cc:1207] Time spent reopening block manager: real 2.218s user 0.001s sys 0.000s
I20260528 13:34:04.301535 2879106 log_block_manager-test.cc:1210] Test on --block_manager=log
[ OK ] EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 (66021 ms)
[----------] 1 test from EncryptionEnabled/LogBlockManagerTest (66021 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (66021 ms total)
[ PASSED ] 1 test.
Performance counter stats for
'env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter=EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=6710886400 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags'
38 syscalls:sys_enter_preadv
0 syscalls:sys_enter_preadv2
9 syscalls:sys_enter_pread64
|
Group 1: 数据集大小影响#
| 测试用例 | Live Blocks数据量 | Streaming耗时 | In-memory耗时 | 加速比 |
|---|
| g1_small | 10,199 | ~0.15s | ~0.06s | 2.5x |
| g1_medium | 50,499 | ~0.82s | ~0.39s | 2.1x |
| g1_large | 100,999 | ~1.64s | ~0.80s | 2.0x |
- 数据量增大时加速比略微下降,但内存模式始终有约2倍优势
Group 2: 删除比例影响#
| 测试用例 | Live Blocks数据量 | 删除比例 | Streaming耗时 | In-memory耗时 | 加速比 |
|---|
| g2_del0 | 5,499 | 0% | ~0.83s | ~0.39s | 2.1x |
| g2_del50 | 50,499 | 50% | ~0.82s | ~0.39s | 2.1x |
| g2_del90 | 50,499 | 90% | ~0.81s | ~0.39s | 2.1x |
| g2_del99 | 5,499 | 99% | ~0.83s | ~0.39s | 2.1x |
- 删除比例对性能影响很小,关键因素是存活 block 数量而非总删除量。
Group 3: –log_container_max_blocks 参数影响#
| 测试用例 | Live Blocks数据量 | container 数量 | Streaming耗时 | In-memory耗时 | 加速比 |
|---|
| g3_maxb-1 | 无限制 | 3 | ~0.79s | ~0.39s | 2.0x |
| g3_maxb10000 | 10,000 | 51 | ~0.19s | ~0.02s | 9.3x |
| g3_maxb1000 | 1,000 | 362 | ~0.19s | ~0.02s | 8.4x |
真实负载测试#
实际验证,使用背景中同样的数据负载,同样环境下,启动时间缩小约一半。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| I20260525 18:21:23.380833 563115 log_block_manager.cc:4094] Opened 14800 log block containers
I20260525 18:21:26.773694 563117 log_block_manager.cc:4094] Opened 15000 log block containers
I20260525 18:21:30.263221 563109 log_block_manager.cc:4094] Opened 15200 log block containers
I20260525 18:21:33.728258 563117 log_block_manager.cc:4094] Opened 15400 log block containers
I20260525 18:21:37.007612 563109 log_block_manager.cc:4094] Opened 15600 log block containers
I20260525 18:21:40.462819 563113 log_block_manager.cc:4094] Opened 15800 log block containers
I20260525 18:21:44.021945 563117 log_block_manager.cc:4094] Opened 16000 log block containers
I20260525 18:21:47.487699 563115 log_block_manager.cc:4094] Opened 16200 log block containers
I20260525 18:21:50.876221 563109 log_block_manager.cc:4094] Opened 16400 log block containers
I20260525 18:21:54.176977 563117 log_block_manager.cc:4094] Opened 16600 log block containers
I20260525 18:21:56.882167 573124 log_block_manager.cc:4595] Read-only block manager, skipping repair
I20260525 18:21:56.883350 563099 fs_manager.cc:736] Time spent opening block manager: real 280.781s user 0.002s sys 0.000s
I20260525 18:21:56.883383 563099 fs_manager.cc:653] Opened local filesystem: /data/rnddata00/kudu/tserver_data,/data/rnddata00/kudu/tserver_wal
uuid: "8b4ce73ea5e6404494bd82becda008bf"
format_stamp: "Formatted at 2026-05-13 10:12:37 on hybrid01"
FS layout report
----------------
wal directory: /data/rnddata00/kudu/tserver_wal
metadata directory: /data/rnddata00/kudu/tserver_wal
1 data directories: /data/rnddata00/kudu/tserver_data/data
Total live blocks: 77987
Total live bytes: 18064438689
Total live bytes (after alignment): 18234310656
Total number of LBM containers: 16768 (0 full)
Total missing blocks: 0
Total orphaned blocks: 8405 (0 repaired)
Total orphaned block bytes: 21998789 (0 repaired)
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)
real 4m41.436s
user 9m14.918s
sys 0m40.768s
|
perf 分析显示,优化后,preadv 降低 99% 以上:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
| Summary of events:
file cache-evic (26887), 4 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 2 0.003 0.000 0.002 0.003 100.00%
sq_acceptor (26888), 617 events, 0.1%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
poll 300 59859.127 0.000 199.530 201.965 0.33%
clone 1 0.025 0.025 0.025 0.025 0.00%
setsockopt 3 0.007 0.002 0.002 0.003 14.99%
accept 1 0.006 0.006 0.006 0.006 0.00%
mmap 1 0.005 0.005 0.005 0.005 0.00%
mprotect 1 0.005 0.005 0.005 0.005 0.00%
fcntl 1 0.003 0.003 0.003 0.003 0.00%
getsockname 1 0.002 0.002 0.002 0.002 0.00%
dir /data (26889), 67171 events, 9.4%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 7679 59613.202 0.000 7.763 158.199 1.93%
stat 10222 53.554 0.002 0.005 0.070 0.65%
open 5109 48.864 0.003 0.010 11.327 28.83%
fstat 5122 28.419 0.002 0.006 0.051 1.05%
futex 2546 18.938 0.002 0.007 0.044 0.66%
rt_sigprocmask 2331 5.672 0.002 0.002 0.024 1.08%
clone 40 1.284 0.022 0.032 0.061 3.78%
getrusage 160 0.500 0.002 0.003 0.014 3.09%
mincore 124 0.436 0.002 0.004 0.014 4.37%
read 126 0.365 0.002 0.003 0.016 4.39%
write 126 0.348 0.002 0.003 0.015 4.18%
compact_schedul (26903), 24 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 12 46000.342 0.000 3833.362 16000.068 46.51%
dir /data (26904), 31348 events, 4.4%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 17 2147.507 0.000 126.324 174.976 7.48%
futex 387 502.123 0.002 1.297 500.066 99.59%
rt_sigprocmask 14197 32.275 0.002 0.002 0.029 0.39%
mincore 318 0.849 0.002 0.003 0.006 1.43%
read 319 0.792 0.002 0.002 0.014 1.88%
write 319 0.791 0.002 0.002 0.022 3.20%
fstat 33 0.158 0.002 0.005 0.016 10.94%
clock_gettime 33 0.155 0.002 0.005 0.016 11.64%
open 16 0.108 0.006 0.007 0.008 2.16%
close 17 0.104 0.003 0.006 0.009 5.12%
stat 13 0.087 0.005 0.007 0.016 12.78%
madvise 1 0.016 0.016 0.016 0.016 0.00%
dir /data (26905), 29565 events, 4.1%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 17 2091.876 0.000 123.052 160.809 7.99%
futex 334 501.858 0.002 1.503 500.065 99.64%
rt_sigprocmask 13472 29.471 0.002 0.002 0.037 0.38%
mincore 276 0.759 0.002 0.003 0.006 1.60%
read 276 0.693 0.002 0.003 0.004 1.31%
write 276 0.660 0.002 0.002 0.004 1.16%
fstat 33 0.175 0.002 0.005 0.013 10.55%
clock_gettime 33 0.133 0.002 0.004 0.008 9.06%
open 16 0.111 0.006 0.007 0.009 2.70%
close 17 0.106 0.003 0.006 0.009 6.02%
stat 13 0.075 0.005 0.006 0.007 3.59%
madvise 1 0.010 0.010 0.010 0.010 0.00%
dir /data (26906), 73291 events, 10.2%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 41 3371.359 0.000 82.228 162.952 8.78%
futex 939 1812.729 0.002 1.930 500.074 40.83%
rt_sigprocmask 33144 80.148 0.002 0.002 0.038 0.26%
mincore 743 2.317 0.002 0.003 0.013 1.16%
read 742 2.184 0.002 0.003 0.020 1.61%
write 740 2.045 0.002 0.003 0.013 1.24%
fstat 81 0.454 0.002 0.006 0.032 9.10%
clock_gettime 81 0.386 0.002 0.005 0.013 5.28%
open 40 0.294 0.006 0.007 0.009 1.82%
close 41 0.264 0.003 0.006 0.016 5.30%
stat 29 0.184 0.005 0.006 0.014 4.67%
madvise 1 0.016 0.016 0.016 0.016 0.00%
dir /data (26907), 30399 events, 4.2%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 17 2120.379 0.000 124.728 172.488 7.43%
futex 381 502.051 0.002 1.318 500.061 99.60%
rt_sigprocmask 13830 29.709 0.002 0.002 0.027 0.35%
mincore 280 0.739 0.002 0.003 0.011 1.97%
read 281 0.676 0.002 0.002 0.007 1.42%
write 281 0.645 0.002 0.002 0.008 1.42%
fstat 33 0.153 0.002 0.005 0.009 9.30%
clock_gettime 33 0.141 0.002 0.004 0.009 9.76%
open 16 0.118 0.006 0.007 0.011 4.44%
close 17 0.105 0.003 0.006 0.008 5.28%
stat 10 0.061 0.005 0.006 0.008 4.44%
madvise 1 0.008 0.008 0.008 0.008 0.00%
dir /data (26908), 126376 events, 17.6%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 75 4898.300 0.000 65.311 169.181 7.31%
futex 1954 2550.487 0.002 1.305 500.063 28.31%
rt_sigprocmask 56851 129.303 0.002 0.002 0.079 0.21%
mincore 1237 3.521 0.002 0.003 0.014 1.10%
read 1236 3.218 0.002 0.003 0.011 1.06%
write 1235 3.060 0.002 0.002 0.027 1.31%
fstat 149 0.731 0.002 0.005 0.019 4.87%
clock_gettime 149 0.706 0.002 0.005 0.015 4.10%
open 74 0.524 0.005 0.007 0.011 1.74%
close 75 0.475 0.003 0.006 0.013 2.70%
stat 52 0.298 0.005 0.006 0.008 1.89%
madvise 2 0.160 0.006 0.080 0.154 92.38%
dir /data (26909), 199424 events, 27.8%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 109 6527.724 0.000 59.887 173.392 5.65%
futex 2491 2672.380 0.002 1.073 500.067 25.38%
rt_sigprocmask 90684 208.333 0.002 0.002 0.069 0.16%
mincore 1851 5.366 0.002 0.003 0.018 0.91%
read 1849 4.887 0.002 0.003 0.022 0.89%
write 1846 4.633 0.002 0.003 0.013 0.75%
fstat 217 1.118 0.002 0.005 0.019 3.92%
clock_gettime 216 1.053 0.002 0.005 0.016 3.39%
open 108 0.790 0.005 0.007 0.018 2.61%
close 109 0.725 0.003 0.007 0.016 2.11%
stat 84 0.501 0.005 0.006 0.009 1.70%
madvise 1 0.014 0.014 0.014 0.014 0.00%
gettid 2 0.008 0.003 0.004 0.006 33.10%
dir /data (26910), 151496 events, 21.2%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
preadv 82 5338.005 0.000 65.098 160.753 6.38%
futex 2074 2391.814 0.002 1.153 500.062 29.44%
rt_sigprocmask 68540 155.601 0.002 0.002 0.037 0.18%
mincore 1479 4.297 0.002 0.003 0.032 1.16%
read 1478 4.017 0.002 0.003 0.023 1.27%
write 1480 3.799 0.002 0.003 0.023 1.07%
fstat 163 0.840 0.002 0.005 0.024 4.94%
clock_gettime 162 0.749 0.002 0.005 0.011 3.78%
open 81 0.578 0.005 0.007 0.019 2.54%
close 82 0.524 0.003 0.006 0.013 3.02%
stat 61 0.363 0.005 0.006 0.020 4.17%
madvise 1 0.010 0.010 0.010 0.010 0.00%
kudu-tserver (26830), 68 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 28 54260.608 0.000 1937.879 5000.069 22.12%
write 6 0.116 0.014 0.019 0.026 9.66%
kudu-tserver (26831), 48 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 24 55000.770 0.000 2291.699 5000.076 22.67%
kudu-tserver (26832), 48 events, 0.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 24 55000.779 0.000 2291.699 5000.074 22.67%
kernel-watcher- (26833), 6225 events, 0.9%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 601 59820.422 0.000 99.535 200.145 4.10%
read 905 7.515 0.002 0.008 0.038 3.49%
open 299 2.988 0.006 0.010 0.040 1.66%
munmap 300 2.490 0.004 0.008 0.035 2.41%
mmap 300 2.275 0.004 0.008 0.023 1.86%
fstat 300 1.017 0.002 0.003 0.016 2.86%
close 299 0.926 0.002 0.003 0.018 3.87%
rt_sigprocmask 96 0.209 0.002 0.002 0.007 4.23%
mincore 6 0.018 0.002 0.003 0.004 12.36%
write 6 0.014 0.002 0.002 0.003 2.71%
|
其他资源对比:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| Container .metadata size distribution:
Bucket Count %
<= 4 KiB 4 0.02
<= 16 KiB 5 0.03
<= 64 KiB 18 0.11
<= 256 KiB 153 0.92
<= 1 MiB 724 4.33
<= 4 MiB 3,928 23.49
<= 16 MiB 11,889 71.10
<= 64 MiB 0 0
> 64 MiB 0 0
total 16,721 100.00
- Same data dir, same flags between BEFORE/AFTER except
--log_container_metadata_inmem_replay_threshold_bytes
(67108864 = default 64 MiB).
Results
-------
1) Wall time of the LBM open phase :
BEFORE: 694.171 s
AFTER: 235.797 s
2) preadv syscalls during the open phase (perf trace summary):
BEFORE: ~30 M preadv
AFTER: ~8 k preadv
3) Disk IO on the data disk:
Metric BEFORE AFTER
r/s 195 451
rkB/s 52,400 195,000
await (ms) 1.19 108.8
%util 19.2 % 97.4 %
Representative raw samples:
BEFORE: r/s=180 rkB/s=49960 await=1.22 %util=17.20
BEFORE: r/s=213 rkB/s=49576 await=1.09 %util=18.90
BEFORE: r/s=221 rkB/s=63032 await=1.20 %util=21.10
AFTER: r/s=427 rkB/s=194848 await=118.60 %util=97.30
AFTER: r/s=426 rkB/s=194080 await=121.66 %util=98.40
AFTER: r/s=428 rkB/s=194644 await=112.38 %util=97.60
Reading these: BEFORE the disk is idle (~19 % util, sub-ms
await) -- the bottleneck is CPU syscall, not IO. AFTER the
disk is bandwidth-bound at its ~195 MB/s cap (97 % util).
The bottleneck has cleanly moved from "syscalls" to
"disk bandwidth".
4) CPU of the kudu-tserver process:
Metric BEFORE AFTER delta
avg %user 169.7 369.9 2.18x (more useful work)
avg %system 624.2 19.7 31.7x lower
avg total %CPU 793.9 389.6 2.04x lower
max total %CPU - 550
Representative raw samples:
BEFORE: %usr=170 %sys=623 %CPU=793
BEFORE: %usr=168 %sys=628 %CPU=796
AFTER: %usr=372 %sys= 20 %CPU=392
AFTER: %usr=518 %sys= 21 %CPU=539
|
结论与未来工作#
结论:采用上述方案,将满足条件的 metadata file 读入内存处理,预期节省 20% 到 50% 的启动时间。
未来工作与其他优化:
- 支持加密路径
- 延迟启动中 log block 的创建