背景

Kudu 启动慢是长期以来饱受诟病的一个问题,部分数据量大的集群,单节点重启极端耗时需要 30 分钟以上,同时其部分参数并不能进行运行时调优,需要重启才能生效,这困扰着运维人员,影响升级和变更流程。

社区以及一些组织实现了一些优化方案,具体可以见 此文 tserver启动加速 一节,一些具体的 patch 包括:

  • KUDU-2977 Sharding block map of LogBlockManager

  • KUDU-3001 Multi-thread to load containers in a data directory

然而即使在一些高版本上,当前依然存在启动慢耗时长的问题,原因通常是:数据量大、资源限制导致的 metadata 的 compact 不及时等。

社区目前最佳的手段可能是使用 RocksDB 去存储底层的 metadata 数据,这是较新的方式,然而对于大量的存量环境,要将现有的 metadata 存储迁移到 RocksDB 上,成本较高;并且较新的特性不确定是否还有其他的问题,底层文件存储的问题可能导致数据问题,风险也较大。

近期,在探索 kudu 启动问题上,笔者注意到一个点是:在抓取 kudu 启动时间的过程中,发现 kudu tserver 启动中 sys time 较高。

结合之前探索过的 kudu 文件系统实现:

[笔记] Kudu 存储引擎 | Kudu 文件系统
日期: 2024-10-23   标签: #big data  #kudu 
Block 抽象接口 操作系统把磁盘抽象成文件,Kudu 则在文件之上再加了一层抽象——Block。在 Kudu 中,一列数据、一个 BloomFilter、一份主键索引,最终都变成一个或多个 Block 写入磁盘。Block 是 Kudu 存储引擎与本地文件系统之间的分界线:上层组件只需面对 Block 接口的 Append / Read,不必关心底层是一个独立文件(FileBlockManager)还是日志容器中的一段字节区间(LogBlockManager)。 这种隔离方式与 Unix 的设备抽象思路一致——Unix 内核用 struct file 加上 read/write 函数指针屏蔽底层设备差异;Kudu 用 Block 基类加上虚函数屏蔽底层文件系统的组织差异。 BlockId:Block 的身份标识 每个 Block 都有一个全局唯一的身份标识 BlockId(src/kudu/fs/block_id.h),本质上是一个 64 位无符号整数: 1 2 3 4 class BlockId { private: uint64_t id_; }; BlockId 可以序列化到 protobuf(CopyToPB / FromPB),也可以打印成 16 位十六进制字符串供调试。它是不透明的——上层代码不应假设 ID 的分配规则或数值含义,只需把它当作一把钥匙,用来在 BlockManager 中取回对应的数据。 Block 基类 Block 是所有 Block 的基类,接口极为简洁——只有一个方法: 1 2 3 4 5 class Block { public: virtual ~Block() = default; virtual const BlockId& id() const = 0; }; id() 返回该 Block 的 BlockId。这足以让上层代码通过 ID 引用任何 Block,而不必关心它是可读的还是可写的。Block 的两个子类——WritableBlock 和 ReadableBlock——分别定义了写路径和读路径的完整接口。 ......

进一步详细分析发现,sys time 高的原因是在启动的过程中存在大量的 preadv syscall。

在包含大量存储块的环境上,启动过程中会产生海量的细碎读操作,这些读操作均需要 syscall,大量的上下文切换和读操作,会拖慢加载速度。

下文是对此问题的详细分析,以及针对该问题提出的一种优化方案。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
I20260525 17:57:30.571516 492752 log_block_manager.cc:3914] Opened 16200 log block containers
I20260525 17:57:38.019493 492753 log_block_manager.cc:3914] Opened 16400 log block containers
I20260525 17:57:45.528719 492758 log_block_manager.cc:3914] Opened 16600 log block containers
I20260525 17:57:51.387989 514375 log_block_manager.cc:4415] Read-only block manager, skipping repair
I20260525 17:57:51.389531 492740 fs_manager.cc:736] Time spent opening block manager: real 661.574s     user 0.002s     sys 0.000s
I20260525 17:57:51.389570 492740 fs_manager.cc:653] Opened local filesystem: /data/rnddata00/kudu/tserver_data,/data/rnddata00/kudu/tserver_wal

uuid: "8b4ce73ea5e6404494bd82becda008bf"
format_stamp: "Formatted at 2026-05-13 10:12:37 on hybrid01"

FS layout report
----------------

wal directory: /data/rnddata00/kudu/tserver_wal
metadata directory: /data/rnddata00/kudu/tserver_wal
1 data directories: /data/rnddata00/kudu/tserver_data/data

Total live blocks: 77987
Total live bytes: 18064438689
Total live bytes (after alignment): 18234310656
Total number of LBM containers: 16768 (0 full)

Total missing blocks: 0
Total orphaned blocks: 8405 (0 repaired)
Total orphaned block bytes: 21998789 (0 repaired)

Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)

Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)

real    11m2.509s
user    23m53.968s
sys     53m11.945s

问题分析

在注意到上面问题之后,对启动过程中的 “reading file system” 过程使用 perf 抓取了启动中的 syscall 情况,并对调用栈进行了分析。

perf 结果显示启动中存在海量的 preadv 调用。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
LOST 7871 events!
<..... 9000 more lines of LOST events ...>
LOST 32929 events!
LOST 31516 events!

 Summary of events:
 kudu-tserver (30770), 62 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 26 53706.680     0.000  2065.642  5005.867     22.71%
   write                  5     0.078     0.013     0.016     0.019      7.57%

 kudu-tserver (30771), 48 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 24 55044.566     0.000  2293.524  5010.907     22.67%

 kudu-tserver (30772), 48 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 24 55024.037     0.000  2292.668  5008.982     22.67%

 kernel-watcher- (30773), 6328 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                592 59870.175     0.000   101.132   222.008      4.13%
   open                 296    24.958     0.008     0.084    21.784     87.24%
   read                 898     8.968     0.002     0.010     0.035      3.38%
   munmap               296     2.339     0.005     0.008     0.021      1.12%
   mmap                 296     2.218     0.006     0.007     0.018      0.96%
   fstat                296     1.041     0.003     0.004     0.013      0.96%
   close                296     0.914     0.003     0.003     0.012      1.32%
   rt_sigprocmask       174     0.500     0.003     0.003     0.003      0.44%
   mincore               10     0.038     0.003     0.004     0.004      4.27%
   write                 10     0.033     0.003     0.003     0.004      1.53%

 file cache-evic (30797), 4 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                  2     0.003     0.000     0.002     0.003    100.00%

 sq_acceptor (30798), 611 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   poll                 297 59911.950     0.000   201.724   222.064      0.46%
   clone                  1     0.027     0.027     0.027     0.027      0.00%
   setsockopt             3     0.010     0.003     0.003     0.004     13.34%
   mprotect               1     0.007     0.007     0.007     0.007      0.00%
   accept                 1     0.007     0.007     0.007     0.007      0.00%
   mmap                   1     0.006     0.006     0.006     0.006      0.00%
   fcntl                  1     0.003     0.003     0.003     0.003      0.00%
   getsockname            1     0.003     0.003     0.003     0.003      0.00%

 dir /data (30799), 2991490 events, 4.8%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           1343517 49711.550     0.000     0.037    92.131      1.46%
   stat               61203   272.244     0.002     0.004     0.036      0.09%
   open               30622   247.117     0.003     0.008    36.105     23.95%
   futex                646   161.280     0.002     0.250    34.882     32.61%
   rt_sigprocmask     18478   127.298     0.002     0.007    43.518     36.99%
   fstat              30628    98.476     0.002     0.003     0.029      0.12%
   close               7202    23.597     0.002     0.003     0.072      0.35%
   mincore             1161     4.410     0.003     0.004     0.012      0.70%
   read                1161     3.935     0.003     0.003     0.017      0.76%
   write               1161     3.840     0.002     0.003     0.018      0.71%
   clock_gettime         30     0.135     0.003     0.005     0.012      7.42%
   brk                    3     0.012     0.003     0.004     0.006     22.57%

 compact_schedul (30812), 24 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 12 46007.877     0.000  3833.990 16000.743     46.51%

 dir /data (30813), 8484357 events, 13.6%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4218088 35480.531     0.002     0.008    86.394      2.07%
   futex               1639   355.904     0.002     0.217    36.006     23.95%
   rt_sigprocmask     18090   146.308     0.002     0.008    25.533     31.46%
   read                1432    21.247     0.002     0.015    15.150     71.45%
   mincore             1431     5.532     0.002     0.004     0.127      2.30%
   write               1432     4.747     0.002     0.003     0.012      0.52%
   open                  47     0.329     0.004     0.007     0.010      3.63%
   fstat                 85     0.283     0.003     0.003     0.006      1.35%
   clock_gettime         64     0.269     0.003     0.004     0.008      3.88%
   stat                  29     0.169     0.004     0.006     0.008      3.62%
   close                 36     0.115     0.003     0.003     0.004      1.18%

 dir /data (30814), 8418485 events, 13.5%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4186255 34155.452     0.002     0.008    89.695      2.08%
   futex               1814   618.258     0.002     0.341    97.937     22.42%
   rt_sigprocmask     17030   111.125     0.002     0.007    25.736     31.97%
   mincore             1344     4.990     0.003     0.004     0.013      0.49%
   read                1344     4.678     0.002     0.003     0.138      2.93%
   write               1343     4.371     0.003     0.003     0.013      0.43%
   open                  51     0.356     0.004     0.007     0.011      3.52%
   fstat                 90     0.300     0.003     0.003     0.012      3.03%
   clock_gettime         68     0.273     0.003     0.004     0.007      3.49%
   stat                  35     0.201     0.004     0.006     0.008      3.11%
   close                 39     0.124     0.003     0.003     0.004      1.40%
   brk                    3     0.010     0.003     0.003     0.005     21.07%

 dir /data (30815), 8444630 events, 13.6%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4198290 33605.348     0.002     0.008   121.906      2.08%
   futex               1833   357.323     0.002     0.195    41.395     22.91%
   rt_sigprocmask     17866   180.114     0.002     0.010   105.896     60.01%
   write               1412     5.613     0.002     0.004     1.017     18.07%
   mincore             1413     5.356     0.003     0.004     0.014      0.69%
   read                1412     4.770     0.002     0.003     0.015      0.57%
   open                  46     0.327     0.004     0.007     0.009      3.45%
   clock_gettime         64     0.290     0.003     0.005     0.020      7.34%
   fstat                 83     0.278     0.003     0.003     0.004      0.93%
   stat                  31     0.192     0.005     0.006     0.013      4.91%
   close                 34     0.110     0.003     0.003     0.004      1.24%

 dir /data (30816), 8535085 events, 13.7%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4242171 35026.262     0.002     0.008   114.182      2.04%
   futex               1768   348.946     0.002     0.197   118.673     37.82%
   rt_sigprocmask     19022    81.015     0.002     0.004    22.809     28.21%
   mincore             1478    42.816     0.002     0.029    25.157     65.11%
   read                1478    16.972     0.002     0.011    12.014     70.77%
   write               1477     4.828     0.002     0.003     0.015      0.57%
   open                  62     0.419     0.004     0.007     0.011      3.31%
   fstat                106     0.347     0.003     0.003     0.005      0.94%
   clock_gettime         80     0.328     0.003     0.004     0.007      3.33%
   stat                  39     0.216     0.004     0.006     0.008      2.82%
   close                 47     0.150     0.003     0.003     0.004      1.12%
   brk                    9     0.033     0.003     0.004     0.006     11.34%

 dir /data (30817), 8491494 events, 13.6%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4221264 34668.736     0.002     0.008   131.688      2.10%
   futex               1826   319.663     0.002     0.175    33.265     25.59%
   rt_sigprocmask     18300    55.593     0.002     0.003     2.237      4.04%
   mincore             1436     5.364     0.002     0.004     0.015      0.53%
   read                1436     4.845     0.002     0.003     0.015      0.62%
   write               1436     4.667     0.002     0.003     0.014      0.40%
   stat                  29     1.192     0.004     0.041     1.023     85.34%
   fstat                 83     0.476     0.003     0.006     0.203     41.96%
   open                  45     0.321     0.005     0.007     0.010      3.36%
   clock_gettime         64     0.254     0.003     0.004     0.007      3.66%
   close                 32     0.104     0.003     0.003     0.004      1.17%
   gettid                 1     0.003     0.003     0.003     0.003      0.00%

 dir /data (30818), 8411713 events, 13.5%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4181435 34919.129     0.002     0.008    92.185      2.00%
   futex               1835   354.040     0.002     0.193    31.058     22.58%
   rt_sigprocmask     18180    85.027     0.002     0.005    22.419     27.37%
   write               1414    54.329     0.003     0.038    33.294     68.31%
   open                  70    25.668     0.004     0.367    25.191     98.12%
   mincore             1412     5.464     0.003     0.004     0.123      2.28%
   read                1412     4.794     0.002     0.003     0.018      0.67%
   fstat                121     0.393     0.003     0.003     0.004      0.68%
   clock_gettime         92     0.380     0.003     0.004     0.007      3.00%
   stat                  39     0.216     0.004     0.006     0.009      3.02%
   close                 51     0.165     0.003     0.003     0.004      1.17%
   brk                    3     0.019     0.003     0.006     0.011     38.37%

 dir /data (30819), 8509579 events, 13.7%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv           4228714 36142.178     0.001     0.009   135.743      2.11%
   futex               1982   312.357     0.002     0.158    39.418     24.65%
   rt_sigprocmask     19496   104.292     0.002     0.005    23.827     31.56%
   read                1502    56.911     0.003     0.038    29.035     64.85%
   mincore             1500    18.674     0.003     0.012    13.015     69.67%
   write               1502     4.865     0.002     0.003     0.022      0.64%
   open                  53     0.379     0.004     0.007     0.011      3.23%
   clock_gettime         76     0.340     0.003     0.004     0.019      5.64%
   fstat                 99     0.328     0.003     0.003     0.004      0.80%
   stat                  31     0.195     0.004     0.006     0.013      5.97%
   close                 38     0.126     0.003     0.003     0.004      1.55%
   brk                    9     0.035     0.003     0.004     0.007     13.39%
   gettid                 1     0.004     0.004     0.004     0.004      0.00%

结合调用栈和启动中的 i/o 路径发现,这些 preadv 存在于启动中的 log conatainer metadata 记录解析上。具体如下:

启动 kudu fs 的 i/o 调用路径:

1
2
3
4
5
6
7
LogBlockManager::Open
  └─ for each data_dir: dd->ExecClosure(OpenDataDir)        // 每个目录一个线程池任务
       ├─ env_->GetChildren(dir)                            // 1 次 getdents 扫整个目录
       └─ for each container in containers_seen:
            ├─ OpenContainer  (同步)                         // 打开 .data,可能查 size、查 fd cache
            └─ dd->ExecClosure(LoadContainer)                // 异步丢回同一线程池
                  └─ ProcessRecords(metadata_file)           // 解析 .metadata

其中在 ProcessRecords 解析 metadata 的过程中实现如下:

1
2
3
4
5
6
7
ReadablePBContainerFile pb_reader(std::move(metadata_reader));
RETURN_NOT_OK_HANDLE_ERROR(pb_reader.Open());
while (true) {
  BlockRecordPB record;
  read_status = pb_reader.ReadNextPB(&record);
  ...
}

而 ReadNextPB -> ReadFullPB -> ReadPBStartingAt 里,每条记录会调用 两次 ValidateAndReadData,底层走 RandomAccessFile::Read,顺着调用链路,在 Linux 上对应 preadv。

1
2
3
4
5
#if defined(__APPLE__)
    RETRY_ON_EINTR(r, preadvsim(fd, iov + completed_iov, iov_count, cur_offset));
#else
    RETRY_ON_EINTR(r, preadv(fd, iov + completed_iov, iov_count, cur_offset));
#endif

也就是说:

每条 BlockRecordPB 至少 2 次 preadv;外加打开容器时的 open + fstat 的额外读。

一个典型的 tablet server,单容器若有 10 万条 record,单容器即 ~20 万次 pread;成千上万个容器并发起来很容易把启动阶段卡在系统调用上下文切换与微读上。

优化方案

由于 Kudu log block container 具有如下特点:

  • 单容器最大 block 数受 –log_container_max_blocks 限制;
  • 启动阶段还会触发 log_container_live_metadata_before_compact_ratio 的 compact,进一步压缩;
  • 对于 100 万条 record,每条 BlockRecordPB 序列化后通常几十字节,整文件常在几十 MB 量级

上述特点导致导致此文件天生不大,通常在几十 MB 区间上。

因此,可以考虑进行如下优化:

启动时,一次性把 metadata 文件读进内存,再解析。

预期收益:preadv syscall 数从 O(sum {containes 记录数}) 降到 O(sum {containers 数});计算量不变;内存峰值约等于一个 .metadata 文件大小 x 线程池大小(实际可控)。

具体实现上:引入一个参数 –log_container_metadata_inmem_replay_threshold_bytes, 参数值设置为 64MB, 当 metadata 文件大小小于此参数时,采用优化方案;参数设置为 0 时,回退到之前的路径上,其他任何解析错误也回退,避免影响正常启动流程。

由于存储格式问题,当前对于加密路径还没有方案完全支持,需要进一步探索

Benchmark 测试

注:本节与下一节的测试,为了排除 page cache 等的影响,均测量的多组,并求取其平均值。

使用 kudu 单元测试中的 benchmark 测试,对优化后的方案和优化前的方案进行了如下多组对比:

preadv 对比

优化前:

1
2
3
4
5
TEST_TMPDIR=/data/rnddata00/test
rm -rf $TEST_TMPDIR
mkdir -p $TEST_TMPDIR
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo perf stat -e 'syscalls:sys_enter_preadv,syscalls:sys_enter_preadv2,syscalls:sys_enter_pread64' -- env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter='EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0' --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=6710886400 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags

结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
wal directory:
metadata directory:
1 data directories: /data/rnddata00/test/log_block_manager-test.EncryptionEnabled_LogBlockManagerTest.StartupBenchmark_0.1779945625777115-2852759-0/0/data

Total live blocks: 5000000
Total live bytes: 5000000
Total live bytes (after alignment): 20480000000

Total number of LBM containers: 2 (1 full)

Did not check for missing blocks
Did not check for orphaned blocks

Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)

Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)

I20260528 13:21:33.849928 2852759 log_block_manager-test.cc:1207] Time spent reopening block manager: real 7.881s     user 0.001s     sys 0.000s
I20260528 13:21:33.849943 2852759 log_block_manager-test.cc:1210] Test on --block_manager=log

[       OK ] EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 (71772 ms)
[----------] 1 test from EncryptionEnabled/LogBlockManagerTest (71772 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (71773 ms total)
[  PASSED  ] 1 test.

Performance counter stats for
'env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter=EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=67108864 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags'

10,000,042      syscalls:sys_enter_preadv
0               syscalls:sys_enter_preadv2
9               syscalls:sys_enter_pread64

优化后:

1
2
3
4
5
TEST_TMPDIR=/data/rnddata00/test
rm -rf $TEST_TMPDIR
mkdir -p $TEST_TMPDIR
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo perf stat -e 'syscalls:sys_enter_preadv,syscalls:sys_enter_preadv2,syscalls:sys_enter_pread64’ -- env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter='EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0' --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_deleted_block_percentage=0 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=0 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags

结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
wal directory:
metadata directory:
1 data directories: /data/rnddata00/test/log_block_manager-test.EncryptionEnabled_LogBlockManagerTest.StartupBenchmark_0.1779946381731206-2879106-0/0/data

Total live blocks: 5000000
Total live bytes: 5000000
Total live bytes (after alignment): 20480000000

Total number of LBM containers: 2 (1 full)

Did not check for missing blocks
Did not check for orphaned blocks

Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)

Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)

I20260528 13:34:04.301520 2879106 log_block_manager-test.cc:1207] Time spent reopening block manager: real 2.218s     user 0.001s     sys 0.000s
I20260528 13:34:04.301535 2879106 log_block_manager-test.cc:1210] Test on --block_manager=log

[       OK ] EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 (66021 ms)
[----------] 1 test from EncryptionEnabled/LogBlockManagerTest (66021 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (66021 ms total)
[  PASSED  ] 1 test.

Performance counter stats for
'env KUDU_ALLOW_SLOW_TESTS=1 TEST_TMPDIR=/data/rnddata00/test ./log_block_manager-test --gtest_filter=EncryptionEnabled/LogBlockManagerTest.StartupBenchmark/0 --startup_benchmark_batch_count_for_testing=5000 --startup_benchmark_block_count_per_batch_for_testing=1000 --startup_benchmark_data_dir_count_for_testing=1 --startup_benchmark_reopen_times=1 --log_container_metadata_inmem_replay_threshold_bytes=6710886400 --fs_max_thread_count_per_data_dir 8 --unlock_experimental_flags'

38              syscalls:sys_enter_preadv
0               syscalls:sys_enter_preadv2
9               syscalls:sys_enter_pread64
  • 优化前:preadv syscall 数量大,约为 记录数 2倍;

  • 优化后:preadv syscall 基本消除。

Group 1: 数据集大小影响

测试用例Live Blocks数据量Streaming耗时In-memory耗时加速比
g1_small10,199~0.15s~0.06s2.5x
g1_medium50,499~0.82s~0.39s2.1x
g1_large100,999~1.64s~0.80s2.0x
  • 数据量增大时加速比略微下降,但内存模式始终有约2倍优势

Group 2: 删除比例影响

测试用例Live Blocks数据量删除比例Streaming耗时In-memory耗时加速比
g2_del05,4990%~0.83s~0.39s2.1x
g2_del5050,49950%~0.82s~0.39s2.1x
g2_del9050,49990%~0.81s~0.39s2.1x
g2_del995,49999%~0.83s~0.39s2.1x
  • 删除比例对性能影响很小,关键因素是存活 block 数量而非总删除量。

Group 3: –log_container_max_blocks 参数影响

测试用例Live Blocks数据量container 数量Streaming耗时In-memory耗时加速比
g3_maxb-1无限制3~0.79s~0.39s2.0x
g3_maxb1000010,00051~0.19s~0.02s9.3x
g3_maxb10001,000362~0.19s~0.02s8.4x
  • 优化前 Streaming: 每个数据目录可并行处理容器 ;

  • 优化后 In-memory: 一次性将所有元数据加载到内存,加速明显。

真实负载测试

实际验证,使用背景中同样的数据负载,同样环境下,启动时间缩小约一半。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
I20260525 18:21:23.380833 563115 log_block_manager.cc:4094] Opened 14800 log block containers
I20260525 18:21:26.773694 563117 log_block_manager.cc:4094] Opened 15000 log block containers
I20260525 18:21:30.263221 563109 log_block_manager.cc:4094] Opened 15200 log block containers
I20260525 18:21:33.728258 563117 log_block_manager.cc:4094] Opened 15400 log block containers
I20260525 18:21:37.007612 563109 log_block_manager.cc:4094] Opened 15600 log block containers
I20260525 18:21:40.462819 563113 log_block_manager.cc:4094] Opened 15800 log block containers
I20260525 18:21:44.021945 563117 log_block_manager.cc:4094] Opened 16000 log block containers
I20260525 18:21:47.487699 563115 log_block_manager.cc:4094] Opened 16200 log block containers
I20260525 18:21:50.876221 563109 log_block_manager.cc:4094] Opened 16400 log block containers
I20260525 18:21:54.176977 563117 log_block_manager.cc:4094] Opened 16600 log block containers

I20260525 18:21:56.882167 573124 log_block_manager.cc:4595] Read-only block manager, skipping repair
I20260525 18:21:56.883350 563099 fs_manager.cc:736] Time spent opening block manager: real 280.781s     user 0.002s     sys 0.000s
I20260525 18:21:56.883383 563099 fs_manager.cc:653] Opened local filesystem: /data/rnddata00/kudu/tserver_data,/data/rnddata00/kudu/tserver_wal

uuid: "8b4ce73ea5e6404494bd82becda008bf"
format_stamp: "Formatted at 2026-05-13 10:12:37 on hybrid01"

FS layout report
----------------

wal directory: /data/rnddata00/kudu/tserver_wal
metadata directory: /data/rnddata00/kudu/tserver_wal
1 data directories: /data/rnddata00/kudu/tserver_data/data

Total live blocks: 77987
Total live bytes: 18064438689
Total live bytes (after alignment): 18234310656
Total number of LBM containers: 16768 (0 full)

Total missing blocks: 0
Total orphaned blocks: 8405 (0 repaired)
Total orphaned block bytes: 21998789 (0 repaired)

Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)

Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
Total corrupted LBM metadata records in RocksDB: 0 (0 repaired)

real    4m41.436s
user    9m14.918s
sys     0m40.768s

perf 分析显示,优化后,preadv 降低 99% 以上:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
Summary of events:
 file cache-evic (26887), 4 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                  2     0.003     0.000     0.002     0.003    100.00%

 sq_acceptor (26888), 617 events, 0.1%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   poll                 300 59859.127     0.000   199.530   201.965      0.33%
   clone                  1     0.025     0.025     0.025     0.025      0.00%
   setsockopt             3     0.007     0.002     0.002     0.003     14.99%
   accept                 1     0.006     0.006     0.006     0.006      0.00%
   mmap                   1     0.005     0.005     0.005     0.005      0.00%
   mprotect               1     0.005     0.005     0.005     0.005      0.00%
   fcntl                  1     0.003     0.003     0.003     0.003      0.00%
   getsockname            1     0.002     0.002     0.002     0.002      0.00%

 dir /data (26889), 67171 events, 9.4%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv              7679 59613.202     0.000     7.763   158.199      1.93%
   stat               10222    53.554     0.002     0.005     0.070      0.65%
   open                5109    48.864     0.003     0.010    11.327     28.83%
   fstat               5122    28.419     0.002     0.006     0.051      1.05%
   futex               2546    18.938     0.002     0.007     0.044      0.66%
   rt_sigprocmask      2331     5.672     0.002     0.002     0.024      1.08%
   clone                 40     1.284     0.022     0.032     0.061      3.78%
   getrusage            160     0.500     0.002     0.003     0.014      3.09%
   mincore              124     0.436     0.002     0.004     0.014      4.37%
   read                 126     0.365     0.002     0.003     0.016      4.39%
   write                126     0.348     0.002     0.003     0.015      4.18%

 compact_schedul (26903), 24 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 12 46000.342     0.000  3833.362 16000.068     46.51%

 dir /data (26904), 31348 events, 4.4%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                17  2147.507     0.000   126.324   174.976      7.48%
   futex                387   502.123     0.002     1.297   500.066     99.59%
   rt_sigprocmask     14197    32.275     0.002     0.002     0.029      0.39%
   mincore              318     0.849     0.002     0.003     0.006      1.43%
   read                 319     0.792     0.002     0.002     0.014      1.88%
   write                319     0.791     0.002     0.002     0.022      3.20%
   fstat                 33     0.158     0.002     0.005     0.016     10.94%
   clock_gettime         33     0.155     0.002     0.005     0.016     11.64%
   open                  16     0.108     0.006     0.007     0.008      2.16%
   close                 17     0.104     0.003     0.006     0.009      5.12%
   stat                  13     0.087     0.005     0.007     0.016     12.78%
   madvise                1     0.016     0.016     0.016     0.016      0.00%

 dir /data (26905), 29565 events, 4.1%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                17  2091.876     0.000   123.052   160.809      7.99%
   futex                334   501.858     0.002     1.503   500.065     99.64%
   rt_sigprocmask     13472    29.471     0.002     0.002     0.037      0.38%
   mincore              276     0.759     0.002     0.003     0.006      1.60%
   read                 276     0.693     0.002     0.003     0.004      1.31%
   write                276     0.660     0.002     0.002     0.004      1.16%
   fstat                 33     0.175     0.002     0.005     0.013     10.55%
   clock_gettime         33     0.133     0.002     0.004     0.008      9.06%
   open                  16     0.111     0.006     0.007     0.009      2.70%
   close                 17     0.106     0.003     0.006     0.009      6.02%
   stat                  13     0.075     0.005     0.006     0.007      3.59%
   madvise                1     0.010     0.010     0.010     0.010      0.00%

 dir /data (26906), 73291 events, 10.2%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                41  3371.359     0.000    82.228   162.952      8.78%
   futex                939  1812.729     0.002     1.930   500.074     40.83%
   rt_sigprocmask     33144    80.148     0.002     0.002     0.038      0.26%
   mincore              743     2.317     0.002     0.003     0.013      1.16%
   read                 742     2.184     0.002     0.003     0.020      1.61%
   write                740     2.045     0.002     0.003     0.013      1.24%
   fstat                 81     0.454     0.002     0.006     0.032      9.10%
   clock_gettime         81     0.386     0.002     0.005     0.013      5.28%
   open                  40     0.294     0.006     0.007     0.009      1.82%
   close                 41     0.264     0.003     0.006     0.016      5.30%
   stat                  29     0.184     0.005     0.006     0.014      4.67%
   madvise                1     0.016     0.016     0.016     0.016      0.00%

 dir /data (26907), 30399 events, 4.2%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                17  2120.379     0.000   124.728   172.488      7.43%
   futex                381   502.051     0.002     1.318   500.061     99.60%
   rt_sigprocmask     13830    29.709     0.002     0.002     0.027      0.35%
   mincore              280     0.739     0.002     0.003     0.011      1.97%
   read                 281     0.676     0.002     0.002     0.007      1.42%
   write                281     0.645     0.002     0.002     0.008      1.42%
   fstat                 33     0.153     0.002     0.005     0.009      9.30%
   clock_gettime         33     0.141     0.002     0.004     0.009      9.76%
   open                  16     0.118     0.006     0.007     0.011      4.44%
   close                 17     0.105     0.003     0.006     0.008      5.28%
   stat                  10     0.061     0.005     0.006     0.008      4.44%
   madvise                1     0.008     0.008     0.008     0.008      0.00%

 dir /data (26908), 126376 events, 17.6%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                75  4898.300     0.000    65.311   169.181      7.31%
   futex               1954  2550.487     0.002     1.305   500.063     28.31%
   rt_sigprocmask     56851   129.303     0.002     0.002     0.079      0.21%
   mincore             1237     3.521     0.002     0.003     0.014      1.10%
   read                1236     3.218     0.002     0.003     0.011      1.06%
   write               1235     3.060     0.002     0.002     0.027      1.31%
   fstat                149     0.731     0.002     0.005     0.019      4.87%
   clock_gettime        149     0.706     0.002     0.005     0.015      4.10%
   open                  74     0.524     0.005     0.007     0.011      1.74%
   close                 75     0.475     0.003     0.006     0.013      2.70%
   stat                  52     0.298     0.005     0.006     0.008      1.89%
   madvise                2     0.160     0.006     0.080     0.154     92.38%

 dir /data (26909), 199424 events, 27.8%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv               109  6527.724     0.000    59.887   173.392      5.65%
   futex               2491  2672.380     0.002     1.073   500.067     25.38%
   rt_sigprocmask     90684   208.333     0.002     0.002     0.069      0.16%
   mincore             1851     5.366     0.002     0.003     0.018      0.91%
   read                1849     4.887     0.002     0.003     0.022      0.89%
   write               1846     4.633     0.002     0.003     0.013      0.75%
   fstat                217     1.118     0.002     0.005     0.019      3.92%
   clock_gettime        216     1.053     0.002     0.005     0.016      3.39%
   open                 108     0.790     0.005     0.007     0.018      2.61%
   close                109     0.725     0.003     0.007     0.016      2.11%
   stat                  84     0.501     0.005     0.006     0.009      1.70%
   madvise                1     0.014     0.014     0.014     0.014      0.00%
   gettid                 2     0.008     0.003     0.004     0.006     33.10%

 dir /data (26910), 151496 events, 21.2%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   preadv                82  5338.005     0.000    65.098   160.753      6.38%
   futex               2074  2391.814     0.002     1.153   500.062     29.44%
   rt_sigprocmask     68540   155.601     0.002     0.002     0.037      0.18%
   mincore             1479     4.297     0.002     0.003     0.032      1.16%
   read                1478     4.017     0.002     0.003     0.023      1.27%
   write               1480     3.799     0.002     0.003     0.023      1.07%
   fstat                163     0.840     0.002     0.005     0.024      4.94%
   clock_gettime        162     0.749     0.002     0.005     0.011      3.78%
   open                  81     0.578     0.005     0.007     0.019      2.54%
   close                 82     0.524     0.003     0.006     0.013      3.02%
   stat                  61     0.363     0.005     0.006     0.020      4.17%
   madvise                1     0.010     0.010     0.010     0.010      0.00%

 kudu-tserver (26830), 68 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 28 54260.608     0.000  1937.879  5000.069     22.12%
   write                  6     0.116     0.014     0.019     0.026      9.66%

 kudu-tserver (26831), 48 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 24 55000.770     0.000  2291.699  5000.076     22.67%

 kudu-tserver (26832), 48 events, 0.0%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                 24 55000.779     0.000  2291.699  5000.074     22.67%

 kernel-watcher- (26833), 6225 events, 0.9%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   futex                601 59820.422     0.000    99.535   200.145      4.10%
   read                 905     7.515     0.002     0.008     0.038      3.49%
   open                 299     2.988     0.006     0.010     0.040      1.66%
   munmap               300     2.490     0.004     0.008     0.035      2.41%
   mmap                 300     2.275     0.004     0.008     0.023      1.86%
   fstat                300     1.017     0.002     0.003     0.016      2.86%
   close                299     0.926     0.002     0.003     0.018      3.87%
   rt_sigprocmask        96     0.209     0.002     0.002     0.007      4.23%
   mincore                6     0.018     0.002     0.003     0.004     12.36%
   write                  6     0.014     0.002     0.002     0.003      2.71%

其他资源对比:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Container .metadata size distribution:
      Bucket       Count       %
      <=   4 KiB       4    0.02
      <=  16 KiB       5    0.03
      <=  64 KiB      18    0.11
      <= 256 KiB     153    0.92
      <=   1 MiB     724    4.33
      <=   4 MiB   3,928   23.49
      <=  16 MiB  11,889   71.10
      <=  64 MiB       0    0
      >   64 MiB       0    0
      total       16,721  100.00

- Same data dir, same flags between BEFORE/AFTER except
  --log_container_metadata_inmem_replay_threshold_bytes
  (67108864 = default 64 MiB).
Results
-------
1) Wall time of the LBM open phase :
       BEFORE: 694.171 s
       AFTER:  235.797 s
2) preadv syscalls during the open phase (perf trace summary):
       BEFORE: ~30 M preadv
       AFTER:  ~8 k preadv
3) Disk IO on the data disk:
       Metric          BEFORE    AFTER
       r/s             195       451
       rkB/s         52,400  195,000
       await (ms)      1.19    108.8
       %util           19.2 %   97.4 %
   Representative raw samples:
       BEFORE:  r/s=180   rkB/s=49960   await=1.22   %util=17.20
       BEFORE:  r/s=213   rkB/s=49576   await=1.09   %util=18.90
       BEFORE:  r/s=221   rkB/s=63032   await=1.20   %util=21.10
       AFTER:   r/s=427   rkB/s=194848  await=118.60 %util=97.30
       AFTER:   r/s=426   rkB/s=194080  await=121.66 %util=98.40
       AFTER:   r/s=428   rkB/s=194644  await=112.38 %util=97.60
   Reading these: BEFORE the disk is idle (~19 % util, sub-ms
   await) -- the bottleneck is CPU syscall, not IO. AFTER the
   disk is bandwidth-bound at its ~195 MB/s cap (97 % util).
   The bottleneck has cleanly moved from "syscalls" to
   "disk bandwidth".

4) CPU of the kudu-tserver process:
       Metric             BEFORE   AFTER    delta
       avg %user          169.7    369.9    2.18x  (more useful work)
       avg %system        624.2     19.7    31.7x lower
       avg total %CPU     793.9    389.6    2.04x lower
       max total %CPU     -        550
   Representative raw samples:
       BEFORE:  %usr=170  %sys=623  %CPU=793
       BEFORE:  %usr=168  %sys=628  %CPU=796
       AFTER:   %usr=372  %sys= 20  %CPU=392
       AFTER:   %usr=518  %sys= 21  %CPU=539

结论与未来工作

结论:采用上述方案,将满足条件的 metadata file 读入内存处理,预期节省 20% 到 50% 的启动时间。

未来工作与其他优化:

  • 支持加密路径
  • 延迟启动中 log block 的创建