本文只讨论以下两种存图方式:
单向链表存图
struct Edge { // balabalabala... int next; }; Edge e[1000000]; int head[100000];
遍历的时候:
for (int i = head[x]; i; i = e[i].next)
vector 存图
struct Edge { // balabalabala... }; Edge e[1000000]; vector<int> es[100000];
遍历的时候:
for (int i : es[x])
首先看一个程序 main.cpp
:
// usage: ./a.out [n] [T]
// e.g. ./a.out 100000 1000
// compile it with "-std=c++11" and "-O3"
// #define VECTOR
// #define LIST
#include <vector>
#include <random>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <functional>
#define fence() asm volatile("mfence" ::: "memory")
using u64 = unsigned long long;
using f128 = long double;
struct Edge {
u64 value;
Edge *next;
};
int main(int argc, char *argv[]) {
using clock = std::chrono::high_resolution_clock;
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " [n] [T]" << std::endl;
return -1;
}
int n = atoi(argv[1]);
int T = atoi(argv[2]);
constexpr u64 SEED = 0xdeadbeef19260817;
std::mt19937_64 gen(SEED);
Edge *mem = new Edge[n];
std::vector<int> idx;
idx.resize(n);
std::iota(idx.begin(), idx.end(), 0);
std::shuffle(idx.begin(), idx.end(), gen);
Edge *list = nullptr;
std::vector<Edge*> vec;
for (int j = 0; j < n; j++) {
int i = idx[j];
mem[i] = Edge{gen(), list};
list = mem + i;
vec.push_back(list);
}
std::reverse(vec.begin(), vec.end());
#ifdef VECTOR
const auto name = "vector";
auto run = [&] {
u64 ret = 0;
for (auto e : vec)
ret += e->value;
return ret;
};
#endif
#ifdef LIST
const auto name = "list";
auto run = [&] {
u64 ret = 0;
for (auto e = list; e; e = e->next)
ret += e->value;
return ret;
};
#endif
u64 ans = 0;
auto t_start = clock::now();
fence();
for (int t = 0; t < T; t++) {
std::cout << "#" << t + 1 << std::endl;
ans += run();
}
fence();
auto t_end = clock::now();
auto span = std::chrono::duration_cast<std::chrono::microseconds>(t_end - t_start).count();
auto aver = f128(span) / T;
std::cout << name << ": " << ans << " " << aver << "μs" << std::endl;
delete[] mem;
return 0;
}
这个程序干了这么几件事:
- 初始化一个
Edge
的内存池mem
; - 把
mem
的Edge
按随机顺序串起来,得到单向链表list
; - 把串起来的
Edge
按链表内的顺序把它们的指针放到vec
里面; - 反复遍历
list
或者vec
,计时。
然后分开编译:
$ g++ main.cpp -DVECTOR -std=c++17 -O3 -o vector.out
$ g++ main.cpp -DLIST -std=c++17 -O3 -o list.out
运行:
$ ./vector.out 100000 1000
...
vector: 9056018651256176736 126.578μs
$ ./list.out 100000 1000
...
list: 9056018651256176736 1224.01μs
结果发现 vector 遍历比单向链表遍历快了将近 10x :)
,大家可能会很惊讶,vector
怎么会这么快呢?事实就是这样,小编也很惊讶。
这是怎么做到的呢?下面小编带大家来了解一下吧!
首先,vector 缓存命中率高?用 perf
检查一下:
$ taskset -c 0 perf stat -ddd ./list.out 100000 1000
...
list: 9056018651256176736 1251.79μs
Performance counter stats for './list.out 100000 1000':
1,256.60 msec task-clock # 1.000 CPUs utilized
3 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
1,074 page-faults # 0.855 K/sec
4,928,477,544 cycles # 3.922 GHz (38.37%)
444,538,195 instructions # 0.09 insn per cycle (46.25%)
106,509,024 branches # 84.760 M/sec (46.49%)
260,219 branch-misses # 0.24% of all branches (46.52%)
210,985,531 L1-dcache-loads # 167.902 M/sec (46.52%)
106,515,848 L1-dcache-load-misses # 50.48% of all L1-dcache accesses (46.52%)
84,942,108 LLC-loads # 67.597 M/sec (30.59%)
390,735 LLC-load-misses # 0.46% of all LL-cache accesses (30.56%)
<not supported> L1-icache-loads
1,166,528 L1-icache-load-misses (30.56%)
203,014,349 dTLB-loads # 161.559 M/sec (30.56%)
4,116 dTLB-load-misses # 0.00% of all dTLB cache accesses (30.56%)
19,835 iTLB-loads # 0.016 M/sec (30.56%)
1,367 iTLB-load-misses # 6.89% of all iTLB cache accesses (30.56%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
1.256954906 seconds time elapsed
1.249223000 seconds user
0.003324000 seconds sys
$ taskset -c 0 perf stat -ddd ./vector.out 100000 1000
...
vector: 9056018651256176736 132.619μs
Performance counter stats for './vector.out 100000 1000':
137.31 msec task-clock # 0.997 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
1,073 page-faults # 0.008 M/sec
525,222,528 cycles # 3.825 GHz (34.28%)
527,824,716 instructions # 1.00 insn per cycle (43.02%)
103,747,402 branches # 755.561 M/sec (45.20%)
231,195 branch-misses # 0.22% of all branches (47.39%)
206,014,450 L1-dcache-loads # 1500.340 M/sec (49.57%)
110,273,989 L1-dcache-load-misses # 53.53% of all L1-dcache accesses (51.75%)
96,467,786 LLC-loads # 702.545 M/sec (34.96%)
57,315 LLC-load-misses # 0.06% of all LL-cache accesses (32.95%)
<not supported> L1-icache-loads
483,211 L1-icache-load-misses (30.77%)
206,050,910 dTLB-loads # 1500.606 M/sec (28.58%)
2,534 dTLB-load-misses # 0.00% of all dTLB cache accesses (26.39%)
6,877 iTLB-loads # 0.050 M/sec (26.22%)
45 iTLB-load-misses # 0.65% of all iTLB cache accesses (26.23%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
0.137661562 seconds time elapsed
0.130449000 seconds user
0.006701000 seconds sys
可以看到:
list
的未命中率50.48% of all L1-dcache accesses, 0.46% of all LL-cache accesses
;vector
的未命中率53.53% of all L1-dcache accesses, 0.06% of all LL-cache accesses
。
两者的缓存命中率不相上下,list
的 L1 命中率反而还高一点。
那是什么原因呢?注意一下 L1-dcache-loads
右边的数字,list
是 167.902 M/sec
,而 vector
高达 1500.340 M/sec
。这个数字表示的是 L1 的带宽。换句话说,虽然二者命中率差不多,但是因为 vector
每秒钟的吞吐量比 list
大,所以比 list
快很多。
看一下两种循环方式的汇编:https://gcc.godbolt.org/z/3oPznofe9
list
是:
add rax, QWORD PTR [rdi] # load1: add rax, [rdi]
mov rdi, QWORD PTR [rdi+8] # load2: mov rdi, [rdi]
test rdi, rdi
jne .L9
vector
是:
mov rdx, QWORD PTR [rax] # load1: mov rdx, [rax]
add rax, 8 # add rax
add r8, QWORD PTR [rdx] # load2: add r8, [rdx]
cmp rax, rcx
jne .L3
没啥毛病,都是两次 load。但是对于 list
而言,这些 load 之间的依赖关系是这样的:
load2 -+-> load2 -+-> load2 -+-> ... -+-> load2 -> load1
| | | |
+-> load1 +-> load1 +-> ... +-> load1
因为下一条边只能从上一条边的 next
拿到。如果上一个 Edge
没 load 完(load2),CPU 也很难知道下一个 Edge
在哪里。而 vector
是:
add rax -+-> add rax -+-> add rax -+-> ... -+-> add rax -> load1 -> load2
| | | |
| | | ... +-> load1 -> load2
| | +-> load1 -> load2
| +-> load1 -> load2
+-> load1 -> load2
只要遍历 vector
的下标一直在加(add rax
),CPU 就能不断地发新的 load 出去。现代 CPU 大多能指令乱序多发,并且 Intel 的 CPU 有 L1d Fill Buffer,允许同时有很多 cache miss,从而充分利用内存带宽。用 perf
可以验证这个事情:
$ taskset -c 0 perf stat -M ILP,MLP ./vector.out 100000 1000
...
vector: 9056018651256176736 135.844μs
Performance counter stats for './vector.out 100000 1000':
297,166,586 uops_executed.core_cycles_ge_1 # 3.62 ILP
537,316,216 uops_executed.thread
510,676,250 l1d_pend_miss.pending_cycles # 8.40 MLP
4,291,380,208 l1d_pend_miss.pending
...
$ taskset -c 0 perf stat -M ILP,MLP ./list.out 100000 1000
...
list: 9056018651256176736 1261.45μs
Performance counter stats for './list.out 100000 1000':
234,580,521 uops_executed.core_cycles_ge_1 # 3.76 ILP
441,502,995 uops_executed.thread
3,284,071,492 l1d_pend_miss.pending_cycles # 1.00 MLP
3,292,907,879 l1d_pend_miss.pending
...
ILP 是指令并行度,MLP 是访存并行度。可以看到 vector
的 8.40 MLP 比 list
的 1.00 MLP 不知道高到哪里去了。
也可以拿 Intel VTune Profiler 看,我记得可以看到 vector
的 FB Full 是 100%,list
是 0%。说明 vector
把 Fill Buffer 挤炸了。图有空再贴。
有什么用?
主要是针对网络流算法:
struct Edge {
int u, v, c, w, rev;
};
Edge e[1000000];
vector<int> es[100000];
因为网络流算法需要反复多次遍历同一张图。如果图非常大,或者非常稠密,或者需要跑很多次增广,这个时候可能图的遍历会占用很多时间。那么此时用 vector
存图比用单向链表好。
举个例子:【ULR #1】光伏元件
某个点 1490ms -> 961ms。当然优化不可能有前面那份代码那么夸张,取决于每个点的邻接表有多长。
哇那是不是都用 vector 比较吼?
也不是。众所周知 vector
的 push_back
比较慢。我试过几个原本就跑得比较快的题,换成 vector
反而可能有负优化 :)
其它
- https://gcc.godbolt.org/z/3oPznofe9:第一行加个
#pragma GCC optimize("unroll-loops")
,GCC 会帮vector
做循环展开。 - 众所周知
#pragma GCC optimize("O3")
只对当前编译单元有效。如果评测机不开 O2,那还是不要用vector
为好。
这就是关于 vector 存图的事情了,大家有什么想法呢,欢迎在评论区告诉小编一起讨论哦!