opendal: binding/python: rust std fs is slower than python fs
Write up: https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/ Reproduce: https://github.com/Xuanwo/when-i-find-rust-is-slow
:) hyperfine "python test_fs.py" "./opendal-test/target/release/opendal-test"
Benchmark 1: python test_fs.py
Time (mean ± σ): 28.8 ms ± 1.9 ms [User: 12.8 ms, System: 15.8 ms]
Range (min … max): 26.2 ms … 36.3 ms 101 runs
Benchmark 2: ./opendal-test/target/release/opendal-test
Time (mean ± σ): 28.0 ms ± 1.3 ms [User: 0.3 ms, System: 27.7 ms]
Range (min … max): 26.1 ms … 37.1 ms 95 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
./opendal-test/target/release/opendal-test ran
1.03 ± 0.08 times faster than python test_fs.py
hyperfine "python test_fs.py" "./opendal-test/target/release/opendal-test" 1.60s user 4.31s system 99% cpu 5.931 total
Python Code
import pathlib
import timeit
root = pathlib.Path(__file__).parent
filename = "file"
def read_file_with_normal() -> bytes:
with open(root / filename, "rb") as fp:
result = fp.read()
return result
if __name__ == "__main__":
read_file_with_normal()
Rust Code:
use std::io::Read;
use std::fs::OpenOptions;
fn main() {
// let bs = op.read("file").unwrap();
let mut bs = vec![0; 64 * 1024 * 1024];
let mut f = OpenOptions::new().read(true).open("/tmp/test/file").unwrap();
f.read_exact(&mut bs).unwrap();
// let bs = std::fs::read("/tmp/test/file").unwrap();
//assert_eq!(bs.len(), 64 * 1024 * 1024);
}
Why rust spent more time on syscall?
Discussion:
https://discord.com/channels/1081052318650339399/1174840499576770560
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 20 (12 by maintainers)
A friend shared a link with me: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
Also in glibc upstream: https://sourceware.org/bugzilla/show_bug.cgi?id=30994
It looks like some AMD CPU doesn’t handle
rep movsb(a.k.a FSRM) correctly.I’m able to reproduce this easily using the C reproducer above, a quick test using perf stat:
With the 0x20 offset:
Without the offset:
You can see the interesting change is the L1-dcache-prefetches and L1-dcache-loads. My CPU is AMD Ryzen 9 5900HX
Hey all, I’ve previously found this issue at work and just saw the HN thread. It occurs because the microcode doesn’t know the physical memory addresses. To then still handle overlapping virtual memory correctly (I mean when 2 virtual pages point to the same physical page, for the lack of better terms) it only compares bits 11:0 to see whether the addresses overlap. So, if the source and destination are 1 page or more apart but have a small misalignment, this then causes it to guess incorrectly and use a slower copy loop, similar to when you memcpy() something only by 1 byte around.
I’ve tested it like this (Windows specific code): https://gist.github.com/lhecker/d11deb1974c6a814576ebcff5d62e935 And these are my latest results on an 7800X3D: https://gist.github.com/lhecker/e46d895013925a5b61a4dc7ccc38dd38 (I’ve had to increase the
max_sizedue to the large L3 cache.)It seems that
rep movsbperformance poorly when DATA IS PAGE ALIGNED, and perform better when DATA IS NOT PAGE ALIGNED, this is very funny…In my case, I tried adjusting the offset value, when offset&0xFFF is between 0x10 - 0x990, it performance well, otherwise, when (dst - src)&0xFFF is small (zero included!), the performance is bad.
And it seems not related to prefetch, I disabled L1 prefetch using
wrmsr, and get the following result:With the offset
Without the offset
L1-dcache-prefetches is zero now, but L1-dcache-loads still goes much higher when there is no offset, and performance drop by a lot.
After a 24h test, I some conclusion below
First, This issue could not be reproduced on my machine. I have test from 64M to 40G RAM file. some of the results
I use eBPF + kprobe to measure the precise
readsyscall time for each problemhere’s result
Here’s the script
Unfortunately, we can not run this script with zen-kenel 6.6.2 from upstream(Both @Xuanwo and me confirm this).
Next step, I will try to measure the precise time by using ptrace or userspace eBPF which is not depend on the kernel symbol.