Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO)

Hi!

Recently I checked many optimizations like PGO and PLO (mostly with LLVM BOLT) improvements on multiple projects. The results are available [here](https://github.com/zamazan4ik/awesome-pgo/). According to the tests, these optimizations can help to achieve better performance in many cases like databases. I think trying to optimize this project with them will be an interesting idea to achieve more performance.

I already did some (very basic!) benchmarks and want to share my results here.

## Test environment

* Fedora 39
* Linux kernel 6.8.7
* AMD Ryzen 9 5900x
* 48 Gib RAM
* SSD Samsung 980 Pro 2 Tib
* Compiler - Rustc 1.77.2
* bbolt-rs version: the latest for now from the `main` branch on commit `980b96b81768c4c8c78034a99aa7004dc4672674`
* Disabled Turbo boost

## Benchmark

For benchmark purposes, I used [these](https://github.com/ambaxter/bbolt-rs/blob/main/src/bin/bench.rs) benchmarks. The PGO training workload was `bench` command run. The release and PGO-optimized results are generated with `bench -c 10000000`.

All PGO and PLO optimizations are done with [cargo-pgo](https://github.com/Kobzol/cargo-pgo). All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are consistent enough across runs. `taskset -c 0` is used for reducing the OS scheduler result interference.

## Results

Let's start with the results.

Release:
```
taskset -c 0 bench -c 10000000
# Write  8.914817311s  (891ns/op)  (1122334 op/sec)
# Read  1.11635498s  (16ns/op)  (62499999 op/sec)
```

Release + PGO optimization:
```
taskset -c 0 bench -c 10000000
# Write  6.502718524s  (650ns/op)  (1538461 op/sec)
# Read  1.06592023s  (13ns/op)  (76923076 op/sec)
```

Release + PGO optimization + BOLT optimization:
```
taskset -c 0 bench -c 10000000
# Write  6.50363167s  (650ns/op)  (1538461 op/sec)
# Read  1.0854954s  (14ns/op)  (71428571 op/sec)
```

(just for reference) Release + PGO instrumentation:
```
taskset -c 0 bench -c 10000000
# Write	14.626009666s	(1.463µs/op)	(683526 op/sec)
# Read	1.131095569s	(28ns/op)	(35714285 op/sec)
```

(just for reference again) Release + PGO optimized + BOLT instrumented:
```
taskset -c 0 bench -c 10000000
# Write  8.641237686s  (864ns/op)  (1157407 op/sec)
# Read  1.102478063s  (12ns/op)  (83333333 op/sec)
```

According to the tests above, I see measurable improvements from enabling PGO in performance. However, enabling PLO with LLVM BOLT didn't show measurable improvements at least in the simple test above.

For anyone interested in binary sizes, I collected some statistics too (without debug symbols stripping):

* Release:  1.3 Mib
* Release + PGO instrumentation: 3.4 Mib
* Release + PGO optimized: 1.3 Mib
* Release + PGO optimized + BOLT instrumented: 14 Mib
* Release + PGO optimized + BOLT optimized: 4.3 Mib

The only interesting case here is the last one - "Release + PGO optimized + BOLT optimized". I don't know why the binary size was increased so much. I guess some "magic" BOLT's option should be involved here and "fix" the situation. However, it's just a guess for now, no more.

## Further steps

I can suggest the following action points:

* Perform more PGO and PLO benchmarks on the database in more scenarios. If it shows improvements - add a note to the documentation about possible improvements in the project performance with PGO.
* Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize `bbolt-rs` according to their workloads.

Here are some examples of how PGO optimization is integrated into other projects:

* Rustc: a CI [script](https://github.com/rust-lang/rust/tree/master/src/tools/opt-dist) for the multi-stage build
* GCC:
  - Official [docs](https://gcc.gnu.org/install/build.html), section "Building with profile feedback"
  - A [part](https://github.com/gcc-mirror/gcc/blob/4832767db7897be6fb5cbc44f079482c90cb95a6/configure#L7818) in a "wonderful" `configure` script 
* Clang: [Docs](https://llvm.org/docs/HowToBuildWithPGO.html) 
* Python: 
  - CPython: [README](https://github.com/python/cpython#profile-guided-optimization)
  - Pyston: [README](https://github.com/pyston/pyston#building)
* Go: [Bash script](https://github.com/golang/go/blob/master/src/cmd/compile/profile.sh)
* V8: [Bazel flag](https://github.com/v8/v8/blob/main/BUILD.gn#L184)
* ChakraCore: [Scripts](https://github.com/chakra-core/ChakraCore/tree/master/Build/scripts/pgo)
* Chromium: [Script](https://chromium.googlesource.com/chromium/src/build/config/+/refs/heads/main/compiler/pgo/BUILD.gn)
* Firefox: [Docs](https://firefox-source-docs.mozilla.org/build/buildsystem/pgo.html)
   - Thunderbird has PGO support too
* PHP - [Makefile command](https://github.com/php/php-src/blob/master/build/Makefile.global#L138) and old Centminmod [scripts](https://github.com/centminmod/php_pgo_training_scripts)
* MySQL: [CMake script](https://github.com/mysql/mysql-server/blob/8.0/cmake/fprofile.cmake)
* YugabyteDB: [GitHub commit](https://github.com/yugabyte/yugabyte-db/commit/34cb791ed9d3d5f8ae9a9b9e9181a46485e1981d)
* FoundationDB: [Script](https://github.com/apple/foundationdb/blob/1a6114a66f3de508c0cf0a45f72f3687ba05750c/contrib/generate_profile.sh)
* Zstd: [Makefile](https://github.com/facebook/zstd/blob/dev/programs/Makefile#L232)
* [Foot](https://codeberg.org/dnkl/foot): [Scripts](https://codeberg.org/dnkl/foot/src/branch/master/pgo)
* Windows Terminal: [GitHub PR](https://github.com/microsoft/terminal/pull/10071)
* Pydantic-core: [GitHub PR](https://github.com/pydantic/pydantic-core/pull/741)
* file.d: [GitHub PR](https://github.com/ozontech/file.d/pull/469)
* OceanBase: [CMake flag](https://github.com/oceanbase/oceanbase/blob/master/cmake/Env.cmake#L55)

I would be happy to answer your questions about all the optimizations above. Please do not treat the issue as a bug or something like that - it's just an idea of how the project performance can be improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #2

Test environment

Benchmark

Results

Further steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluate using Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) #2

Description

Test environment

Benchmark

Results

Further steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions