dianqk's blog


Finding LLVM bugs in rust using good-bad comparisons

Posted on

Abstract

Finding problems in large projects is always complicated, and the LLVM bugs mixed in with rust is one such case. In this article, I will describe how I solved the rust unit test failure issue under stage2. I'll discuss the issues around Failing tests when rustc is compiled with 1 CGU and Implementing niche checks to document my process for solving this type of issue, which I hope will help solve this type of issue in the future.

Before We Start

I have prepared a corresponding project for this article, see:

The PR to fix these two issues is:

How to reproduce the first problem

First, we need to switch to a revision that can be reproduced:

git clone https://github.com/rust-lang/rust.git
git checkout 085acd02d4abaf2ccaf629134caa83cfe23283c8

Then we need to change config.toml:

profile = "dist"

[build]
profiler = true

[rust]
codegen-units = 1
optimize = 2

We also need to understand how to use rustc-perf and then build rustc with PGO optimization using the following script:

./x build --rust-profile-generate=/tmp/profiles --stage 2 library
cargo run --bin collector bench_local --include serde,syn <path to stage2/bin/rustc>
./build/ci-llvm/bin/llvm-profdata merge -o profiles.profdata /tmp/profiles
./x build --rust-profile-use=profiles.profdata --stage 2 library

The problem can be reproduced by compiling the following code with this version of rustc:

#![feature(inline_const)]

fn main() {
    const {
        assert!(-9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64);
    }
}

The reproduced error log is as follows:

error[E0080]: evaluation of `main::{constant#0}` failed
 --> ./test.rs:5:9
  |
5 |         assert!(-9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64);
  |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: -9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64', ./test.rs:5:9
  |
  = note: this error originates in the macro `assert` (in Nightly builds, run with -Z macro-backtrace for more info)

note: erroneous constant encountered
 --> ./test.rs:4:5
  |
4 | /     const {
5 | |         assert!(-9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64);
6 | |     }
  | |_____^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0080`.

Do some preliminary analysis

Use the error stack to determine a problematic crate

We can use -Z treat-err-as-bug to get the error stack, where the incorrectly compiled function is likely to be.

 ./build/host/stage2/bin/rustc ./test.rs -Z treat-err-as-bug
error[E0080]: evaluation of `main::{constant#0}` failed
 --> ./test.rs:5:9
  |
5 |         assert!(-9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64);
  |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: -9.223372036854776e18f64 as i64 == 0x8000000000000000u64 as i64', ./test.rs:5:9
  |
  = note: this error originates in the macro `assert` (in Nightly builds, run with -Z macro-backtrace for more info)

thread 'rustc' panicked at compiler/rustc_errors/src/lib.rs:1724:30:
aborting due to `-Z treat-err-as-bug=1`
stack backtrace:
  ...
  23:     0x7f32189f3bd7 - rustc_const_eval[7551efff2730a760]::const_eval::eval_queries::eval_to_const_value_raw_provider
  ...

By examining the stack, I think rustc_const_eval is the crate to focus on, and we can do a simple verification to prove my guess. Change Cargo.toml to the following:

[profile.release.package.rustc_const_eval]
codegen-units = 16

We can see that the problem is no longer reproducing. I think rustc on stage1 compiles rustc_const_eval erroneously.

Simplify test.rs

After some tweaking of the reproduction use case, I found that the following code can also be reproduced:

#![crate_type = "lib"]

const _: u32 = -1.1f32 as i32 as u32 - 1 as u32;

Also, we can see that the issue here behaves as if any negative floating-point number converted to a signed integer will go straight to 0.

Roughly identify the problem function

By reading the rustc_const_eval code and analyzing the call stack, I guess the issue is in float_to_float_or_int and cast_from_float nearby.

To verify this, we can use #[inline(never)] to prevent partial optimization. I found that adding #[inline(never)] to float_to_float_or_int is still an issue while adding to cast_from_float makes the test code compile correctly. I guess the issue is in float_to_float_or_int and the inline function cast_from_float.

Is it really about PGO?

In fact, we can reproduce the issue directly using the stage2 version of rustc used to generate PGO:

./x build --rust-profile-generate=/tmp/profiles --stage 2 library

Use git bisect to find which commit caused the problem

While this result may not be a faulty commit, it can give us a concrete way to control the compilation of errors. We can further pinpoint the problem by tuning this commit.

Pick a good commit

In execute git bisect, we need to find a good commit. If we can't find a good commit between major releases, I'm going to give up on git bisect because a commit that's too far back in time might not make sense. And the more commits we ignore, the more likely we are to have other unrelated issues. Here's a simple approach: LLVM's tags have a creation rule where we create a tag that raises the major version at the same time we create a new release branch, which is llvmorg-{version}-init, which has a linear relationship and is very bisect-friendly. I would treat llvmorg-18-init and llvmorg-17.0.1 as consistent code.

Here we have chosen the comparison version:

Preparing the LLVM build configuration

Since each step of the bisect takes a long time, I first recommend using a higher-performance computer to shorten this time.

Then change config.toml to reduce the repetitive build time, my changes are as follows:

[build]
submodules = false

[llvm]
download-ci-llvm = false
assertions = false
ccache = "sccache"
targets = "X86"
experimental-targets = ""

[target.x86_64-unknown-linux-gnu]
# The commit of the patch will be removed after using bisect.
llvm-has-rust-patches = false

[rust]
codegen-units = 256

Also, change Cargo.toml:

[profile.release.package.rustc_const_eval]
codegen-units = 1

Reducing functions processed by PGO

This reduces build time and further identifies the problem.

We change PGOInstrumentation.cpp#L1761 filtering rules, for example:

static bool skipPGO(const Function &F) {
  // ...
  if (!F.getName().contains("rustc_const_eval"))
    return true;
  if (!F.getName().contains("float_to_float_or_int"))
    return true;
  if (!F.getName().contains("cast_from_float"))
    return true;
  // ...
}

Execute git bisect

The bisect process is a bit different from the standard LLVM project. Instead of using git bisect skip, we need to manually adapt API changes when we encounter a PassWrapper.cpp compilation failure. Instead of using git bisect skip. Since PassWrapper.cpp has to be modified so often, we can't automate the process using git bisect run, but have to do it manually and check the results each time. With any luck, it takes less than 12 runs to get the results.

After running it for a while, I got a bisect result of 361464c0. Because the LLVM is so complex, I also generally can't tell what the problem is directly from this commit, and this commit may not necessarily be a faulty commit.

I generally categorize bisect results into several categories:

At this point, however, we can't categorize the result of this one. But we can use this result to continue debugging the problem.

Finding the transformation of interest by changing the LLVM source code

At this point, we don't know where the mis-compilation is, and we can't debug it by getting an IR directly. We still compile and run rustc to locate the problem until we have a clear conclusion.

Based on the bisect result, we need to find which function ended up mis-compiling after processImmutArgument.

Using code like the following can help us step-by-step to pinpoint which function is affected.

bool MemCpyOptPass::processImmutArgument(CallBase &CB, unsigned ArgNo) {
    // ...
    auto FnName = CB.getFunction()->getName();
    if (FnName.contains("rustc_const_eval") &&
        FnName.contains("CompileTimeInterpreter") &&
        FnName.contains("float_to_float_or_int")) {
      errs() << "LLVMLOG: Skip " << FnName << "\n";
      return false;
    // ...
}

Looking through the log I can find that the function impacted in MemCpyOptPass is:

<rustc_const_eval::interpret::eval_context::InterpCx<rustc_const_eval::const_eval::machine::CompileTimeInterpreter>>::float_to_float_or_int

From the log I found that this function completes multiple memcpy transformations.

So we can continue filtering to find which transformation is causing the problem.

auto FnName = CB.getFunction()->getName();
bool IsKeyFunction = FnName.contains("rustc_const_eval") &&
                     FnName.contains("CompileTimeInterpreter") &&
                     FnName.contains("float_to_float_or_int");
if (!IsKeyFunction)
  return false;
static int Count = 0;
Count += 1;
if (Count != 3) {
  errs() << "LLVMLOG: Skip " << Count << "\n";
  return false;
}
errs() << "LLVMLOG: Use " << Count << "\n";

When I got to this point, I began to suspect that the mis-compilation had nothing to do with PGO. At this point, I tried canceling the PGO which also reproduced the problem. My guess is that PGO was just a coincidence that exposed the mis-compilation to the runtime. But we don't yet know what type of MemCpyOptPass it is, and it could be a new coincidence, or a mis-compilation, or a misdirected subsequent Pass.

Using -opt-bisect-limit

We can use -opt-bisect-limit to locate which pass changes instructions that cause problems at runtime.

There are two types of passes found using -opt-bisect-limit:

Execute -opt-bisect-limit for a specific crate.

Trivia: This real debugging was done by changing OptBisect.cpp. However, while writing this article, I found a simpler and more efficient solution.

If we set -Cllvm-args=-opt-bisect-limit=-1 via RUSTFLAGS_NOT_BOOTSTRAP directly, we will get a lot of invalid logs. We want to apply this only to rustc_const_eval.

The nightly version of cargo provides this. We need to switch to nightly first. make the following changes:

[build]
cargo = "<path to home>/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/cargo"

Then update Cargo.toml.

cargo-features = ["profile-rustflags"]
# ...
[profile.release.package.rustc_const_eval]
rustflags = [
  "-C", "llvm-args=-opt-bisect-limit=-1",
]

Next, we can build using --keep-stage:

./x build --stage 2 library --keep-stage 0 --keep-stage-std 1 2> build.log
./build/host/stage2/bin/rustc ./test.rs

My final bisect result was:

BISECT: running pass (560953) MemCpyOptPass on _RNvMNtNtCsiODAygBxQYA_16rustc_const_eval9interpret4castINtNtB4_12eval_context8InterpCxNtNtNtB6_10const_eval7machine22CompileTimeInterpreterE21float_to_float_or_intB6_
BISECT: running pass (560954) DSEPass on _RNvMNtNtCsiODAygBxQYA_16rustc_const_eval9interpret4castINtNtB4_12eval_context8InterpCxNtNtNtB6_10const_eval7machine22CompileTimeInterpreterE21float_to_float_or_intB6_
BISECT: NOT running pass (560955) MoveAutoInitPass on _RNvMNtNtCsiODAygBxQYA_16rustc_const_eval9interpret4castINtNtB4_12eval_context8InterpCxNtNtNtB6_10const_eval7machine22CompileTimeInterpreterE21float_to_float_or_intB6_

It's weird to me. If I set the limit to 560953, the result does not stop at MemCpyOptPass. The result is a bit untrustworthy.

BISECT: running pass (560953) BDCEPass on _RNvXsg_NtNtCsjGTw0T6X7N4_16rustc_const_eval9interpret7operandNtB5_9ImmediateNtNtCs6sCMhXnFQZh_4core3fmt5Debug3fmtB9_
BISECT: NOT running pass (560954) InstCombinePass on _RNvXsg_NtNtCsjGTw0T6X7N4_16rustc_const_eval9interpret7operandNtB5_9ImmediateNtNtCs6sCMhXnFQZh_4core3fmt5Debug3fmtB9_

However, we can verify that the problem is related to DSEPass in a similar way, by changing the code to skip DSEPass. Next, as with debugging MemCpyOptPass, we can find out exactly which instruction DSEPass transformation is causing the final mis-compilation.

Update: For more stable results with -opt-bisect-limit, we can try -Z no-parallel-llvm. Also, rustc tries to perform ThinLTO by default, which can be turned off with -C lto=false.

Getting IR to prepare for LLVM debugging

This part is related to LLVM debugging, and I don't have any good experience to share yet.

But the two important points are:

With these two transformations, we can use opt to debug IR. We don't need to use rustc to compile as often!

The recommended way to get the IR is by changing Cargo.toml:

[profile.release.package.rustc_const_eval]
codegen-units = 1
rustflags = [
  "-C", "save-temps",
]

Then find *.no-opt.bc to debug.

Of course, during this debugging process, we need to know the details of the logic of the two transformations of MemCpyOptPass and DSEPass. Here we'll learn that this is a transformation via alias analysis, which ultimately pinpoints the fact that MemCpyOptPass doesn't update the corresponding alias information when replacing the value used by the instruction.

Next question - mir niche checks

Although this problem is easier to reproduce than the previous one, it will be more troublesome to solve. First, let's switch to cf8d85e4.

This can be reproduced by running a test with --stage 2:

./x test tests/ui --stage 2

config.toml reference:

profile = "codegen"

[llvm]
assertions = false
enable-warnings = false
download-ci-llvm = false
ccache = "sccache"
targets = "X86"
experimental-targets = ""
link-shared = true
use-linker = "lld"
optimize = true
release-debuginfo = true

[rust]
debug = false
incremental = false
optimize = 3
debug-logging = false
deny-warnings = false
codegen-backends = ["llvm"]
use-lld = true
lto = "off"
debug-assertions = true
debug-assertions-std = false

The error log is as follows:

thread 'rustc' panicked at compiler/rustc_mir_build/src/thir/pattern/deconstruct_pat.rs:560:22:
occupied niche: found 0x7f7700000000 but must be in 0x0..=0x2 in type std::option::Option<thir::pattern::deconstruct_pat::SliceKind> at offset 0 with type Int(I64, false)
stack backtrace:
  ...
  13:     0x7f77e92ade54 - <rustc_mir_build[48ebee3fe6c2e2b3]::thir::pattern::deconstruct_pat::Constructor>::split::<core[8d828210e7f791ba]::iter::adapters::map::Map<core[8d828210e7f791ba]::iter::adapters::map::Map<core[8d828210e7f791ba]::slice::iter::Iter<rustc_mir_build[48ebee3fe6c2e2b3]::thir::pattern::usefulness::PatStack>, <rustc_mir_build[48ebee3fe6c2e2b3]::thir::pattern::usefulness::Matrix>::heads::{closure#0}>, <rustc_mir_build[48ebee3fe6c2e2b3]::thir::pattern::deconstruct_pat::DeconstructedPat>::ctor>>
  ...

I found that if I set codegen-units=1, the panic disappears. So we can use codegen-units to determine which crate is impacted. I guessed rustc_mir_build based on the stack and tried toggling codegen-units=1 to verify. Since this issue occurred at roughly the same point in time as the previous one, we used the same LLVM commit at the start of the bisect. Unfortunately, with the optimize=3 set, we can't find a good commit in LLVM 16. But I also tried setting optimize=2 and can find LLVM 16 is a good commit.

Tricky git bisect

During the bisect process, I get an unexpected result:

---- [ui] tests/ui/issue-11881.rs stdout ----

error: test compilation failed although it shouldn't!
status: exit status: 1
command: RUSTC_ICE="0" "/home/dianqk/rust-workspace/rust/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/home/dianqk/rust-workspace/rust/tests/ui/issue-11881.rs" "-Zthreads=1" "-Zsimulate-remapped-rust-src-base=/rustc/FAKE_PREFIX" "-Ztranslate-remapped-path-to-local-path=no" "-Z" "ignore-directory-in-diagnostics-source-blocks=/home/dianqk/.cargo" "--sysroot" "/home/dianqk/rust-workspace/rust/build/x86_64-unknown-linux-gnu/stage2" "--target=x86_64-unknown-linux-gnu" "-O" "--error-format" "json" "--json" "future-incompat" "-Ccodegen-units=1" "-Zui-testing" "-Zdeduplicate-diagnostics=no" "-Zwrite-long-types-to-disk=no" "-Cstrip=debuginfo" "-C" "prefer-dynamic" "-o" "/home/dianqk/rust-workspace/rust/build/x86_64-unknown-linux-gnu/test/ui/issue-11881/a" "-A" "internal_features" "-Crpath" "-Cdebuginfo=0" "-Lnative=/home/dianqk/rust-workspace/rust/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "-Clink-arg=-fuse-ld=lld" "-Clink-arg=-Wl,--threads=1" "-L" "/home/dianqk/rust-workspace/rust/build/x86_64-unknown-linux-gnu/test/ui/issue-11881/auxiliary"
stdout: none
--- stderr -------------------------------
error: unexpected token: `&`
  --> /home/dianqk/rust-workspace/rust/tests/ui/issue-11881.rs:18:25
   |
LL |     fn encode(&self, s: &mut S) -> Result<(), S::Error>;
   |                         ^ unexpected token after this

error: unexpected token: `&`
  --> /home/dianqk/rust-workspace/rust/tests/ui/issue-11881.rs:33:23
   |
LL |     fn fmt(&self, _f: &mut fmt::Formatter<'_>) -> fmt::Result {
   |                       ^ unexpected token after this

For this result, we can't use good/bad to execute bisect, which leads bisect to the wrong result.

Unfortunately, even with a skip, we can't get a bisect result. When bisect is in f7deb69f2...7c78cb4b internally, it is possible to get this unexpected error. If it's an earlier commit, it's good, and if it's a later commit, it's bad. This is because new issues were introduced with the change in the semantics of nonnull and so on, and we overlooked some issue fixes during bisect that led to the exposure of a new issue.

The history of this commit is as follows:

bad
Revert "[SimplifyCFG][LICM] Preserve nonnull, range and align metadat…  7c78cb4
skip
[SimplifyCFG][LICM] Preserve nonnull, range and align metadata when s…  78b1fbc
good

Since we ignored some commits that caused a new problem, bisect had no results. Luckily, this time the problem was exceptional and we were able to drop 78b1fbc and 7c78cb4 using rebase.

The final bisect result is [AggressiveInstCombine] Enable also for -O2. I also found the critical transformation in AggressiveInstCombine by changing the code, but I can't see anything wrong with it from the code or IR. It could be that I'm missing something, or it could be that this is just a lucky opportunity for a triggering. We need to keep this point of suspicion in mind and continue troubleshooting.

codegen-units=256 & -opt-bisect-limit=n

This time we have no way to use -opt-bisect-limit because codegen-units is not equal to 1. It doesn't make sense to bisect multiple IRs at the same time. We need to modify the rust code to support selecting a specific CGU to bisect under multiple CGUs.

First use -C save-temps to find the corresponding IR. See bfd759b7 for the details.

Trivia: Here I also tried to find the one transformation in AggressiveInstCombine that had an effect. It was a headache that adding -C save-temps changed the symbol names, which made me look again for the associated symbols.

I wrote a simple script to find this IR:

for bitcode in build/x86_64-unknown-linux-gnu/stage1-rustc/x86_64-unknown-linux-gnu/release/deps/rustc_mir_build-*-cgu.*.rcgu.no-opt.bc; do
    if llvm-nm -U $bitcode | grep -q "example"; then
        echo $bitcode
    fi
done

I viewed the implementation of OptBisect, and although the -opt-bisect-limit option is global, we can replace an empty OptBisect with a separate one for the CGU. The full modification is at a9f62a4a.

Some key changes are referenced below:

struct RunAllOptPassGate : public OptPassGate {
  bool shouldRunPass(const StringRef PassName, StringRef IRDescription) override {
    return true;
  }
  bool isEnabled() const override { return true; }
};

static RunAllOptPassGate &getRunAllOptPassGate() {
  static RunAllOptPassGate RunAllOptPassGate;
  return RunAllOptPassGate;
}

extern "C" void LLVMRustContextSetSetRunAllOptPassGate(LLVMContextRef C) {
  unwrap(C)->setOptPassGate(getRunAllOptPassGate());
}

I also added a command line argument -Z llvm-opt-bisect-limit-cgu so that I could bisect using the following script:

export RUSTFLAGS_NOT_BOOTSTRAP="-C llvm-args=-opt-bisect-limit=-1 -Z llvm-opt-bisect-limit-cgu=rustc_mir_build.63d28fcded2a05ed-cgu.007"
./x build --stage 2 library --keep-stage 0 --keep-stage-std 1 2> build.log
./build/host/stage2/bin/rustc ./tests/ui/consts/const_prop_slice_pat_ice.rs

I also wrote a simple automated bisect script:

function iterate() {
    local good=`sed -n '1p' bisect_result`
    local bad=`sed -n '2p' bisect_result`
    local result=$((bad - good))
    echo "good: $good, bad: $bad"
    if [ $result -eq 1 ]; then
        echo "done"
        exit 0
    else
        local next=$((good + (result / 2)))
        echo $next
        bash bisect.sh $next
        exit_code=$?
        case $exit_code in
            0)
                good=$next
                ;;
            1)
                bad=$next
                ;;
            *)
                echo "failed: $exit_code"
                exit 1
                ;;
        esac
    fi
    echo $good > bisect_result
    echo $bad >> bisect_result
}

while true; do
    iterate
    sleep 1
done

But I got a weird result:

ISECT: running pass (13444) InlinerPass on (symbol)
BISECT: NOT running pass (13445) PostOrderFunctionAttrsPass on (symbol)

I don't think inline is directly related to this mis-compilation. I simply changed OptBisect.cpp to skip InlinerPass:

bool OptBisect::shouldRunPass(const StringRef PassName,
                              StringRef IRDescription) {
  if (PassName == "InlinerPass") {
      printPassMessage(PassName, -1, IRDescription, true);
      return true;
  }
  // ...
}

Eventually, I got:

BISECT: running pass (10040) CorrelatedValuePropagationPass on symbol
BISECT: NOT running pass (10041) SimplifyCFGPass on symbol

To verify that CorrelatedValuePropagationPass is related, it is still done by removing unrelated code, see a08f2c14. I also added a line of logging for simple validation:

LLVMLOG: Delete   %102 = and i64 %101, 4294967295   -> and i64 %101, 0xffffffff
...
occupied niche: found 0x7fba00000001 but must be in 0x0..=0x2 in type std::option::Option<thir::pattern::deconstruct_pat::SliceKind> at offset 0 with type Int(I64, false)
0x7fba00000001 & 0xffffffff = 0x1

I'm pretty sure this is what we're looking for! This correlates very well with rustc's panic log. Eventually, I realized that %101 could get undef results under certain control flows. In this case, we should not remove %102.

Here I am curious why it is related to AggressiveInstCombine, if we remove this Pass, the instruction to delete becomes %123 = and i64 %122, 72057594037927935(0xffffffffffffff). We still can't delete the instruction, it's just that this huge value makes it hard for the program to encounter during runtime.

Summary

What have we compared?

I'm getting closer to the truth with these approaches. By the way, luck is also important, I don't keep records of my missteps :].

In these methods, I have also introduced some specific methods:

The generalized process for solving this type of problem is:

  1. first find a way to shorten the time to single replication, too long debugging time is annoying
  2. the key objective is to find the key transformations that are causing the mis-compilation by comparing the above methods.
  3. extract the IR and locate the problem based on the crucial transformations
  4. fix the problem

I don't have any experience with step 3 yet, but I would use llvm-extract and llvm-reduce to reduce the amount of IRs I get, which would be helpful for debugging. I would also use -opt-bisect-limit to extract the IR for intermediate processes and manually remove some functions or instructions to debug the problem.

I don't have a clear idea of how to submit a proper fix yet, and currently, I rarely manage to get PRs for LGTM without modifications. l need to learn and practice more ;).

Refer