Skip to content

Add support for caching Rust compilations. Fixes #42 #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 24, 2017

Conversation

luser
Copy link
Contributor

@luser luser commented Mar 7, 2017

This is a pretty large set of changes, but it breaks down into a few parts:

  • First, I spent some time refactoring the existing logic from CompilerInfo::get_cached_or_compile, since it was all tangled with the logic of running the C preprocessor. I split that out into a separate method and moved a few other things around while I was there. This is the "Pre-Rust support refactoring, Part 1" change.
  • Next, I added some traits to hide implementation details of the actual compilation process: Compiler, CompilerHasher, and Compilation. I changed get_cached_or_compile and everywhere else that was interacting with compilers to only work with boxed trait objects of these types, and made it so that Compiler::parse_arguments returns a boxed CompilerHasher, and then CompilerHasher::generate_hash_key returns a boxed Compilation. This way the C compiler implementation and the Rust compiler implementation can each persist some private state through the process. I refactored the C compilation bits further into an implementation of these traits that's generic over a CCompilerImpl trait, implementations of which simply call into the existing gcc/clang/msvc functions. This is the "Pre-Rust support refactoring, Part 2" and "Part 3" changes.
  • Finally, I added a Rust implementation of the traits. A few refactoring-type changes snuck in to this changeset but I was too lazy to split them out at this point.

There are also a few tiny refactoring changesets mixed in there, I tried to split small, separable changes out when possible to make the whole thing easier to follow.

@luser luser self-assigned this Mar 7, 2017
@luser luser requested a review from alexcrichton March 7, 2017 19:10
@luser
Copy link
Contributor Author

luser commented Mar 8, 2017

The Travis builds are broken on 1.13 for some reason, I will take a look at that.

Box::new(detect.and_then(move |compiler| {
match compiler {
Some(compiler) => {
Box::new(sha1_digest(executable.clone(), &pool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually somewhat interesting in the rustc case. I'd imagine that quite commonly this is actually the rustup version of rustc.exe rather than the rustc version of rustc.exe. I wonder if we could automatically detect rustup and go fetch the actual compiler instead?

I believe that would work best by looking for rustup.exe in the directory next to rustc.exe (if found). If that's there we can run rustup which rustc which should print out the full path to the actual compiler.

Also as a side note here, I wonder if we should handle the dynamic libraries that rustc itself links to? I can't really imagine a use case though where the dylibs are changed and the rustc executable wouldn't change, though, so it's perhaps not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good point and I can't believe I didn't realize it! I don't have a non-rustup install handy, but would it be enough to run rustc --print sysroot and hash $sysroot/bin/rustc? I'd have to sanity check how that works with the tooltool-downloaded rustc we use in Firefox CI.

We discussed the dylib issue in Hawaii and we kind of hand-waved it away. I definitely don't want to be parsing the output of ldd or reading the binary headers to figure this out, but maybe it's enough to hash rustc + everything in $sysroot/lib/? Actually given that rustc is just a stub, maybe we skip hashing rustc entirely, hash $sysroot/lib and that's good enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would it be enough to run rustc --print sysroot and hash $sysroot/bin/rustc

Oh right yeah that'd be perfect. (and also solve the rustup case)

This may also throw a wrench into sccache's own caching of the compiler as well unfortunately b/c rustup can have a different rustc depending on which directory it's invoked in (and that can change over time). I think sccache assumes the main compiler doesn't change much, right?

Also yeah I think we can handwave away b/c every release has different hashes in the dylib filenames which I believe implies that the contents of rustc will be different (e.g. the file somehow references those dylibs). That being said hashing $sysroot/lib would put a nail in the coffin and work perfectly as well.

The only catch there is that dylibs are in $sysroot/bin on Windows and $sysroot/lib everywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also throw a wrench into sccache's own caching of the compiler as well unfortunately b/c rustup can have a different rustc depending on which directory it's invoked in (and that can change over time). I think sccache assumes the main compiler doesn't change much, right?

Also a good point, we'll have to do something smarter here. sccache caches compiler info by (path, mtime), so rustup will totally break this expectation. I'll have to think about that for a bit, I don't want to have an extra rustc invocation for every single compile if we can avoid it.

Err(e) => return Box::new(future::err(e)),
}
} else {
Box::new(future::err("Failed run rustc --dep-info".into()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this contain the rustc errors logs (e.g. output above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. We return the preprocessor stderr for C compilation when it fails, we should do something smarter here. The error propagation in this code isn't great in general, I should make sure we're providing enough information that users can attempt to diagnose failures.

drop(temp_dir);
hash_all(files, &pool)
}
Err(e) => return Box::new(future::err(e)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could use chain_err to say "failed to parse dep info at {}"

let outstr = String::from_utf8(output.stdout).chain_err(|| "Error parsing rustc output")?;
Ok(outstr.lines().map(|l| l.to_owned()).collect())
} else {
bail!("Failed to run `rustc --print file-names`");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this may actually make a good helper as part of run_input_output, having a method to run a command and auto-check output.status.success() and bundle up all the relevant error info in the case of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to look at the various places where I call this--I do think I wrote this pattern more than once.

// so we can *find* them.
"-l" => return CompilerArguments::CannotCache,
v if v.starts_with("--emit") => {
//XXX: do we need to handle --emit specified more than once?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I think this does happen from time to time (I think I type it in at least every once in a while!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly concerned with caching all possible commandlines (clearly), but it'd be good to ensure that we're at least not producing incorrect behavior in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's fine to just return "not cacheable" for multiple --emit calls for now.


/// Calculate the SHA-1 digest of each file in `files` on background threads
/// in `pool`.
fn hash_all(files: Vec<String>, pool: &CpuPool) -> SFuture<Vec<String>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these files be sorted to ensure that they're always hashed in the same order? I think the dep-info is relatively deterministic (but that's likely not guaranteed) and I wouldn't be surprised if the order of --extern from Cargo was also slightly nondeterministic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort the list of source files from the dep-info, but not the list of externs, they're input to the hash in the order they're listed on the commandline.

If cargo produces nondeterministic commandlines we've got a bigger problem, since we're feeding the entire commandline as hash input! Can we verify that assumption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah just confirmed, Cargo's order of --extern is nondeterministic.

If you'd prefer to change in Cargo I can also do that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing that in cargo would certainly make my life easier, and we're likely to need to wait for your dep-info speedup for this to all be useful anyway, so we can probably wait for a fix for that.

Is anything else about the commandline nondeterministic? If so, we'll have to fix sccache to not use the commandline as a direct hash input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just checked, everything should be deterministic except the order of --extern which is nondeterministic.

m.update(h.as_bytes());
}
// 6. TODO: Environment variables:
// RUSTFLAGS, maybe CARGO_PKG_*?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUSTFLAGS is safe to ignore as it's just used by Cargo, not rustc.

Other env vars though yeah like CARGO_* should probably be hashed. That being said I don't think the env from the child ships over to the server, right? (this is a case where that may actually be quite important to implement!)

In any case I think it's safe to assume that all of the child's CARGO_* env vars should be hashed here (although I guess arguably all env vars should be hashed due to env!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning on fixing #48 immediately to remedy this.

Caching all of the environment variables seems like it's going to result in us having very few cache hits, unless cargo cleans the environment before passing it to rustc. Just looking at the output of env on my Linux machine shows a bunch of vars with PIDs and other junk in them.

I wonder if it'd be possible to teach rustc to tell us about the usage of env!, similar to emitting dep-info? If we could get a list of env! usage then we'd only have to hash those variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be possible to teach rustc to tell us about the usage of env!, similar to emitting dep-info

I was thinking the same thing yeah. I filed rust-lang/rust#40364 to track this.

I also definitely agree that there's tons of junk in env vars typically. We could try to maintain a blacklist of "don't worry about caching this" but otherwise I think blanket hashing CARGO_* should work fine for now.

@alexcrichton
Copy link
Contributor

Overall looks great to me, nice!

I wonder if we should perhaps document somewhere the limitations of rustc caching? Off the top of my head what I can think of is:

  • Only some flag combinations work (e.g. --emit dep-info,link)
  • Values from env! may not be tracked in caching
  • Procedural macros that read files from the filesystem may not be cached correctly (yet)
  • The system linker isn't cached (e.g. changing gcc/ld won't invalidate)
  • Target specs aren't hashed (e.g. custom target specs I believe)

@luser
Copy link
Contributor Author

luser commented Mar 8, 2017

We should definitely document those! I wrote down a bunch of things in the parse_arguments implementation as comments, but it'd be good to have them somewhere else.

The system linker isn't cached (e.g. changing gcc/ld won't invalidate)

This is a good point. Is there a way to ask rustc what linker it would use? If this isn't simple to fix I wonder if we shouldn't avoid caching dylib/bin crates for now.

Target specs aren't hashed (e.g. custom target specs I believe)

I'm not clear on what this means--do you mean that we don't hash the target when it's not passed on the commandline?

@alexcrichton
Copy link
Contributor

This is a good point. Is there a way to ask rustc what linker it would use? If this isn't simple to fix I wonder if we shouldn't avoid caching dylib/bin crates for now.

Not currently, unfortunately. Let's stick w/ just caching rlibs for now which is conservative and most of the benefit I think anyway.

I'm not clear on what this means--do you mean that we don't hash the target when it's not passed on the commandline?

Oh sorry! So for more esoteric use cases you can actually pass --target foo where foo.json is in the cwd. The foo.json file then says all the various options for your "custom target" (aka these are called "custom target specs").

Note that this is a pretty niche use case though. We can probably protect against this by running rustc --print target-list and then checking if the --target value is not in there. If it's not then it's a custom target and we probably shouldn't cache for now. (again though, seems like a prime candidate for a TODO somewhere as opposed to blocking this on fixing)

@luser
Copy link
Contributor Author

luser commented Mar 23, 2017

OK, I think I addressed all the review comments.

luser added 10 commits March 24, 2017 06:29
Pull the "run the preprocessor and generate the hash key" chunk of
`get_cached_or_compile` out into a separate method, in expectation of
making things more generic in a followup commit to support the Rust
compiler, which won't be running a preprocessor.

There are a few major changes here:
* put the refactored bits from `get_cached_or_compile` into a
  `hash_key_from_c_preprocessor_output`
* added a `generate_hash_key` method to `CompilerKind`, which just calls
  the previously mentioned function
* removed `Compiler::compile`, inlined it into `get_cached_or_compile`
  since that was the only call site anyway, and it made some lifetime issues
  easier
* changed `get_cached_or_compile` to take ownership of `self`, and changed
  a few other functions to just take the compiler path as a `&str` instead
  of taking a `&Compiler`.
The goal here was to make the state that persists between running the
preprocessor and running the compiler private, since the Rust compiler
does not have a preprocessor, but it will likely have other state it
would like to persist.

* Add a `Compiler` trait to make the interface to compilation generic.
* Add a `Compilation` trait that can be returned from a method on `Compiler`
  to hold the preprocessor output that is reused for compilation while still
  allowing the calling code to box `Compiler` and `Compilation` as trait
  objects.
* Add a `CCompiler` struct that impls `Compiler`, but is generic over a
  second `CCompilerImpl` trait for specific C compilers, since most of
  the logic of running the preprocessor to generate the hash key is
  shared. Move all of `hash_key_from_c_preprocessor_output` into the
  `Compiler` impl on `CCompiler`.
* Add {GCC,Clang,MSVC} structs that impl `CCompilerImpl` so they can be
  used with `CCompiler`.
* Rework `CompilerKind` to be a simple utility enum and make `CompilerInfo`
  actually hold a `Box<Compiler>` and call methods on it directly.
The goal here was to make more of the state that's currently persisted
from `parse_arguments` -> `generate_hash_key` -> `compile` private so
the C compilers and the Rust compiler can store different kinds of state.

* Split the `Compiler` trait further into a `CompilerHasher` trait,
  which now gets returned in a Box from `Compiler::parse_arguments`
  as a field of `CompilerArguments`.
* Move the existing `ParsedArguments` struct into compiler/c to make it
  specific to C compilers.
… $sysroot/lib instead of compiler executable.

This changeset removes `CompilerInfo` entirely, moving `get_cached_or_compile`
into the `Compiler` trait. The server now deals exclusively with objects of
`Box<Compiler>`.

Also fixes a few other review comments.
@luser
Copy link
Contributor Author

luser commented Mar 24, 2017

I rebased on top of master and built it locally on Linux/Mac/Windows, so everything should be OK, but I figured I'd push the rebased changes here for sanity's sake. If the CI is green I'll merge this and then we can get on with rebasing and merging all your other PRs.

{
let mut f = File::open(file)?;
let mut deps = String::new();
f.read_to_string(&mut deps)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it looks like this happens on the main event loop thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@alexcrichton
Copy link
Contributor

I've pushed a commit that parses dep-info files on a cpu pool, but otherwise looks great! r=me w/ green CI

@luser luser merged commit 907ed3b into mozilla:master Mar 24, 2017
crystalin pushed a commit to PureStake/sccache that referenced this pull request Jul 2, 2021
Set up a smoke test pass that check if Substrate/Polkadot build using dist sccache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants