Skip to content

excessive compile time building linux/lib/zstd/compress.c #1314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
arndb opened this issue Feb 28, 2021 · 10 comments
Open

excessive compile time building linux/lib/zstd/compress.c #1314

arndb opened this issue Feb 28, 2021 · 10 comments
Labels
[BUG] llvm A bug that should be fixed in upstream LLVM compile time Issues affecting compile times Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list.

Comments

@arndb
Copy link

arndb commented Feb 28, 2021

I did a scripted search for files with long compile time. lib/zstd/compress.c is one of the files that came out on top.

$ perf stat clang-11 -c lib/zstd/compress.i -O3 -Wall -Wno-unused  -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE   --target=aarch64-linux -Werror=unknown-warning-option -mgeneral-regs-only -Wno-psabi -fno-asynchronous-unwind-tables -fno-unwind-tables -mbranch-protection=pac-ret+leaf  -fno-delete-null-pointer-checks -Wframe-larger-than=2048 -fno-stack-protector -mno-global-merge -fno-omit-frame-pointer -fno-optimize-sibling-calls -mno-omit-leaf-frame-pointer -pg -fno-strict-overflow -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types   -fsanitize=kernel-hwaddress -mllvm -hwasan-instrument-stack=0 -mllvm -hwasan-use-short-granules=0 -mllvm -hwasan-mapping-offset=0xefffc00000000000  -fsanitize=array-bounds -fsanitize=unreachable -fsanitize=object-size -fsanitize=bool -fsanitize=enum
          26925.09 msec task-clock                #    0.998 CPUs utilized          
              2738      context-switches          #    0.102 K/sec                  
                22      cpu-migrations            #    0.001 K/sec                  
             50399      page-faults               #    0.002 M/sec                  
       87646286353      cycles                    #    3.255 GHz                      (83.32%)
       10257322275      stalled-cycles-frontend   #   11.70% frontend cycles idle     (83.31%)
        5991572775      stalled-cycles-backend    #    6.84% backend cycles idle      (83.34%)
       59088749914      instructions              #    0.67  insn per cycle         
                                                  #    0.17  stalled cycles per insn  (83.35%)
       11508007401      branches                  #  427.408 M/sec                    (83.35%)
         491383656      branch-misses             #    4.27% of all branches          (83.33%)

      26.972371393 seconds time elapsed

      26.757734000 seconds user
       0.159723000 seconds sys

Attaching the preprocessed version:
compress.i.gz

The profile shows these functions on top:

   3.20%  clang-13  libLLVM-13.so.1      [.] llvm::BasicBlock::getTerminator
   2.95%  clang-13  libLLVM-13.so.1      [.] llvm::LiveRangeCalc::findReachingDefs
   2.46%  clang-13  libLLVM-13.so.1      [.] llvm::LoopBase<llvm::MachineBasicBlock, llvm::MachineLoop>::getExitingBlocks
   2.00%  clang-13  libLLVM-13.so.1      [.] llvm::SmallPtrSetImplBase::FindBucketFor
   1.89%  clang-13  libc-2.33.so         [.] _int_malloc
   1.31%  clang-13  libLLVM-13.so.1      [.] llvm::SmallPtrSetImplBase::insert_imp_big
   1.25%  clang-13  libLLVM-13.so.1      [.] llvm::LiveRange::extendInBlock
   1.18%  clang-13  libLLVM-13.so.1      [.] llvm::SelectionDAG::Combine
   1.18%  clang-13  libLLVM-13.so.1      [.] llvm::ScalarEvolution::forgetLoop
   0.83%  clang-13  libLLVM-13.so.1      [.] llvm::SpillPlacement::activate
   0.73%  clang-13  libc-2.33.so         [.] _int_free
   0.70%  clang-13  libLLVM-13.so.1      [.] llvm::LiveRangeUpdater::add
   0.69%  clang-13  libc-2.33.so         [.] msort_with_tmp.part.0
   0.68%  clang-13  libLLVM-13.so.1      [.] llvm::DominatorTreeBase<llvm::BasicBlock, false>::dominates
   0.66%  clang-13  libLLVM-13.so.1      [.] llvm::LiveIntervals::extendSegmentsToUses
   0.65%  clang-13  libc-2.33.so         [.] malloc
   0.64%  clang-13  libLLVM-13.so.1      [.] llvm::LiveVariables::MarkVirtRegAliveInBlock
   0.62%  clang-13  libLLVM-13.so.1      [.] llvm::ValueHandleBase::AddToUseList
   0.62%  clang-13  libc-2.33.so         [.] unlink_chunk.constprop.0
   0.60%  clang-13  libLLVM-13.so.1      [.] llvm::SpillPlacement::addLinks
   0.60%  clang-13  libc-2.33.so         [.] cfree@GLIBC_2.2.5
   0.58%  clang-13  libc-2.33.so         [.] malloc_consolidate
   0.56%  clang-13  libLLVM-13.so.1      [.] llvm::Value::getMetadata
   0.54%  clang-13  libLLVM-13.so.1      [.] llvm::SmallPtrSetImplBase::Grow
   0.52%  clang-13  libLLVM-13.so.1      [.] llvm::Instruction::getSuccessor
   0.51%  clang-13  libLLVM-13.so.1      [.] 0x0000000001029ad0
   0.50%  clang-13  libLLVM-13.so.1      [.] 0x0000000000e5eb1b
@arndb
Copy link
Author

arndb commented Feb 28, 2021

I played around with this a little and found that -fsanitize=kernel-hwaddress is the main reason for the slow compilation, without this, compile time goes down to 12 seconds instead of 27.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 1, 2021

IIRC, doesn't this file have __always_inline on every function? I remember there was one zstd related file that had a relatively large time spent in the optimizer than the frontend, unlike most other source files.

#1086 (comment) is when I looked at it, though it seems I failed to make such notes public...

@nickdesaulniers
Copy link
Member

cc @eugenis for feedback on -fsanitize=kernel-hwaddress.

@nickdesaulniers nickdesaulniers added the compile time Issues affecting compile times label Mar 1, 2021
@arndb
Copy link
Author

arndb commented Mar 1, 2021

Indeed, removing every __always_inline in the preprocessed file makes compilation three to four times faster, both with and without -fsanitize=kernel-address.

With both kasan and always_inline removed, the file now compiles in three seconds, which is still rather slow, but is ten times faster than the maximum I found.

@arndb
Copy link
Author

arndb commented Mar 1, 2021

Comparing clang-13 against gcc-10:

clang-13, hwasan, always_inline  25.6s
clang-13, hwasan                  6.3s
clang-13, kasan, always_inline   23.4s
clang-13, kasan                   4.9s
clang-13, always_inline          11.2s
clang-13, plain                   3.4s
gcc-10.1, kasan, always_inline   13.4s
gcc-10.1, kasan                   7.1s
gcc-10.1, always_inline          11.7s
gcc-10.1, plain                   3.6s

@arndb
Copy link
Author

arndb commented Mar 1, 2021

This one seems related but even worse, at over 100 CPU seconds to compile the preprocessed file with KASAN:

$ perf stat clang-13 -Wno-unused --target=arm-linux-gnueabi  -O2  -fsanitize=kernel-address   -c qla_dbg-clang.i   -mllvm -asan-instrumentation-with-call-threshold=0  

 Performance counter stats for 'clang-13 -Wno-unused --target=arm-linux-gnueabi -O2 -fsanitize=kernel-address -c qla_dbg-clang.i -mllvm -asan-instrumentation-with-call-threshold=0':

         104607.14 msec task-clock                #    0.998 CPUs utilized          
             10588      context-switches          #    0.101 K/sec                  
                59      cpu-migrations            #    0.001 K/sec                  
             54679      page-faults               #    0.523 K/sec                  
      340985929529      cycles                    #    3.260 GHz                      (83.33%)
       67694323496      stalled-cycles-frontend   #   19.85% frontend cycles idle     (83.33%)
       18623444901      stalled-cycles-backend    #    5.46% backend cycles idle      (83.33%)
      151628665799      instructions              #    0.44  insn per cycle         
                                                  #    0.45  stalled cycles per insn  (83.34%)
       27060513254      branches                  #  258.687 M/sec                    (83.33%)
        1384301145      branch-misses             #    5.12% of all branches          (83.34%)

     104.785569716 seconds time elapsed

     104.325451000 seconds user
       0.251598000 seconds sys

Without KASAN, it takes 7.6s to compile.

With gcc, it's 4.6.s with kasan or 4.0s without kasan.

qla_dbg-clang.i.gz

@arndb
Copy link
Author

arndb commented Mar 1, 2021

I found that this one-line change reduces the build time from 104 seconds to 2.7 seconds:

--- a/drivers/scsi/qla2xxx/qla_dbg.c
+++ b/drivers/scsi/qla2xxx/qla_dbg.c
@@ -280,7 +280,7 @@ qla24xx_dump_memory(struct qla_hw_data *ha, __be32 *code_ram,
        return rval;
 }
 
-static __be32 *
+static noinline __be32 *
 qla24xx_read_window(struct device_reg_24xx __iomem *reg, uint32_t iobase,
                    uint32_t count, __be32 *buf)
 {

@arndb
Copy link
Author

arndb commented Mar 2, 2021

I ran the qla_dbg test case (without the noinline) through a "clang-13 -ftime-report" (full report attached), the interesting bits are

                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 96.9025 seconds (97.1371 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  95.4756 ( 98.6%)   0.0677 ( 57.6%)  95.5432 ( 98.6%)  95.7762 ( 98.6%)  DAG Combining 2
   0.4311 (  0.4%)   0.0109 (  9.2%)   0.4420 (  0.5%)   0.4426 (  0.5%)  Instruction Selection

                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 104.6836 seconds (104.9413 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  97.8214 ( 94.0%)   0.1408 ( 22.5%)  97.9622 ( 93.6%)  98.2003 ( 93.6%)  ARM Instruction Selection
   3.1954 (  3.1%)   0.0506 (  8.1%)   3.2459 (  3.1%)   3.2578 (  3.1%)  Greedy Register Allocator
   1.6001 (  1.5%)   0.3769 ( 60.3%)   1.9770 (  1.9%)   1.9829 (  1.9%)  ARM Assembly Printer

The results are the same for earlier versions of clang (tested 10 and up), so this is not a regression.
In fact, clang-10 and clang-11 are a few percent slower overall.

execution-timing.txt

@arndb
Copy link
Author

arndb commented Mar 7, 2021

Here is a reduced test case for qla_dbg.c, which comes down to the same function getting inline many times when probably should not be inlined.

qla_dbg-clang.c.gz

@arndb
Copy link
Author

arndb commented Mar 7, 2021

@kees kees added Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list. [BUG] llvm A bug that should be fixed in upstream LLVM labels Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[BUG] llvm A bug that should be fixed in upstream LLVM compile time Issues affecting compile times Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list.
Projects
None yet
Development

No branches or pull requests

3 participants