Skip to content

[analyzer] Enforce not making overly complicated symbols #144327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

balazs-benics-sonarsource
Copy link
Contributor

Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%.

For example, for the
DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c TU:

  • printPointers entry point was improved from 31.1 seconds to 1.1 second (28x).
  • handle_cpu_fault entry point was improved from 15.5 seconds to 3 seconds (5x).

We had in total 3'037'085 entry points in the test pool. Out of these 390'156 were measured to run over a second.

asset

CPP-6182

Out of the worst 500 entry points, 45 were improved by at least 10%.
Out of these 45, 5 were improved by more than 50%.
Out of these 45, 2 were improved by more than 80%.

For example, for the
`DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c` TU:
- `printPointers` entry point was improved from 31.1 seconds to 1.1 second (28x).
- `handle_cpu_fault` entry point was improved from 15.5 seconds to 3 seconds (5x).

We had in total 3'037'085 entry points in the test pool.
Out of these 390'156 were measured to run over a second.

TODO: Add the plot here about RT.

CPP-6182
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Jun 16, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 16, 2025

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-static-analyzer-1

Author: Balázs Benics (balazs-benics-sonarsource)

Changes

Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%.

For example, for the
DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c TU:

  • printPointers entry point was improved from 31.1 seconds to 1.1 second (28x).
  • handle_cpu_fault entry point was improved from 15.5 seconds to 3 seconds (5x).

We had in total 3'037'085 entry points in the test pool. Out of these 390'156 were measured to run over a second.

asset

CPP-6182


Patch is 40.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/144327.diff

12 Files Affected:

  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h (+12-12)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h (+13-12)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h (+108-58)
  • (modified) clang/lib/StaticAnalyzer/Checkers/Taint.cpp (+1-1)
  • (modified) clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp (+1-1)
  • (modified) clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp (+1-1)
  • (modified) clang/lib/StaticAnalyzer/Core/RangeConstraintManager.cpp (+26-12)
  • (modified) clang/lib/StaticAnalyzer/Core/RangedConstraintManager.cpp (+11-4)
  • (modified) clang/lib/StaticAnalyzer/Core/SValBuilder.cpp (+48-41)
  • (modified) clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp (+18-11)
  • (modified) clang/lib/StaticAnalyzer/Core/SymbolManager.cpp (+6)
  • (added) clang/test/Analysis/ensure-capped-symbol-complexity.cpp (+53)
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
index 2911554de9d97..0458a6125db9a 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
@@ -57,6 +57,8 @@ class SValBuilder {
 protected:
   ASTContext &Context;
 
+  const AnalyzerOptions &AnOpts;
+
   /// Manager of APSInt values.
   BasicValueFactory BasicVals;
 
@@ -68,8 +70,6 @@ class SValBuilder {
 
   ProgramStateManager &StateMgr;
 
-  const AnalyzerOptions &AnOpts;
-
   /// The scalar type to use for array indices.
   const QualType ArrayIndexTy;
 
@@ -317,21 +317,21 @@ class SValBuilder {
     return nonloc::LocAsInteger(BasicVals.getPersistentSValWithData(loc, bits));
   }
 
-  nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
-                               APSIntPtr rhs, QualType type);
+  DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
+                                  APSIntPtr rhs, QualType type);
 
-  nonloc::SymbolVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
-                               const SymExpr *lhs, QualType type);
+  DefinedOrUnknownSVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
+                                  const SymExpr *lhs, QualType type);
 
-  nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
-                               const SymExpr *rhs, QualType type);
+  DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
+                                  const SymExpr *rhs, QualType type);
 
-  NonLoc makeNonLoc(const SymExpr *operand, UnaryOperator::Opcode op,
-                    QualType type);
+  DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand,
+                                  UnaryOperator::Opcode op, QualType type);
 
   /// Create a NonLoc value for cast.
-  nonloc::SymbolVal makeNonLoc(const SymExpr *operand, QualType fromTy,
-                               QualType toTy);
+  DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand, QualType fromTy,
+                                  QualType toTy);
 
   nonloc::ConcreteInt makeTruthVal(bool b, QualType type) {
     return nonloc::ConcreteInt(BasicVals.getTruthValue(b, type));
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
index aca14cf813c4b..11d0a22a31c46 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
@@ -51,9 +51,11 @@ class SymExpr : public llvm::FoldingSetNode {
   /// Note, however, that it can't be used in Profile because SymbolManager
   /// needs to compute Profile before allocating SymExpr.
   const SymbolID Sym;
+  const unsigned Complexity;
 
 protected:
-  SymExpr(Kind k, SymbolID Sym) : K(k), Sym(Sym) {}
+  SymExpr(Kind k, SymbolID Sym, unsigned Complexity)
+      : K(k), Sym(Sym), Complexity(Complexity) {}
 
   static bool isValidTypeForSymbol(QualType T) {
     // FIXME: Depending on whether we choose to deprecate structural symbols,
@@ -61,8 +63,6 @@ class SymExpr : public llvm::FoldingSetNode {
     return !T.isNull() && !T->isVoidType();
   }
 
-  mutable unsigned Complexity = 0;
-
 public:
   virtual ~SymExpr() = default;
 
@@ -108,7 +108,7 @@ class SymExpr : public llvm::FoldingSetNode {
     return llvm::make_range(symbol_iterator(this), symbol_iterator());
   }
 
-  virtual unsigned computeComplexity() const = 0;
+  unsigned complexity() const { return Complexity; }
 
   /// Find the region from which this symbol originates.
   ///
@@ -136,10 +136,15 @@ using SymbolRefSmallVectorTy = SmallVector<SymbolRef, 2>;
 /// A symbol representing data which can be stored in a memory location
 /// (region).
 class SymbolData : public SymExpr {
+  friend class SymbolManager;
   void anchor() override;
 
 protected:
-  SymbolData(Kind k, SymbolID sym) : SymExpr(k, sym) { assert(classof(this)); }
+  SymbolData(Kind k, SymbolID sym) : SymExpr(k, sym, computeComplexity()) {
+    assert(classof(this));
+  }
+
+  static unsigned computeComplexity(...) { return 1; }
 
 public:
   ~SymbolData() override = default;
@@ -147,14 +152,10 @@ class SymbolData : public SymExpr {
   /// Get a string representation of the kind of the region.
   virtual StringRef getKindStr() const = 0;
 
-  unsigned computeComplexity() const override {
-    return 1;
-  };
-
   // Implement isa<T> support.
-  static inline bool classof(const SymExpr *SE) {
-    Kind k = SE->getKind();
-    return k >= BEGIN_SYMBOLS && k <= END_SYMBOLS;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) {
+    return K >= BEGIN_SYMBOLS && K <= END_SYMBOLS;
   }
 };
 
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
index 86774ad5043dd..5239663788fb4 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
@@ -18,6 +18,7 @@
 #include "clang/AST/Type.h"
 #include "clang/Analysis/AnalysisDeclContext.h"
 #include "clang/Basic/LLVM.h"
+#include "clang/StaticAnalyzer/Core/AnalyzerOptions.h"
 #include "clang/StaticAnalyzer/Core/PathSensitive/APSIntPtr.h"
 #include "clang/StaticAnalyzer/Core/PathSensitive/MemRegion.h"
 #include "clang/StaticAnalyzer/Core/PathSensitive/StoreRef.h"
@@ -72,9 +73,9 @@ class SymbolRegionValue : public SymbolData {
   QualType getType() const override;
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolRegionValueKind;
-  }
+  static constexpr Kind ClassKind = SymbolRegionValueKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// A symbol representing the result of an expression in the case when we do
@@ -128,9 +129,9 @@ class SymbolConjured : public SymbolData {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolConjuredKind;
-  }
+  static constexpr Kind ClassKind = SymbolConjuredKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// A symbol representing the value of a MemRegion whose parent region has
@@ -172,9 +173,11 @@ class SymbolDerived : public SymbolData {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolDerivedKind;
+  static constexpr Kind ClassKind = SymbolDerivedKind;
+  static constexpr bool classof(const SymExpr *SE) {
+    return classof(SE->getKind());
   }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// SymbolExtent - Represents the extent (size in bytes) of a bounded region.
@@ -209,9 +212,9 @@ class SymbolExtent : public SymbolData {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolExtentKind;
-  }
+  static constexpr Kind ClassKind = SymbolExtentKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == SymbolExtentKind; }
 };
 
 /// SymbolMetadata - Represents path-dependent metadata about a specific region.
@@ -278,13 +281,14 @@ class SymbolMetadata : public SymbolData {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolMetadataKind;
-  }
+  static constexpr Kind ClassKind = SymbolMetadataKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// Represents a cast expression.
 class SymbolCast : public SymExpr {
+  friend class SymbolManager;
   const SymExpr *Operand;
 
   /// Type of the operand.
@@ -295,20 +299,19 @@ class SymbolCast : public SymExpr {
 
   friend class SymExprAllocator;
   SymbolCast(SymbolID Sym, const SymExpr *In, QualType From, QualType To)
-      : SymExpr(SymbolCastKind, Sym), Operand(In), FromTy(From), ToTy(To) {
+      : SymExpr(SymbolCastKind, Sym, computeComplexity(In, From, To)),
+        Operand(In), FromTy(From), ToTy(To) {
     assert(In);
     assert(isValidTypeForSymbol(From));
     // FIXME: GenericTaintChecker creates symbols of void type.
     // Otherwise, 'To' should also be a valid type.
   }
 
-public:
-  unsigned computeComplexity() const override {
-    if (Complexity == 0)
-      Complexity = 1 + Operand->computeComplexity();
-    return Complexity;
+  static unsigned computeComplexity(const SymExpr *In, QualType, QualType) {
+    return In->complexity() + 1;
   }
 
+public:
   QualType getType() const override { return ToTy; }
 
   LLVM_ATTRIBUTE_RETURNS_NONNULL
@@ -329,13 +332,14 @@ class SymbolCast : public SymExpr {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == SymbolCastKind;
-  }
+  static constexpr Kind ClassKind = SymbolCastKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// Represents a symbolic expression involving a unary operator.
 class UnarySymExpr : public SymExpr {
+  friend class SymbolManager;
   const SymExpr *Operand;
   UnaryOperator::Opcode Op;
   QualType T;
@@ -343,7 +347,8 @@ class UnarySymExpr : public SymExpr {
   friend class SymExprAllocator;
   UnarySymExpr(SymbolID Sym, const SymExpr *In, UnaryOperator::Opcode Op,
                QualType T)
-      : SymExpr(UnarySymExprKind, Sym), Operand(In), Op(Op), T(T) {
+      : SymExpr(UnarySymExprKind, Sym, computeComplexity(In, Op, T)),
+        Operand(In), Op(Op), T(T) {
     // Note, some unary operators are modeled as a binary operator. E.g. ++x is
     // modeled as x + 1.
     assert((Op == UO_Minus || Op == UO_Not) && "non-supported unary expression");
@@ -354,13 +359,12 @@ class UnarySymExpr : public SymExpr {
     assert(!Loc::isLocType(T) && "unary symbol should be nonloc");
   }
 
-public:
-  unsigned computeComplexity() const override {
-    if (Complexity == 0)
-      Complexity = 1 + Operand->computeComplexity();
-    return Complexity;
+  static unsigned computeComplexity(const SymExpr *In, UnaryOperator::Opcode,
+                                    QualType) {
+    return In->complexity() + 1;
   }
 
+public:
   const SymExpr *getOperand() const { return Operand; }
   UnaryOperator::Opcode getOpcode() const { return Op; }
   QualType getType() const override { return T; }
@@ -380,9 +384,9 @@ class UnarySymExpr : public SymExpr {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    return SE->getKind() == UnarySymExprKind;
-  }
+  static constexpr Kind ClassKind = UnarySymExprKind;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// Represents a symbolic expression involving a binary operator
@@ -391,8 +395,9 @@ class BinarySymExpr : public SymExpr {
   QualType T;
 
 protected:
-  BinarySymExpr(SymbolID Sym, Kind k, BinaryOperator::Opcode op, QualType t)
-      : SymExpr(k, Sym), Op(op), T(t) {
+  BinarySymExpr(SymbolID Sym, Kind k, BinaryOperator::Opcode op, QualType t,
+                unsigned Complexity)
+      : SymExpr(k, Sym, Complexity), Op(op), T(t) {
     assert(classof(this));
     // Binary expressions are results of arithmetic. Pointer arithmetic is not
     // handled by binary expressions, but it is instead handled by applying
@@ -408,14 +413,14 @@ class BinarySymExpr : public SymExpr {
   BinaryOperator::Opcode getOpcode() const { return Op; }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) {
-    Kind k = SE->getKind();
-    return k >= BEGIN_BINARYSYMEXPRS && k <= END_BINARYSYMEXPRS;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) {
+    return K >= BEGIN_BINARYSYMEXPRS && K <= END_BINARYSYMEXPRS;
   }
 
 protected:
   static unsigned computeOperandComplexity(const SymExpr *Value) {
-    return Value->computeComplexity();
+    return Value->complexity();
   }
   static unsigned computeOperandComplexity(const llvm::APSInt &Value) {
     return 1;
@@ -430,19 +435,28 @@ class BinarySymExpr : public SymExpr {
 };
 
 /// Template implementation for all binary symbolic expressions
-template <class LHSTYPE, class RHSTYPE, SymExpr::Kind ClassKind>
+template <class LHSTYPE, class RHSTYPE, SymExpr::Kind ClassK>
 class BinarySymExprImpl : public BinarySymExpr {
+  friend class SymbolManager;
   LHSTYPE LHS;
   RHSTYPE RHS;
 
   friend class SymExprAllocator;
   BinarySymExprImpl(SymbolID Sym, LHSTYPE lhs, BinaryOperator::Opcode op,
                     RHSTYPE rhs, QualType t)
-      : BinarySymExpr(Sym, ClassKind, op, t), LHS(lhs), RHS(rhs) {
+      : BinarySymExpr(Sym, ClassKind, op, t,
+                      computeComplexity(lhs, op, rhs, t)),
+        LHS(lhs), RHS(rhs) {
     assert(getPointer(lhs));
     assert(getPointer(rhs));
   }
 
+  static unsigned computeComplexity(LHSTYPE lhs, BinaryOperator::Opcode,
+                                    RHSTYPE rhs, QualType) {
+    // FIXME: Should we add 1 to complexity?
+    return computeOperandComplexity(lhs) + computeOperandComplexity(rhs);
+  }
+
 public:
   void dumpToStream(raw_ostream &os) const override {
     dumpToStreamImpl(os, LHS);
@@ -453,13 +467,6 @@ class BinarySymExprImpl : public BinarySymExpr {
   LHSTYPE getLHS() const { return LHS; }
   RHSTYPE getRHS() const { return RHS; }
 
-  unsigned computeComplexity() const override {
-    if (Complexity == 0)
-      Complexity =
-          computeOperandComplexity(RHS) + computeOperandComplexity(LHS);
-    return Complexity;
-  }
-
   static void Profile(llvm::FoldingSetNodeID &ID, LHSTYPE lhs,
                       BinaryOperator::Opcode op, RHSTYPE rhs, QualType t) {
     ID.AddInteger((unsigned)ClassKind);
@@ -474,7 +481,9 @@ class BinarySymExprImpl : public BinarySymExpr {
   }
 
   // Implement isa<T> support.
-  static bool classof(const SymExpr *SE) { return SE->getKind() == ClassKind; }
+  static constexpr Kind ClassKind = ClassK;
+  static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+  static constexpr bool classof(Kind K) { return K == ClassKind; }
 };
 
 /// Represents a symbolic expression like 'x' + 3.
@@ -489,6 +498,33 @@ using IntSymExpr = BinarySymExprImpl<APSIntPtr, const SymExpr *,
 using SymSymExpr = BinarySymExprImpl<const SymExpr *, const SymExpr *,
                                      SymExpr::Kind::SymSymExprKind>;
 
+struct MaybeSymExpr {
+  MaybeSymExpr() = default;
+  explicit MaybeSymExpr(SymbolRef Sym) : Sym(Sym) {}
+  bool isValid() const { return Sym; }
+  bool isInvalid() const { return !isValid(); }
+  SymbolRef operator->() const { return Sym; }
+
+  SymbolRef getOrNull() const { return Sym; }
+  template <typename SymT> const SymT *getOrNull() const {
+    return llvm::dyn_cast_if_present<SymT>(Sym);
+  }
+
+  DefinedOrUnknownSVal getOrUnknown() const {
+    if (isInvalid())
+      return UnknownVal();
+    return nonloc::SymbolVal(Sym);
+  }
+
+  nonloc::SymbolVal getOrAssert() const {
+    assert(Sym);
+    return nonloc::SymbolVal(Sym);
+  }
+
+private:
+  SymbolRef Sym = nullptr;
+};
+
 class SymExprAllocator {
   SymbolID NextSymbolID = 0;
   llvm::BumpPtrAllocator &Alloc;
@@ -518,27 +554,27 @@ class SymbolManager {
   SymExprAllocator Alloc;
   BasicValueFactory &BV;
   ASTContext &Ctx;
+  const unsigned MaxCompComplexity;
 
 public:
   SymbolManager(ASTContext &ctx, BasicValueFactory &bv,
-                llvm::BumpPtrAllocator &bpalloc)
-      : SymbolDependencies(16), Alloc(bpalloc), BV(bv), Ctx(ctx) {}
+                llvm::BumpPtrAllocator &bpalloc, const AnalyzerOptions &Opts)
+      : SymbolDependencies(16), Alloc(bpalloc), BV(bv), Ctx(ctx),
+        MaxCompComplexity(Opts.MaxSymbolComplexity) {
+    assert(MaxCompComplexity > 0 && "Zero max complexity doesn't make sense");
+  }
 
   static bool canSymbolicate(QualType T);
 
   /// Create or retrieve a SymExpr of type \p SymExprT for the given arguments.
   /// Use the arguments to check for an existing SymExpr and return it,
   /// otherwise, create a new one and keep a pointer to it to avoid duplicates.
-  template <typename SymExprT, typename... Args>
-  const SymExprT *acquire(Args &&...args);
+  template <typename SymExprT, typename... Args> auto acquire(Args &&...args);
 
   const SymbolConjured *conjureSymbol(ConstCFGElementRef Elem,
                                       const LocationContext *LCtx, QualType T,
                                       unsigned VisitCount,
-                                      const void *SymbolTag = nullptr) {
-
-    return acquire<SymbolConjured>(Elem, LCtx, T, VisitCount, SymbolTag);
-  }
+                                      const void *SymbolTag = nullptr);
 
   QualType getType(const SymExpr *SE) const {
     return SE->getType();
@@ -672,7 +708,16 @@ class SymbolVisitor {
 };
 
 template <typename T, typename... Args>
-const T *SymbolManager::acquire(Args &&...args) {
+auto SymbolManager::acquire(Args &&...args) {
+  constexpr bool IsSymbolData = SymbolData::classof(T::ClassKind);
+  if constexpr (IsSymbolData) {
+    assert(T::computeComplexity(args...) == 1);
+  } else {
+    if (T::computeComplexity(args...) > MaxCompComplexity) {
+      return MaybeSymExpr();
+    }
+  }
+
   llvm::FoldingSetNodeID profile;
   T::Profile(profile, args...);
   void *InsertPos;
@@ -681,7 +726,12 @@ const T *SymbolManager::acquire(Args &&...args) {
     SD = Alloc.make<T>(std::forward<Args>(args)...);
     DataSet.InsertNode(SD, InsertPos);
   }
-  return cast<T>(SD);
+
+  if constexpr (IsSymbolData) {
+    return cast<T>(SD);
+  } else {
+    return MaybeSymExpr(SD);
+  }
 }
 
 } // namespace ento
diff --git a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
index e55d064253b84..f0dc889f15e7a 100644
--- a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
@@ -267,7 +267,7 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
 
   // HACK:https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570
   if (const auto &Opts = State->getAnalysisManager().getAnalyzerOptions();
-      Sym->computeComplexity() > Opts.MaxTaintedSymbolComplexity) {
+      Sym->complexity() > Opts.MaxTaintedSymbolComplexity) {
     return {};
   }
 
diff --git a/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp b/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
index e2f8bd541c967..ab0e3d8f56d86 100644
--- a/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
@@ -66,7 +66,7 @@ class TrustNonnullChecker : public Checker<check::PostCall,
                              SVal Cond,
                              bool Assumption) const {
     const SymbolRef CondS = Cond.getAsSymbol();
-    if (!CondS || CondS->computeComplexity() > ComplexityThreshold)
+    if (!CondS || CondS->complexity() > ComplexityThreshold)
       return State;
 
     for (SymbolRef Antecedent : CondS->symbols()) {
diff --git a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
index fa8e669b6bb2f..3486485dcd686 100644
--- a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
+++ b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
@@ -67,7 +67,7 @@ void ExprEngine::VisitBinaryOperator(const BinaryOperator* B,
       if (RightV.isUnknown()) {
         unsigned Count = currBldrCtx->blockCount();
         RightV = svalBuilder.conjureSymbolVal(nullptr, getCFGElementRef(), LCtx,
-                                              Count);
+                                              RHS->getType(), Count);
       }
       // Simulate the effects of a "store":  bind the value of the RHS
       // to the L-Valu...
[truncated]

Count);
RHS->getType(), Count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly in #137355 @fangyi-zhou changed the behavior of this line, thus needed a tiny bit of adjustment to make the new test pass while I was uplifting this downstream patch to current llvm main.
I didn't investigate the case beyond that this was the line that conjured a symbol of a wrong type after #137355, probably because in the past we directly passed a QualType here but after that change we rely on deducing the type from getCFGElementRef() - which is apparently wrong. To see the behavior, revert this hunk and see the broken test. There could be more places where this type mismatch on conjure could cause issues, but I didn't audit the code further.

Copy link
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I'm not convinced that this is the right direction for improving the analyzer runtime.

On the "risks" side I think that adding the corner case that "this may also return UnknownVal in rare situations" into many functions complicates the logic, burdens the code with early return branches and I fear that it will act as a footgun.

On the "benefits" side I fear that your statistics don't prove enough:

  1. You found that "Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%." but this only covers 9% of the worst 500 entry points. Eyeballing the graph suggests that there are some cases where the runtime actually got worse -- so please check that the overall effect of the change is also positive (e.g. the total runtime is reduced meaningfully).
  2. Moreover, if "worst 500 entry points" means "worst 500 in the first run", then it is a biased sample: if you pick the worst outliers (i.e. the entry points where the sum expected runtime + luck factor is largest), then you are expected to get entry points with worse than average luck (because among two similar entry points, the one with bad luck ends up in the worst 500 while the one with good luck avoids it), so if you redo the measurement, then regression toward the mean will produce better results -- even if you do both measurements with the same setup! As a sanity check, please redo the statistics on the the entry points that produced the worst 500 runtimes in the second run -- I fear that on that sample (which is biased in the opposite direction) you will see that the new revision is worse than the baseline.
  3. I'm also interested in comparing the statistical results with a second independent measurement -- is the set of "worst 500 entry points" stable between runs, or are these random unlucky functions that are hit with environmental issues?

If you can share the raw data, I can help with statistical calculations.

Comment on lines +320 to +334
DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
APSIntPtr rhs, QualType type);

nonloc::SymbolVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
const SymExpr *lhs, QualType type);
DefinedOrUnknownSVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
const SymExpr *lhs, QualType type);

nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
const SymExpr *rhs, QualType type);
DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
const SymExpr *rhs, QualType type);

NonLoc makeNonLoc(const SymExpr *operand, UnaryOperator::Opcode op,
QualType type);
DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand,
UnaryOperator::Opcode op, QualType type);

/// Create a NonLoc value for cast.
nonloc::SymbolVal makeNonLoc(const SymExpr *operand, QualType fromTy,
QualType toTy);
DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand, QualType fromTy,
QualType toTy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function must be renamed, because its current name strongly promises that it always returns a NonLoc.

@balazs-benics-sonarsource
Copy link
Contributor Author

Unfortunately I'm not convinced that this is the right direction for improving the analyzer runtime.

On the "risks" side I think that adding the corner case that "this may also return UnknownVal in rare situations" into many functions complicates the logic, burdens the code with early return branches and I fear that it will act as a footgun.

On the "benefits" side I fear that your statistics don't prove enough:

  1. You found that "Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%." but this only covers 9% of the worst 500 entry points. Eyeballing the graph suggests that there are some cases where the runtime actually got worse -- so please check that the overall effect of the change is also positive (e.g. the total runtime is reduced meaningfully).
  2. Moreover, if "worst 500 entry points" means "worst 500 in the first run", then it is a biased sample: if you pick the worst outliers (i.e. the entry points where the sum expected runtime + luck factor is largest), then you are expected to get entry points with worse than average luck (because among two similar entry points, the one with bad luck ends up in the worst 500 while the one with good luck avoids it), so if you redo the measurement, then regression toward the mean will produce better results -- even if you do both measurements with the same setup! As a sanity check, please redo the statistics on the the entry points that produced the worst 500 runtimes in the second run -- I fear that on that sample (which is biased in the opposite direction) you will see that the new revision is worse than the baseline.
  3. I'm also interested in comparing the statistical results with a second independent measurement -- is the set of "worst 500 entry points" stable between runs, or are these random unlucky functions that are hit with environmental issues?

If you can share the raw data, I can help with statistical calculations.

Could you paraphrase your concerns?

How I read this you have mainly 2 concerns:

  1. The use of this strong-type makes it tedious the existing APIs to use because one needs to unwrap the value and frequently make an early-return to explicitly handle the case when a symbol-creation failed?
  2. The measurements weren't conclusive. There was no evidence provided that on the usual cases (all entry points) the RT would not regress. It was also not fair to look at only the longest 500 entry points to evaluate the effectiveness of limiting the max symbol complexity (in other words, honoring the max symbol complexity limit).

[...] regression toward the mean [...]

I formulated the test case in the tests by inspecting a long-running test case. It is consistently low-performing. You can also check it on godbolt, that it would not finish, because the symbol simplifier would need to do so much work due walking the overly complicated symbols. I've also inspected about 10 (a couple of months ago) other radically improved cases, and all showed a large number of binary op manipulations like hashing, or the test case I supply in this patch. This doesn't seem to be a coincidence to me.
From my experience, our pool is large enough to be consistent and roughly reproducible for long-ish entry points. According to our data, usually an entry point should finish in about 1 second if not less. Above that suggests something to look at.

To me, encountering symbols with complexity over the dedicated max symbol complexity threshold is a bug.
Can you think of other ways to ensure we never create overly complicated symbols?

@NagyDonat
Copy link
Contributor

NagyDonat commented Jun 16, 2025

How I read this you have mainly 2 concerns:

  1. The use of this strong-type makes it tedious the existing APIs to use because one needs to unwrap the value and frequently make an early-return to explicitly handle the case when a symbol-creation failed?

Yes, this is roughly what I mean. In addition to the tediousness of writing the unwrap + early return boilerplate I also fear that these early return branches would be difficult to keep in mind and difficult to cover with tests, so they will act as a persistent source of bugs.

  1. The measurements weren't conclusive. There was no evidence provided that on the usual cases (all entry points) the RT would not regress. It was also not fair to look at only the longest 500 entry points to evaluate the effectiveness of limiting the max symbol complexity (in other words, honoring the max symbol complexity limit).

I don't claim that there was no evidence at all, but I feel that it wasn't significant enough.

[...] regression toward the mean [...]

I formulated the test case in the tests by inspecting a long-running test case. It is consistently low-performing. You can also check it on godbolt, that it would not finish, because the symbol simplifier would need to do so much work due walking the overly complicated symbols. I've also inspected about 10 (a couple of months ago) other radically improved cases, and all showed a large number of binary op manipulations like hashing, or the test case I supply in this patch. This doesn't seem to be a coincidence to me. From my experience, our pool is large enough to be consistent and roughly reproducible for long-ish entry points.

This additional information reduces my fears that the measured runtime difference is just environmental noise. However, extreme outliers (and I'd say that top 500 out of 3'000'000 or even just 390'000 is extreme outlier) are still a very treacherous ground for statistical conclusions (they can amplify small noises to a surprising degree), so I would like to see a more comprehensive statistical analysis. If you can share the raw data (the 3 million {old runtime, new runtime} pairs), I can do these statistical calculations myself.

According to our data, usually an entry point should finish in about 1 second if not less. Above that suggests something to look at.

Fair, although I'd emphasize that they are not "must be eliminated" bugs, but just "something to look at" -- which may or may not lead to an improvement. I don't think that symbol complexity is a "low-hanging fruit" for slow entry points -- instead of this I'd suggest investigating the heuristics related to inlining and inlined function size. However this is far off-topic -- I'll try to eventually start a discourse thread about it once I clarify my suspicions.

To me, encountering symbols with complexity over the dedicated max symbol complexity threshold is a bug.

I don't agree -- very complex symbols naturally appear as we define the symbols with the straightforward recursive definition that we use, and they correspond to well-formed expressions. You can say that symbols over a certain complexity threshold are so rare that the analyzer can arbitrarily discard them, and this may (or may not) be a useful heuristic -- but still the existence of the complex symbols is the "natural" bug-free state and suppressing them is the inaccurate behavior which complicates the situation.

Can you think of other ways to ensure we never create overly complicated symbols?

I'm not convinced that we need to ensure that we never create overly complicated symbols. I feel that this is a "cure is worse than the disease" situation -- this patch introduces a significant amount of code complexity for modest gains in performance.

However, if the statistical analysis confirms that this is an useful direction, I'd suggest eliminating the complex symbols (more precisely, the states that contain them) in a lazy fashion: put a boolean tag on the state when a complex symbol is stored in it and prevent further exploration from exploded nodes that contain a tagged state. This way this performance optimization hack could be limited to the engine, and the checkers wouldn't need to know about it at all.

@balazs-benics-sonarsource
Copy link
Contributor Author

How I read this you have mainly 2 concerns:

  1. The use of this strong-type makes it tedious the existing APIs to use because one needs to unwrap the value and frequently make an early-return to explicitly handle the case when a symbol-creation failed?

Yes, this is roughly what I mean.

The purpose of the strong type is to reject the code from compiling when not unwrapped, so I don't have correctness concerns, rather the opposite, I'm feeling safe about correctness. Speaking of the sprinkle of early returns, well, in C++ we don't really have monad support, so this is the best that I can think of. I'd consider other alternatives if you have any in mind. Otherwise, I proposed the best I have, so I don't know how to proceed on this point.

I'm very likely biased, but I find this code simple. Any operation that combines or otherwise "complicates" a symbol, can fail if it would hit the max complexity. So, in essence, in this refactor we would just use some optional type for representing the failure.
Some pieces of code want to handle such complexity failures differently, so the differently named getters will document the intentional choices - this is why a dedicated strong-type is better than just having a std::optional.

In addition to the tediousness of writing the unwrap + early return boilerplate I also fear that these early return branches would be difficult to keep in mind and difficult to cover with tests, so they will act as a persistent source of bugs.

I find it unnecessary to cover all possible combination of when we can reach max symbol complexity. As you commented, there are many many early-return scenarios, but they are also share a common structure, allowing us to structurally reason about them.

3. The measurements weren't conclusive. There was no evidence provided that on the usual cases (all entry points) the RT would not regress. It was also not fair to look at only the longest 500 entry points to evaluate the effectiveness of limiting the max symbol complexity (in other words, honoring the max symbol complexity limit).

I don't claim that there was no evidence at all, but I feel that it wasn't significant enough.

If I recall, there was no measurable difference at all. This was expected, as this patch would not change the common case, when the symbols are not getting complicated. Remember, sometimes the max complexity threshold was already obeyed (hence we had that threshold), but in other edge cases it was not, for the case in the test. Judging the patch, I also don't see any particular reason why I should be more alert for RT regression, if we only avoid work by honoring this threshold - other than the couple of if statements for the early return cases.

This additional information reduces my fears that the measured runtime difference is just environmental noise. However, extreme outliers (and I'd say that top 500 out of 3'000'000 or even just 390'000 is extreme outlier) are still a very treacherous ground for statistical conclusions (they can amplify small noises to a surprising degree), so I would like to see a more comprehensive statistical analysis. If you can share the raw data (the 3 million {old runtime, new runtime} pairs), I can do these statistical calculations myself.

I should have explained why I picked the top 500, indeed. I had a look at the distribution, and there was a significant increase in the trend. -- I managed to find the script generating the plot along with the data used, but the data contains sensitive projects intermixed.
From what I can see there I picked the top 500 because if we exclude those, we end up with 14 seconds of analysis time max per entry point - according to my notes.

According to our data, usually an entry point should finish in about 1 second if not less. Above that suggests something to look at.

Fair, although I'd emphasize that they are not "must be eliminated" bugs, but just "something to look at" -- which may or may not lead to an improvement. I don't think that symbol complexity is a "low-hanging fruit" for slow entry points -- instead of this I'd suggest investigating the heuristics related to inlining and inlined function size. However this is far off-topic -- I'll try to eventually start a discourse thread about it once I clarify my suspicions.

A predictable upper bound is really important to scale the analysis and to bound the total analysis time of a TU.
This is just a pathological case when the analyzer behaves exceptionally poorly. From the investigated cases, I can share that the slowest entry points are caused by doing lookups in the environment and spend significant time in the remove dead. Refer to the unsuccessful attempts to optimizing that in the relevant blog post.
However, the second reason why we had slow entry points was this one, although a lot less often impacting the entry points. You can see that not a lot of the slowest entry points are improved by a lot, quote:

Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%.

So, maybe a better way to look at this patch is by emphasizing less the RT improvements of the pathological few cases, but rather just enforcing an invariant about max symbol complexity. This patch is not a "low-hanging fruit" for improving the averages or the 99% of the cases. This is about the 45 cases out of 3'000'000, and the unlucky users who happen to have similar code to those 45 cases.

I agree that more intrusive changes in the heuristics offer much more room for potential improvement, but also at an increased perturbation in the results. I'm genuinely surprised that a patch like this that did not change any reports (beyond the usual noise levels), nor the average or the 99th percentile RT, brings an enforcement of an invariant and on top of all this also fixes a tiny fraction of pathological cases.

Maybe my tactic should have been to emphasize the invariant aspects more than the RT improvements in hind sight.

To me, encountering symbols with complexity over the dedicated max symbol complexity threshold is a bug.

I don't agree -- very complex symbols naturally appear as we define the symbols with the straightforward recursive definition that we use, and they correspond to well-formed expressions. You can say that symbols over a certain complexity threshold are so rare that the analyzer can arbitrarily discard them, and this may (or may not) be a useful heuristic -- but still the existence of the complex symbols is the "natural" bug-free state and suppressing them is the inaccurate behavior which complicates the situation.

Let me share what happens when the threshold is hit. I think this may have caused the misunderstanding. Having UnknownVal poisons a computation, and this patch may suggest at first glance that we issue UnknownVal a lot more often.

However, if you look carefully of the test provided, you can see that there is no Unknown. It uses a fresh conjured symbol as a substitute for the Unknown that the SValBuilder would create. If you look closer, you can see that the rest of the computation goes on and on, and each time the max complexity is reached, substitutes that with a new conjured symbol - keeping the effective complexity small, without compromising expressiveness of the history of the symbol. One way to look at this is by judging how likely it is that we could reuse some information from the complicated SymExpr to conclude some range facts (when simplifying). The argument is that it's very unlikely, and even harder to explain later how did we reach that conclusion for a symbol range. The engine does this Unknown to conjured substitutions at different places (I can't recall where exactly), but one instance is when trying to bind Unknown to a variable, it will just bind a fresh conjured instead - to recover future precision, without poisoning subsequent operations. So in short, if we return Unknown here, that actually always materializes a conjured symbol. As I'm writing this, I strongly believe that this was the source of confusion and push back against this patch.

Can you think of other ways to ensure we never create overly complicated symbols?

I'm not convinced that we need to ensure that we never create overly complicated symbols. I feel that this is a "cure is worse than the disease" situation -- this patch introduces a significant amount of code complexity for modest gains in performance.

By reading this, I have the impression you argue against having such a threshold. While I can't say why that threshold was introduced, I argue that they had motivation. Right now, we have this threshold, but sometimes we still have overly complicated symbols. This is objectively not a good place to be in. If we didn't have this threshold, indeed, this code would make little sense.
However, assuming that we need this threshold, this code is necessary to obey this threshold in a systematic way.
If you check out the code, and run the hugelyOverComplicatedSymbol() test, it ran 14 seconds on my system, and produced a symbol with 800 complexity. The problem there is not that it's large, but rather that its unbounded and for no particularly good reason or prospect for future use.

However, if the statistical analysis confirms that this is an useful direction, I'd suggest eliminating the complex symbols (more precisely, the states that contain them) in a lazy fashion: put a boolean tag on the state when a complex symbol is stored in it and prevent further exploration from exploded nodes that contain a tagged state.

I'll possibly come back to this point, but not part of this response.

This way this performance optimization hack could be limited to the engine, and the checkers wouldn't need to know about it at all.

Checkers wouldn't need to know about this at all. They should know that sometimes an evalBinOp returns Unknown. This was previously the contract there, and would remain the case. The contract of makeNonLoc APIs was relaxed part of this patch, but that API is not used by checkers, nor should be.

@NagyDonat
Copy link
Contributor

NagyDonat commented Jun 17, 2025

SUMMARY:

My primary concern is the readability of the refactored code -- for details see the inline answers. (My secondary concern is that I don't entirely trust the statistics, but even if I take your conclusions at face value, I don't see why they justify the added complexity.)

I didn't even think about the issue that returning Unknown "poisons the computation", so it wasn't a factor in my pushback.

You had good points in several areas (questions about correctness and testing, no overall RT regression, no impact on checkers).


FULL REPLY:

How I read this you have mainly 2 concerns:

  1. The use of this strong-type makes it tedious the existing APIs to use because one needs to unwrap the value and frequently make an early-return to explicitly handle the case when a symbol-creation failed?

Yes, this is roughly what I mean.

The purpose of the strong type is to reject the code from compiling when not unwrapped, so I don't have correctness concerns, rather the opposite, I'm feeling safe about correctness.

I did a more through re-read of the patch and it reduced my worries about correctness issues.

Speaking of the sprinkle of early returns, well, in C++ we don't really have monad support, so this is the best that I can think of. I'd consider other alternatives if you have any in mind. Otherwise, I proposed the best I have, so I don't know how to proceed on this point.

Unfortunately, my re-read solidified my impression that the additional verbosity of the wrap-unwrap methods and early return branches severely hinders the readability of the refactored methods – which are already very complicated without these additional nonstandard types. (The main issue is complexity, not the raw number of tokens -- so the situation wouldn't be much better in a language like Haskell that properly supports monads.)

Unfortunately I cannot suggest a more readable implementation, but I still feel that doing nothing would be better than merging this commit, because the limited benefits (improvements in 0.0015% of the entry points, no observable change of overall runtime) do not justify the added code complexity. (If this commit was the only possible fix for a commonly occurring crash, I'd accept it.)

I'm very likely biased, but I find this code simple.

You are biased :) -- which is natural as the author of the source code.

Any operation that combines or otherwise "complicates" a symbol, can fail if it would hit the max complexity. So, in essence, in this refactor we would just use some optional type for representing the failure.

Your implementation is a relatively simple natural solution for the goal that symbol complexity should be limited and it is reasonably readable when presented as a diff. However, you are injecting this new goal (a single "thread") into complex methods that already fulfill many other goals ("big Gordian knot of several other threads") and the difficulty of reading code increases exponentially with the number of intermixed goals (even if each of them is simple in isolation).

As I am reading the source code, I need to resolve each name that appears (jump to definition – "oh, this is just a wrapper for that") and mentally separate all the different goals ("these early returns handle nullptrs", "here we check whether swapping the operands help", "this handles an obscure corner case for ObjC", "this limits symbol complexity"), and adding yet another goal significantly complicates this, even if that goal is simple in isolation / when it is highlighted for review as a diff. As return statements interrupt the control flow, they are especially taxing for the reader – to comprehend the behavior of a function, all return statements must be simultaneously juggled in the mind.

Some pieces of code want to handle such complexity failures differently, so the differently named getters will document the intentional choices - this is why a dedicated strong-type is better than just having a std::optional.

I agree that using this strong type is (at least slightly) better than std::optional – even if std::optional has the big advantage that most readers have its definition "cached in their mind", while your custom strong type will need a few "jump to definition" calls for any code that uses it.

In addition to the tediousness of writing the unwrap + early return boilerplate I also fear that these early return branches would be difficult to keep in mind and difficult to cover with tests, so they will act as a persistent source of bugs.

I find it unnecessary to cover all possible combination of when we can reach max symbol complexity. As you commented, there are many many early-return scenarios, but they are also share a common structure, allowing us to structurally reason about them.

Ok, perhaps you're right. I don't see why the test coverage is sufficient, but I believe you if you say so.

  1. The measurements weren't conclusive. There was no evidence provided that on the usual cases (all entry points) the RT would not regress. It was also not fair to look at only the longest 500 entry points to evaluate the effectiveness of limiting the max symbol complexity (in other words, honoring the max symbol complexity limit).

I don't claim that there was no evidence at all, but I feel that it wasn't significant enough.

If I recall, there was no measurable difference at all. This was expected, as this patch would not change the common case, when the symbols are not getting complicated. Remember, sometimes the max complexity threshold was already obeyed (hence we had that threshold), but in other edge cases it was not, for the case in the test. Judging the patch, I also don't see any particular reason why I should be more alert for RT regression, if we only avoid work by honoring this threshold - other than the couple of if statements for the early return cases.

I didn't feel that overall RT regression was too likely -- although the runtime of the analyzer is very chaotic, so I feel there is a low baseline chance of RT regression for almost every commit. (Since the regressions caused by my "don't assume third iteration" commit I'm a bit paranoid...)

This additional information reduces my fears that the measured runtime difference is just environmental noise. However, extreme outliers (and I'd say that top 500 out of 3'000'000 or even just 390'000 is extreme outlier) are still a very treacherous ground for statistical conclusions (they can amplify small noises to a surprising degree), so I would like to see a more comprehensive statistical analysis. If you can share the raw data (the 3 million {old runtime, new runtime} pairs), I can do these statistical calculations myself.

I should have explained why I picked the top 500, indeed. I had a look at the distribution, and there was a significant increase in the trend.

Ok, I see – but this doesn't change that statistical calculations based on outliers are very shaky. It is possible to rigorously validate an effect like "this change has negligible effect overall, but reduces runtime for a subclass of outliers", but it would require significantly more complex statistical tools than what you used.

-- I managed to find the script generating the plot along with the data used, but the data contains sensitive projects intermixed. From what I can see there I picked the top 500 because if we exclude those, we end up with 14 seconds of analysis time max per entry point - according to my notes.

Do you think that the runtime values are sensitive even if you strip the names of the entry points and just share a big set of pairs of numbers? I would say that it's mathematically impossible to divine meaningful properties from the set of analysis runtime values, especially if you mix together many software projects.

According to our data, usually an entry point should finish in about 1 second if not less. Above that suggests something to look at.

Fair, although I'd emphasize that they are not "must be eliminated" bugs, but just "something to look at" -- which may or may not lead to an improvement. [...]

A predictable upper bound is really important to scale the analysis and to bound the total analysis time of a TU.

I don't agree with this -- I think it's perfectly possible to bound the total analysis runtime for all real-world code without spending extraordinary efforts and code complexity on policing outliers whose total runtime is negligible.

[...] So, maybe a better way to look at this patch is by emphasizing less the RT improvements of the pathological few cases, but rather just enforcing an invariant about max symbol complexity.

Why would you want to enforce this invariant if not for the RT improvements that it provides?

This patch is not a "low-hanging fruit" for improving the averages or the 99% of the cases. This is about the 45 cases out of 3'000'000, and the unlucky users who happen to have similar code to those 45 cases.

Are those unlucky users really "unlucky" in a meaningful way? Even if analyzing that entry point takes half a minute instead of a second, it's still a negligible time compared to the analysis of the full project. (Are you perhaps thinking about integrating the analyzer into a code editor, where quick re-analysis after edits makes the runtime more relevant?)

I agree that more intrusive changes in the heuristics offer much more room for potential improvement, but also at an increased perturbation in the results. I'm genuinely surprised that a patch like this that did not change any reports (beyond the usual noise levels), nor the average or the 99th percentile RT, brings an enforcement of an invariant and on top of all this also fixes a tiny fraction of pathological cases.

Maybe my tactic should have been to emphasize the invariant aspects more than the RT improvements in hind sight.

To me, encountering symbols with complexity over the dedicated max symbol complexity threshold is a bug.

I don't agree -- very complex symbols naturally appear as we define the symbols with the straightforward recursive definition that we use, and they correspond to well-formed expressions. You can say that symbols over a certain complexity threshold are so rare that the analyzer can arbitrarily discard them, and this may (or may not) be a useful heuristic -- but still the existence of the complex symbols is the "natural" bug-free state and suppressing them is the inaccurate behavior which complicates the situation.

Let me share what happens when the threshold is hit. I think this may have caused the misunderstanding. Having UnknownVal poisons a computation, and this patch may suggest at first glance that we issue UnknownVal a lot more often.

However, if you look carefully of the test provided, you can see that there is no Unknown. It uses a fresh conjured symbol as a substitute for the Unknown that the SValBuilder would create. If you look closer, you can see that the rest of the computation goes on and on, and each time the max complexity is reached, substitutes that with a new conjured symbol - keeping the effective complexity small, without compromising expressiveness of the history of the symbol. One way to look at this is by judging how likely it is that we could reuse some information from the complicated SymExpr to conclude some range facts (when simplifying). The argument is that it's very unlikely, and even harder to explain later how did we reach that conclusion for a symbol range. The engine does this Unknown to conjured substitutions at different places (I can't recall where exactly), but one instance is when trying to bind Unknown to a variable, it will just bind a fresh conjured instead - to recover future precision, without poisoning subsequent operations. So in short, if we return Unknown here, that actually always materializes a conjured symbol. As I'm writing this, I strongly believe that this was the source of confusion and push back against this patch.

No, this is not the source of the push back, I didn't even think about this question. My primary concern is the loss of source code readability (which will slow down any further developer who needs to understand or edit those areas) and the secondary concern is that I don't trust the statistical conclusions. (However, I'd suggest not merging this PR even if it did produce all the runtime improvements that you claim based on your statistics.)

Can you think of other ways to ensure we never create overly complicated symbols?

I'm not convinced that we need to ensure that we never create overly complicated symbols. I feel that this is a "cure is worse than the disease" situation -- this patch introduces a significant amount of code complexity for modest gains in performance.

By reading this, I have the impression you argue against having such a threshold. While I can't say why that threshold was introduced, I argue that they had motivation.

If the threshold and its current partial enforcement was introduced with a good cause (i.e. to avoid a crash or to avoid a slowdown that's visible on full projects), then I'm supporting its existence.

Right now, we have this threshold, but sometimes we still have overly complicated symbols. This is objectively not a good place to be in. If we didn't have this threshold, indeed, this code would make little sense. However, assuming that we need this threshold, this code is necessary to obey this threshold in a systematic way.

I don't support systematically enforcing the threshold if it complicates our code base and doesn't provide meaningful benefits for the user. This threshold is just heuristic limitation of the engine, so I don't see significant difference in elegance between the status quo (partial enforcement IIUC?) and full enforcement.

If you check out the code, and run the hugelyOverComplicatedSymbol() test, it ran 14 seconds on my system, and produced a symbol with 800 complexity. The problem there is not that it's large, but rather that its unbounded and for no particularly good reason or prospect for future use.

The static analyzer doesn't guarantee acceptable runtimes on artifically crafted code -- I'd guess that it's not too difficult to craft many analogous examples that target various parts of the analyzer code. Unless it surfaces in real code with meaningful frequency, I don't think that it justifies complex code changes.

However, if the statistical analysis confirms that this is an useful direction, I'd suggest eliminating the complex symbols (more precisely, the states that contain them) in a lazy fashion: put a boolean tag on the state when a complex symbol is stored in it and prevent further exploration from exploded nodes that contain a tagged state.

I'll possibly come back to this point, but not part of this response.

Rereading my suggestion, I'm less confident about it – don't bother with revisiting it unless you find it especially intriguing.

This way this performance optimization hack could be limited to the engine, and the checkers wouldn't need to know about it at all.

Checkers wouldn't need to know about this at all. They should know that sometimes an evalBinOp returns Unknown. This was previously the contract there, and would remain the case. The contract of makeNonLoc APIs was relaxed part of this patch, but that API is not used by checkers, nor should be.

Good point, I misunderstood this. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants