-
Notifications
You must be signed in to change notification settings - Fork 14k
[analyzer] Enforce not making overly complicated symbols #144327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[analyzer] Enforce not making overly complicated symbols #144327
Conversation
Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%. For example, for the `DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c` TU: - `printPointers` entry point was improved from 31.1 seconds to 1.1 second (28x). - `handle_cpu_fault` entry point was improved from 15.5 seconds to 3 seconds (5x). We had in total 3'037'085 entry points in the test pool. Out of these 390'156 were measured to run over a second. TODO: Add the plot here about RT. CPP-6182
@llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-static-analyzer-1 Author: Balázs Benics (balazs-benics-sonarsource) ChangesOut of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%. For example, for the
We had in total 3'037'085 entry points in the test pool. Out of these 390'156 were measured to run over a second. CPP-6182 Patch is 40.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/144327.diff 12 Files Affected:
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
index 2911554de9d97..0458a6125db9a 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SValBuilder.h
@@ -57,6 +57,8 @@ class SValBuilder {
protected:
ASTContext &Context;
+ const AnalyzerOptions &AnOpts;
+
/// Manager of APSInt values.
BasicValueFactory BasicVals;
@@ -68,8 +70,6 @@ class SValBuilder {
ProgramStateManager &StateMgr;
- const AnalyzerOptions &AnOpts;
-
/// The scalar type to use for array indices.
const QualType ArrayIndexTy;
@@ -317,21 +317,21 @@ class SValBuilder {
return nonloc::LocAsInteger(BasicVals.getPersistentSValWithData(loc, bits));
}
- nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
- APSIntPtr rhs, QualType type);
+ DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
+ APSIntPtr rhs, QualType type);
- nonloc::SymbolVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
- const SymExpr *lhs, QualType type);
+ DefinedOrUnknownSVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op,
+ const SymExpr *lhs, QualType type);
- nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
- const SymExpr *rhs, QualType type);
+ DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op,
+ const SymExpr *rhs, QualType type);
- NonLoc makeNonLoc(const SymExpr *operand, UnaryOperator::Opcode op,
- QualType type);
+ DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand,
+ UnaryOperator::Opcode op, QualType type);
/// Create a NonLoc value for cast.
- nonloc::SymbolVal makeNonLoc(const SymExpr *operand, QualType fromTy,
- QualType toTy);
+ DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand, QualType fromTy,
+ QualType toTy);
nonloc::ConcreteInt makeTruthVal(bool b, QualType type) {
return nonloc::ConcreteInt(BasicVals.getTruthValue(b, type));
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
index aca14cf813c4b..11d0a22a31c46 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
@@ -51,9 +51,11 @@ class SymExpr : public llvm::FoldingSetNode {
/// Note, however, that it can't be used in Profile because SymbolManager
/// needs to compute Profile before allocating SymExpr.
const SymbolID Sym;
+ const unsigned Complexity;
protected:
- SymExpr(Kind k, SymbolID Sym) : K(k), Sym(Sym) {}
+ SymExpr(Kind k, SymbolID Sym, unsigned Complexity)
+ : K(k), Sym(Sym), Complexity(Complexity) {}
static bool isValidTypeForSymbol(QualType T) {
// FIXME: Depending on whether we choose to deprecate structural symbols,
@@ -61,8 +63,6 @@ class SymExpr : public llvm::FoldingSetNode {
return !T.isNull() && !T->isVoidType();
}
- mutable unsigned Complexity = 0;
-
public:
virtual ~SymExpr() = default;
@@ -108,7 +108,7 @@ class SymExpr : public llvm::FoldingSetNode {
return llvm::make_range(symbol_iterator(this), symbol_iterator());
}
- virtual unsigned computeComplexity() const = 0;
+ unsigned complexity() const { return Complexity; }
/// Find the region from which this symbol originates.
///
@@ -136,10 +136,15 @@ using SymbolRefSmallVectorTy = SmallVector<SymbolRef, 2>;
/// A symbol representing data which can be stored in a memory location
/// (region).
class SymbolData : public SymExpr {
+ friend class SymbolManager;
void anchor() override;
protected:
- SymbolData(Kind k, SymbolID sym) : SymExpr(k, sym) { assert(classof(this)); }
+ SymbolData(Kind k, SymbolID sym) : SymExpr(k, sym, computeComplexity()) {
+ assert(classof(this));
+ }
+
+ static unsigned computeComplexity(...) { return 1; }
public:
~SymbolData() override = default;
@@ -147,14 +152,10 @@ class SymbolData : public SymExpr {
/// Get a string representation of the kind of the region.
virtual StringRef getKindStr() const = 0;
- unsigned computeComplexity() const override {
- return 1;
- };
-
// Implement isa<T> support.
- static inline bool classof(const SymExpr *SE) {
- Kind k = SE->getKind();
- return k >= BEGIN_SYMBOLS && k <= END_SYMBOLS;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) {
+ return K >= BEGIN_SYMBOLS && K <= END_SYMBOLS;
}
};
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
index 86774ad5043dd..5239663788fb4 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
@@ -18,6 +18,7 @@
#include "clang/AST/Type.h"
#include "clang/Analysis/AnalysisDeclContext.h"
#include "clang/Basic/LLVM.h"
+#include "clang/StaticAnalyzer/Core/AnalyzerOptions.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/APSIntPtr.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/MemRegion.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/StoreRef.h"
@@ -72,9 +73,9 @@ class SymbolRegionValue : public SymbolData {
QualType getType() const override;
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolRegionValueKind;
- }
+ static constexpr Kind ClassKind = SymbolRegionValueKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// A symbol representing the result of an expression in the case when we do
@@ -128,9 +129,9 @@ class SymbolConjured : public SymbolData {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolConjuredKind;
- }
+ static constexpr Kind ClassKind = SymbolConjuredKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// A symbol representing the value of a MemRegion whose parent region has
@@ -172,9 +173,11 @@ class SymbolDerived : public SymbolData {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolDerivedKind;
+ static constexpr Kind ClassKind = SymbolDerivedKind;
+ static constexpr bool classof(const SymExpr *SE) {
+ return classof(SE->getKind());
}
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// SymbolExtent - Represents the extent (size in bytes) of a bounded region.
@@ -209,9 +212,9 @@ class SymbolExtent : public SymbolData {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolExtentKind;
- }
+ static constexpr Kind ClassKind = SymbolExtentKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == SymbolExtentKind; }
};
/// SymbolMetadata - Represents path-dependent metadata about a specific region.
@@ -278,13 +281,14 @@ class SymbolMetadata : public SymbolData {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolMetadataKind;
- }
+ static constexpr Kind ClassKind = SymbolMetadataKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// Represents a cast expression.
class SymbolCast : public SymExpr {
+ friend class SymbolManager;
const SymExpr *Operand;
/// Type of the operand.
@@ -295,20 +299,19 @@ class SymbolCast : public SymExpr {
friend class SymExprAllocator;
SymbolCast(SymbolID Sym, const SymExpr *In, QualType From, QualType To)
- : SymExpr(SymbolCastKind, Sym), Operand(In), FromTy(From), ToTy(To) {
+ : SymExpr(SymbolCastKind, Sym, computeComplexity(In, From, To)),
+ Operand(In), FromTy(From), ToTy(To) {
assert(In);
assert(isValidTypeForSymbol(From));
// FIXME: GenericTaintChecker creates symbols of void type.
// Otherwise, 'To' should also be a valid type.
}
-public:
- unsigned computeComplexity() const override {
- if (Complexity == 0)
- Complexity = 1 + Operand->computeComplexity();
- return Complexity;
+ static unsigned computeComplexity(const SymExpr *In, QualType, QualType) {
+ return In->complexity() + 1;
}
+public:
QualType getType() const override { return ToTy; }
LLVM_ATTRIBUTE_RETURNS_NONNULL
@@ -329,13 +332,14 @@ class SymbolCast : public SymExpr {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == SymbolCastKind;
- }
+ static constexpr Kind ClassKind = SymbolCastKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// Represents a symbolic expression involving a unary operator.
class UnarySymExpr : public SymExpr {
+ friend class SymbolManager;
const SymExpr *Operand;
UnaryOperator::Opcode Op;
QualType T;
@@ -343,7 +347,8 @@ class UnarySymExpr : public SymExpr {
friend class SymExprAllocator;
UnarySymExpr(SymbolID Sym, const SymExpr *In, UnaryOperator::Opcode Op,
QualType T)
- : SymExpr(UnarySymExprKind, Sym), Operand(In), Op(Op), T(T) {
+ : SymExpr(UnarySymExprKind, Sym, computeComplexity(In, Op, T)),
+ Operand(In), Op(Op), T(T) {
// Note, some unary operators are modeled as a binary operator. E.g. ++x is
// modeled as x + 1.
assert((Op == UO_Minus || Op == UO_Not) && "non-supported unary expression");
@@ -354,13 +359,12 @@ class UnarySymExpr : public SymExpr {
assert(!Loc::isLocType(T) && "unary symbol should be nonloc");
}
-public:
- unsigned computeComplexity() const override {
- if (Complexity == 0)
- Complexity = 1 + Operand->computeComplexity();
- return Complexity;
+ static unsigned computeComplexity(const SymExpr *In, UnaryOperator::Opcode,
+ QualType) {
+ return In->complexity() + 1;
}
+public:
const SymExpr *getOperand() const { return Operand; }
UnaryOperator::Opcode getOpcode() const { return Op; }
QualType getType() const override { return T; }
@@ -380,9 +384,9 @@ class UnarySymExpr : public SymExpr {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- return SE->getKind() == UnarySymExprKind;
- }
+ static constexpr Kind ClassKind = UnarySymExprKind;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// Represents a symbolic expression involving a binary operator
@@ -391,8 +395,9 @@ class BinarySymExpr : public SymExpr {
QualType T;
protected:
- BinarySymExpr(SymbolID Sym, Kind k, BinaryOperator::Opcode op, QualType t)
- : SymExpr(k, Sym), Op(op), T(t) {
+ BinarySymExpr(SymbolID Sym, Kind k, BinaryOperator::Opcode op, QualType t,
+ unsigned Complexity)
+ : SymExpr(k, Sym, Complexity), Op(op), T(t) {
assert(classof(this));
// Binary expressions are results of arithmetic. Pointer arithmetic is not
// handled by binary expressions, but it is instead handled by applying
@@ -408,14 +413,14 @@ class BinarySymExpr : public SymExpr {
BinaryOperator::Opcode getOpcode() const { return Op; }
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) {
- Kind k = SE->getKind();
- return k >= BEGIN_BINARYSYMEXPRS && k <= END_BINARYSYMEXPRS;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) {
+ return K >= BEGIN_BINARYSYMEXPRS && K <= END_BINARYSYMEXPRS;
}
protected:
static unsigned computeOperandComplexity(const SymExpr *Value) {
- return Value->computeComplexity();
+ return Value->complexity();
}
static unsigned computeOperandComplexity(const llvm::APSInt &Value) {
return 1;
@@ -430,19 +435,28 @@ class BinarySymExpr : public SymExpr {
};
/// Template implementation for all binary symbolic expressions
-template <class LHSTYPE, class RHSTYPE, SymExpr::Kind ClassKind>
+template <class LHSTYPE, class RHSTYPE, SymExpr::Kind ClassK>
class BinarySymExprImpl : public BinarySymExpr {
+ friend class SymbolManager;
LHSTYPE LHS;
RHSTYPE RHS;
friend class SymExprAllocator;
BinarySymExprImpl(SymbolID Sym, LHSTYPE lhs, BinaryOperator::Opcode op,
RHSTYPE rhs, QualType t)
- : BinarySymExpr(Sym, ClassKind, op, t), LHS(lhs), RHS(rhs) {
+ : BinarySymExpr(Sym, ClassKind, op, t,
+ computeComplexity(lhs, op, rhs, t)),
+ LHS(lhs), RHS(rhs) {
assert(getPointer(lhs));
assert(getPointer(rhs));
}
+ static unsigned computeComplexity(LHSTYPE lhs, BinaryOperator::Opcode,
+ RHSTYPE rhs, QualType) {
+ // FIXME: Should we add 1 to complexity?
+ return computeOperandComplexity(lhs) + computeOperandComplexity(rhs);
+ }
+
public:
void dumpToStream(raw_ostream &os) const override {
dumpToStreamImpl(os, LHS);
@@ -453,13 +467,6 @@ class BinarySymExprImpl : public BinarySymExpr {
LHSTYPE getLHS() const { return LHS; }
RHSTYPE getRHS() const { return RHS; }
- unsigned computeComplexity() const override {
- if (Complexity == 0)
- Complexity =
- computeOperandComplexity(RHS) + computeOperandComplexity(LHS);
- return Complexity;
- }
-
static void Profile(llvm::FoldingSetNodeID &ID, LHSTYPE lhs,
BinaryOperator::Opcode op, RHSTYPE rhs, QualType t) {
ID.AddInteger((unsigned)ClassKind);
@@ -474,7 +481,9 @@ class BinarySymExprImpl : public BinarySymExpr {
}
// Implement isa<T> support.
- static bool classof(const SymExpr *SE) { return SE->getKind() == ClassKind; }
+ static constexpr Kind ClassKind = ClassK;
+ static bool classof(const SymExpr *SE) { return classof(SE->getKind()); }
+ static constexpr bool classof(Kind K) { return K == ClassKind; }
};
/// Represents a symbolic expression like 'x' + 3.
@@ -489,6 +498,33 @@ using IntSymExpr = BinarySymExprImpl<APSIntPtr, const SymExpr *,
using SymSymExpr = BinarySymExprImpl<const SymExpr *, const SymExpr *,
SymExpr::Kind::SymSymExprKind>;
+struct MaybeSymExpr {
+ MaybeSymExpr() = default;
+ explicit MaybeSymExpr(SymbolRef Sym) : Sym(Sym) {}
+ bool isValid() const { return Sym; }
+ bool isInvalid() const { return !isValid(); }
+ SymbolRef operator->() const { return Sym; }
+
+ SymbolRef getOrNull() const { return Sym; }
+ template <typename SymT> const SymT *getOrNull() const {
+ return llvm::dyn_cast_if_present<SymT>(Sym);
+ }
+
+ DefinedOrUnknownSVal getOrUnknown() const {
+ if (isInvalid())
+ return UnknownVal();
+ return nonloc::SymbolVal(Sym);
+ }
+
+ nonloc::SymbolVal getOrAssert() const {
+ assert(Sym);
+ return nonloc::SymbolVal(Sym);
+ }
+
+private:
+ SymbolRef Sym = nullptr;
+};
+
class SymExprAllocator {
SymbolID NextSymbolID = 0;
llvm::BumpPtrAllocator &Alloc;
@@ -518,27 +554,27 @@ class SymbolManager {
SymExprAllocator Alloc;
BasicValueFactory &BV;
ASTContext &Ctx;
+ const unsigned MaxCompComplexity;
public:
SymbolManager(ASTContext &ctx, BasicValueFactory &bv,
- llvm::BumpPtrAllocator &bpalloc)
- : SymbolDependencies(16), Alloc(bpalloc), BV(bv), Ctx(ctx) {}
+ llvm::BumpPtrAllocator &bpalloc, const AnalyzerOptions &Opts)
+ : SymbolDependencies(16), Alloc(bpalloc), BV(bv), Ctx(ctx),
+ MaxCompComplexity(Opts.MaxSymbolComplexity) {
+ assert(MaxCompComplexity > 0 && "Zero max complexity doesn't make sense");
+ }
static bool canSymbolicate(QualType T);
/// Create or retrieve a SymExpr of type \p SymExprT for the given arguments.
/// Use the arguments to check for an existing SymExpr and return it,
/// otherwise, create a new one and keep a pointer to it to avoid duplicates.
- template <typename SymExprT, typename... Args>
- const SymExprT *acquire(Args &&...args);
+ template <typename SymExprT, typename... Args> auto acquire(Args &&...args);
const SymbolConjured *conjureSymbol(ConstCFGElementRef Elem,
const LocationContext *LCtx, QualType T,
unsigned VisitCount,
- const void *SymbolTag = nullptr) {
-
- return acquire<SymbolConjured>(Elem, LCtx, T, VisitCount, SymbolTag);
- }
+ const void *SymbolTag = nullptr);
QualType getType(const SymExpr *SE) const {
return SE->getType();
@@ -672,7 +708,16 @@ class SymbolVisitor {
};
template <typename T, typename... Args>
-const T *SymbolManager::acquire(Args &&...args) {
+auto SymbolManager::acquire(Args &&...args) {
+ constexpr bool IsSymbolData = SymbolData::classof(T::ClassKind);
+ if constexpr (IsSymbolData) {
+ assert(T::computeComplexity(args...) == 1);
+ } else {
+ if (T::computeComplexity(args...) > MaxCompComplexity) {
+ return MaybeSymExpr();
+ }
+ }
+
llvm::FoldingSetNodeID profile;
T::Profile(profile, args...);
void *InsertPos;
@@ -681,7 +726,12 @@ const T *SymbolManager::acquire(Args &&...args) {
SD = Alloc.make<T>(std::forward<Args>(args)...);
DataSet.InsertNode(SD, InsertPos);
}
- return cast<T>(SD);
+
+ if constexpr (IsSymbolData) {
+ return cast<T>(SD);
+ } else {
+ return MaybeSymExpr(SD);
+ }
}
} // namespace ento
diff --git a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
index e55d064253b84..f0dc889f15e7a 100644
--- a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
@@ -267,7 +267,7 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
// HACK:https://discourse.llvm.org/t/rfc-make-istainted-and-complex-symbols-friends/79570
if (const auto &Opts = State->getAnalysisManager().getAnalyzerOptions();
- Sym->computeComplexity() > Opts.MaxTaintedSymbolComplexity) {
+ Sym->complexity() > Opts.MaxTaintedSymbolComplexity) {
return {};
}
diff --git a/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp b/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
index e2f8bd541c967..ab0e3d8f56d86 100644
--- a/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/TrustNonnullChecker.cpp
@@ -66,7 +66,7 @@ class TrustNonnullChecker : public Checker<check::PostCall,
SVal Cond,
bool Assumption) const {
const SymbolRef CondS = Cond.getAsSymbol();
- if (!CondS || CondS->computeComplexity() > ComplexityThreshold)
+ if (!CondS || CondS->complexity() > ComplexityThreshold)
return State;
for (SymbolRef Antecedent : CondS->symbols()) {
diff --git a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
index fa8e669b6bb2f..3486485dcd686 100644
--- a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
+++ b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
@@ -67,7 +67,7 @@ void ExprEngine::VisitBinaryOperator(const BinaryOperator* B,
if (RightV.isUnknown()) {
unsigned Count = currBldrCtx->blockCount();
RightV = svalBuilder.conjureSymbolVal(nullptr, getCFGElementRef(), LCtx,
- Count);
+ RHS->getType(), Count);
}
// Simulate the effects of a "store": bind the value of the RHS
// to the L-Valu...
[truncated]
|
Count); | ||
RHS->getType(), Count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly in #137355 @fangyi-zhou changed the behavior of this line, thus needed a tiny bit of adjustment to make the new test pass while I was uplifting this downstream patch to current llvm main.
I didn't investigate the case beyond that this was the line that conjured a symbol of a wrong type after #137355, probably because in the past we directly passed a QualType here but after that change we rely on deducing the type from getCFGElementRef()
- which is apparently wrong. To see the behavior, revert this hunk and see the broken test. There could be more places where this type mismatch on conjure could cause issues, but I didn't audit the code further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I'm not convinced that this is the right direction for improving the analyzer runtime.
On the "risks" side I think that adding the corner case that "this may also return UnknownVal
in rare situations" into many functions complicates the logic, burdens the code with early return branches and I fear that it will act as a footgun.
On the "benefits" side I fear that your statistics don't prove enough:
- You found that "Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%." but this only covers 9% of the worst 500 entry points. Eyeballing the graph suggests that there are some cases where the runtime actually got worse -- so please check that the overall effect of the change is also positive (e.g. the total runtime is reduced meaningfully).
- Moreover, if "worst 500 entry points" means "worst 500 in the first run", then it is a biased sample: if you pick the worst outliers (i.e. the entry points where the sum expected runtime + luck factor is largest), then you are expected to get entry points with worse than average luck (because among two similar entry points, the one with bad luck ends up in the worst 500 while the one with good luck avoids it), so if you redo the measurement, then regression toward the mean will produce better results -- even if you do both measurements with the same setup! As a sanity check, please redo the statistics on the the entry points that produced the worst 500 runtimes in the second run -- I fear that on that sample (which is biased in the opposite direction) you will see that the new revision is worse than the baseline.
- I'm also interested in comparing the statistical results with a second independent measurement -- is the set of "worst 500 entry points" stable between runs, or are these random unlucky functions that are hit with environmental issues?
If you can share the raw data, I can help with statistical calculations.
DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op, | ||
APSIntPtr rhs, QualType type); | ||
|
||
nonloc::SymbolVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op, | ||
const SymExpr *lhs, QualType type); | ||
DefinedOrUnknownSVal makeNonLoc(APSIntPtr rhs, BinaryOperator::Opcode op, | ||
const SymExpr *lhs, QualType type); | ||
|
||
nonloc::SymbolVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op, | ||
const SymExpr *rhs, QualType type); | ||
DefinedOrUnknownSVal makeNonLoc(const SymExpr *lhs, BinaryOperator::Opcode op, | ||
const SymExpr *rhs, QualType type); | ||
|
||
NonLoc makeNonLoc(const SymExpr *operand, UnaryOperator::Opcode op, | ||
QualType type); | ||
DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand, | ||
UnaryOperator::Opcode op, QualType type); | ||
|
||
/// Create a NonLoc value for cast. | ||
nonloc::SymbolVal makeNonLoc(const SymExpr *operand, QualType fromTy, | ||
QualType toTy); | ||
DefinedOrUnknownSVal makeNonLoc(const SymExpr *operand, QualType fromTy, | ||
QualType toTy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function must be renamed, because its current name strongly promises that it always returns a NonLoc
.
Could you paraphrase your concerns? How I read this you have mainly 2 concerns:
I formulated the test case in the tests by inspecting a long-running test case. It is consistently low-performing. You can also check it on godbolt, that it would not finish, because the symbol simplifier would need to do so much work due walking the overly complicated symbols. I've also inspected about 10 (a couple of months ago) other radically improved cases, and all showed a large number of binary op manipulations like hashing, or the test case I supply in this patch. This doesn't seem to be a coincidence to me. To me, encountering symbols with complexity over the dedicated max symbol complexity threshold is a bug. |
Yes, this is roughly what I mean. In addition to the tediousness of writing the unwrap + early return boilerplate I also fear that these early return branches would be difficult to keep in mind and difficult to cover with tests, so they will act as a persistent source of bugs.
I don't claim that there was no evidence at all, but I feel that it wasn't significant enough.
This additional information reduces my fears that the measured runtime difference is just environmental noise. However, extreme outliers (and I'd say that top 500 out of 3'000'000 or even just 390'000 is extreme outlier) are still a very treacherous ground for statistical conclusions (they can amplify small noises to a surprising degree), so I would like to see a more comprehensive statistical analysis. If you can share the raw data (the 3 million
Fair, although I'd emphasize that they are not "must be eliminated" bugs, but just "something to look at" -- which may or may not lead to an improvement. I don't think that symbol complexity is a "low-hanging fruit" for slow entry points -- instead of this I'd suggest investigating the heuristics related to inlining and inlined function size. However this is far off-topic -- I'll try to eventually start a discourse thread about it once I clarify my suspicions.
I don't agree -- very complex symbols naturally appear as we define the symbols with the straightforward recursive definition that we use, and they correspond to well-formed expressions. You can say that symbols over a certain complexity threshold are so rare that the analyzer can arbitrarily discard them, and this may (or may not) be a useful heuristic -- but still the existence of the complex symbols is the "natural" bug-free state and suppressing them is the inaccurate behavior which complicates the situation.
I'm not convinced that we need to ensure that we never create overly complicated symbols. I feel that this is a "cure is worse than the disease" situation -- this patch introduces a significant amount of code complexity for modest gains in performance. However, if the statistical analysis confirms that this is an useful direction, I'd suggest eliminating the complex symbols (more precisely, the states that contain them) in a lazy fashion: put a boolean tag on the state when a complex symbol is stored in it and prevent further exploration from exploded nodes that contain a tagged state. This way this performance optimization hack could be limited to the engine, and the checkers wouldn't need to know about it at all. |
The purpose of the strong type is to reject the code from compiling when not unwrapped, so I don't have correctness concerns, rather the opposite, I'm feeling safe about correctness. Speaking of the sprinkle of early returns, well, in C++ we don't really have monad support, so this is the best that I can think of. I'd consider other alternatives if you have any in mind. Otherwise, I proposed the best I have, so I don't know how to proceed on this point. I'm very likely biased, but I find this code simple. Any operation that combines or otherwise "complicates" a symbol, can fail if it would hit the max complexity. So, in essence, in this refactor we would just use some optional type for representing the failure.
I find it unnecessary to cover all possible combination of when we can reach max symbol complexity. As you commented, there are many many early-return scenarios, but they are also share a common structure, allowing us to structurally reason about them.
If I recall, there was no measurable difference at all. This was expected, as this patch would not change the common case, when the symbols are not getting complicated. Remember, sometimes the max complexity threshold was already obeyed (hence we had that threshold), but in other edge cases it was not, for the case in the test. Judging the patch, I also don't see any particular reason why I should be more alert for RT regression, if we only avoid work by honoring this threshold - other than the couple of
I should have explained why I picked the top 500, indeed. I had a look at the distribution, and there was a significant increase in the trend. -- I managed to find the script generating the plot along with the data used, but the data contains sensitive projects intermixed.
A predictable upper bound is really important to scale the analysis and to bound the total analysis time of a TU.
So, maybe a better way to look at this patch is by emphasizing less the RT improvements of the pathological few cases, but rather just enforcing an invariant about max symbol complexity. This patch is not a "low-hanging fruit" for improving the averages or the 99% of the cases. This is about the 45 cases out of 3'000'000, and the unlucky users who happen to have similar code to those 45 cases. I agree that more intrusive changes in the heuristics offer much more room for potential improvement, but also at an increased perturbation in the results. I'm genuinely surprised that a patch like this that did not change any reports (beyond the usual noise levels), nor the average or the 99th percentile RT, brings an enforcement of an invariant and on top of all this also fixes a tiny fraction of pathological cases. Maybe my tactic should have been to emphasize the invariant aspects more than the RT improvements in hind sight.
Let me share what happens when the threshold is hit. I think this may have caused the misunderstanding. Having However, if you look carefully of the test provided, you can see that there is no
By reading this, I have the impression you argue against having such a threshold. While I can't say why that threshold was introduced, I argue that they had motivation. Right now, we have this threshold, but sometimes we still have overly complicated symbols. This is objectively not a good place to be in. If we didn't have this threshold, indeed, this code would make little sense.
I'll possibly come back to this point, but not part of this response.
Checkers wouldn't need to know about this at all. They should know that sometimes an |
SUMMARY: My primary concern is the readability of the refactored code -- for details see the inline answers. (My secondary concern is that I don't entirely trust the statistics, but even if I take your conclusions at face value, I don't see why they justify the added complexity.) I didn't even think about the issue that returning Unknown "poisons the computation", so it wasn't a factor in my pushback. You had good points in several areas (questions about correctness and testing, no overall RT regression, no impact on checkers). FULL REPLY:
I did a more through re-read of the patch and it reduced my worries about correctness issues.
Unfortunately, my re-read solidified my impression that the additional verbosity of the wrap-unwrap methods and early return branches severely hinders the readability of the refactored methods – which are already very complicated without these additional nonstandard types. (The main issue is complexity, not the raw number of tokens -- so the situation wouldn't be much better in a language like Haskell that properly supports monads.) Unfortunately I cannot suggest a more readable implementation, but I still feel that doing nothing would be better than merging this commit, because the limited benefits (improvements in 0.0015% of the entry points, no observable change of overall runtime) do not justify the added code complexity. (If this commit was the only possible fix for a commonly occurring crash, I'd accept it.)
You are biased :) -- which is natural as the author of the source code.
Your implementation is a relatively simple natural solution for the goal that symbol complexity should be limited and it is reasonably readable when presented as a diff. However, you are injecting this new goal (a single "thread") into complex methods that already fulfill many other goals ("big Gordian knot of several other threads") and the difficulty of reading code increases exponentially with the number of intermixed goals (even if each of them is simple in isolation). As I am reading the source code, I need to resolve each name that appears (jump to definition – "oh, this is just a wrapper for that") and mentally separate all the different goals ("these early returns handle nullptrs", "here we check whether swapping the operands help", "this handles an obscure corner case for ObjC", "this limits symbol complexity"), and adding yet another goal significantly complicates this, even if that goal is simple in isolation / when it is highlighted for review as a diff. As return statements interrupt the control flow, they are especially taxing for the reader – to comprehend the behavior of a function, all return statements must be simultaneously juggled in the mind.
I agree that using this strong type is (at least slightly) better than
Ok, perhaps you're right. I don't see why the test coverage is sufficient, but I believe you if you say so.
I didn't feel that overall RT regression was too likely -- although the runtime of the analyzer is very chaotic, so I feel there is a low baseline chance of RT regression for almost every commit. (Since the regressions caused by my "don't assume third iteration" commit I'm a bit paranoid...)
Ok, I see – but this doesn't change that statistical calculations based on outliers are very shaky. It is possible to rigorously validate an effect like "this change has negligible effect overall, but reduces runtime for a subclass of outliers", but it would require significantly more complex statistical tools than what you used.
Do you think that the runtime values are sensitive even if you strip the names of the entry points and just share a big set of pairs of numbers? I would say that it's mathematically impossible to divine meaningful properties from the set of analysis runtime values, especially if you mix together many software projects.
I don't agree with this -- I think it's perfectly possible to bound the total analysis runtime for all real-world code without spending extraordinary efforts and code complexity on policing outliers whose total runtime is negligible.
Why would you want to enforce this invariant if not for the RT improvements that it provides?
Are those unlucky users really "unlucky" in a meaningful way? Even if analyzing that entry point takes half a minute instead of a second, it's still a negligible time compared to the analysis of the full project. (Are you perhaps thinking about integrating the analyzer into a code editor, where quick re-analysis after edits makes the runtime more relevant?)
No, this is not the source of the push back, I didn't even think about this question. My primary concern is the loss of source code readability (which will slow down any further developer who needs to understand or edit those areas) and the secondary concern is that I don't trust the statistical conclusions. (However, I'd suggest not merging this PR even if it did produce all the runtime improvements that you claim based on your statistics.)
If the threshold and its current partial enforcement was introduced with a good cause (i.e. to avoid a crash or to avoid a slowdown that's visible on full projects), then I'm supporting its existence.
I don't support systematically enforcing the threshold if it complicates our code base and doesn't provide meaningful benefits for the user. This threshold is just heuristic limitation of the engine, so I don't see significant difference in elegance between the status quo (partial enforcement IIUC?) and full enforcement.
The static analyzer doesn't guarantee acceptable runtimes on artifically crafted code -- I'd guess that it's not too difficult to craft many analogous examples that target various parts of the analyzer code. Unless it surfaces in real code with meaningful frequency, I don't think that it justifies complex code changes.
Rereading my suggestion, I'm less confident about it – don't bother with revisiting it unless you find it especially intriguing.
Good point, I misunderstood this. Sorry! |
Out of the worst 500 entry points, 45 were improved by at least 10%. Out of these 45, 5 were improved by more than 50%. Out of these 45, 2 were improved by more than 80%.
For example, for the
DelugeFirmware/src/OSLikeStuff/fault_handler/fault_handler.c
TU:printPointers
entry point was improved from 31.1 seconds to 1.1 second (28x).handle_cpu_fault
entry point was improved from 15.5 seconds to 3 seconds (5x).We had in total 3'037'085 entry points in the test pool. Out of these 390'156 were measured to run over a second.
CPP-6182