Skip to content

Performance Issue: Review SQL Query for Duplicate Findings #12067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dn281090pdi opened this issue Mar 21, 2025 · 4 comments
Open

Performance Issue: Review SQL Query for Duplicate Findings #12067

dn281090pdi opened this issue Mar 21, 2025 · 4 comments

Comments

@dn281090pdi
Copy link

dn281090pdi commented Mar 21, 2025

Description:
We are observing slow performance with the SQL query used to retrieve duplicate findings.

Observed SQL Query:

SELECT "dojo_finding"."id", "dojo_finding"."title", "dojo_finding"."date", "dojo_finding"."sla_start_date", "dojo_finding"."sla_expiration_date", "dojo_finding"."cwe", "dojo_finding"."cve", "dojo_finding"."epss_score", "dojo_finding"."epss_percentile", "dojo_finding"."cvssv3", "dojo_finding"."cvssv3_score", "dojo_finding"."url", "dojo_finding"."severity", "dojo_finding"."description", "dojo_finding"."mitigation", "dojo_finding"."impact", "dojo_finding"."steps_to_reproduce", "dojo_finding"."severity_justification", "dojo_finding"."refs", "dojo_finding"."test_id", "dojo_finding"."active", "dojo_finding"."verified", "dojo_finding"."false_p", "dojo_finding"."duplicate", "dojo_finding"."duplicate_finding_id", "dojo_finding"."out_of_scope", "dojo_finding"."risk_accepted", "dojo_finding"."under_review", "dojo_finding"."last_status_update", "dojo_finding"."review_requested_by_id", "dojo_finding"."under_defect_review", "dojo_finding"."defect_review_requested_by_id", "dojo_finding"."is_mitigated", "dojo_finding"."thread_id", "dojo_finding"."mitigated", "dojo_finding"."mitigated_by_id", "dojo_finding"."reporter_id", "dojo_finding"."numerical_severity", "dojo_finding"."last_reviewed", "dojo_finding"."last_reviewed_by_id", "dojo_finding"."param", "dojo_finding"."payload", "dojo_finding"."hash_code", "dojo_finding"."line", "dojo_finding"."file_path", "dojo_finding"."component_name", "dojo_finding"."component_version", "dojo_finding"."static_finding", "dojo_finding"."dynamic_finding", "dojo_finding"."created", "dojo_finding"."scanner_confidence", "dojo_finding"."sonarqube_issue_id", "dojo_finding"."unique_id_from_tool", "dojo_finding"."vuln_id_from_tool", "dojo_finding"."sast_source_object", "dojo_finding"."sast_sink_object", "dojo_finding"."sast_source_line", "dojo_finding"."sast_source_file_path", "dojo_finding"."nb_occurences", "dojo_finding"."publish_date", "dojo_finding"."service", "dojo_finding"."planned_remediation_date", "dojo_finding"."planned_remediation_version", "dojo_finding"."effort_for_fixing" 
FROM "dojo_finding" 
INNER JOIN "dojo_finding" t3 ON ("dojo_finding"."id" = t3."duplicate_finding_id") 
WHERE ("dojo_finding"."duplicate_finding_id" IS NOT NULL AND t3."id" IS NOT NULL) 
ORDER BY "dojo_finding"."id" DESC;

Image
Image

Is your feature request related to a problem? Please describe
Yes, this feature request is related to a performance problem that directly impacts user experience and database resource utilization. The current SQL query used to identify duplicate findings exhibits slow performance, especially as the dojo_finding table grows.

Describe the solution you'd like
We request a review of the query to identify potential areas for optimization. Specifically, we suggest investigating the INNER JOIN clause; it might be possible to reformulate the query for improved performance, potentially by moving the join logic into the WHERE clause or using a different approach.

Describe alternatives you've considered
We have considered increasing database resources (CPU, memory), but believe that query optimization is a more sustainable solution.

@manuel-sommer
Copy link
Contributor

I am curious which tool you used to analyse this.

@dn281090pdi
Copy link
Author

I am curious which tool you used to analyse this.

pgBadger

@valentijnscholten
Copy link
Member

valentijnscholten commented Mar 21, 2025

@dn281090pdi Do you have suggestion on how to optimize the query? In the past all queries were very generic because we had to support both MySQL and Postgres. Now we have standardized on postgress there is more room for optimizations. But even sticking to pure SQL might still provide possible optimizations. I notice the instance in the screenshot has over 1M findings. That's a little over the median instance size we see and can support here from the small community team. But we're open to contributions.

@devGregA
Copy link
Contributor

@dn281090pdi, this is very interesting as the data appears to run contradictory to what we typically see. I feel that 1M findings falls into an enterprise use case, so I’d gently suggest taking a look at the Pro version, it makes the open-source version possible, and goes far beyond aggregation with high scalability, data enrichment, insights, custom dashboards, all async processes, improved cross-tool deduplication, etc. All the knobs are available in open-source if workers / db / etc are the root cause of what you’re experiencing, but unless you’re running dedupe synchronously, there shouldn’t be a performance hit with the proper workers and queues provisioned. However the similar findings function, which is great for tuning dedupe, and on by default, can look like dedupe to the DB. Generally speaking, the least performant aspect of the OWASP Edition is template rendering, which is why Pro has a totally different UI. Just for reference sake, we have customers that are pushing around 30k scans a day, and we’re performing performance testing with 22 million findings. My guess though is that Similar Findings is the cause of what you’re experiencing, and can be disabled in the system settings. If that doesn’t solve what you’re experiencing, here is a link to the worker settings https://docs.defectdojo.com/en/open_source/installation/running-in-production/#performance-adjustments there are additional DB settings to potentially adjust via settings.py (defaults are in settings.dist.py)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants