Performance Issue: Review SQL Query for Duplicate Findings #12067

dn281090pdi · 2025-03-21T08:48:02Z

Description:
We are observing slow performance with the SQL query used to retrieve duplicate findings.

Observed SQL Query:

SELECT "dojo_finding"."id", "dojo_finding"."title", "dojo_finding"."date", "dojo_finding"."sla_start_date", "dojo_finding"."sla_expiration_date", "dojo_finding"."cwe", "dojo_finding"."cve", "dojo_finding"."epss_score", "dojo_finding"."epss_percentile", "dojo_finding"."cvssv3", "dojo_finding"."cvssv3_score", "dojo_finding"."url", "dojo_finding"."severity", "dojo_finding"."description", "dojo_finding"."mitigation", "dojo_finding"."impact", "dojo_finding"."steps_to_reproduce", "dojo_finding"."severity_justification", "dojo_finding"."refs", "dojo_finding"."test_id", "dojo_finding"."active", "dojo_finding"."verified", "dojo_finding"."false_p", "dojo_finding"."duplicate", "dojo_finding"."duplicate_finding_id", "dojo_finding"."out_of_scope", "dojo_finding"."risk_accepted", "dojo_finding"."under_review", "dojo_finding"."last_status_update", "dojo_finding"."review_requested_by_id", "dojo_finding"."under_defect_review", "dojo_finding"."defect_review_requested_by_id", "dojo_finding"."is_mitigated", "dojo_finding"."thread_id", "dojo_finding"."mitigated", "dojo_finding"."mitigated_by_id", "dojo_finding"."reporter_id", "dojo_finding"."numerical_severity", "dojo_finding"."last_reviewed", "dojo_finding"."last_reviewed_by_id", "dojo_finding"."param", "dojo_finding"."payload", "dojo_finding"."hash_code", "dojo_finding"."line", "dojo_finding"."file_path", "dojo_finding"."component_name", "dojo_finding"."component_version", "dojo_finding"."static_finding", "dojo_finding"."dynamic_finding", "dojo_finding"."created", "dojo_finding"."scanner_confidence", "dojo_finding"."sonarqube_issue_id", "dojo_finding"."unique_id_from_tool", "dojo_finding"."vuln_id_from_tool", "dojo_finding"."sast_source_object", "dojo_finding"."sast_sink_object", "dojo_finding"."sast_source_line", "dojo_finding"."sast_source_file_path", "dojo_finding"."nb_occurences", "dojo_finding"."publish_date", "dojo_finding"."service", "dojo_finding"."planned_remediation_date", "dojo_finding"."planned_remediation_version", "dojo_finding"."effort_for_fixing" 
FROM "dojo_finding" 
INNER JOIN "dojo_finding" t3 ON ("dojo_finding"."id" = t3."duplicate_finding_id") 
WHERE ("dojo_finding"."duplicate_finding_id" IS NOT NULL AND t3."id" IS NOT NULL) 
ORDER BY "dojo_finding"."id" DESC;

Is your feature request related to a problem? Please describe
Yes, this feature request is related to a performance problem that directly impacts user experience and database resource utilization. The current SQL query used to identify duplicate findings exhibits slow performance, especially as the dojo_finding table grows.

Describe the solution you'd like
We request a review of the query to identify potential areas for optimization. Specifically, we suggest investigating the INNER JOIN clause; it might be possible to reformulate the query for improved performance, potentially by moving the join logic into the WHERE clause or using a different approach.

Describe alternatives you've considered
We have considered increasing database resources (CPU, memory), but believe that query optimization is a more sustainable solution.

manuel-sommer · 2025-03-21T09:28:27Z

I am curious which tool you used to analyse this.

dn281090pdi · 2025-03-21T11:36:05Z

I am curious which tool you used to analyse this.

pgBadger

valentijnscholten · 2025-03-21T20:30:10Z

@dn281090pdi Do you have suggestion on how to optimize the query? In the past all queries were very generic because we had to support both MySQL and Postgres. Now we have standardized on postgress there is more room for optimizations. But even sticking to pure SQL might still provide possible optimizations. I notice the instance in the screenshot has over 1M findings. That's a little over the median instance size we see and can support here from the small community team. But we're open to contributions.

devGregA · 2025-03-28T04:26:15Z

@dn281090pdi, this is very interesting as the data appears to run contradictory to what we typically see. I feel that 1M findings falls into an enterprise use case, so I’d gently suggest taking a look at the Pro version, it makes the open-source version possible, and goes far beyond aggregation with high scalability, data enrichment, insights, custom dashboards, all async processes, improved cross-tool deduplication, etc. All the knobs are available in open-source if workers / db / etc are the root cause of what you’re experiencing, but unless you’re running dedupe synchronously, there shouldn’t be a performance hit with the proper workers and queues provisioned. However the similar findings function, which is great for tuning dedupe, and on by default, can look like dedupe to the DB. Generally speaking, the least performant aspect of the OWASP Edition is template rendering, which is why Pro has a totally different UI. Just for reference sake, we have customers that are pushing around 30k scans a day, and we’re performing performance testing with 22 million findings. My guess though is that Similar Findings is the cause of what you’re experiencing, and can be disabled in the system settings. If that doesn’t solve what you’re experiencing, here is a link to the worker settings https://docs.defectdojo.com/en/open_source/installation/running-in-production/#performance-adjustments there are additional DB settings to potentially adjust via settings.py (defaults are in settings.dist.py)

dn281090pdi added the enhancement label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue: Review SQL Query for Duplicate Findings #12067

Performance Issue: Review SQL Query for Duplicate Findings #12067

dn281090pdi commented Mar 21, 2025 •

edited

Loading

manuel-sommer commented Mar 21, 2025

dn281090pdi commented Mar 21, 2025

valentijnscholten commented Mar 21, 2025 •

edited

Loading

devGregA commented Mar 28, 2025

Performance Issue: Review SQL Query for Duplicate Findings #12067

Performance Issue: Review SQL Query for Duplicate Findings #12067

Comments

dn281090pdi commented Mar 21, 2025 • edited Loading

manuel-sommer commented Mar 21, 2025

dn281090pdi commented Mar 21, 2025

valentijnscholten commented Mar 21, 2025 • edited Loading

devGregA commented Mar 28, 2025

dn281090pdi commented Mar 21, 2025 •

edited

Loading

valentijnscholten commented Mar 21, 2025 •

edited

Loading