-
Notifications
You must be signed in to change notification settings - Fork 15
feat(scanner): Add submodule fetch strategy for nested repositories #2679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(scanner): Add submodule fetch strategy for nested repositories #2679
Conversation
Introduce `submoduleFetchStrategy` config to control how the Scanner fetches Git submodules. When set to `TOP_LEVEL_ONLY`, only top-level submodules are fetched, avoiding timeouts on deeply nested repos. This mirrors the behavior already available in the Analyzer and allows to resolve nested provenances even in this kind of repositories with a vast amount of nested submodules. Signed-off-by: Wolfgang Klenk <[email protected]>
cc97a2e
to
f068fce
Compare
@@ -85,7 +87,20 @@ class ScannerRunner( | |||
?: listOf(SourceCodeOrigin.ARTIFACT, SourceCodeOrigin.VCS) | |||
) | |||
|
|||
val workingTreeCache = DefaultWorkingTreeCache() | |||
// If the submodule fetch strategy is set to TOP_LEVEL_ONLY, for git use a plugin config that prevents that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to test this? Maybe with a constructor mock and a verification that the Git-specific plugin options have actually been set?
emptyMap() | ||
} | ||
|
||
val workingTreeCache = DefaultWorkingTreeCache().addVcsPluginConfigs(vcsPluginConfigs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VCS options are not part of the storage key for the nested provenance storage. This means that changing this setting has no effect for repositories where there is already a stored resolution result which could lead to unexpected results. I think to implement this correctly, the storage would have to be adapted as well which might also require changes in ORT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise, in the analyzer this option only affects the project repository, but here it affects also repositories of dependencies which might also be dependencies of other projects which do not use the TOP_LEVEL_ONLY
strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding: You are talking about ScanSummary
s that are persisted to storage, including things like licenseFindings
, copyrightFindings
, snippetFindings
and issues
? But does not each Scanner (like ScanCode) check out the respective package to some "private" space before scanning it? Or am I wrong and all the scanners use this workingTreeCache
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean the NestedProvenanceStorage
. The option you introduce influences the result when the scanner checks for submodules in a repository. The result will be stored in the NestedProvenanceStorage
and then reused. But it will be reused independent of the configured strategy.
So let's say the strategy is set to TOP_LEVEL_ONLY
, then the result will only contain direct submodules. Otherwise, it would also contain transitive submodules. Whatever strategy is configured when a dependency is scanned for the first time will then be reused even if a future scan uses a different strategy.
This is an edge case of course, because it's not so common that you have deeply nested submodule structures, but if it happens, the issue will be very difficult to find and debug.
Also, for the problem you are trying to solve the setting should only affect the scan of project repositories, not the scan of dependencies.
If you want we could do a brainstorming on Monday to check if there is maybe an easy way to avoid this issue.
Introduce
submoduleFetchStrategy
config to control how the Scanner fetches Git submodules. When set toTOP_LEVEL_ONLY
, only top-level submodules are fetched, avoiding timeouts on deeply nested repos.This mirrors the behavior already available in the Analyzer and allows to resolve nested provenances even in this kind of repositories with a vast amount of nested submodules.
If activated, in the logs you will no longer see the
--recursive
flag in thegib submodule update
command then: