Skip to content

feat(scanner): Add submodule fetch strategy for nested repositories #2679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wkl3nk
Copy link
Contributor

@wkl3nk wkl3nk commented May 8, 2025

Introduce submoduleFetchStrategy config to control how the Scanner fetches Git submodules. When set to TOP_LEVEL_ONLY, only top-level submodules are fetched, avoiding timeouts on deeply nested repos.

This mirrors the behavior already available in the Analyzer and allows to resolve nested provenances even in this kind of repositories with a vast amount of nested submodules.

If activated, in the logs you will no longer see the --recursive flag in the gib submodule update command then:

Running 'git submodule update --init --depth 50' in '/tmp/ort-DefaultWorkingTreeCache13286267791034354700'..."

Introduce `submoduleFetchStrategy` config to control how the Scanner
fetches Git submodules. When set to `TOP_LEVEL_ONLY`, only top-level
submodules are fetched, avoiding timeouts on deeply nested repos.

This mirrors the behavior already available in the Analyzer and
allows to resolve nested provenances even in this kind of repositories
with a vast amount of nested submodules.

Signed-off-by: Wolfgang Klenk <[email protected]>
@wkl3nk wkl3nk force-pushed the wkl3nk/scanner-add-submodule-fetch-strategy branch from cc97a2e to f068fce Compare May 8, 2025 12:07
@wkl3nk wkl3nk marked this pull request as ready for review May 8, 2025 12:33
@@ -85,7 +87,20 @@ class ScannerRunner(
?: listOf(SourceCodeOrigin.ARTIFACT, SourceCodeOrigin.VCS)
)

val workingTreeCache = DefaultWorkingTreeCache()
// If the submodule fetch strategy is set to TOP_LEVEL_ONLY, for git use a plugin config that prevents that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to test this? Maybe with a constructor mock and a verification that the Git-specific plugin options have actually been set?

emptyMap()
}

val workingTreeCache = DefaultWorkingTreeCache().addVcsPluginConfigs(vcsPluginConfigs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCS options are not part of the storage key for the nested provenance storage. This means that changing this setting has no effect for repositories where there is already a stored resolution result which could lead to unexpected results. I think to implement this correctly, the storage would have to be adapted as well which might also require changes in ORT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, in the analyzer this option only affects the project repository, but here it affects also repositories of dependencies which might also be dependencies of other projects which do not use the TOP_LEVEL_ONLY strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding: You are talking about ScanSummarys that are persisted to storage, including things like licenseFindings, copyrightFindings, snippetFindings and issues ? But does not each Scanner (like ScanCode) check out the respective package to some "private" space before scanning it? Or am I wrong and all the scanners use this workingTreeCache ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean the NestedProvenanceStorage. The option you introduce influences the result when the scanner checks for submodules in a repository. The result will be stored in the NestedProvenanceStorage and then reused. But it will be reused independent of the configured strategy.
So let's say the strategy is set to TOP_LEVEL_ONLY, then the result will only contain direct submodules. Otherwise, it would also contain transitive submodules. Whatever strategy is configured when a dependency is scanned for the first time will then be reused even if a future scan uses a different strategy.
This is an edge case of course, because it's not so common that you have deeply nested submodule structures, but if it happens, the issue will be very difficult to find and debug.
Also, for the problem you are trying to solve the setting should only affect the scan of project repositories, not the scan of dependencies.
If you want we could do a brainstorming on Monday to check if there is maybe an easy way to avoid this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants