Skip to content

feat(scanner): Add submodule fetch strategy for nested repositories #2679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions api/v1/mapping/src/commonMain/kotlin/Mappings.kt
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,8 @@ fun ScannerJobConfiguration.mapToApi() = ApiScannerJobConfiguration(
skipExcluded,
sourceCodeOrigins?.map { it.mapToApi() },
config?.mapValues { it.value.mapToApi() },
keepAliveWorker
keepAliveWorker,
submoduleFetchStrategy.mapToApi()
)

fun ApiScannerJobConfiguration.mapToModel() = ScannerJobConfiguration(
Expand All @@ -590,7 +591,8 @@ fun ApiScannerJobConfiguration.mapToModel() = ScannerJobConfiguration(
skipExcluded,
sourceCodeOrigins?.map { it.mapToModel() },
config?.mapValues { it.value.mapToModel() },
keepAliveWorker
keepAliveWorker,
submoduleFetchStrategy.mapToModel()
)

fun Secret.mapToApi() = ApiSecret(name, description)
Expand Down
9 changes: 8 additions & 1 deletion api/v1/model/src/commonMain/kotlin/JobConfigurations.kt
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,14 @@ data class ScannerJobConfiguration(
* Keep the worker alive after it has finished. This is useful for manual problem analysis directly
* within the pod's execution environment.
*/
val keepAliveWorker: Boolean = false
val keepAliveWorker: Boolean = false,

/**
* Specifies how submodules are fetched when resolving provenances. Currently supported only for Git repositories.
* If set to [SubmoduleFetchStrategy.FULLY_RECURSIVE] (default), all submodules are fetched recursively. If set
* to [SubmoduleFetchStrategy.TOP_LEVEL_ONLY], only the top-level submodules are fetched.
*/
val submoduleFetchStrategy: SubmoduleFetchStrategy = SubmoduleFetchStrategy.FULLY_RECURSIVE
)

/**
Expand Down
9 changes: 8 additions & 1 deletion model/src/commonMain/kotlin/JobConfigurations.kt
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,14 @@ data class ScannerJobConfiguration(
* Keep the worker alive after it has finished. This is useful for manual problem analysis directly
* within the pod's execution environment.
*/
val keepAliveWorker: Boolean = false
val keepAliveWorker: Boolean = false,

/**
* Specifies how submodules are fetched when resolving provenances. Currently supported only for Git repositories.
* If set to [SubmoduleFetchStrategy.FULLY_RECURSIVE] (default), all submodules are fetched recursively. If set
* to [SubmoduleFetchStrategy.TOP_LEVEL_ONLY], only the top-level submodules are fetched.
*/
val submoduleFetchStrategy: SubmoduleFetchStrategy = SubmoduleFetchStrategy.FULLY_RECURSIVE
)

/**
Expand Down
17 changes: 16 additions & 1 deletion workers/scanner/src/main/kotlin/scanner/ScannerRunner.kt
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
package org.eclipse.apoapsis.ortserver.workers.scanner

import org.eclipse.apoapsis.ortserver.model.ScannerJobConfiguration
import org.eclipse.apoapsis.ortserver.model.SubmoduleFetchStrategy
import org.eclipse.apoapsis.ortserver.workers.common.OrtServerFileListStorage
import org.eclipse.apoapsis.ortserver.workers.common.context.WorkerContext
import org.eclipse.apoapsis.ortserver.workers.common.mapToOrt
Expand All @@ -33,6 +34,7 @@ import org.ossreviewtoolkit.model.PackageType
import org.ossreviewtoolkit.model.Provenance
import org.ossreviewtoolkit.model.ScannerRun
import org.ossreviewtoolkit.model.SourceCodeOrigin
import org.ossreviewtoolkit.model.VcsType
import org.ossreviewtoolkit.model.config.DownloaderConfiguration
import org.ossreviewtoolkit.model.config.ScannerConfiguration
import org.ossreviewtoolkit.model.utils.FileArchiver
Expand Down Expand Up @@ -85,7 +87,20 @@ class ScannerRunner(
?: listOf(SourceCodeOrigin.ARTIFACT, SourceCodeOrigin.VCS)
)

val workingTreeCache = DefaultWorkingTreeCache()
// If the submodule fetch strategy is set to TOP_LEVEL_ONLY, for git use a plugin config that prevents that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to test this? Maybe with a constructor mock and a verification that the Git-specific plugin options have actually been set?

// submodules are fetched recursively.
val vcsPluginConfigs = if (config.submoduleFetchStrategy == SubmoduleFetchStrategy.TOP_LEVEL_ONLY) {
mapOf(
VcsType.GIT.toString() to PluginConfig(
options = mapOf("updateNestedSubmodules" to "false")
)
)
} else {
emptyMap()
}

val workingTreeCache = DefaultWorkingTreeCache().addVcsPluginConfigs(vcsPluginConfigs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCS options are not part of the storage key for the nested provenance storage. This means that changing this setting has no effect for repositories where there is already a stored resolution result which could lead to unexpected results. I think to implement this correctly, the storage would have to be adapted as well which might also require changes in ORT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise, in the analyzer this option only affects the project repository, but here it affects also repositories of dependencies which might also be dependencies of other projects which do not use the TOP_LEVEL_ONLY strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding: You are talking about ScanSummarys that are persisted to storage, including things like licenseFindings, copyrightFindings, snippetFindings and issues ? But does not each Scanner (like ScanCode) check out the respective package to some "private" space before scanning it? Or am I wrong and all the scanners use this workingTreeCache ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean the NestedProvenanceStorage. The option you introduce influences the result when the scanner checks for submodules in a repository. The result will be stored in the NestedProvenanceStorage and then reused. But it will be reused independent of the configured strategy.
So let's say the strategy is set to TOP_LEVEL_ONLY, then the result will only contain direct submodules. Otherwise, it would also contain transitive submodules. Whatever strategy is configured when a dependency is scanned for the first time will then be reused even if a future scan uses a different strategy.
This is an edge case of course, because it's not so common that you have deeply nested submodule structures, but if it happens, the issue will be very difficult to find and debug.
Also, for the problem you are trying to solve the setting should only affect the scan of project repositories, not the scan of dependencies.
If you want we could do a brainstorming on Monday to check if there is maybe an easy way to avoid this issue.


val provenanceDownloader = DefaultProvenanceDownloader(downloaderConfig, workingTreeCache)
val packageProvenanceResolver = DefaultPackageProvenanceResolver(
scanStorages.packageProvenanceStorage,
Expand Down
Loading