Skip to content

Compression plugin #18314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: develop2
Choose a base branch
from
Open

Conversation

perseoGI
Copy link
Contributor

@perseoGI perseoGI commented May 16, 2025

Changelog: Feature: Allow externalization of compression/decompression tasks to a user defined plugin
Docs: conan-io/docs#4107

Close #18259
Close #5209
Close: #18185

Related issues:

Related PRs:

With the aim of simplicity this PR keeps the current behaviour in some compression/decompression tasks such as:

  • Cache save/restore
  • File uploads
    Adding an extra layer which will be in charge of determining if the compression.py plugin is defined falling back to existing behaviour if it doesn't.

The compression.py plugin interface may follow the following structure:

def tar_extract(archive_path, dest_dir, conf=None, *args, **kwargs) -> None:
    pass
def tar_compress(archive_path, files, recursive, conf=None, ref=None, *args, **kwargs) -> None:
    pass

Pluging example with zstd:

# File: compression.py

import os
import tarfile

import zstandard

from conan.api.output import ConanOutput

# zstd compression
# Original author https://github.com/conan-io/conan/pull/14706
def tar_extract(archive_path, dest_dir, conf=None, *args, **kwargs):
    dctx = zstandard.ZstdDecompressor()
    ConanOutput().info(f"Decompressing {os.path.basename(archive_path)} with compression plugin (ZSTD)")
    with open(archive_path, "rb") as tarfile_obj:
        with dctx.stream_reader(tarfile_obj) as stream_reader:
            # The choice of bufsize=32768 comes from profiling decompression at various
            # values and finding that bufsize value consistently performs well.
            with tarfile.open(
                fileobj=stream_reader, bufsize=32768, mode="r|"
            ) as the_tar:
                the_tar.extractall(
                    path=dest_dir, filter=lambda tarinfo, _: tarinfo
                )


def tar_compress(archive_path, files, recursive, conf=None, ref=None, *args, **kwargs):
    ConanOutput(scope=str(ref or "")).info(
        f"Compressing {os.path.basename(archive_path)} with compression plugin (ZSTD)"
    )
    compresslevel = conf.get("user.zstd:compresslevel", check_type=int) if conf else None
    with open(archive_path, "wb") as tarfile_obj:
        # Only provide level if it was overridden by config.
        zstd_kwargs = {}
        if compresslevel is not None:
            zstd_kwargs["level"] = compresslevel

        dctx = zstandard.ZstdCompressor(write_checksum=True, threads=-1, **zstd_kwargs)

        # Create a zstd stream writer so tarfile writes uncompressed data to
        # the zstd stream writer, which in turn writes compressed data to the
        # output tar.zst file.
        with dctx.stream_writer(tarfile_obj) as stream_writer:
            # The choice of bufsize=32768 comes from profiling compression at various
            # values and finding that bufsize value consistently performs well.
            # The variance in compression times at bufsize<=64KB is small. It is only
            # when bufsize>=128KB that compression times start increasing.
            with tarfile.open(
                mode="w|",
                fileobj=stream_writer,
                bufsize=32768,
                format=tarfile.PAX_FORMAT,
            ) as tar:
                current_frame_bytes = 0
                for filename, abs_path in sorted(files.items()):
                    tar.add(abs_path, filename, recursive=recursive)

                    # Flush the current frame if it has reached a large enough size.
                    # There is no required size, but 128MB is a good starting point
                    # because it allows for faster random access to the file.
                    current_frame_bytes += os.path.getsize(abs_path)
                    if current_frame_bytes >= 134217728:
                        stream_writer.flush(zstandard.FLUSH_FRAME)
                        current_frame_bytes = 0

memsharded added a commit that referenced this pull request May 20, 2025
…xisting function (#18320)

Changelog: omit
Docs: omit

As suggested in
#18314 (comment)

This PR will simplify #18314 by
modifying the compression functionality on cache save command.
Intead of creating a tar object and adding files one by one (very
coupled interface with tarfile module), create a list of files which
will be passed to a already existing method in conan `compress_files`.

This method may be moved to another module in a future PR.
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better, easier to assess that the risk is low.
Just some small concern about the pkglist.json of the conan cache restore

@perseoGI perseoGI marked this pull request as ready for review May 23, 2025 09:45
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is looking good.

Before proceeding, I'd like to see a poc of using zstd and how the possible fallbacks and error handling is done, when mixing packages conan_package.tgz compressed with gz and with zstd.

@memsharded memsharded self-assigned this May 23, 2025
Copy link
Member

@AbrilRBS AbrilRBS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about not re-checking the file, otherwise this looks good :)

@memsharded memsharded added this to the 2.18.0 milestone May 26, 2025
@czoido czoido requested a review from Copilot May 26, 2025 09:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces a plugin hook for compression/decompression by loading a user-supplied compression.py and delegating to it when present, with fallback to the builtin behavior.

  • Loaded compression.py via a new config.compression_plugin property and propagated it through all compress/uncompress calls.
  • Updated compress_files and uncompress_file to invoke the plugin if available.
  • Added integration tests for plugin loading, fallback behavior, and default output.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/integration/extensions/test_compression_plugin.py added tests for plugin validity and end-to-end usage
test/integration/command/cache/test_cache_save_restore.py asserted default compress/decompress outputs when no plugin
conan/internal/rest/rest_client_local_recipe_index.py passed config_api into LocalRecipesIndexApp
conan/internal/rest/remote_manager.py propagated compression_plugin into uncompress_file
conan/internal/cache/home_paths.py added compression_plugin_path property
conan/internal/api/uploader.py forwarded compression_plugin to compress_files
conan/internal/conan_app.py injected config_api into RemoteManager and LocalRecipesIndexApp
conan/api/subapi/config.py implemented compression_plugin loader property
conan/api/subapi/cache.py integrated plugin hook into cache save/restore
Comments suppressed due to low confidence (1)

conan/api/subapi/config.py:248

  • The code references os.path but there’s no import os at the top of this module. Add import os to avoid a NameError.
if not os.path.exists(compression_plugin_path):

@@ -158,9 +158,13 @@ def test_cache_save_excluded_folders():

# exclude source
c.run("cache save * --no-source")
# Check default compression function is being used and not compression.py plugin one
assert "Compressing conan_cache_save.tgz\n" in c.out
Copy link
Preview

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Instead of matching the newline, assert on "Compressing conan_cache_save.tgz" in c.out to make the test more robust across platforms and output formats.

Suggested change
assert "Compressing conan_cache_save.tgz\n" in c.out
assert "Compressing conan_cache_save.tgz" in c.out

Copilot uses AI. Check for mistakes.

@perseoGI perseoGI changed the title [POC] Compression plugin Compression plugin May 27, 2025
Copy link
Member

@AbrilRBS AbrilRBS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to discuss the fact that any enterprise adding this plugin would not be able to use their old packaged libraries as-is.

I think the implementation should be robust enough that when the plugin is activated and the file to decompress was not compressed by the plugin, but by vanilla Conan, the workflow would then still be valid. This way we would avoid forcing users to recompile the world just to change the compression of their packages, which I think would kill the feature for any already-existing Conan user.

Note that this does NOT mean for the inverse to be true. I think it it's ok for packages that have been compressed by the plugin to only be able to be decompressed with the plugin too, anyone willing to use the plugin at-scale more than likely already has the infra setup to ensure everyone uses the plugin.

@perseoGI
Copy link
Contributor Author

@AbrilRBS

the plugin is activated and the file to decompress was not compressed by the plugin, but by vanilla Conan, the workflow would then still be valid

Yes, that is a great point!
This can be easily addressed by using a constant filename to pass to the compression plugin for it to create a tar with that name (file extensions apart).
Having a constant filename, have two benefits:

  1. Avoiding name conflicts when the compression plugin uses the same extension conan natively uses (.tgz)
  2. Being able to detect while decompressing, even though when we do not have the plugin enabled, if the decompressed tarfile has in its contents our "constant filename" in it (in that case should be passed to the compression plugin) or not, (plain conan tarfile)

packages that have been compressed by the plugin to only be able to be decompressed with the plugin

Of course, if the plugin is not enabled and conan finds out while decompressing that our "constant filename" has been decompressed, it will raise an error as client wouldn't know how to decompress that file.

@perseoGI perseoGI force-pushed the pgi/plugin/compression branch from 6fab859 to f986651 Compare June 23, 2025 15:51
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some minor interface/UI/UX issues, but overall this approach seems much better than the previous iteration.


def _tar_extract_with_plugin(fileobj, destination_dir, compression_plugin, conf):
"""First remove tar.gz wrapper and then call the plugin to extract"""
with tempfile.TemporaryDirectory() as temp_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conan doesn't really use system TemporaryDirectories for anything, it tries to do everything in the Conan cache, in cache folders, so if something goes wrong the files are there, and also there is no "leakage" of pacakges, recipes or code in anywhere else beside the Conan cache, that can be easily wiped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I didn't like the idea of messing up the conan cache. Do we have temporal folder in our conan cache?
Also, there are some other temporaries usages

pkglist_path = os.path.join(tempfile.gettempdir(), "pkglist.json")

(do not look at the blame, it was me!) but this case is different as it is only a pckglist extraction + removal.

I could move it also to the conan cache and may consider creating a tmp folder in it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand the concern.
Let me think about it.

@memsharded memsharded modified the milestones: 2.18.0, 2.19.0 Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Use subdirectory for conan cache save content [feature] Multithreaded conan cache save Improve Conan untar speed
4 participants