Skip to content

Add Unlabeled Image Dataset for Unsupervised Training #9050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mduszyk opened this issue May 2, 2025 · 5 comments
Open

Add Unlabeled Image Dataset for Unsupervised Training #9050

mduszyk opened this issue May 2, 2025 · 5 comments

Comments

@mduszyk
Copy link

mduszyk commented May 2, 2025

🚀 The feature

I’m proposing to add a dataset class for unsupervised learning (e.g., generative models), where the dataset consists of a flat folder of unlabeled images.

Introduce a new class, e.g. UnlabeledImageDataset, that:

  • Accepts a flat folder of image files
  • Returns only images (no labels)
  • Follows ImageFolder conventions where applicable
  • Resides in torchvision/datasets/folder.py and reuses existing utilities
  • Introducing a new class avoids increasing complexity in ImageFolder

Motivation, pitch

torchvision.datasets.ImageFolder and DatasetFolder are designed for supervised tasks, requiring a specific directory structure and class-label mappings. In unsupervised scenarios, I end up writing custom datasets for this case. A built-in dataset would improve usability and consistency across the PyTorch ecosystem.

This feature request is similar in spirit to Issue #660, where a user suggested supporting unlabeled or unsupervised datasets. The use case remains common, and a lightweight, built-in solution would reduce boilerplate and improve consistency.

Alternatives

An alternative would be to have an "unsupervised" mode for ImageFolder as suggested in Issue #660. But that would result in increased complexity in this class as pointed out in the comment of the issue.

Additional context

It feels like this functionality belongs in a common library especially that ImageFolder is already present in torchvision.

@mduszyk
Copy link
Author

mduszyk commented May 2, 2025

Would you be open to adding this? I’d be happy to contribute a PR if there’s interest.

Thanks!

@NicolasHug
Copy link
Member

Thanks for the feature request @mduszyk . Can you share a bit more about the API you have in mind? Naively this sounds like a shallow wrapper around Pathlib.glob()?

@mduszyk
Copy link
Author

mduszyk commented Jun 2, 2025

I was thinking about sth like this:

from pathlib import Path
from torchvision.io import read_image, ImageReadMode

class UnlabeledImageFolder:

    def __init__(self, root_dir, patterns=('**/*.jpg', '**/*.png'), transform=None):
        self.root = Path(root_dir)
        self.images = []
        for pattern in patterns:
            self.images.extend(self.root.glob(pattern))
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, i):
        img = read_image(self.images[i], ImageReadMode.RGB)
        if self.transform:
            img = self.transform(img)
        return img

It uses glob allowing for multiple patterns, loads the image and performs optional transformation.

One more idea is to unify the ImageFolder API, it could, depending on init parameters, internally instantiate LabeledImageFolder or UnlabeledImageFolder. This way both implementations would be separate and user would see single API.

Looking forward to learn about your thoughts on this.

@NicolasHug
Copy link
Member

Thanks for the details. I think this is reasonable but I hope we can support that with the existing ImageFolder or DatasetFolder, perhaps with some minor modifications. Can you check if allow_empty=True supports what you need already?

@mduszyk
Copy link
Author

mduszyk commented Jun 5, 2025

DatasetFolder and ImageFolder that extends it assume certain
directory structure where subdirectories of the dataset root directory are
considered to be classes. Then dataset returns (input, target) pairs. I was
hoping to be able to also work with datasets where there are no targets,
ie. only the image is returned by __getitem__.

Here is general view of this in the code:

class DatasetFolder(VisionDataset):

    ...

    def find_classes(self, directory: Union[str, Path]) -> tuple[list[str], dict[str, int]]:
        """Find the class folders in a dataset structured as follows::

            directory/
            ├── class_x
            │   ├── xxx.ext
            │   ├── xxy.ext
            │   └── ...
            │       └── xxz.ext
            └── class_y
                ├── 123.ext
                ├── nsdf3.ext
                └── ...
                └── asd932_.ext
        
            ...
        
        """

class ImageFolder(DatasetFolder):
    ...

allow_empty=True makes it consider empty folders to be classes with zero
samples, instead of raising exception in such case. So this does not help
if we wanted to allow users to load images from a flat folder.

I was thinking initially about extending VisionDataset to be compatible
with torchvision datasets and keep implementation separate. However, it would
also make sense to modify DatasetFolder and keep the API consistent.

Let me know your thoughts on this, and if you are interested, I could propose
modifications to DatasetFolder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants