-
Notifications
You must be signed in to change notification settings - Fork 736
Fix: data split issue #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: data split issue #404
Conversation
What's wrong with setting the seed to a value other than 0? If we make this change, we will have to ensure that lines like if config.project.seed != 0:
seed_everything(config.project.seed) are changed in other files - |
If seed is non-zero, there is no problem. Does seed 0 means that the seed is not fixed in this project? It seems that Now, the PR code solves the data contamination problem, but each traininig data and test data are created identically for the seed 0. |
@jeongHwarr not sure if I quite understand the contamination problem. Currently we treat seed 0 as having no seed. However, even with seed 0 I get no overlap between train split and the test split. Here are the sample lines of code I am using for testing this. This does not raise Assertion Error for multiple runs. from anomalib.data.folder import make_dataset
from pathlib import Path
dataset = make_dataset(normal_dir=Path("../datasets/bottle/good/"), abnormal_dir=Path('../datasets/bottle/broken_large'), split_ratio=0.33, seed=0)
train_split = dataset[dataset.split == "train"]
test_split = dataset[dataset.split == "test"]
assert train_split["image_path"].values.all() != test_split["image_path"].values.all() Here is another sample using Folder DataModule folder_dataset = Folder(root=Path("../datasets/bottle"),normal_dir="good", abnormal_dir='broken_large', split_ratio=0.1, seed=0, image_size=256)
folder_dataset.setup()
assert folder_dataset.test_data.samples["image_path"].values.all() != folder_dataset.train_data.samples["image_path"].values.all() |
There is no overlap in your test code because you called However, in the code of this project,
To solve this problem, |
@jeongHwarr I see what you mean. Thanks for spotting this. I'll look into the issue. But does it make sense to keep seed 0 as no seed or should we refactor the code to make |
@jeongHwarr Thanks again for spotting this issue and creating a PR for addressing it. Is it possible for you to make a few changes to this PR (https://github.com/openvinotoolkit/anomalib/pull/437/files) so that we can merge this and you can be a contributor to this repo 😀. Other option is to always merge this and I can create a new PR to address this. |
I think that I've reviewed your PR and nothing seems to change! |
Description
Provide a summary of the modification as well as the issue that has been resolved. List any dependencies that this modification necessitates.
Fixes When test_split is applied to a custom dataset, test data contamination occurs #403
Changes
Checklist