Add `train, val, test` folder paths in `data.yaml` at `save_data_yaml()` #1422

xaristeidou · 2024-08-01T12:50:15Z

Description

In the process of developing a notebook as scheduled in #1388, I used the sv.DetectionDataset().as_yolo() method which executes in the backed the save_data_yaml() function to create the data.yaml file need for the dataset.

As YOLO the model construction for training, and the documentation, the model needs prerequisite train, val arguments in the data.yaml file, and the test argument is not needed but could be passed also if a test dataset exists. When trying to run model.train() the following error raises:
SyntaxError: /content/dataset/data.yaml 'train:' key missing ❌. 'train' and 'val' are required in all data YAMLs.

Therefore in save_data_yaml() except the nc, names arguments we should export also the train, val, test paths in order to be ready for executing the model.train() process. I think we should export the default paths as follows:

train: train/images
val: valid/images
test: test/images

If someone has a different working directory than the root of the folder containing the data.yaml, should change these paths manually. At least it will be easier to debug and modify the path if needed than to add the arguments in the yaml file.

List any dependencies that are required for this change.

None

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

How has this change been tested, please provide a testcase or example of how you tested the change?

Using sv.DetectionDataset.as_yolo() exports the data.yaml file with prerequisite train, val, test paths.

Any specific deployment considerations

None

Docs

Docs updated? No need to

SkalskiP · 2024-08-05T09:11:00Z

Hi @xaristeidou 👋🏻! Thanks for staying active in the supervision repo! As for the PR you opened, the main problem I see is that YOLOv8 shares a single data.yaml file across train, test, and valid subsets. During the as_yolo call, we can only guarantee that one of the subsets is being saved.

xaristeidou · 2024-08-06T07:06:36Z

@SkalskiP Hello there 👋🏻! Yes indeed, if we want to create train, valid, test subset folders for the dataset training we have to run the process one time for each subset. Nevertheless, all 3 subset folder datasets contain the same names and nc values. Even if we export using as_yolo() for train, valid, or test, in the end all of them must have the same data.yaml. I think the problem we encounter is when someone doesn't want to export all 3 subset folders (train, valid, test). In this case, it will automatically add train, val, test paths in the data.yaml file. Even if the addition of these paths in the yaml file won't raise any error in supervision methods and functions, it kind of assumes that there are 3 subsets of the dataset, which contradicts with the build and usage of the as_yolo() method.

I think there are 3 scenarios in that case:

Add train, val, test folder paths in data.yaml in order to be ready for any training process even if it contradicts (the truth is that now I don't really like the concept that something out of the method scope will be executed)
We could add a log message in the execution of save_data_yaml() which runs internal in as_yolo(), to inform users to define the paths for training process if needed.
Change nothing and leave it as it is. In this case the user will have to manually find out for the subset folders paths to define if training process is needed. In my notebook I could add the train, val, test path using code, in order to showcase how could be done in this way also.

SkalskiP · 2024-08-06T07:20:08Z

@xaristeidou, the problem we encounter is when someone doesn't want to export all 3 subset folders, or it could be that the person simply doesn't know they should.

One idea I had is that as_yolo would create data.yaml if it doesn't exist, but append to it if it does. When you call data.yaml, you would also specify whether it's the train, test, or val subset. What do you think about it? IS it too complicated?

xaristeidou · 2024-08-06T08:41:52Z

@SkalskiP Well not too complicated. We could modify as_yolo() which is the function that users have access to when exporting, and we could add an argument subset_type: Optional[str] = None and if provided will accept one of the following ["train", "val", "test"] and run a brief check if the value is in this list. If not provided, won't write in yaml the path.

as_yolo would create data.yaml if it doesn't exist, but append to it if it does the problem in this case is that maybe there is a data.yaml already which has some previous configuration (maybe different names and nc). I think we should stick with the current implementation that modifies the yaml file in order to avoid not updating names and nc if needed. In the case that someone runs as yolo three times, one for train, valid, test folders respectively, to create a dataset, then names and nc will be the same (because of split() which keeps same attributes), so there will be no changes, if they are not the same this will be a mistake in the dataset from the user beforehand if he/she didn't use split() and loaded two different datasets with different configuration, so it should be fixed in a previous code section.

The function should modify the names, nc, and regarding the selection of the subset_type it will modify or append the corresponding field.

SkalskiP · 2024-08-06T09:31:15Z

@xaristeidou, wanna try to implement the PoC of this solution?

xaristeidou · 2024-08-06T09:52:23Z

@SkalskiP Yes! I will be back when it is ready.

xaristeidou · 2024-08-07T13:12:23Z

@SkalskiP I have added the new changes. Here is a Colab notebook for easy testing.

https://colab.research.google.com/drive/1BL7c2ycXkuCrEE5JOqh7u7Zf1kJzgb43?usp=sharing

xaristeidou · 2024-08-14T15:21:17Z

@SkalskiP Did you manage to take a look at the new committed changes and test with the notebook?

xaristeidou and others added 2 commits August 1, 2024 15:24

added train, val, test directory paths for yolo data.yaml

d5f358d

fix(pre_commit): 🎨 auto format pre-commit hooks

a190149

xaristeidou mentioned this pull request Aug 5, 2024

Create new cookbook for utilizing supervision methods to easily create YOLO datasets for training #1388

Open

2 tasks

xaristeidou added 5 commits August 7, 2024 13:19

added subset_type argument to as_yolo()

2770acc

added subset_type argument to save_data_yaml()

511578c

modified as_yolo() to pass subset_type argument

4d5fb6c

fixed typo in descriptions

c4419cf

🎨 pre-commit fix

a2be707

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `train, val, test` folder paths in `data.yaml` at `save_data_yaml()` #1422

Add `train, val, test` folder paths in `data.yaml` at `save_data_yaml()` #1422

xaristeidou commented Aug 1, 2024 •

edited

Loading

SkalskiP commented Aug 5, 2024

xaristeidou commented Aug 6, 2024

SkalskiP commented Aug 6, 2024

xaristeidou commented Aug 6, 2024 •

edited

Loading

SkalskiP commented Aug 6, 2024

xaristeidou commented Aug 6, 2024

xaristeidou commented Aug 7, 2024 •

edited

Loading

xaristeidou commented Aug 14, 2024

Add train, val, test folder paths in data.yaml at save_data_yaml() #1422

Are you sure you want to change the base?

Add train, val, test folder paths in data.yaml at save_data_yaml() #1422

Conversation

xaristeidou commented Aug 1, 2024 • edited Loading

Description

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

SkalskiP commented Aug 5, 2024

xaristeidou commented Aug 6, 2024

SkalskiP commented Aug 6, 2024

xaristeidou commented Aug 6, 2024 • edited Loading

SkalskiP commented Aug 6, 2024

xaristeidou commented Aug 6, 2024

xaristeidou commented Aug 7, 2024 • edited Loading

xaristeidou commented Aug 14, 2024

Add `train, val, test` folder paths in `data.yaml` at `save_data_yaml()` #1422

Add `train, val, test` folder paths in `data.yaml` at `save_data_yaml()` #1422

xaristeidou commented Aug 1, 2024 •

edited

Loading

xaristeidou commented Aug 6, 2024 •

edited

Loading

xaristeidou commented Aug 7, 2024 •

edited

Loading