Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add train, val, test folder paths in data.yaml at save_data_yaml() #1422

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

xaristeidou
Copy link
Contributor

@xaristeidou xaristeidou commented Aug 1, 2024

Description

In the process of developing a notebook as scheduled in #1388, I used the sv.DetectionDataset().as_yolo() method which executes in the backed the save_data_yaml() function to create the data.yaml file need for the dataset.

As YOLO the model construction for training, and the documentation, the model needs prerequisite train, val arguments in the data.yaml file, and the test argument is not needed but could be passed also if a test dataset exists. When trying to run model.train() the following error raises:
SyntaxError: /content/dataset/data.yaml 'train:' key missing ❌. 'train' and 'val' are required in all data YAMLs.

Therefore in save_data_yaml() except the nc, names arguments we should export also the train, val, test paths in order to be ready for executing the model.train() process. I think we should export the default paths as follows:

  • train: train/images
  • val: valid/images
  • test: test/images

If someone has a different working directory than the root of the folder containing the data.yaml, should change these paths manually. At least it will be easier to debug and modify the path if needed than to add the arguments in the yaml file.

List any dependencies that are required for this change.

None

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How has this change been tested, please provide a testcase or example of how you tested the change?

Using sv.DetectionDataset.as_yolo() exports the data.yaml file with prerequisite train, val, test paths.

Any specific deployment considerations

None

Docs

  • Docs updated? No need to

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 5, 2024

Hi @xaristeidou 👋🏻! Thanks for staying active in the supervision repo! As for the PR you opened, the main problem I see is that YOLOv8 shares a single data.yaml file across train, test, and valid subsets. During the as_yolo call, we can only guarantee that one of the subsets is being saved.

@xaristeidou
Copy link
Contributor Author

@SkalskiP Hello there 👋🏻! Yes indeed, if we want to create train, valid, test subset folders for the dataset training we have to run the process one time for each subset. Nevertheless, all 3 subset folder datasets contain the same names and nc values. Even if we export using as_yolo() for train, valid, or test, in the end all of them must have the same data.yaml. I think the problem we encounter is when someone doesn't want to export all 3 subset folders (train, valid, test). In this case, it will automatically add train, val, test paths in the data.yaml file. Even if the addition of these paths in the yaml file won't raise any error in supervision methods and functions, it kind of assumes that there are 3 subsets of the dataset, which contradicts with the build and usage of the as_yolo() method.

I think there are 3 scenarios in that case:

  1. Add train, val, test folder paths in data.yaml in order to be ready for any training process even if it contradicts (the truth is that now I don't really like the concept that something out of the method scope will be executed)
  2. We could add a log message in the execution of save_data_yaml() which runs internal in as_yolo(), to inform users to define the paths for training process if needed.
  3. Change nothing and leave it as it is. In this case the user will have to manually find out for the subset folders paths to define if training process is needed. In my notebook I could add the train, val, test path using code, in order to showcase how could be done in this way also.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 6, 2024

@xaristeidou, the problem we encounter is when someone doesn't want to export all 3 subset folders, or it could be that the person simply doesn't know they should.

One idea I had is that as_yolo would create data.yaml if it doesn't exist, but append to it if it does. When you call data.yaml, you would also specify whether it's the train, test, or val subset. What do you think about it? IS it too complicated?

@xaristeidou
Copy link
Contributor Author

xaristeidou commented Aug 6, 2024

@SkalskiP Well not too complicated. We could modify as_yolo() which is the function that users have access to when exporting, and we could add an argument subset_type: Optional[str] = None and if provided will accept one of the following ["train", "val", "test"] and run a brief check if the value is in this list. If not provided, won't write in yaml the path.

as_yolo would create data.yaml if it doesn't exist, but append to it if it does the problem in this case is that maybe there is a data.yaml already which has some previous configuration (maybe different names and nc). I think we should stick with the current implementation that modifies the yaml file in order to avoid not updating names and nc if needed. In the case that someone runs as yolo three times, one for train, valid, test folders respectively, to create a dataset, then names and nc will be the same (because of split() which keeps same attributes), so there will be no changes, if they are not the same this will be a mistake in the dataset from the user beforehand if he/she didn't use split() and loaded two different datasets with different configuration, so it should be fixed in a previous code section.

The function should modify the names, nc, and regarding the selection of the subset_type it will modify or append the corresponding field.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 6, 2024

@xaristeidou, wanna try to implement the PoC of this solution?

@xaristeidou
Copy link
Contributor Author

@SkalskiP Yes! I will be back when it is ready.

@xaristeidou
Copy link
Contributor Author

xaristeidou commented Aug 7, 2024

@SkalskiP I have added the new changes. Here is a Colab notebook for easy testing.

https://colab.research.google.com/drive/1BL7c2ycXkuCrEE5JOqh7u7Zf1kJzgb43?usp=sharing

@xaristeidou
Copy link
Contributor Author

@SkalskiP Did you manage to take a look at the new committed changes and test with the notebook?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants