Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added saving of best checkpoint #1858

Closed
wants to merge 0 commits into from
Closed

Conversation

sugeeth14
Copy link

@sugeeth14 sugeeth14 commented Aug 31, 2020

With reference to the issue here #1856 . Added saving of best checkpoint based on perplexity.

@francoishernandez
Copy link
Member

francoishernandez commented Sep 1, 2020

Not sure this is the best way to do it:

  • the model is actually saved twice, doubling the i/o
  • you only check for ppl, and probably not in the best place in the code

Can't we just have some flag in the EarlyStopping class that would prevent deleting the current best model when cleaning up?

@sugeeth14
Copy link
Author

sugeeth14 commented Sep 1, 2020

  1. It is inspired from fairseq where the best checkpoint is also saved alongside the original checkpoint thought it is a copy https://github.com/pytorch/fairseq/blob/983163494663e24b611f1ba8d5d47a3edc00e2e5/fairseq/checkpoint_utils.py#L60

When we use -keep_checkpoint since only the last n checkpoints are saved there is a possibility that the best is lost. So isn't it a good idea to have best checkpoint saved so that we can resume training with last checkpoints and also have the best one

  1. I wanted to check for validation loss to save the best like fairseq but then again we should validate at that point. In case of fairseq validation and saving is done at the same time in all cases even when using--save-interval-updates so thought of sticking to ppl.
    https://fairseq.readthedocs.io/en/latest/command_line_tools.html#Checkpointing

your thoughts ?

@francoishernandez
Copy link
Member

  1. My point is that you don't need to copy the model, you can just 'not remove it' when cleaning up to keep the last n. (And you can make a link to keep the info of which checkpoint is currently the best in your training folder.)
    As for resuming training, not sure how to handle this anyway since you won't have the metric of the 'previous best' to compare to your new training metrics. Unless you embed it in the checkpoints, and if you do that you can also embed the flag that says which model is the best to prevent removing it.

  2. The cleanest way is probably to have something similar: force validation every -save_checkpoint_steps in the case of early stopping. Either way, this code does not belong where it is now.

@sugeeth14
Copy link
Author

sugeeth14 commented Sep 2, 2020

  1. You think force validation is enough every -save_checkpoint_steps just in case of early stopping ?
    as I understand I will change here https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/trainer.py#L275 so that it is validated every save_checkpoint_steps

1.In case of model_saver.py want to add looking up in a json file best_checkpoint.json having values

{
"best_checkpoint_step":  step,
"validation_ppl": ppl,
}

and not delete that checkpoint when removing the checkpoints and update it every time best checkpoint comes from (2) . This way the info is saved for later lookup on which checkpoint is the best in the directory. Would that be okay ?
Can you suggest where the new changes would be appropriate ?

@francoishernandez
Copy link
Member

You think force validation is enough every -save_checkpoint_steps just in case of early stopping ?
as I understand I will change here https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/trainer.py#L275 so that it is validated every save_checkpoint_steps

Sure, you can check whether self.earlystopper is None, and if not you also have the save_checkpoint_steps value available.

Not sure about the json file, but it wouldn't hurt I presume. As for where, you can probably put this in trainer.py: retrieve a flag when self.earlystopper is called, and add a condition to this flag when model_saver is called a few lines down.

@francoishernandez
Copy link
Member

@raghava14 please note I just merged #1835, hence the conflicts.

@sugeeth14
Copy link
Author

Hi I created the PR here #1859 hope it is okay please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants