Strange training/validation loss patterns #2843

asedova · 2023-09-19T20:35:55Z

asedova
Sep 19, 2023

Hi, I'm training models with distributed training using 6 GPUs. No matter what the batch size I use, I get a similar strange behavior with the lcurve.out results.

The energy/force validation and training loss seem to follow an almost identical pattern Here is an example for energy loss:

Should the results be that similar across the whole training? It makes me think that the same or very similar frames are being fed into the training and the validation steps.

The data has been shuffled, and placed in a separate directory and this is given in the input json.

njzjz · 2023-10-16T19:40:50Z

njzjz
Oct 16, 2023
Maintainer

I don't see your data, but I guess the energies in your data are quite similar to each other, where the difference between these points is much less than the loss.

4 replies

asedova Oct 16, 2023
Author

Even if the energies were really similar, it is clear from the data that the loss is enough for different batches that there is an appreciable pattern. The validation and training loss follows an ALMOST IDENTICAL pattern. That can only seemingly be due to some data loading bug, it seems to me, but I am not exactly sure of the specifics of the data loader for distributed training. Why I am concerned is we are using 6 GPUs per node on a Power9 architecture (Summit supercomputer). I am wary that there may be something suspicious going on.
This also seems to happen with a number of different datasets (they are all a similar system, but they are molecular molten salts systems at 498 K so there is probably more variability than solid lattices).

asedova Mar 5, 2024
Author

Just wanted to update that on a system with only 1 GPU (not 6) as above, the patterns are not almost identical. The other difference is that the above data were done on Summit supercomputer with the Power 9 CPU and the open-ce port of TensorFlow, and the OS is Centos. Here it's an x86 CPU with Ubuntu, so standard stuff for TensorFlow.

njzjz Mar 5, 2024
Maintainer

I think it will be much quicker to debug by directly printing the coordinates to the screen.

deepmd-kit/deepmd/train/trainer.py

Lines 879 to 880 in b875ea8

    
           train_results = self.get_evaluation_results(train_batches) 
        
           valid_results = self.get_evaluation_results(valid_batches)

asedova Mar 6, 2024
Author

Ok I will work on this. Summit just came back for users after a long hold and some changes. I will run some more tests again when I get a chance. we have a container with the same build as above that should still work, and I am going to try building with the latest version too. Stay tuned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange training/validation loss patterns #2843

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Strange training/validation loss patterns #2843

asedova Sep 19, 2023

Replies: 1 comment · 4 replies

njzjz Oct 16, 2023 Maintainer

asedova Oct 16, 2023 Author

asedova Mar 5, 2024 Author

njzjz Mar 5, 2024 Maintainer

asedova Mar 6, 2024 Author

asedova
Sep 19, 2023

Replies: 1 comment 4 replies

njzjz
Oct 16, 2023
Maintainer

asedova Oct 16, 2023
Author

asedova Mar 5, 2024
Author

njzjz Mar 5, 2024
Maintainer

asedova Mar 6, 2024
Author