Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the difference between active_bytes and reserved_bytes? #47

Open
nyngwang opened this issue Aug 27, 2022 · 3 comments
Open

What's the difference between active_bytes and reserved_bytes? #47

nyngwang opened this issue Aug 27, 2022 · 3 comments

Comments

@nyngwang
Copy link

I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing active_bytes and reserved_bytes. In my testing, while active bytes read 3.83G, the reserved bytes read 9.35G. So why does PyTorch still reserve that much GPU memory?

@nyngwang nyngwang changed the title What's the difference between active_bytes and reserved_bytes? What's the difference between active_bytes and reserved_bytes? Aug 27, 2022
@Stonesjtu
Copy link
Owner

PyTorch caches CUDA memory to prevent repeated memory allocatation cost, you can get more information here:

https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

In your case, the reserved bytes should be peak memory usage before checkpointing, while active bytes should be the current memory usage after `checkpointing'

@nyngwang
Copy link
Author

nyngwang commented Aug 30, 2022

## VGG.forward

active_bytes reserved_bytes line code
         all            all
        peak           peak
       5.71G         10.80G   50     @profile
                              51     def forward(self, x):
       3.86G          8.77G   52         out = self.features(x)
       2.19G          8.77G   53         out = self.classifier(out)
       2.19G          8.77G   54         return out

@Stonesjtu Could you help me re-check the code above: I checkpointed the self.features internally (which itself is a nn.Module with nn.Sequential inside) but added the @profile decorator on the forward method (as above) of the outer class that uses the features (conv2d layers)

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

I also have two additional lines printed by the following code before the stats above printed:

Max CUDA memory allocated on forward:  1.22G
Max CUDA memory allocated on backward:  5.71G

which are generated by the code appended below.

Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

# compute output
if i < 1:
    torch.cuda.reset_peak_memory_stats()
output = model(images)
loss = criterion(output, target)
if i < 1:
    print('Max CUDA memory allocated on forward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.detach().item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))

# compute gradient and do SGD step
if i < 1:
    torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i < 1:
    print('Max CUDA memory allocated on backward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

@Stonesjtu
Copy link
Owner

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

The column (or metric) active bytes peak all is actually the peak active bytes during the execution of this line, it's an accumulated value which depends on the active bytes before the execution of this line.

e.g. you have 4 Linear layer in nn.Sequential, checkpointing after layers[after] would consume less active bytes than checkpointing after layer[0].


Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

According to the pytorch documentation:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi.

Actually it needs the cached memory at a certain point of execution, but at the time of your torch.cuda.max_memory_allocated, it doesn't need so much memory space. You can try torch.cuda.empty_cache() before getting torch.cuda.max_memory_allocated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants