What's the difference between `active_bytes` and `reserved_bytes`? #47

nyngwang · 2022-08-27T15:53:14Z

I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing active_bytes and reserved_bytes. In my testing, while active bytes read 3.83G, the reserved bytes read 9.35G. So why does PyTorch still reserve that much GPU memory?

The text was updated successfully, but these errors were encountered:

Stonesjtu · 2022-08-29T06:36:58Z

PyTorch caches CUDA memory to prevent repeated memory allocatation cost, you can get more information here:

https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

In your case, the reserved bytes should be peak memory usage before checkpointing, while active bytes should be the current memory usage after `checkpointing'

nyngwang · 2022-08-30T07:53:17Z

## VGG.forward

active_bytes reserved_bytes line code
         all            all
        peak           peak
       5.71G         10.80G   50     @profile
                              51     def forward(self, x):
       3.86G          8.77G   52         out = self.features(x)
       2.19G          8.77G   53         out = self.classifier(out)
       2.19G          8.77G   54         return out

@Stonesjtu Could you help me re-check the code above: I checkpointed the self.features internally (which itself is a nn.Module with nn.Sequential inside) but added the @profile decorator on the forward method (as above) of the outer class that uses the features (conv2d layers)

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

I also have two additional lines printed by the following code before the stats above printed:

Max CUDA memory allocated on forward:  1.22G
Max CUDA memory allocated on backward:  5.71G

which are generated by the code appended below.

Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

# compute output
if i < 1:
    torch.cuda.reset_peak_memory_stats()
output = model(images)
loss = criterion(output, target)
if i < 1:
    print('Max CUDA memory allocated on forward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.detach().item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))

# compute gradient and do SGD step
if i < 1:
    torch.cuda.reset_peak_memory_stats()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i < 1:
    print('Max CUDA memory allocated on backward: ', utils.readable_size(torch.cuda.max_memory_allocated()))

Stonesjtu · 2022-08-30T15:13:07Z

Q1: Do you know how to explain this: If I keep the same batch-size, but change how I partition the self.features internally (into checkpointed segments), the active_bytes of the next non-checkpointed line self.classifier(out) also changed.

The column (or metric) active bytes peak all is actually the peak active bytes during the execution of this line, it's an accumulated value which depends on the active bytes before the execution of this line.

e.g. you have 4 Linear layer in nn.Sequential, checkpointing after layers[after] would consume less active bytes than checkpointing after layer[0].

Q2: So how to explain the reserved_bytes, i.e. 10.80G, 8.77G, in the stats generated by pytorch_memlab above? Does it mean that pytorch internally allocates much more GPU memory than it really needs?

According to the pytorch documentation:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi.

Actually it needs the cached memory at a certain point of execution, but at the time of your torch.cuda.max_memory_allocated, it doesn't need so much memory space. You can try torch.cuda.empty_cache() before getting torch.cuda.max_memory_allocated.

nyngwang changed the title ~~What's the difference between active_bytes and reserved_bytes?~~ What's the difference between active_bytes and reserved_bytes? Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the difference between `active_bytes` and `reserved_bytes`? #47

What's the difference between `active_bytes` and `reserved_bytes`? #47

nyngwang commented Aug 27, 2022

Stonesjtu commented Aug 29, 2022

nyngwang commented Aug 30, 2022 •

edited

Loading

Stonesjtu commented Aug 30, 2022

What's the difference between active_bytes and reserved_bytes? #47

What's the difference between active_bytes and reserved_bytes? #47

Comments

nyngwang commented Aug 27, 2022

Stonesjtu commented Aug 29, 2022

nyngwang commented Aug 30, 2022 • edited Loading

Stonesjtu commented Aug 30, 2022

What's the difference between `active_bytes` and `reserved_bytes`? #47

What's the difference between `active_bytes` and `reserved_bytes`? #47

nyngwang commented Aug 30, 2022 •

edited

Loading