only_train_once: oto.compress failed with "xs.append(param.data.view(cc.num_groups, -1))" in graphy.py
@tianyic Hi, when I tried OTO with the following case, oto.compress failed. Could you please give some advice?
import torch
import torch.nn as nn
from only_train_once import OTO
class DemoNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.fc = nn.Sequential(
nn.Linear(1024, 512),
nn.Linear(512, 256)
)
def forward(self, x):
# x: [1, 512, 2, 81]
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1)
return self.fc(x)
if __name__ == "__main__":
model = DemoNet()
model.eval()
fake_input = torch.randn((1, 512, 2, 81))
print(f"{model(fake_input).shape}")
oto = OTO(model=model, dummy_input=fake_input)
oto.compress()
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 47 (26 by maintainers)
@songkq Thanks.
The next generation of OTO would be on another vertical. The vanilla support of transformer could be considered as an extension within the current OTOv2, which is actually ongoing for the PR. The key is to support the
matmuloperator. But we have not merged this PR yet since it hasn’t rigorously considered thebiasstored in theaddoperator yet.Another reason that we do not urgently push the transformer support is because of the standard structure pruning more easily causing regression on transformer compared with CNNs. You might notice that some recent pruning works claim achieving negligible performance regression on transformer while are typically unstructured pruning so that are useless in reality. We believe low-rank analysis should be leveraged into transformer pruning, thereby postponing the transformer support or more precisely
matmulandadd biassupport till when we have sufficient bandwidth to fundamentally solve that problem.@tianyic thanks for the information, I just further debugging and find the problem is similar to above. After the modification, the issue is resolved. Thanks!
@tianyic Thanks for the fix. However, it doesn’t work with
torch=1.8.1andonnx=1.10.1. Maybe a bug intorch1.8.1.Although the bug in
torch1.8.1, I’ve verified the effectiveness of OTO in my case with atarget_group_sparsity=0.1, where the pruned model has a negligible accuracy drop. Good job~ I will try to enlarge the target_group_sparsity withoto=2.0.10andtorch=1.11.0later.@tianyic Hi, I found that
bias=Falseis not the root cause for this issue. Maybe the version of torch (torch=1.8.1) or the defaultopset versioncause the problem. When I trybias=Falsewithtorch=1.11.0+cu113andonnx=1.10.1, everything is OK.I still doubt the
transposeandreshapeoperation under differentopset versioncause the problem. If possible, theopset versioncan be set as an optional configuration of OTO.torch = 1.11.0 with bias=False
torch = 1.8.1 with bias=False
However, when I check the
_export_onnx_opset_versionused in_optimize_trace,torch1.11.0andtorch1.8.1have the same_export_onnx_opset_version. I’m so confusing about this …A good question.
DHSPG optimizer is a hybrid optimizer which indeed has some computational overhead during pruning (when group sparsity is increasing). The overhead is typically varying upon model and dataset. For majority models, the overhead is negligible, but some are not (the worst case I met would double the cost). But remark here that the overhead is temporary and will disappear once group sparsity reaches the target value (afterwards the DHSPG performs the same as the baseline optimizer).
Therefore, to speed up, I would suggest shrinking the pruning procedure, i.e., to make the group sparsity increase faster to reach the target value, which can be typically achieved via fine-tuning the hyperparameters related to group sparsity exploration. In fact, most of experiments I conducted could shrink the pruning stage into just a few epochs, which largely mitigates the overhead. Meanwhile, there might be some engineering tricks in the official torch version that could be leveraged to further speedup the DHSPG.
Hope the above help.