Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspicious extra node with opset 13 #1543

Closed
spillerrec opened this issue May 30, 2021 · 12 comments
Closed

Suspicious extra node with opset 13 #1543

spillerrec opened this issue May 30, 2021 · 12 comments

Comments

@spillerrec
Copy link

Describe the bug
Converting a checkpoint model gives a suspiciously small ONNX file and using --opset 13 has an extra node compared to default opset:
image
The Split node to the left does not appear when using opset 9. I initially thought this node was unconnected but I believe I might be mistaken and the Split node is using a constant input which isn't optimized away with opset 13.

Urgency
None

System information

  • Windows 10
  • Tensorflow Version: 1.14.0 and 2.5 tested
  • Python version: 3.7.5

To Reproduce
Download this model: https://drive.google.com/open?id=1IMwzqZUuRnTv5jcuKdvZx-RZweknww5x
Run:

python -m tf2onnx.convert --checkpoint bar/Train_775000.meta --output bar.onnx --inputs Placeholder:0,Placeholder_1:0,Placeholder_2:0 --outputs add:0

Additional context
I do not know the specifics of the model as I haven't made it, but since the file size before conversion is 180 MB and only 14 MB after, it looks like something isn't working as it should. It does give an error about a missing op ExtractImagePatches #436 :

2021-05-30 21:24:41,962 - INFO - Using tensorflow=2.5.0, onnx=1.8.1, tf2onnx=1.9.0/daeab0
2021-05-30 21:24:41,962 - INFO - Using opset <onnx, 9>
2021-05-30 21:24:42,365 - INFO - Computed 0 values for constant folding
2021-05-30 21:24:43,890 - ERROR - Tensorflow op [CB1/ExtractImagePatches_1: ExtractImagePatches] is not supported
2021-05-30 21:24:43,896 - ERROR - Tensorflow op [CB1/ExtractImagePatches: ExtractImagePatches] is not supported
2021-05-30 21:24:43,949 - ERROR - Unsupported ops: Counter({'ExtractImagePatches': 2})
2021-05-30 21:24:43,977 - INFO - Optimizing ONNX model
2021-05-30 21:24:45,641 - INFO - After optimization: Cast -11 (11->0), Const -198 (251->53), Exp -8 (16->8), Identity -3 (3->0), ReduceMean -8 (24->16), Reshape +7 (10->17), Sub -16 (42->26), Transpose -37 (99->62), Upsample -1 (5->4)
2021-05-30 21:24:45,689 - INFO -
2021-05-30 21:24:45,689 - INFO - Successfully converted TensorFlow model .\bar\Train_775000.meta to ONNX
2021-05-30 21:24:45,689 - INFO - Model inputs: ['Placeholder:0', 'Placeholder_1:0', 'Placeholder_2:0']
2021-05-30 21:24:45,690 - INFO - Model outputs: ['add:0']
2021-05-30 21:24:45,690 - INFO - ONNX model is saved at bar-updated.onnx

But since the node is actually placed in the graph, I assume it is fine and it just mean I can't run it. I want to take a closer look at ExtractImagePatches and see if there is a way to work around it, but before I spend too much time on trying to understand it and how it could be mapped by other operators, I want to be more confident that the ONNX file actually is correct as I can't test it in its current state.

The checkpoint saved state might contain a discriminator network as well (assuming GAN training) which could perhaps be the reason for the file size discrepancy. But I do not know Tensorflow and I can't tell from the code, nor how to extract this to see if that is the remaining 166 MB. However I do know that this model replaces an older Keras based model which I successfully converted to ONNX and that was ~128MB, so I remain suspicious.

@guschmue
Copy link
Contributor

guschmue commented Jun 4, 2021

interesting, awesome catch! The graph for opset-9/10 looks much better.
It seems to be triggered by the pad operator in opset-11 when the pads arg moved from attribute to dynamic input. I think when the operator used attributes we took the split/concat output that have been const and made it an attribute and because of this the split/concat had no consumer so we optimized it away.
But with opset-11 we wire the split/concat output to the pad.
Going to be a little tricky to fix - we need to create bunch of constants now ... let me look at it.

@TomWildenhain-Microsoft
Copy link
Contributor

I can't speak to the size difference, but I'm working on PRs to fix that extra node: #1553, #1554

@TomWildenhain-Microsoft
Copy link
Contributor

@spillerrec the extra node should be removed now. Have you been able to work around the ExtractImagePatches op?

@spillerrec
Copy link
Author

I have managed to get the model running by splitting the model into two parts and running tf.compat.v1.extract_image_patches manually to forward it into the second part. While the dimensions appear to match throughout the network, the final output is however not correct.
I did have to make some hacks to get this to work as onnx.utils.extract_model does not work with models containing custom nodes such as ExtractImagePatches. So I will give this another go later this week to see if I can get it working or figure out what is wrong.

@TomWildenhain-Microsoft
Copy link
Contributor

You could try using onnxruntime_extensions instead of cutting the graph. It will let you implement the missing op in python (just call TF).

https://github.com/microsoft/onnxruntime-extensions/blob/main/tutorials/tf2onnx_custom_ops_tutorial.ipynb

@spillerrec
Copy link
Author

Thanks for the suggestion! Sadly I'm not having a lot of luck this way either. The ExtractImagePatches operator uses attributes which are not strings and this is apparently not (yet) supported: microsoft/onnxruntime-extensions#74
I'm getting the same kind of error message:

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: Unable to find attribute 'ksizes' due to 'Attibute name and type don't match'.

If I try to ignore those integer attributes and hardcode them (as they happen to match for the two instances), I'm getting another error:

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running ExtractImagePatches node. Name:'CB1/ExtractImagePatches_1' Status Message: TypeError: extract_image_patches() missing 3 required positional arguments: 'ksizes', 'rates', and 'strides'

And if I try to patch those attributes out of the network I'm getting another validation error:

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: Unable to find attribute 'ksizes' due to 'No attribute with name:'ksizes'is defined.'.

I'm assuming there is some kind of schema created during the export it is now no longer matching. Well, at least it is giving error messages which are actually insightful.

@TomWildenhain-Microsoft
Copy link
Contributor

Hmm, you should be able to use it. Like you said, it doesn't support attributes so you will need to remove them from the graph. The tutorial I sent you shows an example of converting the attributes to inputs, but hard-coding the values should be fine too.

Your second error looks the most promising. It is actually calling extract_image_patches function but you seem to have defined that function with 3 extra positional args. If you are hard-coding their values into the TF call anyway, removing those should be fine and fix the issue.

@spillerrec
Copy link
Author

Using a conversion rule to remove the attributes worked. And the output is the same as splitting the graph in two, i.e. not correct. After further investigation, I found that the inputs/outputs are supposed to be normalized to [-1, 1] instead of [0, 1] and now the output is as expected.

So the small filesize is indeed correct and the extra nodes with opset 13 are gone on master. 🎉 Now I just need to find a proper alternative to the ExtractImagePatches op without relying on tensorflow. And clean up the graph further, there are a bunch of transposes and squeeze/unsqueeze you can get rid of it you move nodes around. (And the entire subgraph in the middle being replicated 8 times due to the batch size being fixed to 8. Not sure if it is stored like that in the checkpoint or it is the exporting, but it is quite inconvenient.)

@TomWildenhain-Microsoft
Copy link
Contributor

@spillerrec Great, glad to hear you got it working! What is your intended use case? You can probably implement ExtractImagePatches in python using numpy, or you could write a C++ implementation (and even contribute it to onnxruntime_extensions repo). For Transposes/Unsqueezes, we try to get rid of those, so if there is anything obvious feel free to open another issue. The 8 subgraphs probably isn't us; we don't duplicate stuff like that but it can be introduced by TF especially if you use a python loop in your tf code.

@spillerrec
Copy link
Author

@TomWildenhain-Microsoft I'm tying to convert existing AI projects to use ONNX models along with a simple implementation. I'm hoping this will make it easier to for application developers (myself included) to include them in their own projects, without having to deal with TensorFlow, PyTorch, etc.. And learn a bit about what kind of network architectures are used and how they work in the process, which is why I'm spending some effort on making the graph prettier.

While I intend to use onnxruntime myself (likely both a C++ and Python implementation), I would really like to avoid non-standard extensions if possible as it will only cause friction if someone wants to convert it to another runtime. (I would like to try onnxjs for example, though the lack of support for ConvTranspose means I only have a ESRGAN model to try it with right now.)
So if I can replace it standard ops, which I think might be possible without dynamic dimensions, I would want to go that route.

I will start with a numpy implementation of ExtractImagePatches and take it from there, but the TensorFlow documentation is quite lacking and I'm not sure if I will create a complete implementation that I can feel confident in working for all cases. But I will try to see if I can find a more general solution to what can be done with ExtractImagePatches, either by some kind of conversion or some argumentation to why ExtractImagePatches should be standalized. I will comment in #436 if I make any progress on that.

@TomWildenhain-Microsoft
Copy link
Contributor

@spillerrec Cool! Where are you putting the results of your conversion project? ONNX has a model zoo but it can be lacking: https://github.com/onnx/models.

Sounds like you are primarily using onnx for it's interoperability, not the runtime perf gains of onnxruntime. Is that accurate? I'm informally surveying to see whether people care more about onnxruntime CPU or GPU perf, but it sounds like perf might not be the primary concern for you.

The ideal solution would probably be converting the op into a sequence of onnx ops if possible. It is probably sufficient to convert just the rate = [1, 1, 1, 1] case. I'm a bit busy the next two weeks but maybe I'll get a chance soon-ish.

@spillerrec
Copy link
Author

@TomWildenhain-Microsoft It is still at a quite early stage so currently I just have a shared folder with the models: https://drive.google.com/drive/folders/1Rc_VxLYXKapwpde_7haCI6j4_wiooitE?usp=sharing
Some of them don't have descriptions and license metadata yet either, so they are not quite final either. My plan is just to have a GitHub project with the inference code and links to the models.

For this, yes, performance is not the main focus. GPU acceleration without complicated dependencies (such as non-distributable CUDA libraries) is a wish. I much rather want 20% slower GPU performance and being able to support non-NVIDIA hardware.

At work, we are trying to start to use ML and here we care about GPU performance. Running CNNs on 50 Mpix images at full resolution takes time and CPU performance is way too slow. GPU performance is usable, but not great. (We are using CoreML and WinML at the moment.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants