Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stb resize 2.08 #1649

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

stb resize 2.08 #1649

wants to merge 4 commits into from

Conversation

jeffrbig2
Copy link

fix for RGB->BGR three channel flips and add SIMD (thanks to Ryan Salsbury)
fix for sub-rect resizes
use pragmas to control unrolling when they are available.

@ryanrsrs
Copy link

ryanrsrs commented Jun 13, 2024

I test this change on my Raspberry Pi 4B running Raspberry OS in 32-bit mode:
$ uname -a
Linux raspberrypi 6.6.31+rpt-rpi-v7l #1 SMP Raspbian 1:6.6.31-1+rpt1 (2024-05-29) armv7l GNU/Linux

The color bug I noticed in stbir__simple_flip_3ch() is fixed, in both scalar and simd paths. On my platform, stbir__simdf_swiz2 is not defined and it selects the second SIMD code block, using stbir__simdf_swiz().

The change in speed from enabling SIMD is slight (but consistent). I have verified which code paths are executing using printfs.

With gcc, SIMD gave a 15% speedup.
With clang, SIMD gave a 3% slowdown.

GCC build options:
cc -std=gnu11 -Wall -I/usr/include/libdrm -Os -march=native -DSTBIR_USE_FMA -mfpu=neon-vfpv4 -mfp16-format=ieee -Wno-unused-function -c stb_impl.c

Clang build options:
clang -std=gnu11 -Wall -I/usr/include/libdrm -Os -march=native -DSTBIR_USE_FMA -mfpu=neon-vfpv4 -Wno-unused-function -c stb_impl.c

The fastest version, Clang with -DSTBIR_NO_SIMD (lol), performs as follow:
src: 6048 x 8064
dst: 900 x 1200
time: 1.003 seconds

I'm not sure why it's so slow since it's only 150 MB of pixels. Maybe the long scanlines are thrashing the cache in a maximally-bad way?

The speed is fine for my application, and matches the 2.07 non-SIMD speed, so I dunno if there's a problem. But if you expected a bigger difference on this platform, I can poke at it some more.

e: All times mentioned above are for the call to stbir_resize_extended(), which does much more work than just flip_3ch(). But even the core resizer math doesn't speed up with SIMD, really? Maybe I am doing something wrong here.

e2: just rechecked 2.08 times against 2.07, both scalar and SIMD. They're the same. So this does not seem like a regression, just something I noticed now, since I am comparing simd and not-simd back-to-back to see that the color was fixed in both.

@jeffrbig2
Copy link
Author

That's a reasonably big downsample (depending on your filter) - 1 second doesn't seem nuts for a 32-bit platform that is reading 150 MB of input with a sample window of 27x20 (each output pixel has to read 27x20 of the input). 32-bit vs 64-bit is a huge hit here, btw. There are a couple things you can do:

  1. throw threads at it - this is a linear speed up - 2x cores, half the time.
  2. use linear pixel format - STBIR_TYPE_UINT8 instead of STBIR_TYPE_UINT8_SRGB
  3. don't use wrap edge mode
  4. use a simpler filter, STBIR_FILTER_BOX or STBIR_FILTER_TRIANGLE.
  5. to make better cache use, break the resize into vertical stripes (use the stbir_set_pixel_subrect function to do 128 vertical output pixels at a time). This will usually save 25% to 50%.

For option 5, you can also wait for 2.09 which will internally do the cache striping for you.

But yeah, 32-bit arm is just pretty darn pokey in general.

@ryanrsrs
Copy link

Yep, I'm not complaining about the performnace, I just wanted to be check the numbers seemed sensible.

The application I'm testing is decode and display of 45MP iPhone 15 heic files on a Rasp Pi Zero 2 W 512MB. (It works!)

@jeffrbig2
Copy link
Author

There's probably some more wins if you want to get fancy. Instead of decoding the HEIC into RGB and then resizing that, decode into YUV (where the U and V planes are smaller), resize those planes, and THEN convert to RGB in the smaller space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants