Unify the Parquet dictionary value decoders #23612

yingsu00 · 2024-09-10T04:21:36Z

There are many dictionary ID value decoders in the Parquet batch reader. They usually allocates a buffer in every readNext call and it is bad for reliability and performance. There is no need to create a separate decoder and add unnecessary memory allocation and memory copies. It would be nice to send a new PR to unify existing RLE dictionary decoders. After all, dictionary IDs can only be RLE/BP encoded, and is not relevant to the data column types.

Ref: https://parquet.apache.org/docs/file-format/data-pages/encodings/
"Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width)."

See #23584

yingsu00 · 2024-09-10T04:21:55Z

cc @ethanyzhang

yingsu00 added feature request beginner-task labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify the Parquet dictionary value decoders #23612

Unify the Parquet dictionary value decoders #23612

yingsu00 commented Sep 10, 2024

yingsu00 commented Sep 10, 2024

Unify the Parquet dictionary value decoders #23612

Unify the Parquet dictionary value decoders #23612

Comments

yingsu00 commented Sep 10, 2024

yingsu00 commented Sep 10, 2024