Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

International Components for Unicode #2332

Open
Calorion opened this issue Aug 23, 2024 · 1 comment
Open

International Components for Unicode #2332

Calorion opened this issue Aug 23, 2024 · 1 comment

Comments

@Calorion
Copy link

Calorion commented Aug 23, 2024

Flavor Request

Please support ICU. This is the format supported natively by Apple devices, and is used in, e.g., Siri Shortcuts.

@Calorion
Copy link
Author

Calorion commented Sep 1, 2024

I see that this has been requested before.

Here are the differences from PCRE2 that I've run into:

Operators

No support for \K.

No support for conditionals.

Does support bounded quantifiers (such as ? and {2,5}) in lookbehind.

Does not support recursion (?R) (haven't run into this one, but Wikipedia lists it).

Flags

These haven't caused issues for me, but they are differences.

Doesn't support the g flag, because there is no non-global mode. Ditto u.

Doesn't support UAJD flags.

Supports w flag:

UREGEX_UWORD Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.

Differences with Java Regular Expressions

  • ICU does not support UREGEX_CANON_EQ. See https://unicode-org.atlassian.net/browse/ICU-9111.
  • The behavior of \cx (Control-X) differs from Java when x is outside the range A-Z. See https://unicode-org.atlassian.net/browse/ICU-6068.
  • Java allows quantifiers (*, +, etc) on zero length tests. ICU does not. Occurrences of these in patterns are most likely unintended user errors, but it is an incompatibility with Java. https://unicode-org.atlassian.net/browse/ICU-6080
  • ICU recognizes all Unicode properties known to ICU, which is all of them. Java is restricted to just a few.
  • ICU case insensitive matching works with all Unicode characters, and, within string literals, does full Unicode matching (where matching strings may be different lengths.) Java does ASCII only by default, with Unicode aware case folding available as an option.
  • ICU has an extended syntax for set [bracket] expressions, including additional operators. Added for improved compatibility with the original ICU implementation, which was based on ICU UnicodeSet pattern syntax.
  • The property expression \p{punct} differs in what it matches. Java matches matches any of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~. From that list, ICU omits $+<=>^`|~ ICU follows the recommendations from Unicode UTS-18, http://www.unicode.org/reports/tr18/#Compatibility_Properties. See also https://unicode-org.atlassian.net/browse/ICU-20095.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant