python - Normalize unicode does not work as expected -
i face problems different unicode representations of special characters, ones accents or diaereses , on. wrote python script, parses multiple database dumps , compares values between them. problem is, in different file, these special characters stored differently. in files, these characters composed, in others decomposed. want have string extracted dump in composed representation, tried adding following line:
value = unicodedata.normalize("nfc", value)
however, solves problem in cases. example, umlauts works expected. nevertheless, characters ë remain in decomposed schema (e͏̈).
i figured out, there combining grapheme joiner-character(u+034f) between e , diaeresis character. normal, or cause of problem?
does know, how handle issue?
the purpose of u+034f
combining grapheme joiner ensure sequences remain distinct under searching/sorting/normalisation. required correct handling of characters , combining marks used in languages unicode algorithms. section 23.2 of unicode standard (page 805):
u+034f combining grapheme joiner (cgj) used affect collation of adjacent characters purposes of language-sensitive collation , searching. used distinguish sequences otherwise canonically equivalent.
...
in turn, means insertion of combining grapheme joiner between 2 combining marks prevent normalization switching positions of 2 combining marks, regardless of own combining classes.
in general, should not remove cgj without special knowledge why inserted in first place.
Comments
Post a Comment