regex - Test for filtering illegal characters from a string -
i need filter out illegal unicode characters string outlined in guide preparing data amazon cloud search.
both json , xml batches can contain utf-8 characters valid in xml. valid characters control characters tab (0009), carriage return (000d), , line feed (000a), , legal characters of unicode , iso/iec 10646. fffe, ffff, , surrogate blocks d800–dbff , dc00–dfff invalid , cause errors. (for more information, see extensible markup language (xml) 1.0 (fifth edition).) can use following regular expression match invalid characters can remove them: /[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]/ .
i trying write test success , failure cases, having trouble writing unicode characters in prohibited range.
edit2: javascript language trying write tests in
edit1: link amazon cloudsearch documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
in javascript can use unicode escape sequences produce invalid characters strings, so: "\ufffe"
, "\uffff"
, "\ud800"
, on. beware, though: "\ud83c\udf4c"
javascript string represents "🍌"
, banana character, unicode code point 1f34c. amazon api forbids lone surrogates directly encoded in utf-8. banana character (1f34c) encoded utf-8 valid (as bytes f0 9f 8d 8c), , therefore surrogate pair valid. invalid utf-8 encoding of d83c itself, i.e., bytes ed a0 bc.
Comments
Post a Comment