regex - Test for filtering illegal characters from a string -


i need filter out illegal unicode characters string outlined in guide preparing data amazon cloud search.

both json , xml batches can contain utf-8 characters valid in  xml. valid characters control characters tab (0009), carriage return  (000d), , line feed (000a), , legal characters of unicode , iso/iec  10646. fffe, ffff, , surrogate blocks d800–dbff , dc00–dfff  invalid , cause errors. (for more information, see extensible markup  language (xml) 1.0 (fifth edition).)   can use following regular expression match invalid characters  can remove them: /[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]/ . 

i trying write test success , failure cases, having trouble writing unicode characters in prohibited range.

edit2: javascript language trying write tests in

edit1: link amazon cloudsearch documentation: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html

in javascript can use unicode escape sequences produce invalid characters strings, so: "\ufffe", "\uffff", "\ud800" , on. beware, though: "\ud83c\udf4c" javascript string represents "🍌", banana character, unicode code point 1f34c. amazon api forbids lone surrogates directly encoded in utf-8. banana character (1f34c) encoded utf-8 valid (as bytes f0 9f 8d 8c), , therefore surrogate pair valid. invalid utf-8 encoding of d83c itself, i.e., bytes ed a0 bc.


Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -