python - Issues while encoding, decoding arabic language in terminal -


in script cosine similarity need first, convert arabic string vector before perform cosine similarity on terminal under linux --> problem while convert arabic string vector producing arabic as:

[u'\u0627\u0644\u0634\u0645\u0633 \u0645\u0634\u0631\u0642\u0647 \u0646\u0647\u0627\u0631\u0627', u'\u0627\u0644\u0633\u0645\u0627\u0621 \u0632\u0631\u0642\u0627\u0621'] 

my script:

train_set = ["السماء زرقاء", "الشمس مشرقه نهارا"] #documents test_set = ["الشمس التى فى السماء مشرقه","السماء زرقاء"] #query stopwords = set(stopwords.words('english'))  vectorizer = countvectorizer(stop_words = stopwords) transformer = tfidftransformer() trainvectorizerarray = vectorizer.fit_transform(train_set).toarray() testvectorizerarray = vectorizer.transform(test_set).toarray() print 'fit vectorizer train set', trainvectorizerarray print 'transform vectorizer test set', testvectorizerarray cx = lambda a, b : round(np.inner(a, b)/(la.norm(a)*la.norm(b)), 3)  vector in trainvectorizerarray:     print vector     testv in testvectorizerarray:         print testv         cosine = cx(vector, testv)         print cosine 

your result list of strings, join string , clear sentence:

>>> print "\n".join(a) الشمس مشرقه نهارا السماء زرقاء 

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -