python - Issues while encoding, decoding arabic language in terminal -
in script cosine similarity need first, convert arabic string vector before perform cosine similarity on terminal under linux --> problem while convert arabic string vector producing arabic as:
[u'\u0627\u0644\u0634\u0645\u0633 \u0645\u0634\u0631\u0642\u0647 \u0646\u0647\u0627\u0631\u0627', u'\u0627\u0644\u0633\u0645\u0627\u0621 \u0632\u0631\u0642\u0627\u0621']
my script:
train_set = ["السماء زرقاء", "الشمس مشرقه نهارا"] #documents test_set = ["الشمس التى فى السماء مشرقه","السماء زرقاء"] #query stopwords = set(stopwords.words('english')) vectorizer = countvectorizer(stop_words = stopwords) transformer = tfidftransformer() trainvectorizerarray = vectorizer.fit_transform(train_set).toarray() testvectorizerarray = vectorizer.transform(test_set).toarray() print 'fit vectorizer train set', trainvectorizerarray print 'transform vectorizer test set', testvectorizerarray cx = lambda a, b : round(np.inner(a, b)/(la.norm(a)*la.norm(b)), 3) vector in trainvectorizerarray: print vector testv in testvectorizerarray: print testv cosine = cx(vector, testv) print cosine
your result list of strings, join string , clear sentence:
>>> print "\n".join(a) الشمس مشرقه نهارا السماء زرقاء
Comments
Post a Comment