hadoop - How to efficiently find top-k elements? -
i have big sequence file storing tfidf values documents. each line represents line , columns value of tfidfs each term (the row sparse vector). i'd pick top-k words each document using hadoop. naive solution loop through columns each row in mapper , pick top-k file becomes bigger , bigger don't think solution. there better way in hadoop?
1. in every map calculate topk (this local top k each map) 2. spawn signle reduce , top k mappers flow reducer , hence global top k evaluated.
think of problem
1. have been given results of x number of horse races. 2. need find top n fastest horse.
Comments
Post a Comment