hadoop - How to efficiently find top-k elements? -


i have big sequence file storing tfidf values documents. each line represents line , columns value of tfidfs each term (the row sparse vector). i'd pick top-k words each document using hadoop. naive solution loop through columns each row in mapper , pick top-k file becomes bigger , bigger don't think solution. there better way in hadoop?

 1. in every map calculate topk (this local top k each map)  2. spawn signle reduce , top k mappers flow reducer , hence global top k evaluated.  

think of problem

 1. have been given results of x number of horse races.   2. need find top n fastest horse.  

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -