python - pandas: how to sort results of groupby using a pd.cut categorical variable -


i have data frame output groupby using categorical variable created pd.cut.

import pandas pd import numpy np  di = pd.dataframe({'earnings':pd.np.random.choice(10000, 10000), 'counts':[1] * 10000}) brackets=append(np.arange(0,5001,500),100000000) di['earncat']=pd.cut(di['earnings'], brackets,right=false,retbins=true)[0]  di_everyone=di.groupby('earncat').sum()[['counts']] di_everyone.sort_index(inplace=true) di_everyone.to_string 

and output,

[0, 500)          83,005,823 [1000, 1500)      11,995,255 [1500, 2000)      13,943,052 [2000, 2500)      11,967,696 [2500, 3000)      10,741,178 [3000, 3500)       9,749,914 [3500, 4000)       6,833,928 [4000, 4500)       7,150,125 [4500, 5000)       4,655,773 [500, 1000)        9,718,753 [5000, 100000000) 26,588,622 

i'm not sure why [500, 1000) appears on second last line. decided not label earncat because want see breakdown. how can sort on earncat?

thanks in advance

you using pandas 0.15.x not support kind of operation categorical dtypes (which pd.cut function produces)

in meantime, can work around problem this:

di['earnlower'] = di['earncat'].apply(lambda x: int(x[1:].split(',')[0])) di['earnhigher'] = di['earncat'].apply(lambda x: int(x[:-2].split(',')[1]))  di_everyone=di.groupby(['earnlower', 'earnhigher']).sum()[['counts']] 

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -