Pythonの等しい深さのビニング等しい幅のビニングと2つのビニングデータ分析の組み合わせ



Python Equal Depth Binning Equal Width Binning Combined With Two Binning Data Analysis



データ分析pythonおよびその他のディープビンbin-width-halfタンクバインディング

Pythonでは、サンプルはpcut(ボックスごとの等しい深さのビニングサンプルサイズは基本的に同じです)とcut(等しい幅のビニングサンプルサイズは同じ幅です)によってビニングできます。詳細については、以下のコードセクションを参照してください。この記事のデータはインターネットからのものであり、いくつかのコードも参照されています。コメントと拡張は、テクニカルコミュニケーションのためにここで行われます。違反がある場合は、タイムリーな処理のためにブロガーに連絡してください。



#coding:utf-8 import datetime import pandas as pd def RFM(): trad_flow = pd.read_csv(r'../input/RFM_TRAD_FLOW.csv',encoding='GBK') trad_flow_new = trad_flow.copy() ##trad_flow_new['time_new']=trad_flow_new['time'].apply(lambda x :timeFormat(x)) trad_flow_new['time_format'] = trad_flow_new.time.apply(timeFormat) ##pd.set_option('display.max_rows', 9) pd.set_option ('precision', 2) #Decimal point reserved pd.set_option ('display.max_columns', 18) #Maximum number of columns pd.set_option ('expand_frame_repr', False) #do not wrap pd.set_option ('display.width', 200) #The maximum number of characters displayed horizontally ##print(trad_flow_new.head(10)) ## Group users by column cumid, type, column transID statistics F = trad_flow_new.groupby(['cumid','type'])[['transID']].count() ##print(F.head()) ## Pivot DataFrame F according to type and transID, that is, row and column conversion F_trans = pd.pivot_table(F, index='cumid', columns='type', values='transID') ##print(F_trans.head()) # Calculate data without Normal ## Check whether the returned_goods and Normal columns have null values, and if so, use 0 instead. shape [0] is the number of rows, shape [1] is the number of columns ##print(F_trans[F_trans.returned_goods.isnull()].shape[0])#shape[1] ## Only fill NULL in the data column with 0 to fill ##F_trans=F_trans[F_trans.isnull().T.any()].fillna(0) F_trans = F_trans.fillna(0) ## F_trans ['Special_offer'] = F_trans ['Special_offer']. Fillna (0) ## Only column FillNa ## Calculate interest ratio, here is special price / (special price + normal) F_trans['interest'] = F_trans.Special_offer/(F_trans.Special_offer+F_trans.Normal) ##print(F_trans.head()) #print (trad_flow_new [trad_flow_new ['cumid'] == 19021]) #View the details of cumid equal to 19021 ## Customer Value Information Orientation M = trad_flow_new.groupby(['cumid','type'])[['amount']].sum() M_trans = pd.pivot_table(M,index='cumid',columns='type',values='amount') M_trans = M_trans.fillna(0) M_trans['value']=M_trans.Normal + M_trans.Special_offer+M_trans.returned_goods ##print(M_trans.head(10)) trad_flow_new['time_new']=trad_flow_new.time.apply(to_time) R = trad_flow_new.groupby(['cumid'])[['time_new']].max() ##print(R.head()) from sklearn import preprocessing #Here is the equal depth bin q or quantile, here the parameter is divided into two boxes. threshold = pd.qcut(F_trans['interest'], 2, retbins=True)[1][1] print ('F value two classification right boundary: t' + str (threshold)) #Do binary conversion to F dataframe binarizer = preprocessing.Binarizer (threshold = threshold) ## Define a binary classification converter single_row = F_trans ['interest']. values.reshape (-1,1) ## Extract the interest column into the ndarray, where -1 means the number of rows is unknown b_f_interest = pd.DataFrame (binarizer.transform (single_row)) ## Transform a single column and generate a DataFrame through a binary classification converter b_f_interest.index = F_trans.index b_f_interest.columns=['interest'] # Binary conversion of M dataframe threshold = pd.qcut(M_trans['value'], 2, retbins=True)[1][1] print ('The right boundary of the two categories of M values: t' + str (threshold)) binarizer = preprocessing.Binarizer(threshold=threshold) single_row = M_trans['value'].values.reshape(-1,1) b_m_value = pd.DataFrame(binarizer.transform(single_row)) b_m_value.index = M_trans.index b_m_value.columns=['value'] ##print(b_m_value.head()) #Two classification conversion of R dataframe threshold = pd.qcut(R['time_new'],2,retbins=True)[1][1] print ('R value binary classification right boundary: t' + str (threshold)) binarizer = preprocessing.Binarizer(threshold=threshold) single_row = R.time_new.values.reshape(-1,1) b_r_time= pd.DataFrame(binarizer.transform(single_row)) b_r_time.index = R.index b_r_time.columns=['time'] total = pd.concat([b_f_interest,b_m_value,b_r_time],axis=1) ##print(total.head()) ## Define the label, define the customer label according to FMR 2 * 2 * 2 = 8 cases label = { (0, 0, 0): 'No interest-low value-silence', (1, 0, 0): 'interested-low value-silent', (1, 0, 1): 'interested-low value-active', (0, 0, 1): 'No interest-low value-active', (0, 1, 0): 'No interest-high value-silence', (1, 1, 0): 'Interested-High Value-Silence', (1, 1, 1): 'Interested-High Value-Active', (0, 1, 1): 'No interest-high value-active' } total['label'] = total[['interest','value','time']].apply(lambda x:label[((x[0],x[1],x[2]))], axis = 1) print(total.head()) def timeFormat(Str): import datetime return datetime.datetime.strptime(Str,'%d%b%y:%H:%M:%S') def to_time(t): import time out_t = time.mktime (time.strptime (t, '% d% b% y:% H:% M:% S')) ## Convert to numeric type (i.e. with 1970-1-1 8:00:00 (The difference in seconds) is convenient for the later qcut bin return out_t def boxsplit(): F_x = pd.DataFrame(data={'age': [18, 19, 23, 25, 27, 29,34,45]}) '' 'Method 1 equal-width binning, width w = (Max-Min) / N (45-18) / 3 = 9, then the right boundary of each binning is Min + (N-1) * w The boundary of the first bin is 18 + 9 * 1 = 27, the boundary of the second bin is 18 + 9 * 2 = 36, and the boundary of the third bin is 18 + 9 * 3 = 45 ''' ##print(pd.cut(F_x['age'], 3)) print(pd.cut(F_x['age'], 3).value_counts()) '' 'Method 2 equal-depth bins (equal-frequency bins), first calculate the critical value corresponding to each quantile through the quantiles, and then 'equalize' the data ''' print(F_x.age.quantile([0, 1 / 3, 2 / 3, 1])) #print(pd.qcut(F_x['age'], 3, retbins=True)) print(pd.qcut(F_x['age'], 3).value_counts()) '' 'It is not difficult to find through observation that the number of each group of equal-depth bins is basically the same, and the number of each group of equal-width bins may be quite different' ' '' 'Chi-square merger method (to be added)' '' def baseTime(): import time t = (1970, 1, 1, 8, 0, 0, 3, 1, 0) secs = time.mktime(t) print('time.mktime(t) : %f' % secs) print('asctime(localtime(secs)): %s' % time.asctime(time.localtime(secs))) if __name__ == '__main__': #print(timeFormat('14MAY10:13:31:23')) RFM() #a='14JUN09:17:58:34' #print(to_time(a)) boxsplit() baseTime()

結果:

'F:プログラムファイル Python37 python.exe' E:/DevData/GiteePython/com/shenl/ml/RFM/RFM.py
F値2分類の右境界:0.08333333333333333
M値2分類の右境界:2944.5
R値2分類の右境界:1284373750.0
利息価値時間ラベル
cumid
10001 1.0 1.0 1.01.0関心がある-高い価値-アクティブ
10002 0.0 0.0 0.00.0無利子-低価値-沈黙
10003 0.0 0.0 1.00.0無利子-高価値-沈黙
10004 1.0 1.0 1.00.0関心-高価値-サイレント
10005 0.0 0.0 0.00.0無利子-低価値-沈黙
(17.973、27.0] 5
(27.0、36.0] 2
(36.0、45.0] 1
名前:年齢、dtype:int64
0.00 18.00
0.33 23.67
0.67 28.33
1.00 45.00
名前:年齢、dtype:float64
(28.333、45.0] 3
(17,999、23,667] 3
(23,667、28,333] 2
名前:年齢、dtype:int64
time.mktime(t):0.000000
asctime(localtime(secs)):1月1日木曜日08:00:00 1970



プロセスは終了コード0で終了しました