Amazon Employee Access 数据分析报告
报告摘要
- 目标:本分析旨在利用Amazon的员工编号相关信息,来分析和预测当员工申请访问某个编号的资源时,是否被允许访问。
- 方法: 在原有部分变量的基础上,利用特征工程的方法,新增了单变量、双变量、三变量、四变量出现的频率和变量出现的条件概率等变量,利用随机森林模型,对目标变量进行预测。
- 结论:
- 一、对于训练集数据分析发现,各变量之间存在着一定的联系,其中ROLE_TITLE变量和ROLE_RODE变量存在一对一的关系,ROLE_TITLE变量和ROLE_FAMILY变量存在多对一的关系,其他变量之间也存在较强的对应关系。
- 二、根据这种方法建模,发现模型具有一定的预测效果。
目录
- 问题描述
- 数据加载
- 数据探索
- 特征工程
- 降维
- 新增单变量频率
- 新增双变量频率
- 新增三变量频率
- 新增四变量频率
- 新增各变量出现的条件概率
- 模型建立
- 模型预测与评价
一、问题描述
利用Amazon员工的编号信息,包括员工经理的编号、员工所在分类的编号、员工所在部门编号、员工职位编号、员工类别编号等信息,来预测当员工申请访问某个 编号的资源时,是否被允许访问。
变量名 |
含义 |
---|
ACTION |
1代表资源被授权访问,0代表资源未被授权访问 |
RESOURCE |
资源编号 |
MGR_ID |
员工经理的编号 |
ROLE_ROLLUP_1 |
公司员工分类1,如美国工程 |
ROLE_ROLLUP_2 |
公司员工分类2,如美国零售 |
ROLE_DEPTNAME |
公司部门描述,如零售 |
ROLE_TITLE |
职位名称,如高级工程零售经理 |
ROLE_FAMILY_DESC |
公司员工类别扩展描述,如零售经理,软件工程 |
ROLE_FAMILY |
公司员工类别描述,如零售经理 |
ROLE_CODE |
员工角色编号 |
二、数据加载
加载所需的python库
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns
载入train数据集
amazon = pd.read_csv("C:/Users/cs/Desktop/Amazon/train.csv")
data =amazon
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
ROLE_FAMILY |
ROLE_CODE |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
290919 |
117908 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
308574 |
118539 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
19721 |
117880 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
290919 |
118322 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
19793 |
119325 |
---|
三、数据探索
3.1 描述性统计
train数据集共有32769个样本,不存在缺失值
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32769 entries, 0 to 32768
Data columns (total 10 columns):
ACTION 32769 non-null int64
RESOURCE 32769 non-null int64
MGR_ID 32769 non-null int64
ROLE_ROLLUP_1 32769 non-null int64
ROLE_ROLLUP_2 32769 non-null int64
ROLE_DEPTNAME 32769 non-null int64
ROLE_TITLE 32769 non-null int64
ROLE_FAMILY_DESC 32769 non-null int64
ROLE_FAMILY 32769 non-null int64
ROLE_CODE 32769 non-null int64
dtypes: int64(10)
memory usage: 2.5 MB
样本中,约有5.8%的员工授权申请没有通过,除ACTION外,各变量编号从个位数到六位数不等。
data.describe()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
ROLE_FAMILY |
ROLE_CODE |
---|
count |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
32769.000000 |
---|
mean |
0.942110 |
42923.916171 |
25988.957979 |
116952.627788 |
118301.823156 |
118912.779914 |
125916.152644 |
170178.369648 |
183703.408893 |
119789.430132 |
---|
std |
0.233539 |
34173.892702 |
35928.031650 |
10875.563591 |
4551.588572 |
18961.322917 |
31036.465825 |
69509.462130 |
100488.407413 |
5784.275516 |
---|
min |
0.000000 |
0.000000 |
25.000000 |
4292.000000 |
23779.000000 |
4674.000000 |
117879.000000 |
4673.000000 |
3130.000000 |
117880.000000 |
---|
25% |
1.000000 |
20299.000000 |
4566.000000 |
117961.000000 |
118102.000000 |
118395.000000 |
118274.000000 |
117906.000000 |
118363.000000 |
118232.000000 |
---|
50% |
1.000000 |
35376.000000 |
13545.000000 |
117961.000000 |
118300.000000 |
118921.000000 |
118568.000000 |
128696.000000 |
119006.000000 |
118570.000000 |
---|
75% |
1.000000 |
74189.000000 |
42034.000000 |
117961.000000 |
118386.000000 |
120535.000000 |
120006.000000 |
235280.000000 |
290919.000000 |
119348.000000 |
---|
max |
1.000000 |
312153.000000 |
311696.000000 |
311178.000000 |
286791.000000 |
286792.000000 |
311867.000000 |
311867.000000 |
308574.000000 |
270691.000000 |
---|
查看各变量上不同编号的种类数。可以发现,在30000多个样本中,RESOURCE、MGR_ID和ROLE_FAMILY上编号种类数较多,其他变量上编号种类数较少。
值得注意的是,ROLE_TITLE和ROLE_CODE种类数一致。
f = lambda x: x.unique().size
data.apply(f)
ACTION 2
RESOURCE 7518
MGR_ID 4243
ROLE_ROLLUP_1 128
ROLE_ROLLUP_2 177
ROLE_DEPTNAME 449
ROLE_TITLE 343
ROLE_FAMILY_DESC 2358
ROLE_FAMILY 67
ROLE_CODE 343
dtype: int64
3.2 变量间的对应关系探索
3.2.1 ROLE_TITLE与ROLE_CODE
画出ROLE_TITLE和ROLE_CODE变量的散点图,存在明显的正相关关系。
fig,ax = plt.subplots(nrows=1,ncols=1,figsize=(8,5))
plt.scatter(data.ROLE_TITLE,data.ROLE_CODE)
<matplotlib.collections.PathCollection at 0xabb0c50>

将两个变量的值合并,编号的种类数目仍为343,
TITLE_CODE = data.ROLE_TITLE*1000000+data.ROLE_CODE
TITLE_CODE.unique().size
343
f2 = lambda x: x[x!=0].count()
TICO = pd.crosstab(data.ROLE_TITLE,data.ROLE_CODE)
TICO.apply(f2).plot()
TICO.apply(f2)[TICO.apply(f2)>1]
Series([], dtype: int64)

观察交叉表中ROLE_TITLE变量对应的ROLE_CODE变量个数,也为0,说明两个变量间存在一一对应的关系
TICO.apply(f2,axis=1).plot()
TICO.apply(f2,axis=1)[TICO.apply(f2,axis=1)>1]
Series([], dtype: int64)

3.2.2 ROLE_ROLLUP_1与ROLE_DEPTNAME
RO1_DEP= data.ROLE_ROLLUP_1*10000000+data.ROLE_DEPTNAME
data.ROLE_ROLLUP_1.unique().size, data.ROLE_DEPTNAME.unique().size, RO1_DEP.unique().size
(128, 449, 1185)
3.2.3 ROLE_ROLLUP_2与ROLE_DEPTNAME
RO2_DEP= data.ROLE_ROLLUP_2*10000000+data.ROLE_DEPTNAME
data.ROLE_ROLLUP_2.unique().size, data.ROLE_DEPTNAME.unique().size, RO2_DEP.unique().size
(177, 449, 1398)
3.2.4 ROLE_ROLLUP_1与ROLE_ROLLUP_2
RO1_RO2= data.ROLE_ROLLUP_1*10000000+data.ROLE_ROLLUP_2
data.ROLE_ROLLUP_1.unique().size, data.ROLE_ROLLUP_2.unique().size, RO1_RO2.unique().size
(128, 177, 187)
ctRO12 = pd.crosstab(data.ROLE_ROLLUP_1,data.ROLE_ROLLUP_2)
ctRO12.apply(f2).plot()
ctRO12.apply(f2)[ctRO12.apply(f2)>1]
ROLE_ROLLUP_2
118164 2
118178 2
119256 9
dtype: int64

a = data.ROLE_ROLLUP_2[(data.ROLE_ROLLUP_2==118164) | (data.ROLE_ROLLUP_2==118178)| (data.ROLE_ROLLUP_2==119256)].count()
b = data.ACTION[(data.ROLE_ROLLUP_2==118164) | (data.ROLE_ROLLUP_2==118178)| (data.ROLE_ROLLUP_2==119256)].value_counts()
b,a
(1 380
0 36
Name: ACTION, dtype: int64, 416)
ctRO12.apply(f2,axis=1)[ctRO12.apply(f2,axis=1)>1].count()
32
3.2.5 ROLE_FAMILY与ROLE_FAMILY_DESC
FA_DESC= data.ROLE_FAMILY_DESC*1000000+data.ROLE_FAMILY
data.ROLE_FAMILY_DESC.unique().size,data.ROLE_FAMILY.unique().size, FA_DESC.unique().size
(2358, 67, 2586)
ctFAFA = pd.crosstab(data.ROLE_FAMILY,data.ROLE_FAMILY_DESC)
3.2.6 ROLE_TITLE和ROLE_FAMILY
TIFA = data.ROLE_TITLE*1000000+data.ROLE_FAMILY
data.ROLE_TITLE.unique().size, data.ROLE_FAMILY.unique().size, TIFA.unique().size
(343, 67, 343)
ctTIFA = pd.crosstab(data.ROLE_TITLE,data.ROLE_FAMILY)
ctTIFA.apply(f2,axis=1).plot()
ctTIFA.apply(f2,axis=1)[ctTIFA.apply(f2,axis=1)>1].count()
0

3.3 变量分布探索
fig,ax = plt.subplots(figsize=(8,5))
data.ACTION.value_counts().plot(kind="bar",color="lightblue")
ax.set_xticklabels(("Accessed","Not Accessed"), rotation= "horizontal" )
ax.set_title("Bar plot of Action")
<matplotlib.text.Text at 0xd173080>

fig,ax = plt.subplots(nrows=4,ncols=2,figsize=(20,40))
data.RESOURCE.hist(ax=ax[0,0],bins=100)
ax[0,0].set_title("Hist plot of RESOURCE")
data.MGR_ID.hist(ax=ax[0,1],bins=100)
ax[0,1].set_title("Hist plot of MGR_ID")
data.ROLE_ROLLUP_1.hist(ax=ax[1,0],bins=100)
ax[1,0].set_title("Hist plot of ROLE_ROLLUP_1")
data.ROLE_ROLLUP_2.hist(ax=ax[1,1],bins=100)
ax[1,1].set_title("Hist plot of ROLE_ROLLUP_2")
data.ROLE_DEPTNAME.hist(ax=ax[2,0],bins=100)
ax[2,0].set_title("ROLE_DEPTNAME")
data.ROLE_TITLE.hist(ax=ax[2,1],bins=100)
ax[2,1].set_title("Hist plot of ROLE_TITLE")
data.ROLE_FAMILY_DESC.hist(ax=ax[3,0],bins=100)
ax[3,0].set_title("Hist plot of ROLE_FAMILY_DESC")
data.ROLE_FAMILY.hist(ax=ax[3,1],bins=100)
ax[3,1].set_title("Hist plot of ROLE_FAMILY")
<matplotlib.text.Text at 0xdb2fa90>

cm = np.corrcoef(data.values.T)
sns.set(font_scale=1)
cols = data.columns
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 10},
yticklabels=cols,
xticklabels=cols)
plt.tight_layout()
plt.show()

四、特征工程
4.1 降维
data = amazon
del data["ROLE_CODE"]
del data["ROLE_FAMILY"]
amazon = pd.read_csv("C:/Users/cs/Desktop/Amazon/train.csv")
4.2 新增单变量的频率
one= ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(one)):
a=data[one[i]]
b=data[one[i]].value_counts()/32769
a=a.map(b)
data[one[i]+"_prob"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
ROLE_ROLLUP_1_prob |
ROLE_ROLLUP_2_prob |
ROLE_DEPTNAME_prob |
ROLE_TITLE_prob |
ROLE_FAMILY_DESC_prob |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
0.653270 |
0.135006 |
0.002197 |
0.109341 |
0.210443 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
0.653270 |
0.120388 |
0.004852 |
0.002472 |
0.000366 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
0.005615 |
0.005615 |
0.016662 |
0.038329 |
0.001007 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
0.653270 |
0.120388 |
0.005798 |
0.141872 |
0.037963 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
0.008423 |
0.004211 |
0.001373 |
0.002289 |
0.000580 |
---|
4.3 新增双变量的频率
two = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(two)):
for j in range(i+1,len(two)):
a=data[two[i]]+data[two[j]]*1000000
b=a.value_counts()/32769
a=a.map(b)
data[two[i]+"_"+two[j]+"_prob"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob |
ROLE_ROLLUP_1_ROLE_DEPTNAME_prob |
ROLE_ROLLUP_1_ROLE_TITLE_prob |
ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob |
ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |
ROLE_ROLLUP_2_ROLE_TITLE_prob |
ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |
ROLE_DEPTNAME_ROLE_TITLE_prob |
ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
ROLE_TITLE_ROLE_FAMILY_DESC_prob |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.135006 |
0.002014 |
0.089200 |
0.180659 |
0.002014 |
0.013855 |
0.033233 |
0.000671 |
0.001678 |
0.079557 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.120388 |
0.003815 |
0.002472 |
0.000366 |
0.003754 |
0.000580 |
0.000153 |
0.000153 |
0.000153 |
0.000366 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.005615 |
0.000397 |
0.001556 |
0.000061 |
0.000397 |
0.001556 |
0.000061 |
0.005615 |
0.000061 |
0.000061 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.120388 |
0.005401 |
0.125057 |
0.036956 |
0.005035 |
0.022460 |
0.007782 |
0.003052 |
0.001770 |
0.016204 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.004211 |
0.000671 |
0.000488 |
0.000305 |
0.000549 |
0.000244 |
0.000244 |
0.000183 |
0.000183 |
0.000519 |
---|
5 rows × 36 columns
4.4 新增三变量的频率
three = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(three)):
for j in range(i+1,len(three)):
for k in range(j+1,len(three)):
a = data[three[i]]*100000*100000+data[three[j]]*1000000+data[three[k]]
b = a.value_counts()/91690
a = a.map(b)
data[three[i]+"_"+three[j]+"_"+three[k]+"_"+"prob"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |
ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob |
ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |
ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob |
ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob |
ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.000720 |
0.004951 |
0.011877 |
0.000185 |
0.000556 |
0.023220 |
0.000185 |
0.000556 |
0.003937 |
0.000218 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.001341 |
0.000207 |
0.000055 |
0.000055 |
0.000055 |
0.000131 |
0.000055 |
0.000055 |
0.000055 |
0.000055 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.000142 |
0.000556 |
0.000022 |
0.000055 |
0.000022 |
0.000022 |
0.000055 |
0.000022 |
0.000022 |
0.000022 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.001800 |
0.008027 |
0.002781 |
0.001091 |
0.000633 |
0.005682 |
0.000971 |
0.000534 |
0.001451 |
0.000545 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.000196 |
0.000087 |
0.000087 |
0.000044 |
0.000044 |
0.000109 |
0.000022 |
0.000022 |
0.000087 |
0.000065 |
---|
5 rows × 71 columns
4.5 新增四变量的频率
four = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(1,len(four)):
for j in range(i+1,len(four)):
for k in range(j+1,len(four)):
a = data[four[0]]*100000*100000+data[four[i]]*1000000+data[four[j]]+data[four[k]]*0.000001
b = a.value_counts()/32769
a = a.map(b)
data[four[0]+"_"+four[i]+"_"+four[j]+"_"+four[k]+"_"+"prob"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |
RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob |
RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob |
RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob |
RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.000092 |
0.000092 |
0.000092 |
0.000031 |
0.000031 |
0.000061 |
0.000031 |
0.000031 |
0.000061 |
0.000031 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.000336 |
0.000305 |
0.000153 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.000061 |
0.000061 |
0.000031 |
0.000061 |
0.000031 |
0.000031 |
0.000061 |
0.000031 |
0.000031 |
0.000031 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.000061 |
0.000061 |
0.000061 |
0.000061 |
0.000061 |
0.000092 |
0.000031 |
0.000031 |
0.000061 |
0.000061 |
---|
5 rows × 91 columns
4.6 新增各变量出现频率的条件概率
resourcetwo = ['RESOURCE_MGR_ID_prob','RESOURCE_ROLE_ROLLUP_1_prob', 'RESOURCE_ROLE_ROLLUP_2_prob','RESOURCE_ROLE_DEPTNAME_prob',
'RESOURCE_ROLE_TITLE_prob','RESOURCE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcetwo)):
a = data[resourcetwo[i]]/data.RESOURCE_prob
data[resourcetwo[i]+"_"+"probre"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob |
RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob |
RESOURCE_MGR_ID_prob_probre |
RESOURCE_ROLE_ROLLUP_1_prob_probre |
RESOURCE_ROLE_ROLLUP_2_prob_probre |
RESOURCE_ROLE_DEPTNAME_prob_probre |
RESOURCE_ROLE_TITLE_prob_probre |
RESOURCE_ROLE_FAMILY_DESC_prob_probre |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.000031 |
0.000031 |
0.000061 |
0.000031 |
1.000000 |
1.000000 |
1.000000 |
0.333333 |
0.666667 |
1.000000 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
0.033333 |
0.866667 |
0.366667 |
0.033333 |
0.033333 |
0.033333 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.000061 |
0.000031 |
0.000031 |
0.000031 |
0.500000 |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
0.500000 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.000031 |
0.000031 |
0.000031 |
0.000031 |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.000031 |
0.000031 |
0.000061 |
0.000061 |
0.250000 |
0.375000 |
0.250000 |
0.250000 |
0.500000 |
0.375000 |
---|
5 rows × 97 columns
resourcetwo = ['RESOURCE_MGR_ID_prob','RESOURCE_ROLE_ROLLUP_1_prob', 'RESOURCE_ROLE_ROLLUP_2_prob','RESOURCE_ROLE_DEPTNAME_prob',
'RESOURCE_ROLE_TITLE_prob','RESOURCE_ROLE_FAMILY_DESC_prob']
resourceone = [ 'MGR_ID_prob', 'ROLE_ROLLUP_1_prob','ROLE_ROLLUP_2_prob', 'ROLE_DEPTNAME_prob', 'ROLE_TITLE_prob','ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcetwo)):
a = data[resourcetwo[i]]/data[resourceone[i]]
data[resourcetwo[i]+"_"+"proboth"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
RESOURCE_ROLE_ROLLUP_2_prob_probre |
RESOURCE_ROLE_DEPTNAME_prob_probre |
RESOURCE_ROLE_TITLE_prob_probre |
RESOURCE_ROLE_FAMILY_DESC_prob_probre |
RESOURCE_MGR_ID_prob_proboth |
RESOURCE_ROLE_ROLLUP_1_prob_proboth |
RESOURCE_ROLE_ROLLUP_2_prob_proboth |
RESOURCE_ROLE_DEPTNAME_prob_proboth |
RESOURCE_ROLE_TITLE_prob_proboth |
RESOURCE_ROLE_FAMILY_DESC_prob_proboth |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
1.000000 |
0.333333 |
0.666667 |
1.000000 |
0.054545 |
0.000140 |
0.000678 |
0.013889 |
0.000558 |
0.000435 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.366667 |
0.033333 |
0.033333 |
0.033333 |
0.100000 |
0.001215 |
0.002788 |
0.006289 |
0.012346 |
0.083333 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
1.000000 |
1.000000 |
1.000000 |
0.500000 |
0.333333 |
0.010870 |
0.010870 |
0.003663 |
0.001592 |
0.030303 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
1.000000 |
1.000000 |
1.000000 |
1.000000 |
0.016129 |
0.000047 |
0.000253 |
0.005263 |
0.000215 |
0.000804 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.250000 |
0.250000 |
0.500000 |
0.375000 |
0.222222 |
0.010870 |
0.014493 |
0.044444 |
0.053333 |
0.157895 |
---|
5 rows × 103 columns
resourcethree = [ 'RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob','RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob', 'RESOURCE_MGR_ID_ROLE_DEPTNAME_prob',
'RESOURCE_MGR_ID_ROLE_TITLE_prob','RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob', 'RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob',
'RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob', 'RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob', 'RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob','RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcethree)):
a = data[resourcethree[i]]/data.RESOURCE_prob
data[resourcethree[i]+"_"+"probre"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_probre |
RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_probre |
RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob_probre |
RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_probre |
RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_probre |
RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob_probre |
RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_probre |
RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob_probre |
RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_probre |
RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob_probre |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.357389 |
0.119130 |
0.238259 |
0.357389 |
0.119130 |
0.238259 |
0.357389 |
0.119130 |
0.119130 |
0.238259 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.131043 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
0.011913 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.357389 |
0.357389 |
0.357389 |
0.178695 |
0.357389 |
0.357389 |
0.178695 |
0.357389 |
0.178695 |
0.178695 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
0.357389 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.089347 |
0.089347 |
0.134021 |
0.134021 |
0.044674 |
0.089347 |
0.089347 |
0.089347 |
0.089347 |
0.134021 |
---|
5 rows × 118 columns
resourcethree = [ 'RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob','RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob', 'RESOURCE_MGR_ID_ROLE_DEPTNAME_prob',
'RESOURCE_MGR_ID_ROLE_TITLE_prob','RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob', 'RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob',
'RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob', 'RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob', 'RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob','RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob']
othertwo = ['MGR_ID_ROLE_ROLLUP_1_prob','MGR_ID_ROLE_ROLLUP_2_prob','MGR_ID_ROLE_DEPTNAME_prob', 'MGR_ID_ROLE_TITLE_prob',
'MGR_ID_ROLE_FAMILY_DESC_prob', 'ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob', 'ROLE_ROLLUP_1_ROLE_DEPTNAME_prob', 'ROLE_ROLLUP_1_ROLE_TITLE_prob',
'ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob', 'ROLE_ROLLUP_2_ROLE_DEPTNAME_prob', 'ROLE_ROLLUP_2_ROLE_TITLE_prob',
'ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob', 'ROLE_DEPTNAME_ROLE_TITLE_prob','ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob',
'ROLE_TITLE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcethree)):
a = data[resourcethree[i]]/data[othertwo[i]]
data[othertwo[i]+"_"+"proboth"]=a
data.head()
|
ACTION |
RESOURCE |
MGR_ID |
ROLE_ROLLUP_1 |
ROLE_ROLLUP_2 |
ROLE_DEPTNAME |
ROLE_TITLE |
ROLE_FAMILY_DESC |
RESOURCE_prob |
MGR_ID_prob |
... |
ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_proboth |
ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_proboth |
ROLE_ROLLUP_1_ROLE_TITLE_prob_proboth |
ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_proboth |
ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_proboth |
ROLE_ROLLUP_2_ROLE_TITLE_prob_proboth |
ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_proboth |
ROLE_DEPTNAME_ROLE_TITLE_prob_proboth |
ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_proboth |
ROLE_TITLE_ROLE_FAMILY_DESC_prob_proboth |
---|
0 |
1 |
39353 |
85475 |
117961 |
118300 |
123472 |
117905 |
117906 |
0.000092 |
0.001678 |
... |
0.000242 |
0.005415 |
0.000245 |
0.000181 |
0.005415 |
0.001574 |
0.000985 |
0.016245 |
0.006498 |
0.000274 |
---|
1 |
1 |
17183 |
1540 |
117961 |
118343 |
123125 |
118536 |
118536 |
0.000915 |
0.000305 |
... |
0.000997 |
0.002859 |
0.004412 |
0.029782 |
0.002906 |
0.018810 |
0.071478 |
0.071478 |
0.071478 |
0.029782 |
---|
2 |
1 |
36724 |
14457 |
118219 |
118220 |
117884 |
117879 |
267952 |
0.000061 |
0.000092 |
... |
0.003885 |
0.054983 |
0.014015 |
0.178695 |
0.054983 |
0.014015 |
0.178695 |
0.003885 |
0.178695 |
0.178695 |
---|
3 |
1 |
36135 |
5396 |
117961 |
118343 |
119993 |
118321 |
240983 |
0.000031 |
0.001892 |
... |
0.000091 |
0.002019 |
0.000087 |
0.000295 |
0.002166 |
0.000486 |
0.001402 |
0.003574 |
0.006162 |
0.000673 |
---|
4 |
1 |
42680 |
5905 |
117929 |
117930 |
119569 |
119323 |
123932 |
0.000244 |
0.000275 |
... |
0.005180 |
0.032490 |
0.067010 |
0.107217 |
0.019855 |
0.089347 |
0.089347 |
0.119130 |
0.119130 |
0.063069 |
---|
5 rows × 133 columns
五、模型建立
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report
y = data.ACTION
X = data
del X["ACTION"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
forest = RandomForestClassifier(criterion='entropy',
n_estimators=1000,
random_state=1,
n_jobs=2)
RFfit = forest.fit(X_train , y_train)
六、模型预测与评价
preds = RFfit.predict(X_test)
confusion_matrix(y_test,preds)
array([[ 138, 420],
[ 59, 9214]])
pre = RFfit.predict_proba(X_test)
roc_auc_score(y_test,pre[:,1])
0.8639483844684166
fpr,tpr,thresholds = roc_curve(y_test,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Randomforest")
<matplotlib.text.Text at 0x26395198>

利用Kaggle测试集得分为0.89,说明模型具有一定的效果。