• It is a historical fact that during the legendary voyage of "Titanic" the wireless telegraph machine had delivered 6 warnings about the danger of icebergs. Each of the telegraph messages described the...
• ######In hte case of a model that use gender,calss,and ticktet price,you will need an array of 2X3X4([female/male],[1st/2nd/3rd class],[4bins  ############of bprices]).The script will ...
#########hte idea is to create an table which contains just 1's and 0's.The array will be a surbibal reference table.whereby you read in the
##########tet data,find out passenger attributes,look them opn in the survival table,and determine if they should be predicted to survive or not.
######In hte case of a model that use gender,calss,and ticktet price,you will need an array of 2X3X4([female/male],[1st/2nd/3rd class],[4bins
############of bprices]).The script will systematically will loop through each combination and use the "where" function  in python to search
######passengers that fit that combination of variables.Just like befor,you can ask what indices in your data equals female,1st class,and paid
#########more than \$30.For the sake of binning let's say everything equal to and abouve 40 "equals" 39 so it falls in this bin.So then you can
######set the bins:

fare_ceiling=40
######then modify the data in the fare column to=39,if it is greater or equal to the ciling

data[data[0::,9].astype(np.float)>=fare_ceiling,9]=fare_ceiling-1.0

####I know there were 1st,2nd adn 3rd classes on board
number_of_classes=3

#####but it is better practice to calculate this from the data directly
####take the length of an array of unique valuese in column index2
number_of_classes=len(np.unique(data[0::,2]))

######initialize the survival table with all zeros
survival_table=np.zeros((2,number_of_classes,number_of_price_brackets))
######now that these are set up,you can loop throuhg each variable and find all those passengers that agree with the statements:
for i in xrange(number_of_classes):   ##########loop through each class
for j in xrange(number_of_price_brackets):   ########loop through each price bin

women_only_stats=data[                                                \######which element
(data[0::,4]=="femalse")
\######is a female
&(data[0::,2].astype(np.float)
\######and wa ith class
==i+1)\
&(data[0:,9].astype(np.float))
\######## was greater
>=j*fare_bracket_size     \#######than this bin
&(data[0:,9].astype(np.float)
\#######and less than
<=(j+1)*fare_bracket_size)
\#####the next bin
,1]                                              \#####in the 2nde col

men_only_stats=data[(data[0::,4]!="female")     \#####is a male
&(data[0::,2].astype(np.float))
\#########and was ith class
==i+1
&(data[0:,9].astype(np.float)    \#############was greater
>=j*fare_bracket_size)
\############than this bin
& (data[0:,9].astype(np.float)           \############and less than
<(j+1)*fare_bracket_size)
\#########the next bin
,1]


展开全文
• 上一篇：Titanic生存预测1，主要讲了如何做的特征工程。 这一篇讲如何训练模型来实现预测。 %matplotlib inline from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_...
    背景音乐：保留 - 郭顶
上一篇：Titanic生存预测1，主要讲了如何做的特征工程。
这一篇讲如何训练模型来实现预测。
%matplotlib inline
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

1. 读取数据
path_data = '../../data/titanic/'

df_data_y = df['Survived']
df_data_x = df.drop(['Survived', 'PassengerId'], 1)

df_train_x = df_data_x.iloc[:891, :]  # 前891个数据是训练集
df_train_y = df_data_y[:891]

2. 特征选择
我选择用GBDT来进行特征选择，这是由决策树本身的算法特性所决定的，每次通过计算信息增益（或其他准则）来选择特征进行分割，在预测的同时也对特征的贡献进行了“衡量”，因此比较容易可视化~
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)
gbdt_rfe = feature_selection.RFECV(ensemble.GradientBoostingClassifier(random_state=2018), step = 1, scoring = 'accuracy', cv = cv_split)
gbdt_rfe.fit(df_train_x, df_train_y)
columns_rfe = df_train_x.columns.values[gbdt_rfe.get_support()]
print('Picked columns: {}'.format(columns_rfe))
print("Optimal number of features : {}/{}".format(gbdt_rfe.n_features_, len(df_train_x.columns)))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(gbdt_rfe.grid_scores_) + 1), gbdt_rfe.grid_scores_)
plt.show()

结果显示：
Picked columns: ['Age' 'Fare' 'Pclass' 'SibSp' 'FamilySize' 'Family_Survival' 'Sex_Code' 'Title_Master' 'Title_Mr' 'Cabin_C' 'Cabin_E' 'Cabin_X']
Optimal number of features : 12/24


大约在5个以上特征的时候，交叉验证集的分数就已经趋于稳定了。说明在现有特征中，有贡献的特征并不多……
最好的结果出现在12个特征的时候。但需要注意的是，比赛的比分不是由你的交叉验证集决定，所以存在一定的偶然性，鉴于特征数量在比较长的跨度上表现接近，因此我觉得有机会的话，特征数量从5到24的每种选择都值得一试。
我个人比较了24个特征和12个特征，表现最好的是24个全选……没试其他的。
然后对特征进行标准化，用以训练：
stsc = StandardScaler()
df_data_x = stsc.fit_transform(df_data_x)
print('mean:\n', stsc.mean_)
print('var:\n', stsc.var_)

df_train_x = df_data_x[:891]
df_train_y = df_data_y[:891]

df_test_x = df_data_x[891:]
df_test_output = df.iloc[891:, :][['PassengerId','Survived']]

3.模型融合
机器学习的套路是：
先选择一个基础模型，进行训练和预测，最快建立起一个pipeline。
在此基础上用交叉验证和GridSearch对模型调参，查看模型的表现。
用模型融合进行多个模型的组合，用投票的方式（或其他）来预测结果。
一般来说，模型融合得到的结果会比单个模型的要好。
在这里，我跳过了步骤1和2，直接进行步骤3。
3.1 设置基本参数
vote_est = [
('bc', ensemble.BaggingClassifier()),
('etc', ensemble.ExtraTreesClassifier()),
('rfc', ensemble.RandomForestClassifier()),
('gpc', gaussian_process.GaussianProcessClassifier()),
('lr', linear_model.LogisticRegressionCV()),
('bnb', naive_bayes.BernoulliNB()),
('gnb', naive_bayes.GaussianNB()),
('knn', neighbors.KNeighborsClassifier()),
('svc', svm.SVC(probability=True)),
('xgb', XGBClassifier())
]

grid_n_estimator = [10, 50, 100, 300, 500]
grid_ratio = [.5, .8, 1.0]
grid_learn = [.001, .005, .01, .05, .1]
grid_max_depth = [2, 4, 6, 8, 10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]

grid_param = [
{
'n_estimators':grid_n_estimator,
'learning_rate':grid_learn,
'random_state':grid_seed
},
# BaggingClassifier
{
'n_estimators':grid_n_estimator,
'max_samples':grid_ratio,
'random_state':grid_seed
},
# ExtraTreesClassifier
{
'n_estimators':grid_n_estimator,
'criterion':grid_criterion,
'max_depth':grid_max_depth,
'random_state':grid_seed
},
{
'learning_rate':grid_learn,
'n_estimators':grid_n_estimator,
'max_depth':grid_max_depth,
'random_state':grid_seed,

},
# RandomForestClassifier
{
'n_estimators':grid_n_estimator,
'criterion':grid_criterion,
'max_depth':grid_max_depth,
'oob_score':[True],
'random_state':grid_seed
},
# GaussianProcessClassifier
{
'max_iter_predict':grid_n_estimator,
'random_state':grid_seed
},
# LogisticRegressionCV
{
'fit_intercept':grid_bool,  # default: True
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'random_state':grid_seed
},
# BernoulliNB
{
'alpha':grid_ratio,
},
# GaussianNB
{},
# KNeighborsClassifier
{
'n_neighbors':range(6, 25),
'weights':['uniform', 'distance'],
'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']
},
# SVC
{
'C':[1, 2, 3, 4, 5],
'gamma':grid_ratio,
'decision_function_shape':['ovo', 'ovr'],
'probability':[True],
'random_state':grid_seed
},
# XGBClassifier
{
'learning_rate':grid_learn,
'max_depth':[1, 2, 4, 6, 8, 10],
'n_estimators':grid_n_estimator,
'seed':grid_seed
}
]

3.2 训练
对于每个模型都进行调参再组合，不过有的迭代次数较多，为了节省时间我就用了RandomizedSearchCV来简化（还没来得及试验全部GridSearchCV）。
start_total = time.perf_counter()
N = 0
for clf, param in zip (vote_est, grid_param):
start = time.perf_counter()
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)
if 'n_estimators' not in param.keys():
print(clf[1].__class__.__name__, 'GridSearchCV')
best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'accuracy')
best_search.fit(df_train_x, df_train_y)
best_param = best_search.best_params_
else:
print(clf[1].__class__.__name__, 'RandomizedSearchCV')
best_search2 = model_selection.RandomizedSearchCV(estimator = clf[1], param_distributions = param, cv = cv_split, scoring = 'accuracy')
best_search2.fit(df_train_x, df_train_y)
best_param = best_search2.best_params_
run = time.perf_counter() - start

print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
clf[1].set_params(**best_param)

run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))

4. 预测
投票有两种方式——软投票和硬投票。
硬投票：少数服从多数。
软投票：没研究过，有文章表明，计算的是加权平均概率，预测结果是概率高的。
如果没有先验经验，那么最好是两种投票方式都算一遍，看看结果如何。
对于Titanic生存预测，我发现每次都是硬投票的结果要好。
grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
grid_hard_cv = model_selection.cross_validate(grid_hard, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
grid_hard.fit(df_train_x, df_train_y)

print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
print('-'*10)

grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
grid_soft.fit(df_train_x, df_train_y)

print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))

结果为：
Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 89.70
Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 85.97
Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.95
----------
Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 90.02
Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 85.52
Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 6.07

硬投票得出的预测结果，在测试集上的分数较高，标准差较小，优选硬投票。
5. 提交结果：
用硬投票作为预测的方案，得到结果并提交。
df_test_output['Survived'] = grid_hard.predict(df_test_x)
df_test_output.to_csv('../../data/titanic/hardvote.csv', index = False)

在官网上提交结果，给出的分数是0.81339。

后记
Titanic这个项目很值得一试，在实践的过程中，我参考了一些参赛者在kaggle上分享的kernel，收益良多。
但作为入门项目，重在参与，后面有空了再做一遍，看是否能有提高。
接下来，我会尝试参加猫狗大战。
也就是编写一个算法来分类图像是否包含狗或猫。
这对人类，狗和猫来说很容易，但用算法如何实现呢？拭目以待。

展开全文
• 代码所需数据集：https://github.com/jsusu/Titanic_Passenger_Survival_Prediction_2/tree/master/titanic_data import re import numpy as np import pandas as pd import matplotlib.pyplot as plt import ...
代码所需数据集：https://github.com/jsusu/Titanic_Passenger_Survival_Prediction_2/tree/master/titanic_data
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# 1.读取数据

sns.set_style('whitegrid')


PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S

1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C

2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S

3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S

4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S

train_data.info()
print("-" * 40)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  891 non-null    int64
1   Survived     891 non-null    int64
2   Pclass       891 non-null    int64
3   Name         891 non-null    object
4   Sex          891 non-null    object
5   Age          714 non-null    float64
6   SibSp        891 non-null    int64
7   Parch        891 non-null    int64
8   Ticket       891 non-null    object
9   Fare         891 non-null    float64
10  Cabin        204 non-null    object
11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  418 non-null    int64
1   Pclass       418 non-null    int64
2   Name         418 non-null    object
3   Sex          418 non-null    object
4   Age          332 non-null    float64
5   SibSp        418 non-null    int64
6   Parch        418 non-null    int64
7   Ticket       418 non-null    object
8   Fare         417 non-null    float64
9   Cabin        91 non-null     object
10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

# 从上面我们可以看出，Age、Cabin、Embarked、Fare几个特征存在缺失值。


# 绘制存活的比例
train_data['Survived'].value_counts().plot.pie(labeldistance = 1.1,autopct = '%1.2f%%',
shadow = False,startangle = 90,pctdistance = 0.6)

#labeldistance，文本的位置离远点有多远，1.1指1.1倍半径的位置
#autopct，圆里面的文本格式，%3.1f%%表示小数有三位，整数有一位的浮点数
#startangle，起始角度，0，表示从0开始逆时针转，为第一块。一般选择从90度开始比较好看
#pctdistance，百分比的text离圆心的距离
#patches, l_texts, p_texts，为了得到饼图的返回值，p_texts饼图内部文本的，l_texts饼图外label的文本

<matplotlib.axes._subplots.AxesSubplot at 0x121125b50>


# 2.缺失值处理

# 对数据进行分析的时候要注意其中是否有缺失值。一些机器学习算法能够处理缺失值，比如神经网络，一些则不能。
# 对于缺失值，一般有以下几种处理方法:

# （1）如果数据集很多，但有很少的缺失值，可以删掉带缺失值的行；
# （2）如果该属性相对学习来说不是很重要，可以对缺失值赋均值或者众数。
# （3）对于标称属性，可以赋一个代表缺失的值，比如‘U0’。因为缺失本身也可能代表着一些隐含信息。比如船舱号Cabin这一属性，缺失可能代表并没有船舱
train_data.Embarked[train_data.Embarked.isnull()] = train_data.Embarked.dropna().mode().values
#replace missing value with U0
train_data['Cabin'] = train_data.Cabin.fillna('U0')
#train_data.Cabin[train_data.CAbin.isnull()]='U0'


# （4）使用回归 随机森林等模型来预测缺失属性的值。因为Age在该数据集里是一个相当重要的特征（先对Age进行分析即可得知），所以保证一定的缺失值填充准确率是非常重要的，对结果也会产生较大影响。一般情况下，会使用数据完整的条目作为模型的训练集，以此来预测缺失值。对于当前的这个数据，可以使用随机森林来预测也可以使用线性回归预测。这里使用随机森林预测模型，选取数据集中的数值属性作为特征（因为sklearn的模型只能处理数值属性，所以这里先仅选取数值特征，但在实际的应用中需要将非数值特征转换为数值特征）

from sklearn.ensemble import RandomForestRegressor

#choose training data to predict age
age_df = train_data[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
age_df_notnull = age_df.loc[(train_data['Age'].notnull())]
age_df_isnull = age_df.loc[(train_data['Age'].isnull())]
X = age_df_notnull.values[:,1:]
Y = age_df_notnull.values[:,0]

# use RandomForestRegression to train data
RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
RFR.fit(X,Y)
predictAges = RFR.predict(age_df_isnull.values[:,1:])
train_data.loc[train_data['Age'].isnull(), ['Age']]= predictAges

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  891 non-null    int64
1   Survived     891 non-null    int64
2   Pclass       891 non-null    int64
3   Name         891 non-null    object
4   Sex          891 non-null    object
5   Age          891 non-null    float64
6   SibSp        891 non-null    int64
7   Parch        891 non-null    int64
8   Ticket       891 non-null    object
9   Fare         891 non-null    float64
10  Cabin        891 non-null    object
11  Embarked     891 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

# 3.分析数据关系
# 3.1 性别与是否生存的关系 Sex
print(train_data.groupby(['Sex','Survived'])['Survived'].count())

Sex     Survived
female  0            81
1           233
male    0           468
1           109
Name: Survived, dtype: int64

train_data[['Sex','Survived']].groupby(['Sex']).mean()


Survived

Sex

female
0.742038

male
0.188908

train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a24032f10>


#以上为不同性别的生存率，可见在泰坦尼克号事故中，还是体现了Lady First.

# 3.2船舱等级和生存与否的关系 Pclass
print(train_data.groupby(['Pclass','Survived'])['Pclass'].count())

Pclass  Survived
1       0            80
1           136
2       0            97
1            87
3       0           372
1           119
Name: Pclass, dtype: int64

print(train_data[['Pclass','Survived']].groupby(['Pclass']).mean())

        Survived
Pclass
1       0.629630
2       0.472826
3       0.242363

train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a24bc0ad0>


# 不同等级船舱的男女生存率：
train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a24c2e710>


print(train_data.groupby(['Sex','Pclass','Survived'])['Survived'].count())

Sex     Pclass  Survived
female  1       0             3
1            91
2       0             6
1            70
3       0            72
1            72
male    1       0            77
1            45
2       0            91
1            17
3       0           300
1            47
Name: Survived, dtype: int64

# 从图和表中可以看出，总体上泰坦尼克号逃生是妇女优先，但是对于不同等级的船舱还是有一定的区别。

# 3.3 年龄与存活与否的关系 Age
# 分别分析不同等级船舱和不同性别下的年龄分布和生存的关系：

fig,ax = plt.subplots(1,2, figsize = (18,5))
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Pclass","Age",hue="Survived",data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')

ax[1].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age",hue="Survived",data=train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')

plt.show()


# 分析总体的年龄分布：
plt.figure(figsize=(15,5))
plt.subplot(121)
train_data['Age'].hist(bins=100)
plt.xlabel('Age')
plt.ylabel('Num')

plt.subplot(122)
train_data.boxplot(column='Age',showfliers=False)
plt.show()


# 不同年龄下的生存和非生存的分布情况：
facet = sns.FacetGrid(train_data,hue="Survived",aspect=4)
facet.set(xlim=(0,train_data['Age'].max()))

<seaborn.axisgrid.FacetGrid at 0x1a25263d90>


# 不同年龄下的平均生存率：
# average survived passengers by age
fig,axis1 = plt.subplots(1,1,figsize=(18,4))
train_data['Age_int'] = train_data['Age'].astype(int)
average_age = train_data[["Age_int", "Survived"]].groupby(['Age_int'],as_index=False).mean()
sns.barplot(x='Age_int',y='Survived',data=average_age)

<matplotlib.axes._subplots.AxesSubplot at 0x1a254e67d0>


print(train_data['Age'].describe())

count    891.000000
mean      29.658964
std       13.735787
min        0.420000
25%       21.000000
50%       28.000000
75%       37.000000
max       80.000000
Name: Age, dtype: float64

# 样本有891，平均年龄约为30岁，标准差13.5岁，最小年龄0.42，最大年龄80.
# 按照年龄，将乘客划分为儿童、少年、成年、老年，分析四个群体的生还情况：

bins = [0, 12, 18, 65, 100]
train_data['Age_group'] = pd.cut(train_data['Age'],bins)
by_age = train_data.groupby('Age_group')['Survived'].mean()
print(by_age)

Age_group
(0, 12]      0.506173
(12, 18]     0.466667
(18, 65]     0.364512
(65, 100]    0.125000
Name: Survived, dtype: float64

by_age.plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1a253a8650>


# 称呼与存活与否的关系
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.',expand=False)
pd.crosstab(train_data['Title'],train_data['Sex'])


Sex
female
male

Title

Capt
0
1

Col
0
2

Countess
1
0

Don
0
1

Dr
1
6

Jonkheer
0
1

1
0

Major
0
2

Master
0
40

Miss
182
0

Mlle
2
0

Mme
1
0

Mr
0
517

Mrs
125
0

Ms
1
0

Rev
0
6

Sir
0
1

# 观察不同称呼与生存率的关系：
train_data[['Title','Survived']].groupby(['Title']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a25a5e710>


# 同时，对于名字，我们还可以观察名字长度和生存率之间存在关系的可能：

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_data['Name_length'] = train_data['Name'].apply(len)
name_length = train_data[['Name_length','Survived']].groupby(['Name_length'], as_index=False).mean()
sns.barplot(x='Name_length', y='Survived',data=name_length)

<matplotlib.axes._subplots.AxesSubplot at 0x1a25b1b590>


# 从上面的图片可以看出，名字长度和生存与否确实也存在一定的相关性.

# 3.5 有无兄弟姐妹和存活与否的关系 SibSp

#将数据分为有兄弟姐妹和没有兄弟姐妹的两组：
sibsp_df = train_data[train_data['SibSp'] != 0]
no_sibsp_df = train_data[train_data['SibSp'] == 0]

plt.figure(figsize=(11,5))
plt.subplot(121)
sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived','Survived'],autopct= '%1.1f%%')
plt.xlabel('sibsp')

plt.subplot(122)
no_sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived','Survived'],autopct= '%1.1f%%')
plt.xlabel('no_sibsp')

plt.show()


# 3.6 有无父母子女和存活与否的关系 Parch
# 和有无兄弟姐妹一样，同样分析可以得到：
parch_df = train_data[train_data['Parch'] != 0]
no_parch_df = train_data[train_data['Parch'] == 0]

plt.figure(figsize=(11,5))
plt.subplot(121)
parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct= '%1.2f%%')
plt.xlabel('parch')

plt.subplot(122)
no_parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.2f%%')
plt.xlabel('no_parch')

plt.show()


# 3.7 亲友的人数和存活与否的关系 SibSp & Parch

fig, ax=plt.subplots(1,2,figsize=(15,5))
train_data[['Parch','Survived']].groupby(['Parch']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train_data[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')

Text(0.5, 1.0, 'SibSp and Survived')


train_data['Family_Size'] = train_data['Parch'] + train_data['SibSp']+1
train_data[['Family_Size','Survived']].groupby(['Family_Size']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a25323310>


# 从图表中可以看出，若独自一人，那么其存活率比较低；但是如果亲友太多的话，存活率也会很低。

# 3.8 票价分布和存活与否的关系 Fare

# 首先绘制票价的分布情况:
plt.figure(figsize=(10,5))
train_data['Fare'].hist(bins=70)

train_data.boxplot(column='Fare', by='Pclass', showfliers=False)
plt.show()


print(train_data['Fare'].describe())

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

# 绘制生存与否与票价均值和方差的关系：
fare_not_survived = train_data['Fare'][train_data['Survived'] == 0]
fare_survived = train_data['Fare'][train_data['Survived'] == 1]

average_fare = pd.DataFrame([fare_not_survived.mean(),fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(),fare_survived.std()])
average_fare.plot(yerr=std_fare,kind='bar',legend=False)

plt.show()


#由上图表可知，票价与是否生还有一定的相关性，生还者的平均票价要大于未生还者的平均票价。

# 3.9 船舱类型和存活与否的关系 Cabin
# 由于船舱的缺失值确实太多，有效值仅仅有204个，很难分析出不同的船舱和存活的关系，所以在做特征工程的时候，可以直接将该组特征丢弃掉。 当然，这里我们也可以对其进行一下分析，对于缺失的数据都分为一类。 简单地将数据分为是否有Cabin记录作为特征，与生存与否进行分析：

# Replace missing values with "U0"
train_data.loc[train_data.Cabin.isnull(),'Cabin'] = 'U0'
train_data['Has_Cabin'] = train_data['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
train_data[['Has_Cabin','Survived']].groupby(['Has_Cabin']).mean().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1a26439910>


#对不同类型的船舱进行分析：

# create feature for the alphabetical part of the cabin number
train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
# convert the distinct cabin letters with incremental integer values
train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]
train_data[['CabinLetter','Survived']].groupby(['CabinLetter']).mean().plot.bar()


<matplotlib.axes._subplots.AxesSubplot at 0x1a2651a7d0>


# 可见，不同的船舱生存率也有不同，但是差别不是很大。所以在处理中，我们可以直接将特征删除。

# 3.10 港口和存活与否的关系 Embarked|
# 泰坦尼克号从英国的南安普顿港出发，途径法国瑟堡和爱尔兰昆士敦，那么在昆士敦之前上船的人，有可能在瑟堡或昆士敦下船，这些人将不会遇到海难。

sns.countplot('Embarked',hue='Survived',data=train_data)
plt.title('Embarked and Survived')

Text(0.5, 1.0, 'Embarked and Survived')


sns.factorplot('Embarked','Survived',data = train_data, size=3, aspect=2)
plt.title('Embarked and Survived rate')
plt.show()


# 由上可以看出，在不同的港口上船，生还率不同，C最高，Q次之，S最低。 以上为所给出的数据特征与生还与否的分析。 据了解，泰坦尼克号上共有2224名乘客。本训练数据只给出了891名乘客的信息，如果该数据集是从总共的2224人随机选出的，根据中心极限定理，该样本的数据量也足够大，那么我们的分析结果就具有代表性；但如果不是随机选取，那么我们的分析结果就可能不太靠谱了。

# 3.11 其他可能和存活与否有关系的特征
# 对于数据集中没有给出的特征信息，我们还可以联想其他可能会对模型产生影响的特征因素。如：乘客的国籍、乘客的身高、乘客的体重、乘客是否会游泳、乘客职业等等。
# 另外还有数据集中没有分析的几个特征：Ticket（船票号）、Cabin（船舱号），这些因素的不同可能会影响乘客在船中的位置从而影响逃生的顺序。但是船舱号数据缺失，船票号类别大，难以分析规律，所以在后期模型融合的时候，将这些因素交由模型来决定其重要性。

# 4. 变量转换
# 变量转换的目的是将数据转换为适用于模型使用的数据，不同模型接受不同类型的数据，Scikit-learn要求数据都是数字型numeric，所以我们要将一些非数字型的原始数据转换为数字型numeric。 所以下面对数据的转换进行介绍，以在进行特征工程的时候使用。 所有的数据可以分为两类：
# 1.定性（Qualitative）变量可以以某种方式，Age就是一个很好的例子。
# 2.定量（Quantitative）变量描述了物体的某一（不能被数学表示的）方面，Embarked就是一个例子。

# 4.1 Dummy Variables
# 就是类别变量或者二元变量，当qualitative variable是一些频繁出现的几个独立变量时，Dummy Variables比较适用。
#我们以Embarked只包含三个值’S'，‘C'，’Q'，我们可以使用下面的代码将其转换为dummies：

embark_dummies = pd.get_dummies(train_data['Embarked'])
train_data = train_data.join(embark_dummies)
train_data.drop(['Embarked'], axis=1, inplace=True)

embark_dummies = train_data[['S','C','Q']]


S
C
Q

0
1
0
0

1
0
1
0

2
1
0
0

3
1
0
0

4
1
0
0

# 4.2 Factoring
# dummy不好处理Cabin（船舱号）这种标称属性，因为他出现的变量比较多。所以Pandas有一个方法叫做factorize()，它可以创建一些数字，来表示类别变量，对每一个类别映射一个ID，这种映射最后只生成一个特征，不像dummy那样生成多个特征。

# Replace missing values with "U0"
train_data['Cabin'][train_data.Cabin.isnull()] = 'U0'
# create feature for the alphabetical part of the cabin number
train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
# convert the distinct cabin letters with incremental integer values
train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]



Cabin
CabinLetter

0
U0
0

1
C85
1

2
U0
0

3
C123
1

4
U0
0

# 4.3 Scaling
# Scaling可以将一个很大范围的数值映射到一个很小范围（通常是 -1到1，或者是0到1），很多情况下我们需要将数值做Scaling使其范围大小一样，否则大范围数特征将会有更高的权重。比如：Age的范围可能只是0-100，而income的范围可能是0-10000000，在某些对数组大小敏感的模型中会影响其结果。

# 下面对Age进行Scaling：
from sklearn import preprocessing

assert np.size(train_data['Age']) == 891
# StandardScaler will subtract the mean from each value then scale to the unit varience
scaler = preprocessing.StandardScaler()
train_data['Age_scaled'] = scaler.fit_transform(train_data['Age'].values.reshape(-1,1))

print(train_data['Age_scaled'].head())

0   -0.557905
1    0.607590
2   -0.266531
3    0.389059
4    0.389059
Name: Age_scaled, dtype: float64

# 4.4 Binning

# Binning通过观察“邻居”（即周围的值）将连续数据离散化。存储的值被分布到一些“桶”或“箱”中，就像直方图的bin将数据划分成几块一样。
# 下面的代码对Fare进行Binning。

# Divide all fares into quartiles
train_data['Fare_bin'] = pd.qcut(train_data['Fare'],5)

0      (-0.001, 7.854]
1    (39.688, 512.329]
2        (7.854, 10.5]
3    (39.688, 512.329]
4        (7.854, 10.5]
Name: Fare_bin, dtype: category
Categories (5, interval[float64]): [(-0.001, 7.854] < (7.854, 10.5] < (10.5, 21.679] < (21.679, 39.688] < (39.688, 512.329]]

# 在将数据Binning化后，要么将数据factorize化，要么dummies化。
# qcut() create a new variable that idetifies the quartile range, but we can't use the string
# so either factorize or create dummies from the result

# factorize
train_data['Fare_bin_id'] = pd.factorize(train_data['Fare_bin'])[0]

# dummies
fare_bin_dummies_df = pd.get_dummies(train_data['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))
train_data = pd.concat([train_data, fare_bin_dummies_df], axis=1)

# 5.特征工程
# 在进行特征工程的时候，我们不仅需要对训练数据进行处理，还需要同时将测试数据同训练数据一起处理，使得二者具有相同的数据类型和数据分布。
test_df_org['Survived'] = 0
combined_train_test = train_df_org.append(test_df_org)   #891+418=1309rows, 12columns
PassengerId = test_df_org['PassengerId']

# 对数据进行特征工程，也就是从各项参数中提取出对输出结果有或大或小的影响的特征，将这些特征作为训练模型的依据。一般来说，我们会先从含有缺失值的特征开始。

# 5.1 Embarked
# 因为“Embarked”项的缺失值不多，所以这里我们以众数来填充：
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0],inplace=True)

# 对于三种不同的港口，由上面介绍的数值转换，我们知道可以有两种特征处理方式；dummy和factorizing。因为只有三个港口，所以我们可以直接用dummy来处理：

#为了后面的特征分析，这里我们将Embarked特征进行factorizing
combined_train_test['Embarked'] = pd.factorize(combined_train_test['Embarked'])[0]

#使用pd.get_dummies获取one-hot编码
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)


# 5.2 Sex
# 对Sex也进行one-hot编码，也就是dummy处理：
# 为了后面的特征分析，这里我们也将Sex特征进行factorizing
combined_train_test['Sex'] = pd.factorize(combined_train_test['Sex'])[0]

sex_dummies_df = pd.get_dummies(combined_train_test['Sex'],prefix=combined_train_test[['Sex']].columns[0])
combined_train_test = pd.concat([combined_train_test,sex_dummies_df],axis=1)

# 5.3 Name

# 首先从名字中提取各种称呼
# what is each person's title?
combined_train_test['Title'] = combined_train_test['Name'].map(lambda x: re.compile(",(.*?)\.").findall(x)[0])
combined_train_test['Title'] = combined_train_test['Title'].apply(lambda x:x.strip())

# 尽管提取的Title两句话得到的效果是一样的，但是如果用
# combined_train_test['Title'] = combined_train_test['Name'].map(lambda x: re.compile(",(.*?)\.").findall(x)[0])
# 下面的语句执行后有问题。


# 将各式称呼进行统一化处理：
title_Dict = {}
title_Dict.update(dict.fromkeys(['Capt','Col','Major','Dr','Rev'],'Officer'))
title_Dict.update(dict.fromkeys(['Mme','Ms','Mrs'],'Mrs'))
title_Dict.update(dict.fromkeys(['Male','Miss'],'Miss'))
title_Dict.update(dict.fromkeys(['Mr'],'Mr'))
title_Dict.update(dict.fromkeys(['Master','Jonkheer'],'Master'))

combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)

# 使用dummy对不同的称呼进行分列：
#为了后面的特征分析，这里我们也将Title特征进行factorizing
combined_train_test['Title'] = pd.factorize(combined_train_test['Title'])[0]
title_dummies_df = pd.get_dummies(combined_train_test['Title'],prefix=combined_train_test[['Title']].columns[0])
combined_train_test = pd.concat([combined_train_test,title_dummies_df],axis=1)

# 增加名字长度的特征
combined_train_test['Name_length'] = combined_train_test['Name'].apply(len)

# 5.4 Fare
# 回目录

# 由前面分析可以知道，Fare项在测试数据中缺少一个值，所以需要对该值进行填充。我们按照一二三等舱各自的均价来填充：

# 下面transform将函数np.mean应用到各个group中。
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform(np.mean))

# 通过对Ticket数据的分析，我们可以看到部分票号数据有重复，同时结合亲属人数及名字的数据，和票价船舱等级对比，我们可以知道购买的票中有家庭票和团体票，所以我们需要将团体票的票价分配到每个人的头上。
combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
combined_train_test['Fare'] = combined_train_test['Fare']/combined_train_test['Group_Ticket']
combined_train_test.drop(['Group_Ticket'],axis=1,inplace=True)

# 使用binning给票价分等级：
combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'],5)

# 对于5个等级的票价我们可以继续使用dummy为票价等价分列：
combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'],5)

combined_train_test['Fare_bin_id'] = pd.factorize(combined_train_test['Fare_bin'])[0]

fare_bin_dummies_df = pd.get_dummies(combined_train_test['Fare_bin_id']).rename(columns=lambda x: 'Fare_' + str(x))
combined_train_test = pd.concat([combined_train_test,fare_bin_dummies_df],axis=1)
combined_train_test.drop(['Fare_bin'],axis=1, inplace=True)

# 5.5 Pclass
# Pclass这一项，其实已经可以不用继续处理了，我们只需将其转换为dummy形式即可。 但是为了更好的分析，我们这里假设对于不同等级的船舱，各船舱内部的票价也说明了各等级舱的位置，那么也就很有可能与逃生的顺序有关系。所以这里分析出每等舱里的高价和低价位。

print(combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean())

Pclass
1    33.910500
2    11.411010
3     7.337571
Name: Fare, dtype: float64

from sklearn.preprocessing import LabelEncoder

#建立Pclass Fare Category
def pclass_fare_category(df,pclass1_mean_fare,pclass2_mean_fare,pclass3_mean_fare):
if df['Pclass'] == 1:
if df['Fare'] <= pclass1_mean_fare:
return 'Pclass1_Low'
else:
return 'Pclass1_High'
elif df['Pclass'] == 2:
if df['Fare'] <= pclass2_mean_fare:
return 'Pclass2_Low'
else:
return 'Pclass2_High'
elif df['Pclass'] == 3:
if df['Fare'] <= pclass3_mean_fare:
return 'Pclass3_Low'
else:
return 'Pclass3_High'

Pclass1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(1)
Pclass2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(2)
Pclass3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(3)

#建立Pclass_Fare Category
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category,args=(
Pclass1_mean_fare,Pclass2_mean_fare,Pclass3_mean_fare),axis=1)
pclass_level = LabelEncoder()

#给每一项添加标签
pclass_level.fit(np.array(['Pclass1_Low','Pclass1_High','Pclass2_Low','Pclass2_High','Pclass3_Low','Pclass3_High']))

#转换成数值
combined_train_test['Pclass_Fare_Category'] = pclass_level.transform(combined_train_test['Pclass_Fare_Category'])

# dummy 转换
pclass_dummies_df = pd.get_dummies(combined_train_test['Pclass_Fare_Category']).rename(columns=lambda x: 'Pclass_' + str(x))
combined_train_test = pd.concat([combined_train_test,pclass_dummies_df],axis=1)

# 同时，我们将Pclass特征factorize化：
combined_train_test['Pclass'] = pd.factorize(combined_train_test['Pclass'])[0]

# 5.6 Parch and SibSp
# 由前面的分析，我们可以知道，亲友的数量没有或者太多会影响到Survived。所以将二者合并为FamliySize这一组合项，同时也保留这两项。

def family_size_category(family_size):
if family_size <= 1:
return 'Single'
elif family_size <= 4:
return 'Small_Family'
else:
return 'Large_Family'

combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)

le_family = LabelEncoder()
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])

family_size_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
prefix=combined_train_test[['Family_Size_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, family_size_dummies_df], axis=1)

# 5.7 Age

# 因为Age项的缺失值较多，所以不能直接填充age的众数或者平均数。

# 常见的有两种对年龄的填充方式：一种是根据Title中的称呼，如Mr，Master、Miss等称呼不同类别的人的平均年龄来填充；一种是综合几项如Sex、Title、Pclass等其他没有缺失值的项，使用机器学习算法来预测Age。

# 这里我们使用后者来处理。以Age为目标值，将Age完整的项作为训练集，将Age缺失的项作为测试集。

missing_age_df = pd.DataFrame(combined_train_test[
['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass']])

missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]



Age
Embarked
Sex
Title
Name_length
Family_Size
Family_Size_Category
Fare
Fare_bin_id
Pclass

5
NaN
2
0
0
16
1
1
8.4583
2
0

17
NaN
0
0
0
28
1
1
13.0000
3
2

19
NaN
1
1
1
23
1
1
7.2250
4
0

26
NaN
1
0
0
23
1
1
7.2250
4
0

28
NaN
2
1
2
29
1
1
7.8792
0
0

# 建立Age的预测模型，我们可以多模型预测，然后再做模型的融合，提高预测的精度。

from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor

def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)

# model 1  gbm
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])

# model 2 rf
rf_reg = RandomForestRegressor()
rf_reg_param_grid = {'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_RF'][:4])

# two models merge
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
# missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)

missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
print(missing_age_test['Age'][:4])

missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)

return missing_age_test

# 利用融合模型预测的结果填充Age的缺失值：
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)

Fitting 10 folds for each of 1 candidates, totalling 10 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done   5 out of  10 | elapsed:    4.3s remaining:    4.3s
[Parallel(n_jobs=25)]: Done  10 out of  10 | elapsed:    4.3s finished

Age feature Best GB Params:{'learning_rate': 0.01, 'max_depth': 4, 'max_features': 3, 'n_estimators': 2000}
Age feature Best GB Score:-128.38286366239385
GB Train Error for "Age" Feature Regressor:-65.2562037120689
5     37.508266
17    31.580052
19    34.597808
26    29.076996
Name: Age_GB, dtype: float64
Fitting 10 folds for each of 1 candidates, totalling 10 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done   5 out of  10 | elapsed:    1.6s remaining:    1.6s
[Parallel(n_jobs=25)]: Done  10 out of  10 | elapsed:    1.7s finished

Age feature Best RF Params:{'max_depth': 5, 'n_estimators': 200, 'random_state': 0}
Age feature Best RF Score:-119.64194051962507
RF Train Error for "Age" Feature Regressor-96.82296812792812
5     33.513123
17    33.098071
19    34.853983
26    28.148613
Name: Age_RF, dtype: float64
shape1 (263,) (263, 2)
5     29.97686
17    29.97686
19    29.97686
26    29.97686
Name: Age, dtype: float64

missing_age_test.head()


Age
Embarked
Sex
Title
Name_length
Family_Size
Family_Size_Category
Fare
Fare_bin_id
Pclass

5
29.97686
2
0
0
16
1
1
8.4583
2
0

17
29.97686
0
0
0
28
1
1
13.0000
3
2

19
29.97686
1
1
1
23
1
1
7.2250
4
0

26
29.97686
1
0
0
23
1
1
7.2250
4
0

28
29.97686
2
1
2
29
1
1
7.8792
0
0

# 5.8 Ticket
# 观察Ticket的值，我们可以看到，Ticket有字母和数字之分，而对于不同的字母，可能在很大程度上就意味着船舱等级或者不同船舱的位置，也会对Survived产生一定的影响，所以我们将Ticket中的字母分开，为数字的部分则分为一类。

combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric() else x)

# 如果要提取数字信息，则也可以这样做，现在我们对数字票单纯地分为一类。
# combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
# combined_train_test['Ticket_Number'].fillna(0, inplace=True)

# 将 Ticket_Letter factorize
combined_train_test['Ticket_Letter'] = pd.factorize(combined_train_test['Ticket_Letter'])[0]

# 5.9 Cabin

# 因为Cabin项的缺失值确实太多了，我们很难对其进行分析，或者预测。所以这里我们可以直接将Cabin这一项特征去除。但通过上面的分析，可以知道，该特征信息的有无也与生存率有一定的关系，所以这里我们暂时保留该特征，并将其分为有和无两类。

combined_train_test.loc[combined_train_test.Cabin.isnull(), 'Cabin'] = 'U0'
combined_train_test['Cabin'] = combined_train_test['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)

# 5.10 特征间相关性分析
# 我们挑选一些主要的特征，生成特征之间的关联图，查看特征与特征之间的相关性：

Correlation = pd.DataFrame(combined_train_test[['Embarked','Sex','Title','Name_length','Family_Size',
'Family_Size_Category','Fare','Fare_bin_id','Pclass',
'Pclass_Fare_Category','Age','Ticket_Letter','Cabin']])

colormap = plt.cm.viridis
plt.figure(figsize=(14,12))
plt.title('Pearson Correaltion of Feature',y=1.05,size=15)
sns.heatmap(Correlation.astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,linecolor='white',annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a26cf2d90>


# 5.11 特征之间的数据分布图
g = sns.pairplot(combined_train_test[[u'Survived',u'Pclass',u'Sex',u'Age',u'Fare',u'Embarked',
u'Family_Size',u'Title',u'Ticket_Letter']],hue='Survived',
palette = 'seismic',size=1.2,diag_kind ='kde',diag_kws=
g.set(xticklabels=[])

<seaborn.axisgrid.PairGrid at 0x1a23685110>


# 5.12 输入模型前的一些处理：
# 5.12.1 一些数据的正则化 这里我们将Age和fare进行正则化:

from sklearn import preprocessing
scale_age_fare = preprocessing.StandardScaler().fit(combined_train_test[['Age','Fare','Name_length']])
combined_train_test[['Age','Fare','Name_length']] = scale_age_fare.transform(combined_train_test[['Age','Fare','Name_length']])

# 5.12.2 弃掉无用特征
# 对于上面的特征工程中，我们从一些原始的特征中提取出了很多要融合到模型中的特征，但是我们需要剔除那些原本的我们用不到的或者非数值特征： 首先对我们的数据先进行一下备份，以便后期的再次分析

combined_data_backup = combined_train_test

combined_train_test.drop(['PassengerId','Embarked','Sex','Name','Fare_bin_id','Pclass_Fare_Category',                          'Parch','SibSp','Family_Size_Category','Ticket'],axis=1,inplace=True)

# 5.12.3 将训练数据和测试数据分开
train_data = combined_train_test[:891]
test_data = combined_train_test[891:]

titanic_train_data_X = train_data.drop(['Survived'],axis=1)
titanic_train_data_Y = train_data['Survived']
titanic_test_data_X = test_data.drop(['Survived'],axis=1)

titanic_train_data_X.shape

(891, 34)

titanic_train_data_X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 34 columns):
#   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
0   Pclass                  891 non-null    int64
1   Age                     891 non-null    float64
2   Fare                    891 non-null    float64
3   Cabin                   891 non-null    int64
4   Embarked_0              891 non-null    uint8
5   Embarked_1              891 non-null    uint8
6   Embarked_2              891 non-null    uint8
7   Sex_0                   891 non-null    uint8
8   Sex_1                   891 non-null    uint8
9   Title                   891 non-null    int64
10  Title_-1                891 non-null    uint8
11  Title_0                 891 non-null    uint8
12  Title_1                 891 non-null    uint8
13  Title_2                 891 non-null    uint8
14  Title_3                 891 non-null    uint8
15  Title_4                 891 non-null    uint8
16  Title_5                 891 non-null    uint8
17  Name_length             891 non-null    float64
18  Fare_0                  891 non-null    uint8
19  Fare_1                  891 non-null    uint8
20  Fare_2                  891 non-null    uint8
21  Fare_3                  891 non-null    uint8
22  Fare_4                  891 non-null    uint8
23  Pclass_0                891 non-null    uint8
24  Pclass_1                891 non-null    uint8
25  Pclass_2                891 non-null    uint8
26  Pclass_3                891 non-null    uint8
27  Pclass_4                891 non-null    uint8
28  Pclass_5                891 non-null    uint8
29  Family_Size             891 non-null    int64
30  Family_Size_Category_0  891 non-null    uint8
31  Family_Size_Category_1  891 non-null    uint8
32  Family_Size_Category_2  891 non-null    uint8
33  Ticket_Letter           891 non-null    int64
dtypes: float64(3), int64(5), uint8(26)
memory usage: 85.3 KB

# 6. 模型融合及测试
# 6.1 利用不同的模型来对特征进行筛选，选出较为重要的特征：
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

def get_top_n_features(titanic_train_data_X,titanic_train_data_Y,top_n_features):

#randomforest
rf_est = RandomForestClassifier(random_state=0)
rf_param_grid = {'n_estimators':[500],'min_samples_split':[2,3],'max_depth':[20]}
rf_grid = model_selection.GridSearchCV(rf_est,rf_param_grid,n_jobs=25,cv=10,verbose=1)
rf_grid.fit(titanic_train_data_X,titanic_train_data_Y)
print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
print('Top N Features RF Train Score:' + str(rf_grid.score(titanic_train_data_X,titanic_train_data_Y)))
feature_imp_sorted_rf = pd.DataFrame({'feature':list(titanic_train_data_X),
'importance':rf_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
print('Sample 10 Feeatures from RF Classifier')
print(str(features_top_n_rf[:10]))

print('Sample 10 Features from Ada Classifier:')

#ExtraTree
et_est = ExtraTreesClassifier(random_state=0)
et_param_grid = {'n_estimators':[500],'min_samples_split':[3,4],'max_depth':[20]}
et_grid = model_selection.GridSearchCV(et_est,et_param_grid,n_jobs=25,cv=10,verbose=1)
et_grid.fit(titanic_train_data_X,titanic_train_data_Y)
print('Top N Features Best ET Params:' + str(et_grid.best_params_))
print('Top N Features Best DT Score:' + str(et_grid.best_score_))
print('Top N Features ET Train Score:' + str(et_grid.score(titanic_train_data_X,titanic_train_data_Y)))
feature_imp_sorted_et = pd.DataFrame({'feature':list(titanic_train_data_X),
'importance':et_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
print('Sample 10 Features from ET Classifier:')
print(str(features_top_n_et[:10]))

gb_param_grid = {'n_estimators':[500],'learning_rate':[0.01,0.1],'max_depth':[20]}
gb_grid = model_selection.GridSearchCV(gb_est,gb_param_grid,n_jobs=25,cv=10,verbose=1)
gb_grid.fit(titanic_train_data_X,titanic_train_data_Y)
print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
print('Top N Features GB Train Score:' + str(gb_grid.score(titanic_train_data_X,titanic_train_data_Y)))
feature_imp_sorted_gb = pd.DataFrame({'feature':list(titanic_train_data_X),
'importance':gb_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
print('Sample 10 Feature from GB Classifier:')
print(str(features_top_n_gb[:10]))

# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0)
dt_param_grid = {'min_samples_split':[2,4],'max_depth':[20]}
dt_grid = model_selection.GridSearchCV(dt_est,dt_param_grid,n_jobs=25,cv=10,verbose=1)
dt_grid.fit(titanic_train_data_X,titanic_train_data_Y)
print('Top N Features Bset DT Params:' + str(dt_grid.best_params_))
print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
print('Top N Features DT Train Score:' + str(dt_grid.score(titanic_train_data_X,titanic_train_data_Y)))
feature_imp_sorted_dt = pd.DataFrame({'feature':list(titanic_train_data_X),
'importance':dt_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
print('Sample 10 Features from DT Classifier:')
print(str(features_top_n_dt[:10]))

# merge the three models
ignore_index=True).drop_duplicates()
feature_imp_sorted_gb,feature_imp_sorted_dt],ignore_index=True)

return features_top_n,features_importance


# 6.2 依据我们筛选出的特征构建训练集和测试集
# 但如果在进行特征工程的过程中，产生了大量的特征，而特征与特征之间会存在一定的相关性。太多的特征一方面会影响训练的速度，另一方面也可能会使得模型过拟合。所以在特征太多的情况下，我们可以利用不同的模型对特征进行筛选，选取我们想要的前n个特征。

feature_to_pick = 30
feature_top_n,feature_importance = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick)
titanic_train_data_X = pd.DataFrame(titanic_train_data_X[feature_top_n])
titanic_test_data_X = pd.DataFrame(titanic_test_data_X[feature_top_n])

Fitting 10 folds for each of 2 candidates, totalling 20 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    6.4s remaining:    3.4s
[Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    7.7s finished

Top N Features Best RF Params:{'max_depth': 20, 'min_samples_split': 3, 'n_estimators': 500}
Top N Features Best RF Score:0.8271785268414481
Top N Features RF Train Score:0.9764309764309764
Sample 10 Feeatures from RF Classifier
1               Age
17      Name_length
2              Fare
8             Sex_1
9             Title
11          Title_0
7             Sex_0
29      Family_Size
0            Pclass
33    Ticket_Letter
Name: feature, dtype: object
Fitting 10 folds for each of 2 candidates, totalling 20 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    3.7s remaining:    2.0s
[Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    5.0s finished

Top N Features Best Ada Params:{'learning_rate': 0.01, 'n_estimators': 500}
Top N Features Best Ada Score:0.8181897627965042
Top N Features Ada Train Score:0.8204264870931538
Sample 10 Features from Ada Classifier:
11                   Title_0
2                       Fare
30    Family_Size_Category_0
29               Family_Size
7                      Sex_0
0                     Pclass
3                      Cabin
8                      Sex_1
17               Name_length
1                        Age
Name: feature, dtype: object
Fitting 10 folds for each of 2 candidates, totalling 20 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    3.5s remaining:    1.9s
[Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    4.0s finished

Top N Features Best ET Params:{'max_depth': 20, 'min_samples_split': 4, 'n_estimators': 500}
Top N Features Best DT Score:0.8237952559300874
Top N Features ET Train Score:0.9708193041526375
Sample 10 Features from ET Classifier:
11          Title_0
7             Sex_0
8             Sex_1
17      Name_length
1               Age
2              Fare
3             Cabin
9             Title
33    Ticket_Letter
13          Title_2
Name: feature, dtype: object
Fitting 10 folds for each of 2 candidates, totalling 20 fits

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:   12.3s remaining:    6.6s
[Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:   12.7s finished

Top N Features Best GB Params:{'learning_rate': 0.1, 'max_depth': 20, 'n_estimators': 500}
Top N Features Best GB Score:0.7835081148564295
Top N Features GB Train Score:0.9966329966329966
Sample 10 Feature from GB Classifier:
11                   Title_0
1                        Age
2                       Fare
17               Name_length
30    Family_Size_Category_0
29               Family_Size
0                     Pclass
9                      Title
28                  Pclass_5
33             Ticket_Letter
Name: feature, dtype: object
Fitting 10 folds for each of 2 candidates, totalling 20 fits
Top N Features Bset DT Params:{'max_depth': 20, 'min_samples_split': 4}
Top N Features Best DT Score:0.7823220973782771
Top N Features DT Train Score:0.9607182940516273
Sample 10 Features from DT Classifier:
11                   Title_0
1                        Age
2                       Fare
17               Name_length
30    Family_Size_Category_0
16                   Title_5
28                  Pclass_5
0                     Pclass
33             Ticket_Letter
29               Family_Size
Name: feature, dtype: object

[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    0.1s finished

# 用视图可视化不同算法筛选的特征排序：

rf_feature_imp = feature_importance[:10]

# make importances relative to max importance
rf_feature_importance = 100.0 * (rf_feature_imp['importance'] / rf_feature_imp['importance'].max())

# Get the indexes of all features over the importance threshold
rf_important_idx = np.where(rf_feature_importance)[0]

pos = np.arange(rf_important_idx.shape[0]) + .5

plt.figure(1, figsize = (18, 8))

plt.subplot(121)
plt.barh(pos, rf_feature_importance[rf_important_idx][::-1])
plt.yticks(pos, rf_feature_imp['feature'][::-1])
plt.xlabel('Relative Importance')
plt.title('RandomForest Feature Importance')

plt.subplot(122)
plt.xlabel('Relative Importance')

plt.show()


# 6.3 模型融合（Model Ensemble）
# 常见的模型融合方法有：Bagging、Boosting、Stacking、Blending。

# 6.3.1 Bagging
# Bagging将多个模型，也就是基学习器的预测结果进行简单的加权平均或者投票。它的好处是可以并行地训练基学习器。Random Forest就用到了Bagging的思想。

# 6.3.2 Boosting

# 6.3.3. Stacking
# Stacking是用新的次学习器去学习如何组合上一层的基学习器。如果把Bagging看作是多个基分类器的线性组合，那么Stacking就是多个基分类器的非线性组合。Stacking可以将学习器一层一层地堆砌起来，形成一个网状的结构。 相比来说Stacking的融合框架相对前面二者来说在精度上确实有一定的提升，所以在下面的模型融合上，我们也使用Stacking方法。

# 6.3.4 Blending
# Blending和Stacking很相似，但同时它可以防止信息泄露的问题。

# Stacking框架融合：这里我们使用了两层的模型融合

# Level 2使用了XGBoost，使用第一层预测的结果作为特征对最终的结果进行预测。

# Level 1：

# Stacking框架是堆叠使用基础分类器的预测作为对二级模型的训练的输入。然而，我们不能简单地在全部训练数据上训练基本模型，产生预测，输出用于第二层的训练。如果我们在Train Data上训练，然后在Train Data上预测，就会造成标签。为了避免标签，我们需要对每个基学习器使用K-fold，将Kge模型对Valid Set的预测结果拼起来，作为下一层学习器的输入。

# 所以这里我们建立输出fold预测方法：

from sklearn.model_selection import KFold

# Some useful parameters which will come in handy later on
ntrain = titanic_train_data_X.shape[0]
ntest = titanic_test_data_X.shape[0]
SEED = 0 #for reproducibility
NFOLDS = 7 # set folds for out-of-fold prediction
kf = KFold(n_splits = NFOLDS,random_state=SEED,shuffle=False)

def get_out_fold(clf,x_train,y_train,x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS,ntest))

for i, (train_index,test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]

clf.fit(x_tr,y_tr)

oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i,:] = clf.predict(x_test)

oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1,1),oof_test.reshape(-1,1)

# 构建不同的基学习器，这里我们使用了RandomForest、AdaBoost、ExtraTrees、GBDT、DecisionTree、KNN、SVM七个基学习器：（这里的模型可以使用如上面的GridSearch方法对模型的超参数进行搜索选择

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

rf = RandomForestClassifier(n_estimators=500,warm_start=True,max_features='sqrt',max_depth=6,min_samples_split=3,min_samples_leaf=2,n_jobs=-1,verbose=0)

et = ExtraTreesClassifier(n_estimators=500,n_jobs=-1,max_depth=8,min_samples_leaf=2,verbose=0)

dt = DecisionTreeClassifier(max_depth=8)

knn = KNeighborsClassifier(n_neighbors=2)

svm = SVC(kernel='linear',C=0.025)


# 将pandas转换为arrays：
# Create Numpy arrays of train,test and target(Survived) dataframes to feed into our models
x_train = titanic_train_data_X.values   #Creates an array of the train data
x_test = titanic_test_data_X.values   #Creates an array of the test data
y_train = titanic_train_data_Y.values

# Create our OOF train and test predictions.These base result will be used as new featurs
rf_oof_train,rf_oof_test = get_out_fold(rf,x_train,y_train,x_test)  # Random Forest
et_oof_train,et_oof_test = get_out_fold(et,x_train,y_train,x_test)  # Extra Trees
gb_oof_train,gb_oof_test = get_out_fold(gb,x_train,y_train,x_test)  # Gradient Boost
dt_oof_train,dt_oof_test = get_out_fold(dt,x_train,y_train,x_test)  #Decision Tree
knn_oof_train,knn_oof_test = get_out_fold(knn,x_train,y_train,x_test)  # KNeighbors
svm_oof_train,svm_oof_test = get_out_fold(svm,x_train,y_train,x_test)  # Support Vector

print("Training is complete")

Training is complete

# Training is complete


# 6.4 预测并生成提交文件
# Level 2：我们利用XGBoost，使用第一层预测的结果作为特征对最终的结果进行预测。

x_train = np.concatenate((rf_oof_train,ada_oof_train,et_oof_train,gb_oof_train,dt_oof_train,knn_oof_train,svm_oof_train),axis=1)

from xgboost import XGBClassifier

gbm = XGBClassifier(n_estimators=200,max_depth=4,min_child_weight=2,gamma=0.9,subsample=0.8,
predictions = gbm.predict(x_test)

StackingSubmission = pd.DataFrame({'PassengerId':PassengerId,'Survived':predictions})
StackingSubmission.to_csv('StackingSubmission.csv',index=False,sep=',')

# 7. 验证：学习曲线
# 回目录

# 在我们对数据不断地进行特征工程，产生的特征越来越多，用大量的特征对模型进行训练，会使我们的训练集拟合得越来越好，但同时也可能会逐渐丧失泛化能力，从而在测试数据上表现不佳，发生过拟合现象。 当然我们建立的模型可能不仅在预测集上表现不好，也很可能是因为在训练集上的表现不佳，处于欠拟合状态。 下图是在吴恩达老师的机器学习课程上给出的四种学习曲线：
# 四种曲线：见笔记


# 上面红线代表test error（Cross-validation error），蓝线代表train error。这里我们也可以把错误率替换为准确率，那么相应曲线的走向就应该是上下颠倒的，（score=1-error）。

# 注意我们的图中是error曲线。

# 左上角是最优情况，随着样本数的增加，train error虽然有一定的增加，但是test error却有很明显的降低；

# 右上角是最差情况，train error很大，模型并没有从特征 中学习到什么，导致test error非常大，模型几乎无法预测数据，需要去寻找数据本身和训练阶段的原因；

# 左下角是high variance，train error虽然较低，但是模型产生了过拟合，缺乏泛化能力，导致test error很高；

# 右下角是high bias的情况，train error很高，这时需要去调整模型的参数，减小train error。

# 所以我们通过学习曲线观察模型处于什么样的状态。从而决定对模型进行如何的操作。当然，我们把验证放到最后，并不是这一步在最后去做。对于我们的Stacking框架中第一层的各个基学习器我们都应该对其学习曲线进行观察，从而去更好地调节超参数，进而得到更好的最终结果。构建绘制学习曲线的函数：

from sklearn.model_selection import learning_curve
# from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,
n_jobs=1,train_sizes=np.linspace(.1,1.0,5),verbose=0):
"""
Generate a simple plot of the test and training learning curve.

Parameters
-------------
estimator:object type that implents the "fit" and "predict" methods
An object of that type which is cloned for each validation.

title:string
Title for the chart.

X:array-like,shape(n_samples,n_features)
Training vector,where n_samples is the number of samples and n_features is
the number of features.

y:array-like,shape(n_samples) or (n_samples,n_features),optional
Target relative to X for classification or regression;
None for unsupervised learning.

ylim:tuple,shape(ymin,ymax),optional
Defines minimum and maximum yvalues plotted.

cv:integer,cross-validation generator,optional
If an integer is passed,it is the number of folds(defaults to 3).
Specific cross-validation objects can be passed,see
sklearn.cross_validation module for the list of possible objects

n_jobs:integer,optional
Number of jobs to run in parallel(default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes,train_scores,test_scores = learning_curve(estimator,X,y,cv=cv,
n_jobs=n_jobs,train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores,axis=1)
train_scores_std = np.std(train_scores,axis=1)
test_scores_mean = np.mean(test_scores,axis=1)
test_scores_std = np.std(test_scores,axis=1)
plt.grid()

plt.fill_between(train_sizes,train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,alpha=0.1,color='r')
plt.fill_between(train_sizes,test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,alpha=0.1,color='g')
plt.plot(train_sizes,train_scores_mean,'o-',color="r",label="Training score")
plt.plot(train_sizes,test_scores_mean,'o-',color="g",label="Cross-validation score")

plt.legend(loc="best")
return plt

X = x_train
Y = y_train

# RandomForest
rf_parameters = {'n_jobs':-1,'n_estimators':500,'warm_start':True,'max_depth':6,
'min_samples_leaf':2,'max_features':'sqrt','verbose':0}

# ExtraTrees
et_parameters = {'n_jobs':-1,'n_estimators':500,'max_depth':8,'min_samples_leaf':2,'verbose':0}

gb_parameters = {'n_estimators':500,'max_depth':5,'min_samples_leaf':2,'verbose':0}

# DecisionTree
dt_parameters = {'max_depth':8}

# KNeighbors
knn_parameters = {'n_neighbors':2}

# SVM
svm_parameters = {'kernel':'linear','C':0.025}

# XGB
gbm_parameters = {'n_estimators':2000,'max_depth':4,'min_child_weight':2,'gamma':0.9,'subsample':0.8,

title = "Learning Curves"
plot_learning_curve(RandomForestClassifier(**rf_parameters),title,X,Y,cv=None,n_jobs=4,
train_sizes=[50,100,150,200,250,350,400,450,500])
plt.show()


# 由上面的分析我们可以看出，对于RandomForest的模型，这里是存在一定的问题的，所以我们需要去调整模型的超参数，从而达到更好的效果。

# 8. 超参数调试
# 将生成的提交文件到Kaggle，得分结果：0.79425

# xgboost stacking：0.78468

# voting bagging：0.79904

# 这也说明了我们的stacking模型还有很大的改进空间。所以我们可以在以下几个方面进行改进，提高模型预测的精度：

# 特征工程：寻找更好的特征、删去影响较大的冗余特征；

# 模型超参数调试：改进欠拟合或者过拟合的状态；

# 改进模型框架：对于stacking框架的各层模型进行更好的选择；

# 调参的过程慢慢尝试吧......



展开全文
• Kaggle_Titanic
数据来源：https://www.kaggle.com/c/titanic
Training
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


count_survivors = pd.value_counts(train_data['Survived'])
count_survivors.plot(kind='bar')
plt.xlabel('Is_survived')
plt.ylabel('Number of People')
plt.title('Survivor histogram')


from sklearn.preprocessing import StandardScaler

train_data['Sex'] = train_data['Sex'].map({'female':0, 'male':1})

age_avg = np.mean([0 if np.isnan(item) else item for item in train_data['Age']])
train_data['Age'] = [age_avg if np.isnan(item) else item for item in train_data['Age']]
train_data['Age'] = StandardScaler().fit_transform(train_data['Age'].values.reshape(-1,1))

train_data['SibSp'] = StandardScaler().fit_transform(train_data['SibSp'].values.reshape(-1,1))
train_data['Parch'] = StandardScaler().fit_transform(train_data['Parch'].values.reshape(-1,1))
train_data['Fare'] = StandardScaler().fit_transform(train_data['Fare'].values.reshape(-1,1))

train_data['Embarked'] = train_data['Embarked'].map({'S':1, 'C':2, 'Q':3})
pier = [0 if np.isnan(item) else item for item in train_data['Embarked']]
train_data['Embarked'] = [max(set(pier), key=pier.count) if item == 0 else item for item in pier]

train_data = train_data.drop(columns=['Name','Ticket','Cabin','PassengerId'])

c:\python27\lib\site-packages\sklearn\utils\validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)

X = train_data.ix[:, train_data.columns != 'Survived']
Y = train_data.ix[:, train_data.columns == 'Survived']

c:\python27\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.metrics import recall_score,confusion_matrix

def getBestC(X, Y):
folds = KFold(len(Y), 5)
c_param_range = [0.01,0.1,1,10,100]

results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range

for i in range(len(c_param_range)):
print '******** c_param = %.2f ********' % c_param_range[i]
recall_accs = []
for iteration, fold in enumerate(folds, start=1):
lr = LogisticRegression(C = c_param_range[i], penalty = 'l1')
lr.fit(X.iloc[fold[0]].values, Y.iloc[fold[0]].values)
Y_hat = lr.predict(X.iloc[fold[1]].values)
recall_acc = recall_score(Y.iloc[fold[1]].values, Y_hat)
recall_accs.append(recall_acc)

print 'Iteration %d: recall score = %f' % (iteration,recall_acc)

results_table.ix[i,'Mean recall score'] = np.mean(recall_accs)
print '\nMean recall score %f\n' % np.mean(recall_accs)

best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
print '--------------------------------\nbest_c = %.2f' % best_c
return best_c


c:\python27\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

best_c = getBestC(X, Y)

******** c_param = 0.01 ********
Iteration 1: recall score = 0.000000
Iteration 2: recall score = 0.000000
Iteration 3: recall score = 0.000000
Iteration 4: recall score = 0.000000
Iteration 5: recall score = 0.000000

Mean recall score 0.000000

******** c_param = 0.10 ********
Iteration 1: recall score = 0.694915
Iteration 2: recall score = 0.683544
Iteration 3: recall score = 0.681159
Iteration 4: recall score = 0.583333
Iteration 5: recall score = 0.698413

Mean recall score 0.668273

******** c_param = 1.00 ********
Iteration 1: recall score = 0.745763
Iteration 2: recall score = 0.708861
Iteration 3: recall score = 0.710145
Iteration 4: recall score = 0.597222
Iteration 5: recall score = 0.746032

Mean recall score 0.701604

******** c_param = 10.00 ********
Iteration 1: recall score = 0.745763
Iteration 2: recall score = 0.708861
Iteration 3: recall score = 0.739130
Iteration 4: recall score = 0.597222
Iteration 5: recall score = 0.761905

Mean recall score 0.710576

******** c_param = 100.00 ********
Iteration 1: recall score = 0.745763
Iteration 2: recall score = 0.708861
Iteration 3: recall score = 0.739130
Iteration 4: recall score = 0.597222
Iteration 5: recall score = 0.761905

Mean recall score 0.710576

--------------------------------
best_c = 10.00

c:\python27\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
c:\python27\lib\site-packages\ipykernel_launcher.py:25: DeprecationWarning:
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)

thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

import itertools

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X.values, Y.values)
Y_hat = lr.predict(X.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y, Y_hat)
#np.set_printoptions(precision=2)

print "Recall value in training dataset: %f" % (1.0*cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

Recall value in training dataset: 0.710526


lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X.values, Y.values)
Y_hat_proba = lr.predict_proba(X.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
Y_hat = Y_hat_proba[:,1] > i

plt.subplot(3,3,j)
j += 1

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y, Y_hat)

print "Recall value in training dataset: %f, with threshold = %.1f" % ((1.0*cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])), i)

# Plot non-normalized confusion matrix
class_names = [0,1]
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Threshold >= %s'%i)

Recall value in training dataset: 0.938596, with threshold = 0.1
Recall value in training dataset: 0.850877, with threshold = 0.2
Recall value in training dataset: 0.824561, with threshold = 0.3
Recall value in training dataset: 0.757310, with threshold = 0.4
Recall value in training dataset: 0.710526, with threshold = 0.5
Recall value in training dataset: 0.646199, with threshold = 0.6
Recall value in training dataset: 0.532164, with threshold = 0.7
Recall value in training dataset: 0.371345, with threshold = 0.8
Recall value in training dataset: 0.204678, with threshold = 0.9


Testing
test_data = pd.read_csv('test.csv')

test_data['Sex'] = test_data['Sex'].map({'female':0, 'male':1})

age_avg = np.mean([0 if np.isnan(item) else item for item in test_data['Age']])
test_data['Age'] = [age_avg if np.isnan(item) else item for item in test_data['Age']]
test_data['Age'] = StandardScaler().fit_transform(test_data['Age'].values.reshape(-1,1))

test_data['SibSp'] = StandardScaler().fit_transform(test_data['SibSp'].values.reshape(-1,1))
test_data['Parch'] = StandardScaler().fit_transform(test_data['Parch'].values.reshape(-1,1))

fare_avg = np.mean([0 if np.isnan(item) else item for item in test_data['Fare']])
test_data['Fare'] = [fare_avg if np.isnan(item) else item for item in test_data['Fare']]
test_data['Fare'] = StandardScaler().fit_transform(test_data['Fare'].values.reshape(-1,1))

test_data['Embarked'] = test_data['Embarked'].map({'S':1, 'C':2, 'Q':3})
pier = [0 if np.isnan(item) else item for item in test_data['Embarked']]
test_data['Embarked'] = [max(set(pier), key=pier.count) if item == 0 else item for item in pier]

test_data = test_data.drop(columns=['Name','Ticket','Cabin'])

test_data.head()


PassengerId
Pclass
Sex
Age
SibSp
Parch
Fare
Embarked
0
892
3
1
0.428099
-0.499470
-0.400248
-0.498403
3
1
893
3
0
1.399492
0.616992
-0.400248
-0.513271
1
2
894
2
1
2.565163
-0.499470
-0.400248
-0.465085
3
3
895
3
1
-0.154736
-0.499470
-0.400248
-0.483463
1
4
896
3
0
-0.543293
0.616992
0.619896
-0.418468
1

train_data.head()


Survived
Pclass
Sex
Age
SibSp
Parch
Fare
Embarked
0
0
3
1
-0.494245
0.432793
-0.473674
-0.502445
1.0
1
1
1
0
0.717307
0.432793
-0.473674
0.786845
2.0
2
1
3
0
-0.191357
-0.474545
-0.473674
-0.488854
1.0
3
1
1
0
0.490141
0.432793
-0.473674
0.420730
1.0
4
0
3
1
0.490141
-0.474545
-0.473674
-0.486337
1.0

X_test = test_data.drop(['PassengerId'], axis=1)

X_test.head()


Pclass
Sex
Age
SibSp
Parch
Fare
Embarked
0
3
1
0.428099
-0.499470
-0.400248
-0.498403
3
1
3
0
1.399492
0.616992
-0.400248
-0.513271
1
2
2
1
2.565163
-0.499470
-0.400248
-0.465085
3
3
3
1
-0.154736
-0.499470
-0.400248
-0.483463
1
4
3
0
-0.543293
0.616992
0.619896
-0.418468
1

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X.values, Y.values)
Y_hat_proba = lr.predict_proba(X_test.values)
Y_hat = [1 if y > 0.6 else 0 for y in Y_hat_proba[:,1]]

results = pd.DataFrame(Y_hat, columns=['Survived'])
results.insert(0, 'PassengerId', test_data['PassengerId'])
results.to_csv('results.csv')



展开全文
• 1.stack https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/data 2.feature engineer https://www.kaggle.com/sinakhorami/titanic-best-working-classifier
• midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2 diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_...
• VariableDefinitionKey survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / ...
• 与在Embarked = C和Q的Pclass = 2的男性相比，Pclass=3的男性有更好的存活率。基于数据分析的假设.填充 #2。 对于Pclass=3的男性乘客而言，不同登船口有不同的存活率。基于数据分析的假设.关联 #1。 决策 ...
• 2 理解数据2.1 采集数据从Kaggle泰坦尼克号项目页面下载数据：https://www.kaggle.com/c/titanic​www.kaggle.com2.2 导入数据2.3 查看数据集信息从结果来看，数据总共有1309行。其中数据类型列：年龄（Age）、船舱...
• Pclass–类别型变量，1、2/3分别代表头等舱到下等舱 Name–姓名，姓名看起来没什么用，但是可以用来判定是否一家，在年龄缺失的时候可以用来断定 sex–性别，类别型变量 Age–年龄，有缺失值，如果年龄小于1，则...
• 下面我们再来看看各种舱级别情况下各... 2 fig.set(alpha=0.5) 3 plt.title(u"根据舱等级和性别的获救情况",fontproperties=getChineseFont()) 4 5 ax1 = fig.add_subplot(141) 6 data_train.Survived[dat...
• 参考博客：大树先生 Titanic: Machine Learning from Disaster ...2数据总览 3缺失值的处理方法 4分析数据关系 1数据集 train.csv ：(891人) test.csv ：(418人) 12列数据（有的乘客信息不
• 利用逻辑回归预测泰坦尼克号生存率 目录 提出问题 理解数据 采集数据 导入数据 查看数据集信息 数据清洗 ...2.理解数据 ...从Kaggle泰坦尼克号项目页面下载数据：https://www.kaggle.com/c/titanic 2...
• <p>I am working on updating the Titanic example to follow ;1774164161">the "cleaned" data</a> from . The changes are: <ul><li>adding "fare class*</li><li>using strings as input labels</li>...
• 前言:这是Titanic的第二篇文章, 在多模型ensemble之后并没有提高LB的得分和排名,但是依然是精挑细选的一片开源notebook.亮点在数据分析和特征工程. 1. 查看数据 1.1 数据加载 import pandas as pd import numpy as ...
• 【数据分析】 入门案例：Titanic乘客获救预测（2）1 数据清洗及特征处理1.1 缺失值处理1.1.1 查看缺失值 1 数据清洗及特征处理 数据清洗是指对拿到的原始数据中的缺失值，异常值的处理，以保证后续数据分析和建模的...

...