精华内容
下载资源
问答
  • Titanic

    2016-12-15 09:58:02
    It is a historical fact that during the legendary voyage of "Titanic" the wireless telegraph machine had delivered 6 warnings about the danger of icebergs. Each of the telegraph messages described the...
  • titanic survival 2

    2016-07-25 16:48:25
    ######In hte case of a model that use gender,calss,and ticktet price,you will need an array of 2X3X4([female/male],[1st/2nd/3rd class],[4bins  ############of bprices]).The script will ...

    #########hte idea is to create an table which contains just 1's and 0's.The array will be a surbibal reference table.whereby you read in the 

    ##########tet data,find out passenger attributes,look them opn in the survival table,and determine if they should be predicted to survive or not.

    ######In hte case of a model that use gender,calss,and ticktet price,you will need an array of 2X3X4([female/male],[1st/2nd/3rd class],[4bins 

    ############of bprices]).The script will systematically will loop through each combination and use the "where" function  in python to search 

    ######passengers that fit that combination of variables.Just like befor,you can ask what indices in your data equals female,1st class,and paid

    #########more than $30.For the sake of binning let's say everything equal to and abouve 40 "equals" 39 so it falls in this bin.So then you can 

    ######set the bins:


    #######so we add  a ceiling 

    fare_ceiling=40

    ######then modify the data in the fare column to=39,if it is greater or equal to the ciling 


    data[data[0::,9].astype(np.float)>=fare_ceiling,9]=fare_ceiling-1.0


    ####I know there were 1st,2nd adn 3rd classes on board 

    number_of_classes=3


    #####but it is better practice to calculate this from the data directly

    ####take the length of an array of unique valuese in column index2

    number_of_classes=len(np.unique(data[0::,2]))


    ######initialize the survival table with all zeros

    survival_table=np.zeros((2,number_of_classes,number_of_price_brackets))

    ######now that these are set up,you can loop throuhg each variable and find all those passengers that agree with the statements:

    for i in xrange(number_of_classes):   ##########loop through each class 

        for j in xrange(number_of_price_brackets):   ########loop through each price bin


            women_only_stats=data[                                                \######which element

                                                        (data[0::,4]=="femalse")        \######is a female

                                                     &(data[0::,2].astype(np.float) \######and wa ith class

                                                              ==i+1)\

                                                     &(data[0:,9].astype(np.float)) \######## was greater

                                                            >=j*fare_bracket_size     \#######than this bin

                                                     &(data[0:,9].astype(np.float) \#######and less than 

                                                            <=(j+1)*fare_bracket_size)  \#####the next bin 

                                                            ,1]                                              \#####in the 2nde col



    men_only_stats=data[(data[0::,4]!="female")     \#####is a male

                                           &(data[0::,2].astype(np.float))   \#########and was ith class 

                                               ==i+1

                                            &(data[0:,9].astype(np.float)    \#############was greater

                                             >=j*fare_bracket_size)               \############than this bin

                                          & (data[0:,9].astype(np.float)           \############and less than 

                                             <(j+1)*fare_bracket_size)          \#########the next bin

                                             ,1]




    展开全文
  • Titanic生存预测2

    2018-02-06 18:41:06
    上一篇:Titanic生存预测1,主要讲了如何做的特征工程。 这一篇讲如何训练模型来实现预测。 %matplotlib inline from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_...
        

    背景音乐:保留 - 郭顶

    上一篇:Titanic生存预测1,主要讲了如何做的特征工程。

    这一篇讲如何训练模型来实现预测。

    %matplotlib inline
    from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
    from xgboost import XGBClassifier
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    from sklearn import feature_selection
    from sklearn import model_selection
    from sklearn import metrics
    import pandas as pd
    import time
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    

    1. 读取数据

    path_data = '../../data/titanic/'
    df = pd.read_csv(path_data + 'fe_data.csv')
    
    df_data_y = df['Survived']
    df_data_x = df.drop(['Survived', 'PassengerId'], 1)
    
    df_train_x = df_data_x.iloc[:891, :]  # 前891个数据是训练集
    df_train_y = df_data_y[:891]
    

    2. 特征选择

    我选择用GBDT来进行特征选择,这是由决策树本身的算法特性所决定的,每次通过计算信息增益(或其他准则)来选择特征进行分割,在预测的同时也对特征的贡献进行了“衡量”,因此比较容易可视化~

    cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0) 
    gbdt_rfe = feature_selection.RFECV(ensemble.GradientBoostingClassifier(random_state=2018), step = 1, scoring = 'accuracy', cv = cv_split)
    gbdt_rfe.fit(df_train_x, df_train_y)
    columns_rfe = df_train_x.columns.values[gbdt_rfe.get_support()]
    print('Picked columns: {}'.format(columns_rfe))
    print("Optimal number of features : {}/{}".format(gbdt_rfe.n_features_, len(df_train_x.columns)))
    plt.figure()
    plt.xlabel("Number of features selected")
    plt.ylabel("Cross validation score (nb of correct classifications)")
    plt.plot(range(1, len(gbdt_rfe.grid_scores_) + 1), gbdt_rfe.grid_scores_)
    plt.show()
    

    结果显示:

    Picked columns: ['Age' 'Fare' 'Pclass' 'SibSp' 'FamilySize' 'Family_Survival' 'Sex_Code' 'Title_Master' 'Title_Mr' 'Cabin_C' 'Cabin_E' 'Cabin_X']
    Optimal number of features : 12/24
    
    8888511-202e7c2af9d73917.png

    大约在5个以上特征的时候,交叉验证集的分数就已经趋于稳定了。说明在现有特征中,有贡献的特征并不多……

    最好的结果出现在12个特征的时候。但需要注意的是,比赛的比分不是由你的交叉验证集决定,所以存在一定的偶然性,鉴于特征数量在比较长的跨度上表现接近,因此我觉得有机会的话,特征数量从5到24的每种选择都值得一试。

    我个人比较了24个特征和12个特征,表现最好的是24个全选……没试其他的。

    然后对特征进行标准化,用以训练:

    stsc = StandardScaler()
    df_data_x = stsc.fit_transform(df_data_x)
    print('mean:\n', stsc.mean_)
    print('var:\n', stsc.var_)
    
    df_train_x = df_data_x[:891]
    df_train_y = df_data_y[:891]
    
    df_test_x = df_data_x[891:]
    df_test_output = df.iloc[891:, :][['PassengerId','Survived']]
    

    3.模型融合

    机器学习的套路是:

    1. 先选择一个基础模型,进行训练和预测,最快建立起一个pipeline。
    2. 在此基础上用交叉验证和GridSearch对模型调参,查看模型的表现。
    3. 用模型融合进行多个模型的组合,用投票的方式(或其他)来预测结果。

    一般来说,模型融合得到的结果会比单个模型的要好。

    在这里,我跳过了步骤1和2,直接进行步骤3。

    3.1 设置基本参数

    vote_est = [
        ('ada', ensemble.AdaBoostClassifier()),
        ('bc', ensemble.BaggingClassifier()),
        ('etc', ensemble.ExtraTreesClassifier()),
        ('gbc', ensemble.GradientBoostingClassifier()),
        ('rfc', ensemble.RandomForestClassifier()),
        ('gpc', gaussian_process.GaussianProcessClassifier()),
        ('lr', linear_model.LogisticRegressionCV()),
        ('bnb', naive_bayes.BernoulliNB()),
        ('gnb', naive_bayes.GaussianNB()),
        ('knn', neighbors.KNeighborsClassifier()),
        ('svc', svm.SVC(probability=True)),
        ('xgb', XGBClassifier())
    ]
    
    grid_n_estimator = [10, 50, 100, 300, 500]
    grid_ratio = [.5, .8, 1.0]
    grid_learn = [.001, .005, .01, .05, .1]
    grid_max_depth = [2, 4, 6, 8, 10]
    grid_criterion = ['gini', 'entropy']
    grid_bool = [True, False]
    grid_seed = [0]
    
    grid_param = [
        # AdaBoostClassifier
        {
            'n_estimators':grid_n_estimator,
            'learning_rate':grid_learn,
            'random_state':grid_seed
        },
        # BaggingClassifier
        {
            'n_estimators':grid_n_estimator,
            'max_samples':grid_ratio,
            'random_state':grid_seed
        },
        # ExtraTreesClassifier
        {
            'n_estimators':grid_n_estimator,
            'criterion':grid_criterion,
            'max_depth':grid_max_depth,
            'random_state':grid_seed
        },
        # GradientBoostingClassifier
        {
            'learning_rate':grid_learn,
            'n_estimators':grid_n_estimator,
            'max_depth':grid_max_depth,
            'random_state':grid_seed,
    
        },
        # RandomForestClassifier
        {
            'n_estimators':grid_n_estimator,
            'criterion':grid_criterion,
            'max_depth':grid_max_depth,
            'oob_score':[True],
            'random_state':grid_seed
        },
        # GaussianProcessClassifier
        {
            'max_iter_predict':grid_n_estimator,
            'random_state':grid_seed
        },
        # LogisticRegressionCV
        {
            'fit_intercept':grid_bool,  # default: True
            'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'random_state':grid_seed
        },
        # BernoulliNB
        {
            'alpha':grid_ratio,
        },
        # GaussianNB
        {},
        # KNeighborsClassifier
        {
            'n_neighbors':range(6, 25),
            'weights':['uniform', 'distance'],
            'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']
        },
        # SVC
        {
            'C':[1, 2, 3, 4, 5],
            'gamma':grid_ratio,
            'decision_function_shape':['ovo', 'ovr'],
            'probability':[True],
            'random_state':grid_seed
        },
        # XGBClassifier
        {
            'learning_rate':grid_learn,
            'max_depth':[1, 2, 4, 6, 8, 10],
            'n_estimators':grid_n_estimator,
            'seed':grid_seed
        }
    ]
    

    3.2 训练

    对于每个模型都进行调参再组合,不过有的迭代次数较多,为了节省时间我就用了RandomizedSearchCV来简化(还没来得及试验全部GridSearchCV)。

    start_total = time.perf_counter()
    N = 0
    for clf, param in zip (vote_est, grid_param):  
        start = time.perf_counter()     
        cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0) 
        if 'n_estimators' not in param.keys():
            print(clf[1].__class__.__name__, 'GridSearchCV')
            best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'accuracy')
            best_search.fit(df_train_x, df_train_y)
            best_param = best_search.best_params_
        else:
            print(clf[1].__class__.__name__, 'RandomizedSearchCV')
            best_search2 = model_selection.RandomizedSearchCV(estimator = clf[1], param_distributions = param, cv = cv_split, scoring = 'accuracy')
            best_search2.fit(df_train_x, df_train_y)
            best_param = best_search2.best_params_
        run = time.perf_counter() - start
    
        print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
        clf[1].set_params(**best_param) 
    
    run_total = time.perf_counter() - start_total
    print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
    

    4. 预测

    投票有两种方式——软投票和硬投票。

    • 硬投票:少数服从多数。
    • 软投票:没研究过,有文章表明,计算的是加权平均概率,预测结果是概率高的。

    如果没有先验经验,那么最好是两种投票方式都算一遍,看看结果如何。

    对于Titanic生存预测,我发现每次都是硬投票的结果要好。

    grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
    grid_hard_cv = model_selection.cross_validate(grid_hard, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
    grid_hard.fit(df_train_x, df_train_y)
    
    print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100)) 
    print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
    print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
    print('-'*10)
    
    grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
    grid_soft_cv = model_selection.cross_validate(grid_soft, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
    grid_soft.fit(df_train_x, df_train_y)
    
    print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
    print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
    print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
    

    结果为:

    Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 89.70
    Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 85.97
    Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.95
    ----------
    Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 90.02
    Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 85.52
    Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 6.07
    

    硬投票得出的预测结果,在测试集上的分数较高,标准差较小,优选硬投票。

    5. 提交结果:

    用硬投票作为预测的方案,得到结果并提交。

    df_test_output['Survived'] = grid_hard.predict(df_test_x)
    df_test_output.to_csv('../../data/titanic/hardvote.csv', index = False)
    

    在官网上提交结果,给出的分数是0.81339。


    后记

    Titanic这个项目很值得一试,在实践的过程中,我参考了一些参赛者在kaggle上分享的kernel,收益良多。

    但作为入门项目,重在参与,后面有空了再做一遍,看是否能有提高。

    接下来,我会尝试参加猫狗大战
    也就是编写一个算法来分类图像是否包含狗或猫。
    这对人类,狗和猫来说很容易,但用算法如何实现呢?拭目以待。

    展开全文
  • Titanic-乘客生存预测2

    2020-03-14 17:57:22
    代码所需数据集:https://github.com/jsusu/Titanic_Passenger_Survival_Prediction_2/tree/master/titanic_data import re import numpy as np import pandas as pd import matplotlib.pyplot as plt import ...

    代码所需数据集:https://github.com/jsusu/Titanic_Passenger_Survival_Prediction_2/tree/master/titanic_data

    import re
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    import warnings
    warnings.filterwarnings('ignore')
    
    %matplotlib inline
    
    # 1.读取数据
    train_data = pd.read_csv("./titanic_data/titanic_train.csv")
    test_data = pd.read_csv("./titanic_data/titanic_test.csv")
    
    sns.set_style('whitegrid')
    train_data.head()
    
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
    4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
    train_data.info()
    print("-" * 40)
    test_data.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  891 non-null    int64  
     1   Survived     891 non-null    int64  
     2   Pclass       891 non-null    int64  
     3   Name         891 non-null    object 
     4   Sex          891 non-null    object 
     5   Age          714 non-null    float64
     6   SibSp        891 non-null    int64  
     7   Parch        891 non-null    int64  
     8   Ticket       891 non-null    object 
     9   Fare         891 non-null    float64
     10  Cabin        204 non-null    object 
     11  Embarked     889 non-null    object 
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.7+ KB
    ----------------------------------------
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 418 entries, 0 to 417
    Data columns (total 11 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  418 non-null    int64  
     1   Pclass       418 non-null    int64  
     2   Name         418 non-null    object 
     3   Sex          418 non-null    object 
     4   Age          332 non-null    float64
     5   SibSp        418 non-null    int64  
     6   Parch        418 non-null    int64  
     7   Ticket       418 non-null    object 
     8   Fare         417 non-null    float64
     9   Cabin        91 non-null     object 
     10  Embarked     418 non-null    object 
    dtypes: float64(2), int64(4), object(5)
    memory usage: 36.0+ KB
    
    # 从上面我们可以看出,Age、Cabin、Embarked、Fare几个特征存在缺失值。
    
    
    # 绘制存活的比例
    train_data['Survived'].value_counts().plot.pie(labeldistance = 1.1,autopct = '%1.2f%%',
                                                   shadow = False,startangle = 90,pctdistance = 0.6)
    
    #labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
    #autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
    #shadow,饼是否有阴影
    #startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
    #pctdistance,百分比的text离圆心的距离
    #patches, l_texts, p_texts,为了得到饼图的返回值,p_texts饼图内部文本的,l_texts饼图外label的文本
    
    <matplotlib.axes._subplots.AxesSubplot at 0x121125b50>
    

    在这里插入图片描述

    # 2.缺失值处理
    
    # 对数据进行分析的时候要注意其中是否有缺失值。一些机器学习算法能够处理缺失值,比如神经网络,一些则不能。
    # 对于缺失值,一般有以下几种处理方法:
    
    # (1)如果数据集很多,但有很少的缺失值,可以删掉带缺失值的行;
    # (2)如果该属性相对学习来说不是很重要,可以对缺失值赋均值或者众数。
    # (3)对于标称属性,可以赋一个代表缺失的值,比如‘U0’。因为缺失本身也可能代表着一些隐含信息。比如船舱号Cabin这一属性,缺失可能代表并没有船舱
    train_data.Embarked[train_data.Embarked.isnull()] = train_data.Embarked.dropna().mode().values
    #replace missing value with U0
    train_data['Cabin'] = train_data.Cabin.fillna('U0')    
    #train_data.Cabin[train_data.CAbin.isnull()]='U0'
    
    
    
    # (4)使用回归 随机森林等模型来预测缺失属性的值。因为Age在该数据集里是一个相当重要的特征(先对Age进行分析即可得知),所以保证一定的缺失值填充准确率是非常重要的,对结果也会产生较大影响。一般情况下,会使用数据完整的条目作为模型的训练集,以此来预测缺失值。对于当前的这个数据,可以使用随机森林来预测也可以使用线性回归预测。这里使用随机森林预测模型,选取数据集中的数值属性作为特征(因为sklearn的模型只能处理数值属性,所以这里先仅选取数值特征,但在实际的应用中需要将非数值特征转换为数值特征)
    
    from sklearn.ensemble import RandomForestRegressor
    
    #choose training data to predict age
    age_df = train_data[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
    age_df_notnull = age_df.loc[(train_data['Age'].notnull())]
    age_df_isnull = age_df.loc[(train_data['Age'].isnull())]
    X = age_df_notnull.values[:,1:]
    Y = age_df_notnull.values[:,0]
    
    # use RandomForestRegression to train data
    RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
    RFR.fit(X,Y)
    predictAges = RFR.predict(age_df_isnull.values[:,1:])
    train_data.loc[train_data['Age'].isnull(), ['Age']]= predictAges
    
    train_data.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
     #   Column       Non-Null Count  Dtype  
    ---  ------       --------------  -----  
     0   PassengerId  891 non-null    int64  
     1   Survived     891 non-null    int64  
     2   Pclass       891 non-null    int64  
     3   Name         891 non-null    object 
     4   Sex          891 non-null    object 
     5   Age          891 non-null    float64
     6   SibSp        891 non-null    int64  
     7   Parch        891 non-null    int64  
     8   Ticket       891 non-null    object 
     9   Fare         891 non-null    float64
     10  Cabin        891 non-null    object 
     11  Embarked     891 non-null    object 
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.7+ KB
    
    # 3.分析数据关系
    # 3.1 性别与是否生存的关系 Sex
    print(train_data.groupby(['Sex','Survived'])['Survived'].count())
    
    Sex     Survived
    female  0            81
            1           233
    male    0           468
            1           109
    Name: Survived, dtype: int64
    
    train_data[['Sex','Survived']].groupby(['Sex']).mean()
    
    Survived
    Sex
    female 0.742038
    male 0.188908
    train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a24032f10>
    

    在这里插入图片描述

    #以上为不同性别的生存率,可见在泰坦尼克号事故中,还是体现了Lady First.
    
    # 3.2船舱等级和生存与否的关系 Pclass
    print(train_data.groupby(['Pclass','Survived'])['Pclass'].count())
    
    Pclass  Survived
    1       0            80
            1           136
    2       0            97
            1            87
    3       0           372
            1           119
    Name: Pclass, dtype: int64
    
    print(train_data[['Pclass','Survived']].groupby(['Pclass']).mean())
    
            Survived
    Pclass          
    1       0.629630
    2       0.472826
    3       0.242363
    
    train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a24bc0ad0>
    

    在这里插入图片描述

    # 不同等级船舱的男女生存率:
    train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a24c2e710>
    

    在这里插入图片描述

    print(train_data.groupby(['Sex','Pclass','Survived'])['Survived'].count())
    
    Sex     Pclass  Survived
    female  1       0             3
                    1            91
            2       0             6
                    1            70
            3       0            72
                    1            72
    male    1       0            77
                    1            45
            2       0            91
                    1            17
            3       0           300
                    1            47
    Name: Survived, dtype: int64
    
    # 从图和表中可以看出,总体上泰坦尼克号逃生是妇女优先,但是对于不同等级的船舱还是有一定的区别。
    
    # 3.3 年龄与存活与否的关系 Age
    # 分别分析不同等级船舱和不同性别下的年龄分布和生存的关系:
    
    fig,ax = plt.subplots(1,2, figsize = (18,5))
    ax[0].set_yticks(range(0,110,10))
    sns.violinplot("Pclass","Age",hue="Survived",data=train_data,split=True,ax=ax[0])
    ax[0].set_title('Pclass and Age vs Survived') 
    
    ax[1].set_yticks(range(0,110,10))
    sns.violinplot("Sex","Age",hue="Survived",data=train_data,split=True,ax=ax[1])
    ax[1].set_title('Sex and Age vs Survived')
     
    plt.show()
    

    在这里插入图片描述

    # 分析总体的年龄分布:
    plt.figure(figsize=(15,5))
    plt.subplot(121)
    train_data['Age'].hist(bins=100)
    plt.xlabel('Age')
    plt.ylabel('Num')
     
    plt.subplot(122)
    train_data.boxplot(column='Age',showfliers=False)
    plt.show()
    

    在这里插入图片描述

    # 不同年龄下的生存和非生存的分布情况:
    facet = sns.FacetGrid(train_data,hue="Survived",aspect=4)
    facet.map(sns.kdeplot,'Age',shade=True)
    facet.set(xlim=(0,train_data['Age'].max()))
    facet.add_legend()
    
    <seaborn.axisgrid.FacetGrid at 0x1a25263d90>
    

    在这里插入图片描述

    # 不同年龄下的平均生存率:
    # average survived passengers by age
    fig,axis1 = plt.subplots(1,1,figsize=(18,4))
    train_data['Age_int'] = train_data['Age'].astype(int)
    average_age = train_data[["Age_int", "Survived"]].groupby(['Age_int'],as_index=False).mean()
    sns.barplot(x='Age_int',y='Survived',data=average_age)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a254e67d0>
    

    在这里插入图片描述

    print(train_data['Age'].describe())
    
    count    891.000000
    mean      29.658964
    std       13.735787
    min        0.420000
    25%       21.000000
    50%       28.000000
    75%       37.000000
    max       80.000000
    Name: Age, dtype: float64
    
    # 样本有891,平均年龄约为30岁,标准差13.5岁,最小年龄0.42,最大年龄80.
    # 按照年龄,将乘客划分为儿童、少年、成年、老年,分析四个群体的生还情况:
    
    bins = [0, 12, 18, 65, 100]
    train_data['Age_group'] = pd.cut(train_data['Age'],bins)
    by_age = train_data.groupby('Age_group')['Survived'].mean()
    print(by_age)
    
    Age_group
    (0, 12]      0.506173
    (12, 18]     0.466667
    (18, 65]     0.364512
    (65, 100]    0.125000
    Name: Survived, dtype: float64
    
    by_age.plot(kind = 'bar')
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a253a8650>
    

    在这里插入图片描述

    # 称呼与存活与否的关系
    # 通过观察名字数据,我们可以看出其中包括对乘客的称呼,如:Mr、Miss、Mrs等,称呼信息包含了乘客的年龄、性别,同时也包含了入社会地位等的称呼,如:Dr,Lady,Major(少校),Master(硕士,主人,师傅)等的称呼。
    train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.',expand=False)
    pd.crosstab(train_data['Title'],train_data['Sex'])
    
    Sex female male
    Title
    Capt 0 1
    Col 0 2
    Countess 1 0
    Don 0 1
    Dr 1 6
    Jonkheer 0 1
    Lady 1 0
    Major 0 2
    Master 0 40
    Miss 182 0
    Mlle 2 0
    Mme 1 0
    Mr 0 517
    Mrs 125 0
    Ms 1 0
    Rev 0 6
    Sir 0 1
    # 观察不同称呼与生存率的关系:
    train_data[['Title','Survived']].groupby(['Title']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a25a5e710>
    

    在这里插入图片描述

    # 同时,对于名字,我们还可以观察名字长度和生存率之间存在关系的可能:
    
    fig, axis1 = plt.subplots(1,1,figsize=(18,4))
    train_data['Name_length'] = train_data['Name'].apply(len)
    name_length = train_data[['Name_length','Survived']].groupby(['Name_length'], as_index=False).mean()
    sns.barplot(x='Name_length', y='Survived',data=name_length)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a25b1b590>
    

    在这里插入图片描述

    # 从上面的图片可以看出,名字长度和生存与否确实也存在一定的相关性.
    
    # 3.5 有无兄弟姐妹和存活与否的关系 SibSp
    
    #将数据分为有兄弟姐妹和没有兄弟姐妹的两组:
    sibsp_df = train_data[train_data['SibSp'] != 0]
    no_sibsp_df = train_data[train_data['SibSp'] == 0]
    
    plt.figure(figsize=(11,5))
    plt.subplot(121)
    sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived','Survived'],autopct= '%1.1f%%')
    plt.xlabel('sibsp')
     
    plt.subplot(122)
    no_sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived','Survived'],autopct= '%1.1f%%')
    plt.xlabel('no_sibsp')
     
    plt.show()
    

    在这里插入图片描述

    # 3.6 有无父母子女和存活与否的关系 Parch
    # 和有无兄弟姐妹一样,同样分析可以得到:
    parch_df = train_data[train_data['Parch'] != 0]  
    no_parch_df = train_data[train_data['Parch'] == 0]  
     
    plt.figure(figsize=(11,5))  
    plt.subplot(121)  
    parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct= '%1.2f%%')  
    plt.xlabel('parch')  
     
    plt.subplot(122)  
    no_parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.2f%%')  
    plt.xlabel('no_parch') 
     
    plt.show()  
    

    在这里插入图片描述

    # 3.7 亲友的人数和存活与否的关系 SibSp & Parch
    
    fig, ax=plt.subplots(1,2,figsize=(15,5))
    train_data[['Parch','Survived']].groupby(['Parch']).mean().plot.bar(ax=ax[0])
    ax[0].set_title('Parch and Survived')
    train_data[['SibSp','Survived']].groupby(['SibSp']).mean().plot.bar(ax=ax[1])
    ax[1].set_title('SibSp and Survived')
    
    Text(0.5, 1.0, 'SibSp and Survived')
    

    在这里插入图片描述

    train_data['Family_Size'] = train_data['Parch'] + train_data['SibSp']+1
    train_data[['Family_Size','Survived']].groupby(['Family_Size']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a25323310>
    

    在这里插入图片描述

    # 从图表中可以看出,若独自一人,那么其存活率比较低;但是如果亲友太多的话,存活率也会很低。
    
    # 3.8 票价分布和存活与否的关系 Fare
    
    # 首先绘制票价的分布情况:
    plt.figure(figsize=(10,5))
    train_data['Fare'].hist(bins=70)
     
    train_data.boxplot(column='Fare', by='Pclass', showfliers=False)
    plt.show()
    

    在这里插入图片描述

    在这里插入图片描述

    print(train_data['Fare'].describe())
    
    count    891.000000
    mean      32.204208
    std       49.693429
    min        0.000000
    25%        7.910400
    50%       14.454200
    75%       31.000000
    max      512.329200
    Name: Fare, dtype: float64
    
    # 绘制生存与否与票价均值和方差的关系:
    fare_not_survived = train_data['Fare'][train_data['Survived'] == 0]
    fare_survived = train_data['Fare'][train_data['Survived'] == 1]
     
    average_fare = pd.DataFrame([fare_not_survived.mean(),fare_survived.mean()])
    std_fare = pd.DataFrame([fare_not_survived.std(),fare_survived.std()])
    average_fare.plot(yerr=std_fare,kind='bar',legend=False)
     
    plt.show()
    

    在这里插入图片描述

    #由上图表可知,票价与是否生还有一定的相关性,生还者的平均票价要大于未生还者的平均票价。
    
    # 3.9 船舱类型和存活与否的关系 Cabin
    # 由于船舱的缺失值确实太多,有效值仅仅有204个,很难分析出不同的船舱和存活的关系,所以在做特征工程的时候,可以直接将该组特征丢弃掉。 当然,这里我们也可以对其进行一下分析,对于缺失的数据都分为一类。 简单地将数据分为是否有Cabin记录作为特征,与生存与否进行分析:
    
    # Replace missing values with "U0"
    train_data.loc[train_data.Cabin.isnull(),'Cabin'] = 'U0'
    train_data['Has_Cabin'] = train_data['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
    train_data[['Has_Cabin','Survived']].groupby(['Has_Cabin']).mean().plot.bar()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a26439910>
    

    在这里插入图片描述

    #对不同类型的船舱进行分析:
    
    # create feature for the alphabetical part of the cabin number
    train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    # convert the distinct cabin letters with incremental integer values
    train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]
    train_data[['CabinLetter','Survived']].groupby(['CabinLetter']).mean().plot.bar()
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a2651a7d0>
    

    在这里插入图片描述

    # 可见,不同的船舱生存率也有不同,但是差别不是很大。所以在处理中,我们可以直接将特征删除。
    
    # 3.10 港口和存活与否的关系 Embarked|
    # 泰坦尼克号从英国的南安普顿港出发,途径法国瑟堡和爱尔兰昆士敦,那么在昆士敦之前上船的人,有可能在瑟堡或昆士敦下船,这些人将不会遇到海难。
    
    sns.countplot('Embarked',hue='Survived',data=train_data)
    plt.title('Embarked and Survived')
    
    Text(0.5, 1.0, 'Embarked and Survived')
    

    在这里插入图片描述

    sns.factorplot('Embarked','Survived',data = train_data, size=3, aspect=2)
    plt.title('Embarked and Survived rate')
    plt.show()
    

    在这里插入图片描述

    # 由上可以看出,在不同的港口上船,生还率不同,C最高,Q次之,S最低。 以上为所给出的数据特征与生还与否的分析。 据了解,泰坦尼克号上共有2224名乘客。本训练数据只给出了891名乘客的信息,如果该数据集是从总共的2224人随机选出的,根据中心极限定理,该样本的数据量也足够大,那么我们的分析结果就具有代表性;但如果不是随机选取,那么我们的分析结果就可能不太靠谱了。
    
    # 3.11 其他可能和存活与否有关系的特征
    # 对于数据集中没有给出的特征信息,我们还可以联想其他可能会对模型产生影响的特征因素。如:乘客的国籍、乘客的身高、乘客的体重、乘客是否会游泳、乘客职业等等。
    # 另外还有数据集中没有分析的几个特征:Ticket(船票号)、Cabin(船舱号),这些因素的不同可能会影响乘客在船中的位置从而影响逃生的顺序。但是船舱号数据缺失,船票号类别大,难以分析规律,所以在后期模型融合的时候,将这些因素交由模型来决定其重要性。
    
    # 4. 变量转换
    # 变量转换的目的是将数据转换为适用于模型使用的数据,不同模型接受不同类型的数据,Scikit-learn要求数据都是数字型numeric,所以我们要将一些非数字型的原始数据转换为数字型numeric。 所以下面对数据的转换进行介绍,以在进行特征工程的时候使用。 所有的数据可以分为两类:
    # 1.定性(Qualitative)变量可以以某种方式,Age就是一个很好的例子。
    # 2.定量(Quantitative)变量描述了物体的某一(不能被数学表示的)方面,Embarked就是一个例子。
    
    # 4.1 Dummy Variables
    # 就是类别变量或者二元变量,当qualitative variable是一些频繁出现的几个独立变量时,Dummy Variables比较适用。
    #我们以Embarked只包含三个值’S',‘C',’Q',我们可以使用下面的代码将其转换为dummies:
    
    embark_dummies = pd.get_dummies(train_data['Embarked'])
    train_data = train_data.join(embark_dummies)
    train_data.drop(['Embarked'], axis=1, inplace=True)
    
    embark_dummies = train_data[['S','C','Q']]
    embark_dummies.head()
    
    S C Q
    0 1 0 0
    1 0 1 0
    2 1 0 0
    3 1 0 0
    4 1 0 0
    # 4.2 Factoring
    # dummy不好处理Cabin(船舱号)这种标称属性,因为他出现的变量比较多。所以Pandas有一个方法叫做factorize(),它可以创建一些数字,来表示类别变量,对每一个类别映射一个ID,这种映射最后只生成一个特征,不像dummy那样生成多个特征。
    
    # Replace missing values with "U0"
    train_data['Cabin'][train_data.Cabin.isnull()] = 'U0'
    # create feature for the alphabetical part of the cabin number
    train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    # convert the distinct cabin letters with incremental integer values
    train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]
    
    train_data[['Cabin','CabinLetter']].head()
    
    Cabin CabinLetter
    0 U0 0
    1 C85 1
    2 U0 0
    3 C123 1
    4 U0 0
    # 4.3 Scaling
    # Scaling可以将一个很大范围的数值映射到一个很小范围(通常是 -1到1,或者是0到1),很多情况下我们需要将数值做Scaling使其范围大小一样,否则大范围数特征将会有更高的权重。比如:Age的范围可能只是0-100,而income的范围可能是0-10000000,在某些对数组大小敏感的模型中会影响其结果。
    
    # 下面对Age进行Scaling:
    from sklearn import preprocessing
     
    assert np.size(train_data['Age']) == 891
    # StandardScaler will subtract the mean from each value then scale to the unit varience
    scaler = preprocessing.StandardScaler()
    train_data['Age_scaled'] = scaler.fit_transform(train_data['Age'].values.reshape(-1,1))
    
    print(train_data['Age_scaled'].head())
    
    0   -0.557905
    1    0.607590
    2   -0.266531
    3    0.389059
    4    0.389059
    Name: Age_scaled, dtype: float64
    
    # 4.4 Binning
    
    # Binning通过观察“邻居”(即周围的值)将连续数据离散化。存储的值被分布到一些“桶”或“箱”中,就像直方图的bin将数据划分成几块一样。
    # 下面的代码对Fare进行Binning。
    
    # Divide all fares into quartiles
    train_data['Fare_bin'] = pd.qcut(train_data['Fare'],5)
    print(train_data['Fare_bin'].head())
    
    0      (-0.001, 7.854]
    1    (39.688, 512.329]
    2        (7.854, 10.5]
    3    (39.688, 512.329]
    4        (7.854, 10.5]
    Name: Fare_bin, dtype: category
    Categories (5, interval[float64]): [(-0.001, 7.854] < (7.854, 10.5] < (10.5, 21.679] < (21.679, 39.688] < (39.688, 512.329]]
    
    # 在将数据Binning化后,要么将数据factorize化,要么dummies化。
    # qcut() create a new variable that idetifies the quartile range, but we can't use the string
    # so either factorize or create dummies from the result
     
    # factorize
    train_data['Fare_bin_id'] = pd.factorize(train_data['Fare_bin'])[0]
     
    # dummies
    fare_bin_dummies_df = pd.get_dummies(train_data['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))
    train_data = pd.concat([train_data, fare_bin_dummies_df], axis=1)
    
    # 5.特征工程
    # 在进行特征工程的时候,我们不仅需要对训练数据进行处理,还需要同时将测试数据同训练数据一起处理,使得二者具有相同的数据类型和数据分布。
    train_df_org = pd.read_csv("./titanic_data/titanic_train.csv")
    test_df_org = pd.read_csv('./titanic_data/titanic_test.csv')
    test_df_org['Survived'] = 0
    combined_train_test = train_df_org.append(test_df_org)   #891+418=1309rows, 12columns
    PassengerId = test_df_org['PassengerId']
    
    # 对数据进行特征工程,也就是从各项参数中提取出对输出结果有或大或小的影响的特征,将这些特征作为训练模型的依据。一般来说,我们会先从含有缺失值的特征开始。
    
    # 5.1 Embarked
    # 因为“Embarked”项的缺失值不多,所以这里我们以众数来填充:
    combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0],inplace=True)
    
    # 对于三种不同的港口,由上面介绍的数值转换,我们知道可以有两种特征处理方式;dummy和factorizing。因为只有三个港口,所以我们可以直接用dummy来处理:
    
    #为了后面的特征分析,这里我们将Embarked特征进行factorizing
    combined_train_test['Embarked'] = pd.factorize(combined_train_test['Embarked'])[0]
     
    #使用pd.get_dummies获取one-hot编码
    emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'],prefix=combined_train_test[['Embarked']].columns[0])
    combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)
    
    
    # 5.2 Sex
    # 对Sex也进行one-hot编码,也就是dummy处理:
    # 为了后面的特征分析,这里我们也将Sex特征进行factorizing
    combined_train_test['Sex'] = pd.factorize(combined_train_test['Sex'])[0]
     
    sex_dummies_df = pd.get_dummies(combined_train_test['Sex'],prefix=combined_train_test[['Sex']].columns[0])
    combined_train_test = pd.concat([combined_train_test,sex_dummies_df],axis=1)
    
    # 5.3 Name
    
    # 首先从名字中提取各种称呼
    # what is each person's title?
    combined_train_test['Title'] = combined_train_test['Name'].map(lambda x: re.compile(",(.*?)\.").findall(x)[0])
    combined_train_test['Title'] = combined_train_test['Title'].apply(lambda x:x.strip())
    
    # 尽管提取的Title两句话得到的效果是一样的,但是如果用
    # combined_train_test['Title'] = combined_train_test['Name'].map(lambda x: re.compile(",(.*?)\.").findall(x)[0])
    # 下面的语句执行后有问题。
    
    
    # 将各式称呼进行统一化处理:
    title_Dict = {}
    title_Dict.update(dict.fromkeys(['Capt','Col','Major','Dr','Rev'],'Officer'))
    title_Dict.update(dict.fromkeys(['Don','Sir','the Countess','Dona','Lady'],'Royalty'))
    title_Dict.update(dict.fromkeys(['Mme','Ms','Mrs'],'Mrs'))
    title_Dict.update(dict.fromkeys(['Male','Miss'],'Miss'))
    title_Dict.update(dict.fromkeys(['Mr'],'Mr'))
    title_Dict.update(dict.fromkeys(['Master','Jonkheer'],'Master'))
     
    combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)
    
    # 使用dummy对不同的称呼进行分列:
    #为了后面的特征分析,这里我们也将Title特征进行factorizing
    combined_train_test['Title'] = pd.factorize(combined_train_test['Title'])[0]
    title_dummies_df = pd.get_dummies(combined_train_test['Title'],prefix=combined_train_test[['Title']].columns[0])
    combined_train_test = pd.concat([combined_train_test,title_dummies_df],axis=1)
    
    # 增加名字长度的特征
    combined_train_test['Name_length'] = combined_train_test['Name'].apply(len)
    
    # 5.4 Fare
    # 回目录
    
    # 由前面分析可以知道,Fare项在测试数据中缺少一个值,所以需要对该值进行填充。我们按照一二三等舱各自的均价来填充:
    
    # 下面transform将函数np.mean应用到各个group中。
    combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform(np.mean))
    
    # 通过对Ticket数据的分析,我们可以看到部分票号数据有重复,同时结合亲属人数及名字的数据,和票价船舱等级对比,我们可以知道购买的票中有家庭票和团体票,所以我们需要将团体票的票价分配到每个人的头上。
    combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
    combined_train_test['Fare'] = combined_train_test['Fare']/combined_train_test['Group_Ticket']
    combined_train_test.drop(['Group_Ticket'],axis=1,inplace=True)
    
    # 使用binning给票价分等级:
    combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'],5)
    
    # 对于5个等级的票价我们可以继续使用dummy为票价等价分列:
    combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'],5)
     
    combined_train_test['Fare_bin_id'] = pd.factorize(combined_train_test['Fare_bin'])[0]
     
    fare_bin_dummies_df = pd.get_dummies(combined_train_test['Fare_bin_id']).rename(columns=lambda x: 'Fare_' + str(x))
    combined_train_test = pd.concat([combined_train_test,fare_bin_dummies_df],axis=1)
    combined_train_test.drop(['Fare_bin'],axis=1, inplace=True)
    
    # 5.5 Pclass
    # Pclass这一项,其实已经可以不用继续处理了,我们只需将其转换为dummy形式即可。 但是为了更好的分析,我们这里假设对于不同等级的船舱,各船舱内部的票价也说明了各等级舱的位置,那么也就很有可能与逃生的顺序有关系。所以这里分析出每等舱里的高价和低价位。
    
    print(combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean())
    
    Pclass
    1    33.910500
    2    11.411010
    3     7.337571
    Name: Fare, dtype: float64
    
    from sklearn.preprocessing import LabelEncoder
     
    #建立Pclass Fare Category
    def pclass_fare_category(df,pclass1_mean_fare,pclass2_mean_fare,pclass3_mean_fare):
        if df['Pclass'] == 1:
            if df['Fare'] <= pclass1_mean_fare:
                return 'Pclass1_Low'
            else:
                return 'Pclass1_High'
        elif df['Pclass'] == 2:
            if df['Fare'] <= pclass2_mean_fare:
                return 'Pclass2_Low'
            else:
                return 'Pclass2_High'
        elif df['Pclass'] == 3:
            if df['Fare'] <= pclass3_mean_fare:
                return 'Pclass3_Low'
            else:
                return 'Pclass3_High'
     
    Pclass1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(1)
    Pclass2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(2)
    Pclass3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get(3)
     
    #建立Pclass_Fare Category
    combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category,args=(
            Pclass1_mean_fare,Pclass2_mean_fare,Pclass3_mean_fare),axis=1)
    pclass_level = LabelEncoder()
     
    #给每一项添加标签
    pclass_level.fit(np.array(['Pclass1_Low','Pclass1_High','Pclass2_Low','Pclass2_High','Pclass3_Low','Pclass3_High']))
     
    #转换成数值
    combined_train_test['Pclass_Fare_Category'] = pclass_level.transform(combined_train_test['Pclass_Fare_Category'])
     
    # dummy 转换
    pclass_dummies_df = pd.get_dummies(combined_train_test['Pclass_Fare_Category']).rename(columns=lambda x: 'Pclass_' + str(x))
    combined_train_test = pd.concat([combined_train_test,pclass_dummies_df],axis=1)
    
    # 同时,我们将Pclass特征factorize化:
    combined_train_test['Pclass'] = pd.factorize(combined_train_test['Pclass'])[0]
    
    # 5.6 Parch and SibSp
    # 由前面的分析,我们可以知道,亲友的数量没有或者太多会影响到Survived。所以将二者合并为FamliySize这一组合项,同时也保留这两项。
    
    def family_size_category(family_size):
        if family_size <= 1:
            return 'Single'
        elif family_size <= 4:
            return 'Small_Family'
        else:
            return 'Large_Family'
    
    combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
    combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
    
    le_family = LabelEncoder()
    le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
    combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
    
    family_size_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
                                            prefix=combined_train_test[['Family_Size_Category']].columns[0])
    combined_train_test = pd.concat([combined_train_test, family_size_dummies_df], axis=1)
    
    # 5.7 Age
    
    # 因为Age项的缺失值较多,所以不能直接填充age的众数或者平均数。
    
    # 常见的有两种对年龄的填充方式:一种是根据Title中的称呼,如Mr,Master、Miss等称呼不同类别的人的平均年龄来填充;一种是综合几项如Sex、Title、Pclass等其他没有缺失值的项,使用机器学习算法来预测Age。
    
    # 这里我们使用后者来处理。以Age为目标值,将Age完整的项作为训练集,将Age缺失的项作为测试集。
    
    missing_age_df = pd.DataFrame(combined_train_test[
        ['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass']])
    
    missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
    missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
    
    missing_age_test.head()
    
    
    Age Embarked Sex Title Name_length Family_Size Family_Size_Category Fare Fare_bin_id Pclass
    5 NaN 2 0 0 16 1 1 8.4583 2 0
    17 NaN 0 0 0 28 1 1 13.0000 3 2
    19 NaN 1 1 1 23 1 1 7.2250 4 0
    26 NaN 1 0 0 23 1 1 7.2250 4 0
    28 NaN 2 1 2 29 1 1 7.8792 0 0
    # 建立Age的预测模型,我们可以多模型预测,然后再做模型的融合,提高预测的精度。
    
    from sklearn import ensemble
    from sklearn import model_selection
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.ensemble import RandomForestRegressor
    
    def fill_missing_age(missing_age_train, missing_age_test):
        missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
        missing_age_Y_train = missing_age_train['Age']
        missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
    
        # model 1  gbm
        gbm_reg = GradientBoostingRegressor(random_state=42)
        gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
        gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
        gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
        print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
        print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
        print('GB Train Error for "Age" Feature Regressor:' + str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
        missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
        print(missing_age_test['Age_GB'][:4])
    
        # model 2 rf
        rf_reg = RandomForestRegressor()
        rf_reg_param_grid = {'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
        rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
        rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
        print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
        print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
        print('RF Train Error for "Age" Feature Regressor' + str(rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
        missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
        print(missing_age_test['Age_RF'][:4])
    
        # two models merge
        print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
        # missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)
    
        missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
        print(missing_age_test['Age'][:4])
    
        missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)
    
        return missing_age_test
    
    # 利用融合模型预测的结果填充Age的缺失值:
    combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
    
    Fitting 10 folds for each of 1 candidates, totalling 10 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done   5 out of  10 | elapsed:    4.3s remaining:    4.3s
    [Parallel(n_jobs=25)]: Done  10 out of  10 | elapsed:    4.3s finished
    
    
    Age feature Best GB Params:{'learning_rate': 0.01, 'max_depth': 4, 'max_features': 3, 'n_estimators': 2000}
    Age feature Best GB Score:-128.38286366239385
    GB Train Error for "Age" Feature Regressor:-65.2562037120689
    5     37.508266
    17    31.580052
    19    34.597808
    26    29.076996
    Name: Age_GB, dtype: float64
    Fitting 10 folds for each of 1 candidates, totalling 10 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done   5 out of  10 | elapsed:    1.6s remaining:    1.6s
    [Parallel(n_jobs=25)]: Done  10 out of  10 | elapsed:    1.7s finished
    
    
    Age feature Best RF Params:{'max_depth': 5, 'n_estimators': 200, 'random_state': 0}
    Age feature Best RF Score:-119.64194051962507
    RF Train Error for "Age" Feature Regressor-96.82296812792812
    5     33.513123
    17    33.098071
    19    34.853983
    26    28.148613
    Name: Age_RF, dtype: float64
    shape1 (263,) (263, 2)
    5     29.97686
    17    29.97686
    19    29.97686
    26    29.97686
    Name: Age, dtype: float64
    
    missing_age_test.head()
    
    Age Embarked Sex Title Name_length Family_Size Family_Size_Category Fare Fare_bin_id Pclass
    5 29.97686 2 0 0 16 1 1 8.4583 2 0
    17 29.97686 0 0 0 28 1 1 13.0000 3 2
    19 29.97686 1 1 1 23 1 1 7.2250 4 0
    26 29.97686 1 0 0 23 1 1 7.2250 4 0
    28 29.97686 2 1 2 29 1 1 7.8792 0 0
    # 5.8 Ticket
    # 观察Ticket的值,我们可以看到,Ticket有字母和数字之分,而对于不同的字母,可能在很大程度上就意味着船舱等级或者不同船舱的位置,也会对Survived产生一定的影响,所以我们将Ticket中的字母分开,为数字的部分则分为一类。
    
    combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
    combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric() else x)
    
    # 如果要提取数字信息,则也可以这样做,现在我们对数字票单纯地分为一类。
    # combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
    # combined_train_test['Ticket_Number'].fillna(0, inplace=True)
    
    # 将 Ticket_Letter factorize
    combined_train_test['Ticket_Letter'] = pd.factorize(combined_train_test['Ticket_Letter'])[0]
    
    # 5.9 Cabin
    
    # 因为Cabin项的缺失值确实太多了,我们很难对其进行分析,或者预测。所以这里我们可以直接将Cabin这一项特征去除。但通过上面的分析,可以知道,该特征信息的有无也与生存率有一定的关系,所以这里我们暂时保留该特征,并将其分为有和无两类。
    
    combined_train_test.loc[combined_train_test.Cabin.isnull(), 'Cabin'] = 'U0'
    combined_train_test['Cabin'] = combined_train_test['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
    
    # 5.10 特征间相关性分析
    # 我们挑选一些主要的特征,生成特征之间的关联图,查看特征与特征之间的相关性:
    
    Correlation = pd.DataFrame(combined_train_test[['Embarked','Sex','Title','Name_length','Family_Size',
                                                    'Family_Size_Category','Fare','Fare_bin_id','Pclass',
                                                    'Pclass_Fare_Category','Age','Ticket_Letter','Cabin']])
    
    colormap = plt.cm.viridis
    plt.figure(figsize=(14,12))
    plt.title('Pearson Correaltion of Feature',y=1.05,size=15)
    sns.heatmap(Correlation.astype(float).corr(),linewidths=0.1,vmax=1.0,square=True,cmap=colormap,linecolor='white',annot=True)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1a26cf2d90>
    

    在这里插入图片描述

    # 5.11 特征之间的数据分布图
    g = sns.pairplot(combined_train_test[[u'Survived',u'Pclass',u'Sex',u'Age',u'Fare',u'Embarked',
                                          u'Family_Size',u'Title',u'Ticket_Letter']],hue='Survived',
                                          palette = 'seismic',size=1.2,diag_kind ='kde',diag_kws=
                                          dict(shade=True),plot_kws=dict(s=10))
    g.set(xticklabels=[])
    
    <seaborn.axisgrid.PairGrid at 0x1a23685110>
    

    在这里插入图片描述

    # 5.12 输入模型前的一些处理:
    # 5.12.1 一些数据的正则化 这里我们将Age和fare进行正则化:
    
    from sklearn import preprocessing
    scale_age_fare = preprocessing.StandardScaler().fit(combined_train_test[['Age','Fare','Name_length']])
    combined_train_test[['Age','Fare','Name_length']] = scale_age_fare.transform(combined_train_test[['Age','Fare','Name_length']])
    
    # 5.12.2 弃掉无用特征
    # 对于上面的特征工程中,我们从一些原始的特征中提取出了很多要融合到模型中的特征,但是我们需要剔除那些原本的我们用不到的或者非数值特征: 首先对我们的数据先进行一下备份,以便后期的再次分析
    
    combined_data_backup = combined_train_test
    
    combined_train_test.drop(['PassengerId','Embarked','Sex','Name','Fare_bin_id','Pclass_Fare_Category',                          'Parch','SibSp','Family_Size_Category','Ticket'],axis=1,inplace=True)
    
    # 5.12.3 将训练数据和测试数据分开
    train_data = combined_train_test[:891]
    test_data = combined_train_test[891:]
     
    titanic_train_data_X = train_data.drop(['Survived'],axis=1)
    titanic_train_data_Y = train_data['Survived']
    titanic_test_data_X = test_data.drop(['Survived'],axis=1)
    
    titanic_train_data_X.shape
    
    (891, 34)
    
    titanic_train_data_X.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 891 entries, 0 to 890
    Data columns (total 34 columns):
     #   Column                  Non-Null Count  Dtype  
    ---  ------                  --------------  -----  
     0   Pclass                  891 non-null    int64  
     1   Age                     891 non-null    float64
     2   Fare                    891 non-null    float64
     3   Cabin                   891 non-null    int64  
     4   Embarked_0              891 non-null    uint8  
     5   Embarked_1              891 non-null    uint8  
     6   Embarked_2              891 non-null    uint8  
     7   Sex_0                   891 non-null    uint8  
     8   Sex_1                   891 non-null    uint8  
     9   Title                   891 non-null    int64  
     10  Title_-1                891 non-null    uint8  
     11  Title_0                 891 non-null    uint8  
     12  Title_1                 891 non-null    uint8  
     13  Title_2                 891 non-null    uint8  
     14  Title_3                 891 non-null    uint8  
     15  Title_4                 891 non-null    uint8  
     16  Title_5                 891 non-null    uint8  
     17  Name_length             891 non-null    float64
     18  Fare_0                  891 non-null    uint8  
     19  Fare_1                  891 non-null    uint8  
     20  Fare_2                  891 non-null    uint8  
     21  Fare_3                  891 non-null    uint8  
     22  Fare_4                  891 non-null    uint8  
     23  Pclass_0                891 non-null    uint8  
     24  Pclass_1                891 non-null    uint8  
     25  Pclass_2                891 non-null    uint8  
     26  Pclass_3                891 non-null    uint8  
     27  Pclass_4                891 non-null    uint8  
     28  Pclass_5                891 non-null    uint8  
     29  Family_Size             891 non-null    int64  
     30  Family_Size_Category_0  891 non-null    uint8  
     31  Family_Size_Category_1  891 non-null    uint8  
     32  Family_Size_Category_2  891 non-null    uint8  
     33  Ticket_Letter           891 non-null    int64  
    dtypes: float64(3), int64(5), uint8(26)
    memory usage: 85.3 KB
    
    # 6. 模型融合及测试
    # 6.1 利用不同的模型来对特征进行筛选,选出较为重要的特征:
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.tree import DecisionTreeClassifier
     
    def get_top_n_features(titanic_train_data_X,titanic_train_data_Y,top_n_features):
        
        #randomforest
        rf_est = RandomForestClassifier(random_state=0)
        rf_param_grid = {'n_estimators':[500],'min_samples_split':[2,3],'max_depth':[20]}
        rf_grid = model_selection.GridSearchCV(rf_est,rf_param_grid,n_jobs=25,cv=10,verbose=1)
        rf_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
        print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
        print('Top N Features RF Train Score:' + str(rf_grid.score(titanic_train_data_X,titanic_train_data_Y)))
        feature_imp_sorted_rf = pd.DataFrame({'feature':list(titanic_train_data_X),
                                              'importance':rf_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
        features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
        print('Sample 10 Feeatures from RF Classifier')
        print(str(features_top_n_rf[:10]))
        
        #AdaBoost
        ada_est = AdaBoostClassifier(random_state=0)
        ada_param_grid = {'n_estimators':[500],'learning_rate':[0.01,0.1]}
        ada_grid = model_selection.GridSearchCV(ada_est,ada_param_grid,n_jobs=25,cv=10,verbose=1)
        ada_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        print('Top N Features Best Ada Params:' + str(ada_grid.best_params_))
        print('Top N Features Best Ada Score:' + str(ada_grid.best_score_))
        print('Top N Features Ada Train Score:' + str(ada_grid.score(titanic_train_data_X,titanic_train_data_Y)))
        feature_imp_sorted_ada = pd.DataFrame({'feature':list(titanic_train_data_X),
                                               'importance':ada_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
        features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
        print('Sample 10 Features from Ada Classifier:')
        print(str(features_top_n_ada[:10]))
        
        #ExtraTree
        et_est = ExtraTreesClassifier(random_state=0)
        et_param_grid = {'n_estimators':[500],'min_samples_split':[3,4],'max_depth':[20]}
        et_grid = model_selection.GridSearchCV(et_est,et_param_grid,n_jobs=25,cv=10,verbose=1)
        et_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        print('Top N Features Best ET Params:' + str(et_grid.best_params_))
        print('Top N Features Best DT Score:' + str(et_grid.best_score_))
        print('Top N Features ET Train Score:' + str(et_grid.score(titanic_train_data_X,titanic_train_data_Y)))
        feature_imp_sorted_et = pd.DataFrame({'feature':list(titanic_train_data_X),
                                              'importance':et_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
        features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
        print('Sample 10 Features from ET Classifier:')
        print(str(features_top_n_et[:10]))
        
        # GradientBoosting
        gb_est = GradientBoostingClassifier(random_state=0)
        gb_param_grid = {'n_estimators':[500],'learning_rate':[0.01,0.1],'max_depth':[20]}
        gb_grid = model_selection.GridSearchCV(gb_est,gb_param_grid,n_jobs=25,cv=10,verbose=1)
        gb_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
        print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
        print('Top N Features GB Train Score:' + str(gb_grid.score(titanic_train_data_X,titanic_train_data_Y)))
        feature_imp_sorted_gb = pd.DataFrame({'feature':list(titanic_train_data_X),
                                              'importance':gb_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
        features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
        print('Sample 10 Feature from GB Classifier:')
        print(str(features_top_n_gb[:10]))
        
        # DecisionTree
        dt_est = DecisionTreeClassifier(random_state=0)
        dt_param_grid = {'min_samples_split':[2,4],'max_depth':[20]}
        dt_grid = model_selection.GridSearchCV(dt_est,dt_param_grid,n_jobs=25,cv=10,verbose=1)
        dt_grid.fit(titanic_train_data_X,titanic_train_data_Y)
        print('Top N Features Bset DT Params:' + str(dt_grid.best_params_))
        print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
        print('Top N Features DT Train Score:' + str(dt_grid.score(titanic_train_data_X,titanic_train_data_Y)))
        feature_imp_sorted_dt = pd.DataFrame({'feature':list(titanic_train_data_X),
                                              'importance':dt_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
        features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
        print('Sample 10 Features from DT Classifier:')
        print(str(features_top_n_dt[:10]))
        
        # merge the three models
        features_top_n = pd.concat([features_top_n_rf,features_top_n_ada,features_top_n_et,features_top_n_gb,features_top_n_dt],
                                  ignore_index=True).drop_duplicates()
        features_importance = pd.concat([feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_et,
                                         feature_imp_sorted_gb,feature_imp_sorted_dt],ignore_index=True)
        
        return features_top_n,features_importance
    
    
    # 6.2 依据我们筛选出的特征构建训练集和测试集
    # 但如果在进行特征工程的过程中,产生了大量的特征,而特征与特征之间会存在一定的相关性。太多的特征一方面会影响训练的速度,另一方面也可能会使得模型过拟合。所以在特征太多的情况下,我们可以利用不同的模型对特征进行筛选,选取我们想要的前n个特征。
    
    feature_to_pick = 30
    feature_top_n,feature_importance = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick)
    titanic_train_data_X = pd.DataFrame(titanic_train_data_X[feature_top_n])
    titanic_test_data_X = pd.DataFrame(titanic_test_data_X[feature_top_n])
    
    Fitting 10 folds for each of 2 candidates, totalling 20 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    6.4s remaining:    3.4s
    [Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    7.7s finished
    
    
    Top N Features Best RF Params:{'max_depth': 20, 'min_samples_split': 3, 'n_estimators': 500}
    Top N Features Best RF Score:0.8271785268414481
    Top N Features RF Train Score:0.9764309764309764
    Sample 10 Feeatures from RF Classifier
    1               Age
    17      Name_length
    2              Fare
    8             Sex_1
    9             Title
    11          Title_0
    7             Sex_0
    29      Family_Size
    0            Pclass
    33    Ticket_Letter
    Name: feature, dtype: object
    Fitting 10 folds for each of 2 candidates, totalling 20 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    3.7s remaining:    2.0s
    [Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    5.0s finished
    
    
    Top N Features Best Ada Params:{'learning_rate': 0.01, 'n_estimators': 500}
    Top N Features Best Ada Score:0.8181897627965042
    Top N Features Ada Train Score:0.8204264870931538
    Sample 10 Features from Ada Classifier:
    11                   Title_0
    2                       Fare
    30    Family_Size_Category_0
    29               Family_Size
    7                      Sex_0
    0                     Pclass
    3                      Cabin
    8                      Sex_1
    17               Name_length
    1                        Age
    Name: feature, dtype: object
    Fitting 10 folds for each of 2 candidates, totalling 20 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    3.5s remaining:    1.9s
    [Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    4.0s finished
    
    
    Top N Features Best ET Params:{'max_depth': 20, 'min_samples_split': 4, 'n_estimators': 500}
    Top N Features Best DT Score:0.8237952559300874
    Top N Features ET Train Score:0.9708193041526375
    Sample 10 Features from ET Classifier:
    11          Title_0
    7             Sex_0
    8             Sex_1
    17      Name_length
    1               Age
    2              Fare
    3             Cabin
    9             Title
    33    Ticket_Letter
    13          Title_2
    Name: feature, dtype: object
    Fitting 10 folds for each of 2 candidates, totalling 20 fits
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:   12.3s remaining:    6.6s
    [Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:   12.7s finished
    
    
    Top N Features Best GB Params:{'learning_rate': 0.1, 'max_depth': 20, 'n_estimators': 500}
    Top N Features Best GB Score:0.7835081148564295
    Top N Features GB Train Score:0.9966329966329966
    Sample 10 Feature from GB Classifier:
    11                   Title_0
    1                        Age
    2                       Fare
    17               Name_length
    30    Family_Size_Category_0
    29               Family_Size
    0                     Pclass
    9                      Title
    28                  Pclass_5
    33             Ticket_Letter
    Name: feature, dtype: object
    Fitting 10 folds for each of 2 candidates, totalling 20 fits
    Top N Features Bset DT Params:{'max_depth': 20, 'min_samples_split': 4}
    Top N Features Best DT Score:0.7823220973782771
    Top N Features DT Train Score:0.9607182940516273
    Sample 10 Features from DT Classifier:
    11                   Title_0
    1                        Age
    2                       Fare
    17               Name_length
    30    Family_Size_Category_0
    16                   Title_5
    28                  Pclass_5
    0                     Pclass
    33             Ticket_Letter
    29               Family_Size
    Name: feature, dtype: object
    
    
    [Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
    [Parallel(n_jobs=25)]: Done  13 out of  20 | elapsed:    0.1s remaining:    0.0s
    [Parallel(n_jobs=25)]: Done  20 out of  20 | elapsed:    0.1s finished
    
    # 用视图可视化不同算法筛选的特征排序:
    
    rf_feature_imp = feature_importance[:10]
    Ada_feature_imp = feature_importance[32:32+10].reset_index(drop=True)
    
    # make importances relative to max importance
    rf_feature_importance = 100.0 * (rf_feature_imp['importance'] / rf_feature_imp['importance'].max())
    Ada_feature_importance = 100.0 * (Ada_feature_imp['importance'] / Ada_feature_imp['importance'].max())
    
    # Get the indexes of all features over the importance threshold
    rf_important_idx = np.where(rf_feature_importance)[0]
    Ada_important_idx = np.where(Ada_feature_importance)[0]
    
    # Adapted from http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
    pos = np.arange(rf_important_idx.shape[0]) + .5
    
    plt.figure(1, figsize = (18, 8))
    
    plt.subplot(121)
    plt.barh(pos, rf_feature_importance[rf_important_idx][::-1])
    plt.yticks(pos, rf_feature_imp['feature'][::-1])
    plt.xlabel('Relative Importance')
    plt.title('RandomForest Feature Importance')
    
    plt.subplot(122)
    plt.barh(pos, Ada_feature_importance[Ada_important_idx][::-1])
    plt.yticks(pos, Ada_feature_imp['feature'][::-1])
    plt.xlabel('Relative Importance')
    plt.title('AdaBoost Feature Importance')
    
    plt.show()
    

    在这里插入图片描述

    # 6.3 模型融合(Model Ensemble)
    # 常见的模型融合方法有:Bagging、Boosting、Stacking、Blending。
    
    # 6.3.1 Bagging
    # Bagging将多个模型,也就是基学习器的预测结果进行简单的加权平均或者投票。它的好处是可以并行地训练基学习器。Random Forest就用到了Bagging的思想。
    
    # 6.3.2 Boosting
    # Boosting的思想有点像知错能改,每个基学习器是在上一个基学习器学习的基础上,对上一个基学习器的错误进行弥补。我们将会用到的AdaBoost,Gradient Boost就用到了这种思想。
    
    # 6.3.3. Stacking
    # Stacking是用新的次学习器去学习如何组合上一层的基学习器。如果把Bagging看作是多个基分类器的线性组合,那么Stacking就是多个基分类器的非线性组合。Stacking可以将学习器一层一层地堆砌起来,形成一个网状的结构。 相比来说Stacking的融合框架相对前面二者来说在精度上确实有一定的提升,所以在下面的模型融合上,我们也使用Stacking方法。
    
    # 6.3.4 Blending
    # Blending和Stacking很相似,但同时它可以防止信息泄露的问题。
    
    # Stacking框架融合:这里我们使用了两层的模型融合
    
    # Level 1使用了:Random Forest、AdaBoost、ExtraTrees、GBDT、Decision Tree、KNN、SVM,一共7个模型
    
    # Level 2使用了XGBoost,使用第一层预测的结果作为特征对最终的结果进行预测。
    
    # Level 1:
    
    # Stacking框架是堆叠使用基础分类器的预测作为对二级模型的训练的输入。然而,我们不能简单地在全部训练数据上训练基本模型,产生预测,输出用于第二层的训练。如果我们在Train Data上训练,然后在Train Data上预测,就会造成标签。为了避免标签,我们需要对每个基学习器使用K-fold,将Kge模型对Valid Set的预测结果拼起来,作为下一层学习器的输入。
    
    # 所以这里我们建立输出fold预测方法:
    
    from sklearn.model_selection import KFold
     
    # Some useful parameters which will come in handy later on
    ntrain = titanic_train_data_X.shape[0]
    ntest = titanic_test_data_X.shape[0]
    SEED = 0 #for reproducibility
    NFOLDS = 7 # set folds for out-of-fold prediction
    kf = KFold(n_splits = NFOLDS,random_state=SEED,shuffle=False)
     
    def get_out_fold(clf,x_train,y_train,x_test):
        oof_train = np.zeros((ntrain,))
        oof_test = np.zeros((ntest,))
        oof_test_skf = np.empty((NFOLDS,ntest))
        
        for i, (train_index,test_index) in enumerate(kf.split(x_train)):
            x_tr = x_train[train_index]
            y_tr = y_train[train_index]
            x_te = x_train[test_index]
            
            clf.fit(x_tr,y_tr)
            
            oof_train[test_index] = clf.predict(x_te)
            oof_test_skf[i,:] = clf.predict(x_test)
            
        oof_test[:] = oof_test_skf.mean(axis=0)
        return oof_train.reshape(-1,1),oof_test.reshape(-1,1)
    
    # 构建不同的基学习器,这里我们使用了RandomForest、AdaBoost、ExtraTrees、GBDT、DecisionTree、KNN、SVM七个基学习器:(这里的模型可以使用如上面的GridSearch方法对模型的超参数进行搜索选择
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.tree import DecisionTreeClassifier
     
    rf = RandomForestClassifier(n_estimators=500,warm_start=True,max_features='sqrt',max_depth=6,min_samples_split=3,min_samples_leaf=2,n_jobs=-1,verbose=0)
     
    ada = AdaBoostClassifier(n_estimators=500,learning_rate=0.1)
     
    et = ExtraTreesClassifier(n_estimators=500,n_jobs=-1,max_depth=8,min_samples_leaf=2,verbose=0)
     
    gb = GradientBoostingClassifier(n_estimators=500,learning_rate=0.008,min_samples_split=3,min_samples_leaf=2,max_depth=5,verbose=0)
     
    dt = DecisionTreeClassifier(max_depth=8)
     
    knn = KNeighborsClassifier(n_neighbors=2)
     
    svm = SVC(kernel='linear',C=0.025)
    
    
    # 将pandas转换为arrays:
    # Create Numpy arrays of train,test and target(Survived) dataframes to feed into our models
    x_train = titanic_train_data_X.values   #Creates an array of the train data
    x_test = titanic_test_data_X.values   #Creates an array of the test data
    y_train = titanic_train_data_Y.values
    
    # Create our OOF train and test predictions.These base result will be used as new featurs
    rf_oof_train,rf_oof_test = get_out_fold(rf,x_train,y_train,x_test)  # Random Forest
    ada_oof_train,ada_oof_test = get_out_fold(ada,x_train,y_train,x_test)  # AdaBoost
    et_oof_train,et_oof_test = get_out_fold(et,x_train,y_train,x_test)  # Extra Trees
    gb_oof_train,gb_oof_test = get_out_fold(gb,x_train,y_train,x_test)  # Gradient Boost
    dt_oof_train,dt_oof_test = get_out_fold(dt,x_train,y_train,x_test)  #Decision Tree
    knn_oof_train,knn_oof_test = get_out_fold(knn,x_train,y_train,x_test)  # KNeighbors
    svm_oof_train,svm_oof_test = get_out_fold(svm,x_train,y_train,x_test)  # Support Vector
     
    print("Training is complete")
    
    Training is complete
    
    # Training is complete
    
    
    # 6.4 预测并生成提交文件
    # Level 2:我们利用XGBoost,使用第一层预测的结果作为特征对最终的结果进行预测。
    
    x_train = np.concatenate((rf_oof_train,ada_oof_train,et_oof_train,gb_oof_train,dt_oof_train,knn_oof_train,svm_oof_train),axis=1)
    x_test =np.concatenate((rf_oof_test,ada_oof_test,et_oof_test,gb_oof_test,dt_oof_test,knn_oof_test,svm_oof_test),axis=1)
    
    from xgboost import XGBClassifier
     
    gbm = XGBClassifier(n_estimators=200,max_depth=4,min_child_weight=2,gamma=0.9,subsample=0.8,
                        colsample_bytree=0.8,objective='binary:logistic',nthread=-1,scale_pos_weight=1).fit(x_train,y_train)
    predictions = gbm.predict(x_test)
    
    StackingSubmission = pd.DataFrame({'PassengerId':PassengerId,'Survived':predictions})
    StackingSubmission.to_csv('StackingSubmission.csv',index=False,sep=',')
    
    # 7. 验证:学习曲线
    # 回目录
    
    # 在我们对数据不断地进行特征工程,产生的特征越来越多,用大量的特征对模型进行训练,会使我们的训练集拟合得越来越好,但同时也可能会逐渐丧失泛化能力,从而在测试数据上表现不佳,发生过拟合现象。 当然我们建立的模型可能不仅在预测集上表现不好,也很可能是因为在训练集上的表现不佳,处于欠拟合状态。 下图是在吴恩达老师的机器学习课程上给出的四种学习曲线:
    # 四种曲线:见笔记
    
    
    
    # 上面红线代表test error(Cross-validation error),蓝线代表train error。这里我们也可以把错误率替换为准确率,那么相应曲线的走向就应该是上下颠倒的,(score=1-error)。
    
    # 注意我们的图中是error曲线。
    
    # 左上角是最优情况,随着样本数的增加,train error虽然有一定的增加,但是test error却有很明显的降低;
    
    # 右上角是最差情况,train error很大,模型并没有从特征 中学习到什么,导致test error非常大,模型几乎无法预测数据,需要去寻找数据本身和训练阶段的原因;
    
    # 左下角是high variance,train error虽然较低,但是模型产生了过拟合,缺乏泛化能力,导致test error很高;
    
    # 右下角是high bias的情况,train error很高,这时需要去调整模型的参数,减小train error。
    
    # 所以我们通过学习曲线观察模型处于什么样的状态。从而决定对模型进行如何的操作。当然,我们把验证放到最后,并不是这一步在最后去做。对于我们的Stacking框架中第一层的各个基学习器我们都应该对其学习曲线进行观察,从而去更好地调节超参数,进而得到更好的最终结果。构建绘制学习曲线的函数:
    
    from sklearn.model_selection import learning_curve
    # from sklearn.learning_curve import learning_curve
     
    def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,
                            n_jobs=1,train_sizes=np.linspace(.1,1.0,5),verbose=0):
        """
        Generate a simple plot of the test and training learning curve.
        
        Parameters
        -------------
        estimator:object type that implents the "fit" and "predict" methods
        An object of that type which is cloned for each validation.
        
        title:string
        Title for the chart.
        
        X:array-like,shape(n_samples,n_features)
        Training vector,where n_samples is the number of samples and n_features is 
        the number of features.
        
        y:array-like,shape(n_samples) or (n_samples,n_features),optional
        Target relative to X for classification or regression;
        None for unsupervised learning.
        
        ylim:tuple,shape(ymin,ymax),optional
        Defines minimum and maximum yvalues plotted.
        
        cv:integer,cross-validation generator,optional
        If an integer is passed,it is the number of folds(defaults to 3).
        Specific cross-validation objects can be passed,see
        sklearn.cross_validation module for the list of possible objects
        
        n_jobs:integer,optional
        Number of jobs to run in parallel(default 1).
        """
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel("Training examples")
        plt.ylabel("Score")
        train_sizes,train_scores,test_scores = learning_curve(estimator,X,y,cv=cv,
                                                              n_jobs=n_jobs,train_sizes=train_sizes)
        train_scores_mean = np.mean(train_scores,axis=1)
        train_scores_std = np.std(train_scores,axis=1)
        test_scores_mean = np.mean(test_scores,axis=1)
        test_scores_std = np.std(test_scores,axis=1)
        plt.grid()
        
        plt.fill_between(train_sizes,train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std,alpha=0.1,color='r')
        plt.fill_between(train_sizes,test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std,alpha=0.1,color='g')
        plt.plot(train_sizes,train_scores_mean,'o-',color="r",label="Training score")
        plt.plot(train_sizes,test_scores_mean,'o-',color="g",label="Cross-validation score")
        
        plt.legend(loc="best")
        return plt
    
    X = x_train
    Y = y_train
     
    # RandomForest
    rf_parameters = {'n_jobs':-1,'n_estimators':500,'warm_start':True,'max_depth':6,
                     'min_samples_leaf':2,'max_features':'sqrt','verbose':0}
     
    # AdaBoost
    ada_parameters = {'n_estimators':500,'learning_rate':0.1}
     
    # ExtraTrees
    et_parameters = {'n_jobs':-1,'n_estimators':500,'max_depth':8,'min_samples_leaf':2,'verbose':0}
     
    # GradientBoosting
    gb_parameters = {'n_estimators':500,'max_depth':5,'min_samples_leaf':2,'verbose':0}
     
    # DecisionTree
    dt_parameters = {'max_depth':8}
     
    # KNeighbors
    knn_parameters = {'n_neighbors':2}
     
    # SVM
    svm_parameters = {'kernel':'linear','C':0.025}
     
    # XGB
    gbm_parameters = {'n_estimators':2000,'max_depth':4,'min_child_weight':2,'gamma':0.9,'subsample':0.8,
                      'colsample_bytree':0.8,'objective':'binary:logistic','nthread':-1,'scale_pos_weight':1}
     
     
    title = "Learning Curves"
    plot_learning_curve(RandomForestClassifier(**rf_parameters),title,X,Y,cv=None,n_jobs=4,
                        train_sizes=[50,100,150,200,250,350,400,450,500])
    plt.show()
    

    [

    # 由上面的分析我们可以看出,对于RandomForest的模型,这里是存在一定的问题的,所以我们需要去调整模型的超参数,从而达到更好的效果。
    
    # 8. 超参数调试
    # 将生成的提交文件到Kaggle,得分结果:0.79425
    
    # xgboost stacking:0.78468
    
    # voting bagging:0.79904
    
    # 这也说明了我们的stacking模型还有很大的改进空间。所以我们可以在以下几个方面进行改进,提高模型预测的精度:
    
    # 特征工程:寻找更好的特征、删去影响较大的冗余特征;
    
    # 模型超参数调试:改进欠拟合或者过拟合的状态;
    
    # 改进模型框架:对于stacking框架的各层模型进行更好的选择;
    
    # 调参的过程慢慢尝试吧......
    
    展开全文
  • 逻辑回归实战 — Kaggle_Titanic 2

    千次阅读 2017-12-13 09:13:07
    Kaggle_Titanic

    数据来源:https://www.kaggle.com/c/titanic

    Training

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    %matplotlib inline
    
    train_data = pd.read_csv('train.csv')
    
    count_survivors = pd.value_counts(train_data['Survived'])
    count_survivors.plot(kind='bar')
    plt.xlabel('Is_survived')
    plt.ylabel('Number of People')
    plt.title('Survivor histogram')
    

    survivor_hist

    from sklearn.preprocessing import StandardScaler
    
    train_data['Sex'] = train_data['Sex'].map({'female':0, 'male':1})
    
    age_avg = np.mean([0 if np.isnan(item) else item for item in train_data['Age']])
    train_data['Age'] = [age_avg if np.isnan(item) else item for item in train_data['Age']]
    train_data['Age'] = StandardScaler().fit_transform(train_data['Age'].values.reshape(-1,1))
    
    train_data['SibSp'] = StandardScaler().fit_transform(train_data['SibSp'].values.reshape(-1,1))
    train_data['Parch'] = StandardScaler().fit_transform(train_data['Parch'].values.reshape(-1,1))
    train_data['Fare'] = StandardScaler().fit_transform(train_data['Fare'].values.reshape(-1,1))
    
    train_data['Embarked'] = train_data['Embarked'].map({'S':1, 'C':2, 'Q':3})
    pier = [0 if np.isnan(item) else item for item in train_data['Embarked']]
    train_data['Embarked'] = [max(set(pier), key=pier.count) if item == 0 else item for item in pier]
    
    train_data = train_data.drop(columns=['Name','Ticket','Cabin','PassengerId'])
    
    c:\python27\lib\site-packages\sklearn\utils\validation.py:475: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
      warnings.warn(msg, DataConversionWarning)
    
    X = train_data.ix[:, train_data.columns != 'Survived']
    Y = train_data.ix[:, train_data.columns == 'Survived']
    
    c:\python27\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
    .ix is deprecated. Please use
    .loc for label based indexing or
    .iloc for positional indexing
    
    See the documentation here:
    http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
      """Entry point for launching an IPython kernel.
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.cross_validation import KFold
    from sklearn.metrics import recall_score,confusion_matrix
    
    def getBestC(X, Y):
        folds = KFold(len(Y), 5)
        c_param_range = [0.01,0.1,1,10,100]
    
        results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
        results_table['C_parameter'] = c_param_range
    
    
        for i in range(len(c_param_range)):
            print '******** c_param = %.2f ********' % c_param_range[i]
            recall_accs = []
            for iteration, fold in enumerate(folds, start=1):
                lr = LogisticRegression(C = c_param_range[i], penalty = 'l1')
                lr.fit(X.iloc[fold[0]].values, Y.iloc[fold[0]].values)
                Y_hat = lr.predict(X.iloc[fold[1]].values)
                recall_acc = recall_score(Y.iloc[fold[1]].values, Y_hat)
                recall_accs.append(recall_acc)
    
                print 'Iteration %d: recall score = %f' % (iteration,recall_acc)
    
            results_table.ix[i,'Mean recall score'] = np.mean(recall_accs)
            print '\nMean recall score %f\n' % np.mean(recall_accs)
            
        best_c = results_table.loc[results_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
        print '--------------------------------\nbest_c = %.2f' % best_c
        return best_c
    
    
    c:\python27\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)
    
    best_c = getBestC(X, Y)
    
    ******** c_param = 0.01 ********
    Iteration 1: recall score = 0.000000
    Iteration 2: recall score = 0.000000
    Iteration 3: recall score = 0.000000
    Iteration 4: recall score = 0.000000
    Iteration 5: recall score = 0.000000
    
    Mean recall score 0.000000
    
    ******** c_param = 0.10 ********
    Iteration 1: recall score = 0.694915
    Iteration 2: recall score = 0.683544
    Iteration 3: recall score = 0.681159
    Iteration 4: recall score = 0.583333
    Iteration 5: recall score = 0.698413
    
    Mean recall score 0.668273
    
    ******** c_param = 1.00 ********
    Iteration 1: recall score = 0.745763
    Iteration 2: recall score = 0.708861
    Iteration 3: recall score = 0.710145
    Iteration 4: recall score = 0.597222
    Iteration 5: recall score = 0.746032
    
    Mean recall score 0.701604
    
    ******** c_param = 10.00 ********
    Iteration 1: recall score = 0.745763
    Iteration 2: recall score = 0.708861
    Iteration 3: recall score = 0.739130
    Iteration 4: recall score = 0.597222
    Iteration 5: recall score = 0.761905
    
    Mean recall score 0.710576
    
    ******** c_param = 100.00 ********
    Iteration 1: recall score = 0.745763
    Iteration 2: recall score = 0.708861
    Iteration 3: recall score = 0.739130
    Iteration 4: recall score = 0.597222
    Iteration 5: recall score = 0.761905
    
    Mean recall score 0.710576
    
    --------------------------------
    best_c = 10.00
    
    
    c:\python27\lib\site-packages\sklearn\utils\validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    c:\python27\lib\site-packages\ipykernel_launcher.py:25: DeprecationWarning: 
    .ix is deprecated. Please use
    .loc for label based indexing or
    .iloc for positional indexing
    
    See the documentation here:
    http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
    
    def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title)
        plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=0)
        plt.yticks(tick_marks, classes)
    
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j],
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
    
        plt.tight_layout()
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
    
    import itertools
    
    lr = LogisticRegression(C = best_c, penalty = 'l1')
    lr.fit(X.values, Y.values)
    Y_hat = lr.predict(X.values)
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(Y, Y_hat)
    #np.set_printoptions(precision=2)
    
    print "Recall value in training dataset: %f" % (1.0*cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
    
    # Plot non-normalized confusion matrix
    class_names = [0, 1]
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
    plt.show()
    
    Recall value in training dataset: 0.710526
    

    confusion_matrix

    lr = LogisticRegression(C = best_c, penalty = 'l1')
    lr.fit(X.values, Y.values)
    Y_hat_proba = lr.predict_proba(X.values)
    
    thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    
    plt.figure(figsize=(10,10))
    
    j = 1
    for i in thresholds:
        Y_hat = Y_hat_proba[:,1] > i
        
        plt.subplot(3,3,j)
        j += 1
        
        # Compute confusion matrix
        cnf_matrix = confusion_matrix(Y, Y_hat)
    
        print "Recall value in training dataset: %f, with threshold = %.1f" % ((1.0*cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])), i)
    
        # Plot non-normalized confusion matrix
        class_names = [0,1]
        plot_confusion_matrix(cnf_matrix
                              , classes=class_names
                              , title='Threshold >= %s'%i) 
    
    Recall value in training dataset: 0.938596, with threshold = 0.1
    Recall value in training dataset: 0.850877, with threshold = 0.2
    Recall value in training dataset: 0.824561, with threshold = 0.3
    Recall value in training dataset: 0.757310, with threshold = 0.4
    Recall value in training dataset: 0.710526, with threshold = 0.5
    Recall value in training dataset: 0.646199, with threshold = 0.6
    Recall value in training dataset: 0.532164, with threshold = 0.7
    Recall value in training dataset: 0.371345, with threshold = 0.8
    Recall value in training dataset: 0.204678, with threshold = 0.9
    

    thresholds_cnf

    Testing

    test_data = pd.read_csv('test.csv')
    
    test_data['Sex'] = test_data['Sex'].map({'female':0, 'male':1})
    
    age_avg = np.mean([0 if np.isnan(item) else item for item in test_data['Age']])
    test_data['Age'] = [age_avg if np.isnan(item) else item for item in test_data['Age']]
    test_data['Age'] = StandardScaler().fit_transform(test_data['Age'].values.reshape(-1,1))
    
    test_data['SibSp'] = StandardScaler().fit_transform(test_data['SibSp'].values.reshape(-1,1))
    test_data['Parch'] = StandardScaler().fit_transform(test_data['Parch'].values.reshape(-1,1))
    
    fare_avg = np.mean([0 if np.isnan(item) else item for item in test_data['Fare']])
    test_data['Fare'] = [fare_avg if np.isnan(item) else item for item in test_data['Fare']]
    test_data['Fare'] = StandardScaler().fit_transform(test_data['Fare'].values.reshape(-1,1))
    
    test_data['Embarked'] = test_data['Embarked'].map({'S':1, 'C':2, 'Q':3})
    pier = [0 if np.isnan(item) else item for item in test_data['Embarked']]
    test_data['Embarked'] = [max(set(pier), key=pier.count) if item == 0 else item for item in pier]
    
    test_data = test_data.drop(columns=['Name','Ticket','Cabin'])
    
    test_data.head()
    
    PassengerId Pclass Sex Age SibSp Parch Fare Embarked
    0 892 3 1 0.428099 -0.499470 -0.400248 -0.498403 3
    1 893 3 0 1.399492 0.616992 -0.400248 -0.513271 1
    2 894 2 1 2.565163 -0.499470 -0.400248 -0.465085 3
    3 895 3 1 -0.154736 -0.499470 -0.400248 -0.483463 1
    4 896 3 0 -0.543293 0.616992 0.619896 -0.418468 1
    train_data.head()
    
    Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 0 3 1 -0.494245 0.432793 -0.473674 -0.502445 1.0
    1 1 1 0 0.717307 0.432793 -0.473674 0.786845 2.0
    2 1 3 0 -0.191357 -0.474545 -0.473674 -0.488854 1.0
    3 1 1 0 0.490141 0.432793 -0.473674 0.420730 1.0
    4 0 3 1 0.490141 -0.474545 -0.473674 -0.486337 1.0
    X_test = test_data.drop(['PassengerId'], axis=1)
    
    X_test.head()
    
    Pclass Sex Age SibSp Parch Fare Embarked
    0 3 1 0.428099 -0.499470 -0.400248 -0.498403 3
    1 3 0 1.399492 0.616992 -0.400248 -0.513271 1
    2 2 1 2.565163 -0.499470 -0.400248 -0.465085 3
    3 3 1 -0.154736 -0.499470 -0.400248 -0.483463 1
    4 3 0 -0.543293 0.616992 0.619896 -0.418468 1
    lr = LogisticRegression(C = best_c, penalty = 'l1')
    lr.fit(X.values, Y.values)
    Y_hat_proba = lr.predict_proba(X_test.values)
    Y_hat = [1 if y > 0.6 else 0 for y in Y_hat_proba[:,1]]
    
    results = pd.DataFrame(Y_hat, columns=['Survived'])
    results.insert(0, 'PassengerId', test_data['PassengerId'])
    results.to_csv('results.csv')
    
    展开全文
  • kaggle Titanic

    2020-05-16 17:53:49
    1.stack https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/data 2.feature engineer https://www.kaggle.com/sinakhorami/titanic-best-working-classifier
  • kaggle titanic2

    2018-06-17 02:34:50
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2 diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_...
  • titanic.ipynb

    2019-08-31 10:00:31
    VariableDefinitionKey survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / ...
  • Kaggle-Titanic入门教程2

    2018-05-31 23:25:24
    与在Embarked = C和Q的Pclass = 2的男性相比,Pclass=3的男性有更好的存活率。基于数据分析的假设.填充 #2。 对于Pclass=3的男性乘客而言,不同登船口有不同的存活率。基于数据分析的假设.关联 #1。 决策 ...
  • 2 理解数据2.1 采集数据从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic​www.kaggle.com2.2 导入数据2.3 查看数据集信息从结果来看,数据总共有1309行。其中数据类型列:年龄(Age)、船舱...
  • titanic预测

    2018-03-01 23:23:08
    Pclass–类别型变量,1、2/3分别代表头等舱到下等舱 Name–姓名,姓名看起来没什么用,但是可以用来判定是否一家,在年龄缺失的时候可以用来断定 sex–性别,类别型变量 Age–年龄,有缺失值,如果年龄小于1,则...
  • 下面我们再来看看各种舱级别情况下各... 2 fig.set(alpha=0.5) 3 plt.title(u"根据舱等级和性别的获救情况",fontproperties=getChineseFont()) 4 5 ax1 = fig.add_subplot(141) 6 data_train.Survived[dat...
  • 参考博客:大树先生 Titanic: Machine Learning from Disaster ...2数据总览 3缺失值的处理方法 4分析数据关系 1数据集 train.csv :(891人) test.csv :(418人) 12列数据(有的乘客信息不
  • titanic_kaggle

    2020-03-10 12:17:30
    利用逻辑回归预测泰坦尼克号生存率 目录 提出问题 理解数据 采集数据 导入数据 查看数据集信息 数据清洗 ...2.理解数据 ...从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic 2...
  • Updating Titanic Example

    2020-12-28 14:13:30
    <p>I am working on updating the Titanic example to follow ;1774164161">the "cleaned" data</a> from . The changes are: <ul><li>adding "fare class*</li><li>using strings as input labels</li>...
  • 前言:这是Titanic的第二篇文章, 在多模型ensemble之后并没有提高LB的得分和排名,但是依然是精挑细选的一片开源notebook.亮点在数据分析和特征工程. 1. 查看数据 1.1 数据加载 import pandas as pd import numpy as ...
  • 【数据分析】 入门案例:Titanic乘客获救预测(2)1 数据清洗及特征处理1.1 缺失值处理1.1.1 查看缺失值 1 数据清洗及特征处理 数据清洗是指对拿到的原始数据中的缺失值,异常值的处理,以保证后续数据分析和建模的...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 711
精华内容 284
关键字:

2titanic