精华内容
下载资源
问答
  • 采用了随机森林和支持向量机的方法进行实验。对原始数据进行了异常值处理和属性数字的预处理操作,得到实验数据。实验1对数据进行十折交叉验证取十次结果平均值作为最终结果,度量方法为准确率,分别在两种模型上...

           本文在adult数据集上进行了实验,运用14个基本属性预测工资状况是否高于50K。采用了随机森林和支持向量机的方法进行实验。对原始数据进行了异常值处理和属性数字化的预处理操作,得到实验数据。实验1对数据进行十折交叉验证取十次结果平均值作为最终结果,度量方法为准确率,分别在两种模型上做对比实验。实验2采用两次留出法划分了训练集、验证集、测试集,以验证集为数据、以准确率作为标准调整模型参数,结果分别列出了两种模型下的查全率、查准率和F1值。经过两次实验验证,在adult数据集上,随机森林较支持向量机有更好的分类效果。

    1.  Adult数据集预处理(数据预处理见实验2:dealdata函数)

    1) 去除缺失值:原始数据中包含缺失数据,需进行清理,采用删除条目的方式去除缺失值。原始数据共计32561条,清理后为30162条。正例共计7508条,反例22654条。

    2)属性数字化:原始文件描述属性时使用文字形式,需将文字转化成数字作为模型输入。

    2.  实验1:十折交叉方法及准确率度量

    十折交叉验证是常用的测试方法。将数据集分成十份,轮流将其中9份作为训练数据,1份作为测试数据,进行试验。

    每次试验都会得出相应的正确率(或差错率)。10次的结果的正确率(或差错率)的平均值作为对算法精度的估计。

    import pandas as pd
    import numpy as np
    from sklearn import svm
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import KFold
    import warnings
    
    warnings.filterwarnings("ignore")
    # 数据预处理并获得训练集测试集
    # 从数据集中获得原始数据
    adult_digitization = pd.read_csv("data_cleaned.csv")
    # 构造输入和输出
    X = adult_digitization[
        ['age', 'workclass', 'fnlwgt', 'education', 'education_number', 'marriage', 'occupation', 'relationship',
         'race',
         'sex', 'capital_gain', 'apital_loss', 'hours_per_week', 'native_country']]
    Y = adult_digitization[['income']]
    # 交叉验证
    preaccrf = []
    preaccsvm = []
    num = 1
    kf = KFold(n_splits=10)
    for train, test in kf.split(X):
        X_train, X_test = X.loc[train], X.loc[test]
        Y_train, Y_test = Y.loc[train], Y.loc[test]
        rf = RandomForestClassifier(oob_score=False, random_state=10, criterion='entropy', n_estimators=400)
        rf.fit(X_train, Y_train)
        test_predictions = rf.predict(X_test)
        accuracy = accuracy_score(Y_test, test_predictions)
        preaccrf.append(accuracy)
        print("随机森林"+str(num)+"测试集准确率:  %s " % accuracy)
        num = num + 1
    num = 1
    for train, test in kf.split(X):
        X_train, X_test = X.loc[train], X.loc[test]
        Y_train, Y_test = Y.loc[train], Y.loc[test]
        clf = svm.SVC(kernel='rbf', C=1)
        clf.fit(X_train, Y_train)
        test_predictions = clf.predict(X_test)
        accuracy = accuracy_score(Y_test, test_predictions)
        preaccsvm.append(accuracy)
        print("支持向量机"+str(num)+"测试集准确率:  %s " % accuracy)
        num = num + 1
    print("随机森林十折交叉平均准确率:  %s " % np.mean(np.array(preaccrf)))
    print("支持向量机十折交叉平均准确率:  %s " % np.mean(np.array(preaccrf)))
    

    3.  实验2:留出法及查全查准度量

    为了调整模型参数以得到分类的最优模型,实验2采用两次留出法构造出比例为7:2:1的训练集、验证集和测试集。训练集用于训练分类模型,在验证集上运行后比较不同参数下的准确度,选择最优参数作为实验结果。最终评价测试集的准确程度使用查全率查准率及F1值。

    import pandas as pd
    import numpy as np
    from sklearn import svm
    from sklearn.externals import joblib
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    import warnings
    
    
    warnings.filterwarnings("ignore")
    # 数据预处理并获得训练集测试集
    
    
    def dealdata(filename):
        # 从数据集中获得原始数据
        adult_raw = pd.read_csv(filename, header=None)
        # print(len(adult_raw))
        # 添加标题
        adult_raw.rename(columns={0: 'age', 1: 'workclass', 2: 'fnlwgt', 3: 'education', 4: 'education_number',
                                  5: 'marriage', 6: 'occupation', 7: 'relationship', 8: 'race', 9: 'sex',
                                  10: 'capital_gain', 11: 'apital_loss', 12: 'hours_per_week', 13: 'native_country',
                                  14: 'income'}, inplace=True)
        # 清理数据,删除缺失值
        adult_cleaned = adult_raw.dropna()
    
        # 属性数字化
        adult_digitization = pd.DataFrame()
        target_columns = ['workclass', 'education', 'marriage', 'occupation', 'relationship', 'race', 'sex',
                          'native_country',
                          'income']
        for column in adult_cleaned.columns:
            if column in target_columns:
                unique_value = list(enumerate(np.unique(adult_cleaned[column])))
                dict_data = {key: value for value, key in unique_value}
                adult_digitization[column] = adult_cleaned[column].map(dict_data)
            else:
                adult_digitization[column] = adult_cleaned[column]
        # 确认数据类型为int型数据
        # for column in adult_digitization:
        #     adult_digitization[column] = adult_digitization[column].astype(int)
        # adult_digitization.to_csv("data_cleaned.csv")
        # print(len(adult_cleaned))
        # 构造输入和输出
        X = adult_digitization[
            ['age', 'workclass', 'fnlwgt', 'education', 'education_number', 'marriage', 'occupation', 'relationship',
             'race',
             'sex', 'capital_gain', 'apital_loss', 'hours_per_week', 'native_country']]
        Y = adult_digitization[['income']]
        # 查看数据情况 0:22654, 1:7508
        # print(Y.value_counts())
        # (0.7:0.3)构造训练集和测试集
        X_train, X_t_v, Y_train, Y_t_v = train_test_split(X, Y, test_size=0.3, random_state=0)
        # 7:2:1  训练集:验证集:测试集
        X_validation, X_test, Y_validation, Y_test = train_test_split(X_t_v, Y_t_v, test_size=0.3, random_state=0)
        # validation
        X_train.to_csv("X_train.csv", index=None)
        X_validation.to_csv("X_validation.csv", index=None)
        X_test.to_csv("X_test.csv", index=None)
        Y_train.to_csv("Y_train.csv", index=None)
        Y_validation.to_csv("Y_validation.csv", index=None)
        Y_test.to_csv("Y_test.csv", index=None)
    def randomforestmodel():
        # 构建随机森林模型
        rf = RandomForestClassifier(oob_score=False, random_state=10, criterion='entropy', n_estimators=400)
        rf.fit(X_train, Y_train['income'])
        joblib.dump(rf, "rf.m")
        # 调参过程可采用程序控制,使用验证集准确率调参
        # validation_predictions = rf.predict(X_validation)
        # print("验证集准确率:  %s " % accuracy_score(Y_validation, validation_predictions))
        # 调参1 criterion gini:0.835333122829 ;entropy:0.840069466372
        # 调参2 n_estimators gini 10:0.835333122829 ; 20:0.841806125671;30:0.842911272498 ;40:0.842437638143;50:0.845279444269
        # 调参2 n_estimators gini 100:0.848437006631;150:0.848910640985;200:0.849068519103 ;300:0.848279128513 ;400:0.848437006631
        # 调参2 n_estimators entropy 10:0.840069466372; 20:0.84275339438 ;30:0.843700663088 ;40:0.84433217556 ;50:0.846226712978
        # 调参2 n_estimators entropy 100:0.848121250395;150:0.848752762867;200:0.849857909694;300:0.849542153458;400: 0.851594568993
    
    
    def svmmodel():
        # 构建SVM模型
        for C in range(1, 10, 1):
            # C = 1
            clf = svm.SVC(kernel='linear', C=C)
            clf.fit(X_train, Y_train)
            joblib.dump(clf, "linear"+str(C)+"svm.m")
            validation_predictions = clf.predict(X_validation)
            print("C="+str(C)+":  验证集准确率:  %s " % accuracy_score(Y_validation, validation_predictions))
        # 调参
        # kernel='rbf':0.743132301863,‘linear’: 0.775023681718
        # C = 1:  验证集准确率: 0.743132301863
        # C = 2:  验证集准确率: 0.743290179981
        # C = 3:  验证集准确率: 0.743132301863
        # C = 4:  验证集准确率: 0.743132301863
        # C = 5:  验证集准确率: 0.743132301863
        # C = 6:  验证集准确率: 0.743132301863
        # C = 7:  验证集准确率: 0.743132301863
        # C = 8:  验证集准确率: 0.743132301863
        # C = 9:  验证集准确率: 0.743132301863
    
    
    dealdata("adultdata.csv")
    X_train = pd.read_csv('X_train.csv')
    X_validation = pd.read_csv("X_validation.csv")
    X_test = pd.read_csv('X_test.csv')
    Y_train = pd.read_csv('Y_train.csv')
    Y_validation = pd.read_csv("Y_validation.csv")
    Y_test = pd.read_csv('Y_test.csv')
    # print('Y_train: ')                        # 0:15890,1:5223
    # print(Y_train['income'].value_counts())
    # print('Y_validation: ')                        # 0:4709,1:1625
    # print(Y_validation['income'].value_counts())
    # print('Y_test ')                          # 0:2055,1:660
    # print(Y_test['income'].value_counts())
    
    # 训练随机森林模型并存储
    # randomforestmodel()
    # rf = joblib.load("rf.m")
    # Y_predictions1 = rf.predict(X_test)
    # print(classification_report(Y_test, Y_predictions1))
    # confmat = confusion_matrix(Y_test, Y_predictions1)
    # print(confmat)
    # 训练支持向量机模型并存储
    svmmodel()
    # rbf = joblib.load("rbfsvm.m")
    # Y_predictions2 = rbf.predict(X_test)
    # print(classification_report(Y_test, Y_predictions2))
    # confmat = confusion_matrix(Y_test, Y_predictions2)
    # print(confmat)
    # 随机森林预测结果
    #              precision    recall  f1-score   support
    #
    #           0       0.89      0.93      0.91      2055
    #           1       0.73      0.63      0.68       660
    #
    # avg / total       0.85      0.85      0.85      2715
    # [[1904  151] [ 246  414]]
    
    # 支持向量机预测结果
    #              precision    recall  f1-score   support
    #
    #           0       0.76      1.00      0.86      2055
    #           1       1.00      0.01      0.01       660
    #
    # avg / total       0.82      0.76      0.66      2715
    # [[2055    0] [ 656    4]]
    展开全文
  • 利用多光谱特征数据重组和标准化处理建立训练、测试及预测样本;基于最小相对熵理论设计损失函数,训练随机森林算法模型,提取不同风化类型及风化程度样本数据的光谱特征;利用训练后具有特征感知能力的分类模型对石窟多...
  • 在这里,我使用随机森林分类器,对好酒和不太好的酒进行二元分类。首先导入数据包:importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimport seaborn as sns导入数据:data = pd.read_csv(‘wine...

    在本次分析中,我使用了随机森林回归,并涉及数据标准化和超参数调优。在这里,我使用随机森林分类器,对好酒和不太好的酒进行二元分类。

    首先导入数据包:

    importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimport seaborn as sns

    导入数据:

    data = pd.read_csv(‘winequality-red.csv‘)

    data.head()

    20180620130339116515.png

    data.describe()

    20180620130339340148.png

    注释:

    fixed acidity:非挥发性酸

    volatile acidity : 挥发性酸

    citric acid:柠檬酸

    residual sugar :剩余糖分

    chlorides:氯化物

    free sulfur dioxide :游离二氧化硫

    total sulfur dioxide:总二氧化硫

    density:密度

    pH:pH

    sulphates:硫酸盐

    alcohol:酒精

    quality:质量

    所有数据的数值为1599,所以没有缺失值。让我们看看是否有重复值:

    extra =data[data.duplicated()]

    extra.shape

    20180620130339531554.png

    有240个重复值,但先不删除它,因为葡萄酒的质量等级是由不同的品酒师给出的。

    数据可视化

    sns.set()

    data.hist(figsize=(10,10), color=‘red‘)

    plt.show()

    20180620130339703429.png

    只有质量是离散型变量,主要集中在5和6中,下面分析下变量的相关性:

    colormap =plt.cm.viridis

    plt.figure(figsize=(12,12))

    plt.title(‘Correlation of Features‘, y=1.05, size=15)

    sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True,

    linecolor=‘white‘, annot=True)

    20180620130339924132.png

    观察:

    酒精与葡萄酒质量的相关性最高,其次是各种酸度、硫酸盐、密度和氯化物。

    使用分类器:

    将葡萄酒分成两组;“优质”>5为“好酒”

    y = data.quality #set ‘quality‘ as target

    X = data.drop(‘quality‘, axis=1) #rest are features

    print(y.shape, X.shape) #check correctness

    20180620130340567687.png

    #Create a new y1

    y1 = (y > 5).astype(int)

    y1.head()

    20180620130340718078.png

    # plot histogram

    ax= y1.plot.hist(color=‘green‘)

    ax.set_title(‘Wine quality distribution‘, fontsize=14)

    ax.set_xlabel(‘aggregated target value‘)

    20180620130340873351.png

    利用随机森林分类器训练预测模型

    from sklearn.model_selection importtrain_test_split, cross_val_scorefrom sklearn.ensemble importRandomForestClassifierfrom sklearn.metrics importaccuracy_score, log_lossfrom sklearn.metrics import confusion_matrix

    将数据分割为训练和测试数据集

    seed = 8 #set seed for reproducibility

    X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2,

    random_state=seed)

    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    20180620130341036437.png

    对随机森林分类器进行交叉验证训练和评价

    #Instantiate the Random Forest Classifier

    RF_clf = RandomForestClassifier(random_state=seed)

    RF_clf

    20180620130341139953.png

    #在训练数据集上计算k-fold交叉验证,并查看平均精度得分

    cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring=‘accuracy‘)print(‘The accuracy scores for the iterations are {}‘.format(cv_scores))print(‘The mean accuracy score is {}‘.format(cv_scores.mean()))

    20180620130341281554.png

    执行预测

    RF_clf.fit(X_train, y_train)

    pred_RF= RF_clf.predict(X_test)

    #Print 5 results to see

    for i in range(0,5):print(‘Actual wine quality is‘, y_test.iloc[i], ‘and predicted is‘, pred_RF[i])

    20180620130341410460.png

    在前五名中,有一个错误。让我们看看指标。

    print(accuracy_score(y_test, pred_LR))print(log_loss(y_test, pred_LR))

    20180620130341529601.png

    print(confusion_matrix(y_test, pred_LR))

    20180620130341664367.png

    总共有81个分类错误。

    与Logistic回归分类器相比,随机森林分类器更优。

    让我们调优随机森林分类器的超参数

    from sklearn.model_selection importGridSearchCV

    grid_values= {‘n_estimators‘:[50,100,200],‘max_depth‘:[None,30,15,5],‘max_features‘:[‘auto‘,‘sqrt‘,‘log2‘],‘min_samples_leaf‘:[1,20,50,100]}

    grid_RF= GridSearchCV(RF_clf,param_grid=grid_values,scoring=‘accuracy‘)

    grid_RF.fit(X_train, y_train)

    20180620130341866515.png

    grid_RF.best_params_

    20180620130342147765.png

    除了估计数之外,其他推荐值是默认值。

    RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed)

    RF_clf.fit(X_train,y_train)

    pred_RF=RF_clf.predict(X_test)print(accuracy_score(y_test,pred_RF))print(log_loss(y_test,pred_RF))

    20180620130342306945.png

    print(confusion_matrix(y_test,pred_RF))

    20180620130342401671.png

    通过超参数调谐,射频分类器的准确度已提高到82.5%,日志损失值也相应降低。分类错误的数量也减少到56个。

    将随机森林分类器作为基本推荐器,将红酒分为“推荐”(6级以上)或“不推荐”(5级以下),预测准确率为82.5%似乎是合理的。

    展开全文
  • 利用python分析红葡萄酒数据

    千次阅读 2018-06-20 12:39:00
    在这里,我使用随机森林分类器,对好酒和不太好的酒进行二元分类。 首先导入数据包: import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns 导入数据: ...

     

    在本次分析中,我使用了随机森林回归,并涉及数据标准化和超参数调优。在这里,我使用随机森林分类器,对好酒和不太好的酒进行二元分类。

    首先导入数据包:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

    导入数据:

    data = pd.read_csv('winequality-red.csv')
    data.head()

    data.describe()

    注释:

    fixed acidity:非挥发性酸   

    volatile acidity : 挥发性酸  

    citric acid:柠檬酸

    residual sugar :剩余糖分

    chlorides:氯化物

    free sulfur dioxide :游离二氧化硫

    total sulfur dioxide:总二氧化硫

    density:密度

    pH:pH

    sulphates:硫酸盐

    alcohol:酒精

    quality:质量

    所有数据的数值为1599,所以没有缺失值。让我们看看是否有重复值:

    extra = data[data.duplicated()]
    extra.shape

    有240个重复值,但先不删除它,因为葡萄酒的质量等级是由不同的品酒师给出的。

    数据可视化

    sns.set()
    data.hist(figsize=(10,10), color='red')
    plt.show()

    只有质量是离散型变量,主要集中在5和6中,下面分析下变量的相关性:

    colormap = plt.cm.viridis
    plt.figure(figsize=(12,12))
    plt.title('Correlation of Features', y=1.05, size=15)
    sns.heatmap(data.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, 
                linecolor='white', annot=True)

     

     

    观察:
    酒精与葡萄酒质量的相关性最高,其次是各种酸度、硫酸盐、密度和氯化物。

    使用分类器:

    将葡萄酒分成两组;“优质”>5为“好酒”

    y = data.quality                  # set 'quality' as target
    X = data.drop('quality', axis=1)  # rest are features
    print(y.shape, X.shape)           # check correctness

     

    # Create a new y1
    y1 = (y > 5).astype(int)
    y1.head()

     

     # plot histogram
    ax = y1.plot.hist(color='green')
    ax.set_title('Wine quality distribution', fontsize=14)
    ax.set_xlabel('aggregated target value')

    利用随机森林分类器训练预测模型

    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, log_loss
    from sklearn.metrics import confusion_matrix

    将数据分割为训练和测试数据集

    seed = 8 # set seed for reproducibility
    X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=0.2,
                                                        random_state=seed)
    print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

    对随机森林分类器进行交叉验证训练和评价

    # Instantiate the Random Forest Classifier
    RF_clf = RandomForestClassifier(random_state=seed)
    RF_clf

    # 在训练数据集上计算k-fold交叉验证,并查看平均精度得分
    cv_scores = cross_val_score(RF_clf,X_train, y_train, cv=10, scoring='accuracy')
    print('The accuracy scores for the iterations are {}'.format(cv_scores))
    print('The mean accuracy score is {}'.format(cv_scores.mean()))

    执行预测

    RF_clf.fit(X_train, y_train)
    pred_RF = RF_clf.predict(X_test)
    # Print 5 results to see
    for i in range(0,5):
        print('Actual wine quality is ', y_test.iloc[i], ' and predicted is ', pred_RF[i])

    在前五名中,有一个错误。让我们看看指标。

    print(accuracy_score(y_test, pred_LR))
    print(log_loss(y_test, pred_LR))

    print(confusion_matrix(y_test, pred_LR))

    总共有81个分类错误。

    与Logistic回归分类器相比,随机森林分类器更优。

    让我们调优随机森林分类器的超参数

    from sklearn.model_selection import GridSearchCV
    grid_values = {'n_estimators':[50,100,200],'max_depth':[None,30,15,5],
                   'max_features':['auto','sqrt','log2'],'min_samples_leaf':[1,20,50,100]}
    grid_RF = GridSearchCV(RF_clf,param_grid=grid_values,scoring='accuracy')
    grid_RF.fit(X_train, y_train)

    grid_RF.best_params_

    RF_clf = RandomForestClassifier(n_estimators=100,random_state=seed)
    RF_clf.fit(X_train,y_train)
    pred_RF = RF_clf.predict(X_test)
    print(accuracy_score(y_test,pred_RF))
    print(log_loss(y_test,pred_RF))

    print(confusion_matrix(y_test,pred_RF))

    通过超参数调谐,射频分类器的准确度已提高到82.5%,日志损失值也相应降低。分类错误的数量也减少到56个。

    将随机森林分类器作为基本推荐器,将红酒分为“推荐”(6级以上)或“不推荐”(5级以下),预测准确率为82.5%似乎是合理的。

    转载于:https://www.cnblogs.com/zqalq/p/9203207.html

    展开全文
  • 选择特征进行特征工程)3.2数据标准化4讨论是否需要PCA4.1没有经过PCA降维的KNN4.2经过PCA降维过的KNN4.3比较降维前后5讨论分类算法优劣5.1KNN5.2KNN网格搜索优化5.3SVC5.4逻辑回归5.5voting5.6随机森林5.7比较分类...

    机器学习树叶分类与聚类
    在这里插入图片描述

    1导入包

    import os
    import matplotlib.image as img
    import matplotlib.pyplot as plt  
    import numpy as np
    import pandas as pd
    import warnings
    
    # 备份原有warnings过滤器
    filters = warnings.filters[:]
    # 新增一条忽略DeprecationWarning的过滤规则
    warnings.simplefilter('ignore', DeprecationWarning)
    # import sets
    # # 恢复原来的过滤器
    # warnings.filters = filters
    

    2查看数据

    save_path='./filtered_imgs'
    if os.path.exists(save_path) is False:
        os.makedirs(save_path)
    
    os.listdir()
    
    ['.ipynb_checkpoints',
     'data_imgs',
     'data_imgs2',
     'filtered_imgs',
     'images',
     'ML_lesson10_KMeans聚类.ipynb',
     'notebook.tex',
     'render.html',
     'sample_submission.csv',
     'test.csv',
     'train.csv',
     'Untitled.ipynb',
     'Untitled1.ipynb',
     '树叶分类.html',
     '树叶分类实现.ipynb',
     '聚类.ipynb',
     '聚类测试.ipynb']
    
    # 对文件名称重新排序
    img_path='./images'
    img_name_list=os.listdir(img_path)
    img_name_list.sort(key=lambda x: int(x.split('.')[0]))
    
    ##顺序读取前12张图片并排序
    DImage=[]
    for img_name in img_name_list[:12]:
        img_full_path=os.path.join(img_path, img_name)
        DImage.append(img.imread(img_full_path))
    
    plt.style.use('ggplot')
    
    ##可视化树叶图片
    f=plt.figure(figsize=(8,6))
    for i in range(12):
        plt.subplot(3,4,i+1)
        plt.axis("off")
        plt.title("image_ID:{0}".format(img_name_list[i].split('.jpg')[0]))
        plt.imshow(DImage[i],cmap='hot')
    plt.show()
    

    在这里插入图片描述

    3读取训练集和测试集

    Train = pd.read_csv('train.csv')
    Train_id = Train['id']
    Test = pd.read_csv('test.csv')
    Test_id = Test['id']
    Test.drop(['id'],inplace = True, axis = 1)
    
    # 查看训练集数据描述
    Train.describe()
    
    id margin1 margin2 margin3 margin4 margin5 margin6 margin7 margin8 margin9 ... texture55 texture56 texture57 texture58 texture59 texture60 texture61 texture62 texture63 texture64
    count 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 ... 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000 990.000000
    mean 799.595960 0.017412 0.028539 0.031988 0.023280 0.014264 0.038579 0.019202 0.001083 0.007167 ... 0.036501 0.005024 0.015944 0.011586 0.016108 0.014017 0.002688 0.020291 0.008989 0.019420
    std 452.477568 0.019739 0.038855 0.025847 0.028411 0.018390 0.052030 0.017511 0.002743 0.008933 ... 0.063403 0.019321 0.023214 0.025040 0.015335 0.060151 0.011415 0.039040 0.013791 0.022768
    min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
    25% 415.250000 0.001953 0.001953 0.013672 0.005859 0.001953 0.000000 0.005859 0.000000 0.001953 ... 0.000000 0.000000 0.000977 0.000000 0.004883 0.000000 0.000000 0.000000 0.000000 0.000977
    50% 802.500000 0.009766 0.011719 0.025391 0.013672 0.007812 0.015625 0.015625 0.000000 0.005859 ... 0.004883 0.000000 0.005859 0.000977 0.012695 0.000000 0.000000 0.003906 0.002930 0.011719
    75% 1195.500000 0.025391 0.041016 0.044922 0.029297 0.017578 0.056153 0.029297 0.000000 0.007812 ... 0.043701 0.000000 0.022217 0.009766 0.021484 0.000000 0.000000 0.023438 0.012695 0.029297
    max 1584.000000 0.087891 0.205080 0.156250 0.169920 0.111330 0.310550 0.091797 0.031250 0.076172 ... 0.429690 0.202150 0.172850 0.200200 0.106450 0.578130 0.151370 0.375980 0.086914 0.141600

    8 rows × 193 columns

    Train.head()
    
    id species margin1 margin2 margin3 margin4 margin5 margin6 margin7 margin8 ... texture55 texture56 texture57 texture58 texture59 texture60 texture61 texture62 texture63 texture64
    0 1 Acer_Opalus 0.007812 0.023438 0.023438 0.003906 0.011719 0.009766 0.027344 0.0 ... 0.007812 0.000000 0.002930 0.002930 0.035156 0.0 0.0 0.004883 0.000000 0.025391
    1 2 Pterocarya_Stenoptera 0.005859 0.000000 0.031250 0.015625 0.025391 0.001953 0.019531 0.0 ... 0.000977 0.000000 0.000000 0.000977 0.023438 0.0 0.0 0.000977 0.039062 0.022461
    2 3 Quercus_Hartwissiana 0.005859 0.009766 0.019531 0.007812 0.003906 0.005859 0.068359 0.0 ... 0.154300 0.000000 0.005859 0.000977 0.007812 0.0 0.0 0.000000 0.020508 0.002930
    3 5 Tilia_Tomentosa 0.000000 0.003906 0.023438 0.005859 0.021484 0.019531 0.023438 0.0 ... 0.000000 0.000977 0.000000 0.000000 0.020508 0.0 0.0 0.017578 0.000000 0.047852
    4 6 Quercus_Variabilis 0.005859 0.003906 0.048828 0.009766 0.013672 0.015625 0.005859 0.0 ... 0.096680 0.000000 0.021484 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.031250

    5 rows × 194 columns

    Test.head()
    
    margin1 margin2 margin3 margin4 margin5 margin6 margin7 margin8 margin9 margin10 ... texture55 texture56 texture57 texture58 texture59 texture60 texture61 texture62 texture63 texture64
    0 0.019531 0.009766 0.078125 0.011719 0.003906 0.015625 0.005859 0.0 0.005859 0.023438 ... 0.006836 0.000000 0.015625 0.000977 0.015625 0.0 0.0 0.000000 0.003906 0.053711
    1 0.007812 0.005859 0.064453 0.009766 0.003906 0.013672 0.007812 0.0 0.033203 0.023438 ... 0.000000 0.000000 0.006836 0.001953 0.013672 0.0 0.0 0.000977 0.037109 0.044922
    2 0.000000 0.000000 0.001953 0.021484 0.041016 0.000000 0.023438 0.0 0.011719 0.005859 ... 0.128910 0.000000 0.000977 0.000000 0.000000 0.0 0.0 0.015625 0.000000 0.000000
    3 0.000000 0.000000 0.009766 0.011719 0.017578 0.000000 0.003906 0.0 0.003906 0.001953 ... 0.012695 0.015625 0.002930 0.036133 0.013672 0.0 0.0 0.089844 0.000000 0.008789
    4 0.001953 0.000000 0.015625 0.009766 0.039062 0.000000 0.009766 0.0 0.005859 0.000000 ... 0.000000 0.042969 0.016602 0.010742 0.041016 0.0 0.0 0.007812 0.009766 0.007812

    5 rows × 192 columns

    Train['species'].value_counts().head()
    
    Quercus_Agrifolia               10
    Quercus_Chrysolepis             10
    Alnus_Cordata                   10
    Viburnum_x_Rhytidophylloides    10
    Ginkgo_Biloba                   10
    Name: species, dtype: int64
    
    print("树叶种类数目为:",len(set(Train['species'])))
    
    树叶种类数目为: 99
    
    ## 把species转换为类标签。如把树叶名转化为数字标签
    map_dic = {}
    i =- 1
    for _ in Train['species']:
        if _ in map_dic:
            pass
        else:
            i+=1
            map_dic[_] = map_dic.get(_, i)
    
    [(key, value) for key, value in map_dic.items()][:5]
    
    [('Acer_Opalus', 0),
     ('Pterocarya_Stenoptera', 1),
     ('Quercus_Hartwissiana', 2),
     ('Tilia_Tomentosa', 3),
     ('Quercus_Variabilis', 4)]
    
    len(map_dic)
    
    99
    
    Train['species'].replace(map_dic.keys(), map_dic.values(), inplace=True)
    
    Train.drop(['id'], inplace = True, axis = 1)
    
    Train_ture = Train['species']
    

    3.1画出相关性矩阵(需要根据相关性矩阵,选择特征进行特征工程)

    corr = Train.corr()
    f, ax = plt.subplots(figsize=(25, 25))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5)
    plt.show()
    

    在这里插入图片描述

    # 判断是否存在缺失值,若为False,则无缺失值
    np.all(np.any(pd.isnull(Train)))
    
    False
    
    X = Train.drop(['species'], axis=1)
    y = Train['species']
    print(y.head())
    print("训练集尺寸:", X.shape) 
    
    0    0
    1    1
    2    2
    3    3
    4    4
    Name: species, dtype: int64
    训练集尺寸: (990, 192)
    
    ## 划分训练集和测试集
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_state=42)
    print("训练集数据尺寸",X_train.shape)
    print("测试集数据尺寸",X_test.shape)
    print("训练集目标尺寸",y_train.shape)
    print("测试集目标尺寸",y_test.shape)
    
    训练集数据尺寸 (792, 192)
    测试集数据尺寸 (198, 192)
    训练集目标尺寸 (792,)
    测试集目标尺寸 (198,)
    

    3.2数据标准化

    from sklearn.preprocessing import StandardScaler
    
    
    # 数据标准化
    standerScaler = StandardScaler()
    X_train = standerScaler.fit_transform(X_train)
    X_test = standerScaler.transform(X_test)
    

    4讨论是否需要PCA

    4.1没有经过PCA降维的KNN

    %%time
    from sklearn.preprocessing import StandardScaler
    
    X_train_shape1 = X_train.shape
    X_test_shape1 = X_test.shape
    print(X_train.shape[1])
    
    192
    Wall time: 0 ns
    
    %%time
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score
    
    clf=KNeighborsClassifier(2)
    clf.fit(X_train, y_train)
    train_predictions = clf.predict(X_test)
    notpca_score = accuracy_score(y_test, train_predictions)
    print(notpca_score)
    
    0.9696969696969697
    Wall time: 134 ms
    

    4.2经过PCA降维过的KNN

    from sklearn.decomposition import PCA
    
    pca = PCA(n_components=0.95)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    X_train_shape2 = X_train.shape
    X_test_shape2 = X_test.shape
    print(X_train.shape[1])
    
    68
    
    %%time
    
    clf=KNeighborsClassifier(2)
    clf.fit(X_train, y_train)
    train_predictions = clf.predict(X_test)
    pca_score = accuracy_score(y_test, train_predictions)
    print(pca_score)
    
    0.9646464646464646
    Wall time: 34.9 ms
    

    4.3比较降维前后

    data_score = pd.DataFrame([[X_train_shape1, X_train_shape2],
                              [X_test_shape1, X_test_shape2], 
                              [notpca_score, pca_score]])
    # 添加中文行索引
    data_score.index = ["X_train形状", "X_test形状", "Accuracy得分"]
    # 添加中文列索引
    data_score.columns = ["降维前", "降维后"]
    print("KNN")
    data_score
    
    降维前 降维后
    X_train形状 (792, 192) (792, 68)
    X_test形状 (198, 192) (198, 68)
    Accuracy得分 0.969697 0.964646

    讨论: pca降维在保留95%的信息下,与降维前的准确率相比略低,相差不大的情况下,以获得较快的计算速度。

    5讨论分类算法优劣

    score_classify_list = []
    model_classify_list = []
    model_predict_list = []
    

    5.1KNN

    from sklearn.neighbors import KNeighborsClassifier
    
    knn_clf0 = KNeighborsClassifier()
    knn_clf0.fit(X_train, y_train)
    print("*"*30)
    print('KNeighborsClassifier')
    
    y_predict = knn_clf0.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("KNN")
    model_predict_list.append(y_predict)
    

    运行结果:
    ******************************
    KNeighborsClassifier
    Accuracy: 97.9798%

    5.2KNN网格搜索优化

    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import GridSearchCV
    
    
    param_grid = [
        {
            'weights':["uniform"],
            'n_neighbors':[i for i in range(2, 4)]
        },
        {
            'weights':["distance"],
            'n_neighbors':[i for i in range(2, 4)],
            'p':[i for i in range(1, 3)]
        }
    ]
    
    knn_clf = KNeighborsClassifier()
    gird_search = GridSearchCV(knn_clf, param_grid)
    gird_search.fit(X_train, y_train)
    print("最佳超参数: ", gird_search.best_params_)
    knn_clf = gird_search.best_estimator_
    print("*"*30)
    print('KNeighborsClassifier网格搜索')
    
    y_predict = knn_clf.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("KNN网格搜索")
    model_predict_list.append(y_predict)
    

    运行结果:
    最佳超参数: {‘n_neighbors’: 2, ‘p’: 1, ‘weights’: ‘distance’}
    ******************************
    KNeighborsClassifier网格搜索
    Accuracy: 98.9899%

    5.3SVC

    from sklearn.svm import SVC
    svc_clf = SVC(probability=True)
    svc_clf.fit(X_train, y_train)
    
    print("*"*30)
    print('SVC')
    
    y_predict = svc_clf.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("SVC")
    model_predict_list.append(y_predict)
    

    运行结果:
    ******************************
    SVC
    Accuracy: 97.4747%

    5.4逻辑回归

    from sklearn.linear_model import LogisticRegressionCV
    
    lr = LogisticRegressionCV(multi_class="ovr", 
                              fit_intercept=True, 
                              Cs=np.logspace(-2,2,20), 
                              cv=2, 
                              penalty="l2", 
                              solver="lbfgs", 
                              tol=0.01)
    
    lr.fit(X_train,y_train)
    
    print("*"*30)
    print('逻辑回归')
    
    y_predict = lr.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("逻辑回归")
    model_predict_list.append(y_predict)
    

    运行结果:
    ******************************
    逻辑回归
    Accuracy: 98.9899%

    5.5voting

    warnings.simplefilter('ignore', DeprecationWarning)
    
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import VotingClassifier
    
    voting_clf = VotingClassifier(estimators=[
        ('log_clf', LogisticRegression()),
        ('svm_clf', SVC(probability=True)),
    ], voting='hard')
    
    voting_clf.fit(X_train,y_train)
    
    print("*"*30)
    print('voting')
    
    y_predict = voting_clf.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("voting")
    model_predict_list.append(y_predict)
    

    运行结果:
    ******************************
    voting
    Accuracy: 97.4747%

    5.6随机森林

    from sklearn.ensemble import RandomForestClassifier
    
    rf_clf = RandomForestClassifier(n_estimators=250,random_state=666, oob_score=True)
    rf_clf.fit(X_train, y_train)
    
    print("*"*30)
    print('随机森林')
    
    y_predict = rf_clf.predict(X_test)
    score = accuracy_score(y_test, y_predict)
    print("Accuracy: {:.4%}".format(score))
    
    score_classify_list.append(score)
    model_classify_list.append("随机森林")
    model_predict_list.append(y_predict)
    

    运行结果:
    ******************************
    随机森林
    Accuracy: 96.9697%

    5.7比较分类算法

    data_score = pd.DataFrame(score_classify_list)
    models = model_classify_list
    features = ["accuracy_score得分"]
    # 添加中文行索引
    data_score.index = models
    # 添加中文列索引
    data_score.columns = features
    data_score
    

    不同分类算法的得分:

    accuracy_score得分
    KNN 0.979798
    KNN网格搜索 0.989899
    SVC 0.974747
    逻辑回归 0.989899
    voting 0.974747
    随机森林 0.969697
    from pyecharts import options as opts
    from pyecharts.charts import Bar
    from pyecharts.globals import ThemeType
    
    models_score = [round(i, 4) for i in score_classify_list]
    
    def bar_reversal_axis() -> Bar:
        c = (
            Bar(init_opts=opts.InitOpts(width="800px",height="300px", theme=ThemeType.WONDERLAND))
            .add_xaxis(models)
            .add_yaxis("model", models_score)
            .reversal_axis()
            .set_series_opts(label_opts=opts.LabelOpts(position="right"))
            .set_global_opts(title_opts=opts.TitleOpts(title="分类算法", subtitle="得分"))
        )
        return c
    c = bar_reversal_axis()
    c.render("分类算法得分.html")
    c.render_notebook()
    

    可视化不同分类算法的得分:
    不同分类算法的得分

    # (6, 198)
    model_predict_array = np.array(model_predict_list)
    test_id_list = ["test_"+str(i) for i in range(model_predict_array.shape[1])]
    
    
    def bar_datazoom_slider() -> Bar:
        c = (
            Bar(init_opts=opts.InitOpts(width="1000px",height="200px",theme=ThemeType.WONDERLAND))
            .add_xaxis(test_id_list[:])
            .add_yaxis(models[0], model_predict_array[0,:].tolist())
            .add_yaxis(models[1], model_predict_array[1,:].tolist())
            .add_yaxis(models[2], model_predict_array[2,:].tolist())
            .add_yaxis(models[3], model_predict_array[3,:].tolist())
            .add_yaxis(models[4], model_predict_array[4,:].tolist())
            .add_yaxis(models[5], model_predict_array[5,:].tolist())
            .set_global_opts(
                title_opts=opts.TitleOpts(title="分类", subtitle="预测类别"),
                datazoom_opts=[opts.DataZoomOpts(type_="slider")]
            )
        )
        return c
    
    c = bar_datazoom_slider()
    c.render("分类预测类别.html")
    c.render_notebook()
    

    可视化不同分类算法的预测结果:
    不同分类算法的预测结果

    warnings.simplefilter('ignore', DeprecationWarning)
    def plot_learning_curve(model, x_start, x_shop, metrics, X_train_xx, X_test_xx, y_train_xx, y_test_xx,model_name):
    
        train_score = []
        test_score = []
        for i in range(x_start, x_shop, 100):
            model.fit(X_train_xx[:i], y_train_xx[:i])
            y_train_predict = model.predict(X_train_xx[:i])
            train_score.append(metrics(y_train_xx[:i], y_train_predict))
    
            y_test_predict = model.predict(X_test_xx)
            test_score.append(metrics(y_test_xx, y_test_predict))
    
        plt.plot([i for i in range(x_start, x_shop, 100)], np.sqrt(train_score), label="train")
        plt.plot([i for i in range(x_start, x_shop, 100)], np.sqrt(test_score), label="test")
        plt.legend()
        plt.title(model_name)
        plt.xlabel('训练数量')
        plt.ylabel('正确率')
        plt.axis([x_start, x_shop, 0., 1.1])
    
    X_train_xx = X_train.copy()
    X_test_xx = X_test.copy()
    y_train_xx = y_train.copy()
    y_test_xx = y_test.copy()
    
    %%time
    model_clf_list = [knn_clf0, knn_clf, svc_clf, lr, voting_clf, rf_clf]
    figure, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 25), dpi=80)
    warnings.simplefilter('ignore', Warning)
    for i in range(1, 7):
        plt.subplot(3,2,i)
        plot_learning_curve(model_clf_list[i-1], 22, len(X_train_xx)+1, accuracy_score, X_train_xx, X_test_xx, y_train_xx, y_test_xx, models[i-1])
    plt.show()
    

    学习曲线:观察学习曲线,这6种模型的最终准确率都非常接近于100%,KNN网格搜索和voting集成学习的训练结果和测试结果比较接近,可知两者的过拟合程度非常小。又因为KNN网格搜索最高,所以KNN网格搜索训练出的模型最为理想。

    Wall time: 2min 21s
    

    6预测树叶图片进行分类

    6.1测试集,预测结果99种类别

    ### warnings.simplefilter('ignore', DeprecationWarning)
    # 测试集,预测结果99种类别
    Test_sta = standerScaler.transform(Test)
    Test_pca = pca.transform(Test_sta)
    Test_predict = knn_clf.predict(Test_pca)
    print("预测结果尺寸为:")
    Test_predict.shape
    
    预测结果尺寸为:
    (594,)
    

    6.2初始化Test_label_dic, Train_label_dic

    Test_label_dic={}
    for i in range(99):
        Test_label_dic[i]=np.where(Test_predict==i)[0]
        
    Train_label_dic={}
    for i in range(99):
        Train_label_dic[i]=np.where(Train_ture==i)[0]
    

    6.3建立所有图片的数据集

    DImage=[]
    for img_name in img_name_list:
        img_full_path=os.path.join(img_path, img_name)
        DImage.append(img.imread(img_full_path))
    train_y = {}
    test_y = {}
    

    6.4生成分类图片

    # import pysnooper
    
    # @pysnooper.snoop(r'C:\Users\linyihua\Desktop\mylog/file.log')
    def imagesclassifier(DImage,Train_id,Test_id,Train_label_dic,Test_label_dic,root_path, train_y={}, test_y={}):
        '''DImage为图像数据集(循环读取的1584张图片),Train_id,Test_id分别是train.csv
        和test.csv 内id列,Train_label_dic,Test_label_dic是树叶类别标签和图片索引对应关系'''
        if os.path.exists(root_path) is False:
            os.makedirs(root_path)
        save_path=root_path
        for i in range(99):
            train_val=Train_id.values[np.array(Train_label_dic[i]).reshape(-1)]-1
            test_val=Test_id.values[np.array(Test_label_dic[i]).reshape(-1)]-1
            train_y[i] = train_val
            test_y[i] = test_val
            Train_imgs=np.array(DImage)[train_val]
            Test_imgs=np.array(DImage)[test_val]
            for index, _ in enumerate(Train_imgs):
                img_name='train'+str(train_val[index])+'.jpg'
                save_path=os.path.join(save_path,str(i))
                if os.path.exists(save_path) is False:
                    os.makedirs(save_path)
                img.imsave(os.path.join(save_path,img_name),_,cmap='binary')
                save_path=root_path
    
            for index, _ in enumerate(Test_imgs):
                img_name='test'+str(test_val[index])+'.jpg'
                save_path=os.path.join(save_path,str(i))
                if os.path.exists(save_path) is False:
                    os.makedirs(save_path)
                img.imsave(os.path.join(save_path,img_name),_,cmap='binary')
                save_path=root_path
        return train_y, test_y
        print(train_val)
    
    %%time
    train_y, test_y = imagesclassifier(DImage,Train_id,Test_id,Train_label_dic,Test_label_dic,save_path)
    
    Wall time: 1min 16s
    

    6.5查看在filtered_imgs下生成99个文件夹(文件夹4内的图片如下)

    ##顺序读取前12张图片并排序
    img_name_list1=os.listdir('./filtered_imgs/0')
    img_path1='./filtered_imgs/0'
    DImage1=[]
    for img_name1 in img_name_list1[:12]:
        img_full_path1=os.path.join(img_path1,img_name1)
        DImage1.append(img.imread(img_full_path1))
    ##可视化树叶图片
    f=plt.figure(figsize=(15,6))
    for i in range(12):
        plt.subplot(3,4,i+1)
        plt.axis("off")
        plt.title("image_ID:{0}".format(img_name_list1[i].split('.jpg')[0]))
        plt.imshow(DImage1[i],cmap='hot')
    plt.show()
    

    在这里插入图片描述

    6.6查看在filtered_imgs下生成99个文件夹(文件夹7内的图片如下)

    ##顺序读取前12张图片并排序
    img_name_list2=os.listdir('./filtered_imgs/7')
    img_path2='./filtered_imgs/7'
    DImage2=[]
    for img_name2 in img_name_list2[:12]:
        img_full_path2=os.path.join(img_path2,img_name2)
        DImage2.append(img.imread(img_full_path2))
    ##可视化树叶图片
    f=plt.figure(figsize=(15, 10))
    for i in range(12):
        plt.subplot(3,4,i+1)
        plt.axis("off")
        plt.title("image_ID:{0}".format(img_name_list2[i].split('.jpg')[0]))
        plt.imshow(DImage2[i],cmap='hot')
    plt.show()
    

    在这里插入图片描述

    7不同聚类分析

    7.1导入数据

    # 导入数据
    Train2 = pd.read_csv('train.csv')
    Test2 = pd.read_csv('test.csv')
    print(Train2.shape, Test2.shape)
    
    (990, 194) (594, 193)
    

    7.2数据处理

    ## 数据处理
    Train2.drop(['species'],inplace = True, axis = 1)
    data = np.concatenate((Train2,Test2), axis=0)
    data = pd.DataFrame(data)
    columns = Test2.columns
    data.columns = columns
    data = data.sort_values(by="id", ascending=True)
    data.head()
    
    id margin1 margin2 margin3 margin4 margin5 margin6 margin7 margin8 margin9 ... texture55 texture56 texture57 texture58 texture59 texture60 texture61 texture62 texture63 texture64
    0 1.0 0.007812 0.023438 0.023438 0.003906 0.011719 0.009766 0.027344 0.0 0.001953 ... 0.007812 0.000000 0.002930 0.002930 0.035156 0.0 0.0 0.004883 0.000000 0.025391
    1 2.0 0.005859 0.000000 0.031250 0.015625 0.025391 0.001953 0.019531 0.0 0.000000 ... 0.000977 0.000000 0.000000 0.000977 0.023438 0.0 0.0 0.000977 0.039062 0.022461
    2 3.0 0.005859 0.009766 0.019531 0.007812 0.003906 0.005859 0.068359 0.0 0.000000 ... 0.154300 0.000000 0.005859 0.000977 0.007812 0.0 0.0 0.000000 0.020508 0.002930
    990 4.0 0.019531 0.009766 0.078125 0.011719 0.003906 0.015625 0.005859 0.0 0.005859 ... 0.006836 0.000000 0.015625 0.000977 0.015625 0.0 0.0 0.000000 0.003906 0.053711
    3 5.0 0.000000 0.003906 0.023438 0.005859 0.021484 0.019531 0.023438 0.0 0.013672 ... 0.000000 0.000977 0.000000 0.000000 0.020508 0.0 0.0 0.017578 0.000000 0.047852

    5 rows × 193 columns

    X = data.iloc[:, 1:]
    y_id = data["id"]
    
    X.shape
    
    (1584, 192)
    

    7.2.1数据标准化

    from sklearn.preprocessing import StandardScaler
    
    standerScaler2 = StandardScaler()
    X = standerScaler2.fit_transform(X)
    

    7.2.2pca,降为2维方便可视化

    from sklearn.decomposition import PCA
    
    pca2 = PCA(n_components=2)
    X_reduction = pca2.fit_transform(X)
    
    y_predict_list = []
    

    7.3scatter_cluster(cluster_num, plt, model_name, X_reduction, y_predict):

    def scatter_cluster(cluster_num, plt, model_name, X_reduction, y_predict):
        colors = [
                    "#64A600","#A6A600", 
                    "#C6A300","#EA7500", 
                    "#AD5A5A","#A5A552",
                    "#5CADAD","#8080C0",
                    "#EA0000","#FF359A",
                    "#D200D2","#9F35FF",
                    "#2828FF","#0080FF",
                    "#00CACA","#02DF82",
                 ]
        markers = ["o", "^", "s", "p", "x", "+", "d", "*"] * 2
    
    #     plt.figure(figsize=(), dpi=80)
        plt.grid(linestyle="--", alpha=0.5)
        for i in range(cluster_num):
            plt.scatter(X_reduction[y_predict==i, 0], X_reduction[y_predict==i, 1],
                        color=colors[i], marker=markers[i], label=str(i))
    
        plt.title(model_name+":聚类数量为"+str(cluster_num))
        plt.legend()
    #     plt.show()
    
    model_names = []
    

    7.4KMeans

    from sklearn.cluster import KMeans
    
    # km = KMeans()
    # km.fit(X_reduction, y)
    # y_predict = km.predict(X_reduction)
    from sklearn.cluster import KMeans
    km = KMeans(n_clusters=8,init='k-means++',n_init=10,max_iter=300,tol=0.0001,precompute_distances='auto',verbose=0,random_state=None,copy_x=True,n_jobs=1,algorithm='auto')
    #n_clusters:class的个数;
    #max_inter:每一个初始化值下,最大的iteration次数;
    #n_init:尝试用n_init个初始化值进行拟合;
    #tol:within-cluster sum of square to declare convergence;
    #init=‘k-means++’:可使初始化的centroids相互远离;
    km.fit(X_reduction, y)
    y_predict = km.predict(X_reduction)
    y_predict_list.append(y_predict)
    model_names.append("KMeans")
    

    7.5Birch

    warnings.simplefilter('ignore', FutureWarning)
    
    from sklearn.cluster import Birch
    
    y_predict = Birch(n_clusters = 8).fit_predict(X_reduction)
    y_predict_list.append(y_predict)
    model_names.append("Birch")
    

    7.6MiniBatchKMeans

    from sklearn.cluster import MiniBatchKMeans
    
    y_predict = MiniBatchKMeans(n_clusters = 8).fit_predict(X_reduction)
    y_predict_list.append(y_predict)
    model_names.append("MiniBatchKMeans")
    

    7.7高斯混合聚类

    from sklearn import mixture
    
    y_predict = mixture.GMM(n_components=8).fit_predict(X_reduction)
    y_predict_list.append(y_predict)
    model_names.append("高斯混合聚类")
    

    7.8比较不同聚类的效果

    figure, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12), dpi=80)
    cluster_num = 8
    for i in range(1, 5):
        plt.subplot(2,2,i)
        scatter_cluster(cluster_num, plt, model_names[i-1], X_reduction, y_predict_list[i-1])
    # 添加网格显示
    plt.grid(linestyle="--", alpha=0.8)
    plt.show()
    

    不同聚类可视化:将特征运用PCA降维,降成两维以方便可视化,根据簇内距离最近,簇间距离最大原则,KMeans比较理想。
    在这里插入图片描述

    8分析聚类聚多少类合适

    def KMeans_clusters(clusters_num, X):
        km = KMeans(n_clusters=clusters_num,init='k-means++',n_init=10,max_iter=300,tol=0.0001,
                    precompute_distances='auto',verbose=0,random_state=None,
                    copy_x=True,n_jobs=1,algorithm='auto')
        km.fit(X, y)
        y_predict = km.predict(X)
        return y_predict
    
    cluster_num = 4
    cluster_name = "KMeans"
    figure, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12), dpi=80)
    for i in range(1, 5):
        y_predict = KMeans_clusters(cluster_num*i, X_reduction)
        plt.subplot(2,2,i)
        scatter_cluster(cluster_num*i, plt, cluster_name, X_reduction, y_predict)
    
    # 添加网格显示
    plt.grid(linestyle="--", alpha=0.8)
    plt.show()
    

    可视化不同簇数聚类:观察可知分成16簇比较理性。

    在这里插入图片描述

    9聚类结果

    9.1imagesclassifier2(cluster_num, DImage2, y_predict_KMeans2 ,root_path):

    # 聚16类
    cluster_num = 16
    y_predict = KMeans_clusters(cluster_num, X)
    
    def imagesclassifier2(cluster_num, DImage, y_predict ,root_path):
        if os.path.exists(root_path) is False:
            os.makedirs(root_path)
        save_path=root_path
        for i in range(cluster_num):
            data_val=np.where(y_predict==i)[0]
            data_imgs=np.array(DImage)[data_val]
            for index, _ in enumerate(data_imgs):
                img_name='data'+str(data_val[index])+'.jpg'
                save_path=os.path.join(save_path,str(i))
                if os.path.exists(save_path) is False:
                    os.makedirs(save_path)
                img.imsave(os.path.join(save_path,img_name),_,cmap='binary')
                save_path=root_path
    
    imagesclassifier2(cluster_num, DImage, y_predict, r"./data_imgs")
    

    9.2顺序读取前12张图片并排序

    ##顺序读取前12张图片并排序
    img_name_list22=os.listdir('./data_imgs/5')
    img_path22='./data_imgs/5'
    DImage22=[]
    for img_name22 in img_name_list22[:12]:
        img_full_path22=os.path.join(img_path22,img_name22)
        DImage22.append(img.imread(img_full_path22))
    ##可视化树叶图片
    f=plt.figure(figsize=(15, 10))
    for i in range(12):
        plt.subplot(3,4,i+1)
        plt.axis("off")
        plt.title("image_ID:{0}".format(img_name_list22[i].split('.jpg')[0]))
        plt.imshow(DImage22[i],cmap='hot')
    plt.show()
    

    在这里插入图片描述

    展开全文
  • 有时需要对特征进行一些分析,比如特征之间的相关性,特征与目标变量之间的相关性,有时需要对数据进行一些预处理,比如对分类变量创建虚拟变量,对连续变量进行log变换或者标准化,归一化。 请问,这两个步骤之间...
  • 随机森林特征重要性分析.flv_d.flv ├─课时23.级联模型原理.flv_d.flv ├─课时24.数据预处理与热度图.flv_d.flv ├─课时25.二阶段输入特征制作.flv_d.flv ├─课时26.使用级联模型进行预测.flv_d.flv ├─课时27....
  • 数据运营思维导图

    2018-04-26 14:24:22
    数据分析是精细运营,要建立起体系思维(金字塔思维) 自上而下 目标—维度拆解—数据分析模型—发现问题—优化策略 自下而上 异常数据 影响因素 影响因素与问题数据之间的相关关系 原因 优化策略 数据...
  • 支持向量机算法是一种对线性和非线性数据进行分类的方法,非线性数据进行分类的时候可以通过核函数转为线性的情况再处理。其中的一个关键的步骤是搜索最大边缘超平面。详细介绍链接 EM 期望最大算法。期望最大...
  • 2019数据运营思维导图

    2019-03-29 21:34:09
    数据分析是精细运营,要建立起体系思维(金字塔思维) 自上而下 目标—维度拆解—数据分析模型—发现问题—优化策略 自下而上 异常数据 影响因素 影响因素与问题数据之间的相关关系 原因 优化策略 数据化运营7大...
  • 3.11.2 基本类型的分类及特点 49 3.11.3 常量后缀 49 3.11.4 常量类型 49 3.11.5 数据类型转换 49 3.11.6 运算符优先级和结合性 50 表达式 50 4 最简单的 C程序设计—顺序程序设计 4.1 C语句概述 51 4.2 赋值语句 ...
  • 目前Shogun的机器学习功能分为几个部分:feature表示,feature预处理,核函数表示,核函数标准化,距离表示,分类器表示,聚类方法,分布,性能评价方法,回归方法,结构化输出学习器。 SHOGUN 的核心由C++实现,...
  • 文章目录读取并选择数据定义缺失值替换函数并填补缺失值标准化数据利用knn模型进行预测,做拒绝推断将审核通过的申请者和未通过的申请者进行合并分类变量转换处理异常值利用随机森林填补变量变量细筛与数据清洗WOE...
  • 一、决策树(不需要对数据进行去量纲化,归一化,标准化) 公司中不用决策树:使用决策树的升级版:集成算法(随机森林,梯度提升树,极限森林,adaboost提升树) 作用:分类,回归。 划分节点的标准:熵 或 Gini系数 ...
  • 字母汤-源码

    2021-02-13 15:24:01
    使用标准定标器将数值数据标准化 我们使用随机森林作为算法。 hidden_​​nodes_layer1 = 24,hidden_​​nodes_layer2 = 10,hidden_​​nodes_layer3 = 6 我尝试增加n_estimators,在0-2之间更改详细,添加和...
  • 3. AJAX 分别通过什么机制实现标准化呈现、实现动态显示和交互、进行数据交换与处理、 进行异步数据读取、绑定和处理所有数据? 4. 请比较描述机器学习的分类模型 XGboost、Adaboost、Catboost、随机森林和 GBDT? ...
  • 梯度提升树(GBT)是决策树的集合。 GBT迭代地训练决策树以便使损失函数最小化。 spark.ml实现支持GBT用于二进制... (2)不需要对数据进行标准化预处理; (3)可以分析特征之间的相互影响 值得注意的是,Spark中的...
  • 首先填充此数据集以进行进一步处理,然后进行标准化以确保每个特征具有相同的权重,然后将其分为训练集和测试集。 此后,在地面臭氧水平的预测中使用了五种不同的机器学习模型,并比较了它们的最终准确性得分。 ...
  • 范例10-19-1 空间矩、中心矩、标准中心矩及Hu不变矩 411 范例10-19-2比较两个形状 416 范例10-19-3比较两个影像,使用直方图比对法 419 范例10-19-4找出某物整合MatchTemplete与compareHist 422 范例10-20-1 找出合适...
  • 2.31 GBDT和随机森林的区别 83 2.32 四种聚类方法之比较 84 第三章 深度学习基础 88 3.1 基本概念 88 3.1.1 神经网络组成? 88 3.1.2 神经网络有哪些常用模型结构? 90 3.1.3 如何选择深度学习开发平台? 92 3.1.4 ...

空空如也

空空如也

1 2
收藏数 23
精华内容 9
关键字:

数据标准化进行随机森林分类