精华内容
下载资源
问答
  • 目录1 遗传算法特征选取基本原理遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明...

    目录

    1 遗传算法特征选取基本原理

    遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明对应特征未被选取,该特征将不出现在分类器中。其基本步骤为:

    (1) 编码。采用二进制编码方法, 二进制码的每一位的值, “0”表示特征未被选中;“1”表示特征被选中。

    (2) 初始群体的生成。随机产生N个初始串构成初始种群, 通常种群数确定为50~100。

    (3) 适应度函数。适应度函数表明个体或解的优劣性。针对特征选取问题, 适应度函数的构造非常重要, 它主要依据类别的可分性判据以及特征的分类能力。

    (4) 将适应度最大的个体, 即种群中最好的个体无条件地复制到下一代新种群中, 然后对父代种群进行选择、交叉和变异等遗传算子运算, 从而繁殖出下一代新种群其它n-1个基因串。通常采用转轮法作为选取方法, 适应度大的基因串选择的机会大, 从而被遗传到下一代的机会大, 相反, 适应度小的基因串选择的机会小, 从而被淘汰的机率大。交叉和变异是产生新个体的遗传算子, 交叉率太大, 将使高适应度的基因串结构很快被破坏掉, 太小则使搜索停止不前, 一般取为0.5~0.9。变异率太大, 将使遗传算法变为随机搜索, 太小则不会产生新个体, 一般取为0.01~0.1。

    (5) 如果达到设定的繁衍代数, 返回最好的基因串, 并将其作为特征选取的依据, 算法结束。否则, 回到 (4) 继续下一代的繁衍。

    2. 适应度函数选择和环境要求

    (1)适应度函数选择

    可以参考scikitlearn库的model evaluation的分类指标和回归指标,本手稿选取mean_squared_error指标

    以下有很多指标可以选择

    SCORERS = dict(explained_variance=explained_variance_scorer,

    r2=r2_scorer,

    max_error=max_error_scorer,

    neg_median_absolute_error=neg_median_absolute_error_scorer,

    neg_mean_absolute_error=neg_mean_absolute_error_scorer,

    neg_mean_squared_error=neg_mean_squared_error_scorer,

    neg_mean_squared_log_error=neg_mean_squared_log_error_scorer,

    accuracy=accuracy_scorer, roc_auc=roc_auc_scorer,

    balanced_accuracy=balanced_accuracy_scorer,

    average_precision=average_precision_scorer,

    neg_log_loss=neg_log_loss_scorer,

    brier_score_loss=brier_score_loss_scorer,

    # Cluster metrics that use supervised evaluation

    adjusted_rand_score=adjusted_rand_scorer,

    homogeneity_score=homogeneity_scorer,

    completeness_score=completeness_scorer,

    v_measure_score=v_measure_scorer,

    mutual_info_score=mutual_info_scorer,

    adjusted_mutual_info_score=adjusted_mutual_info_scorer,

    normalized_mutual_info_score=normalized_mutual_info_scorer,

    fowlkes_mallows_score=fowlkes_mallows_scorer)

    (2)依赖的第三方工具包

    遗传算法工具包:sklearn-genetic

    python第三方工具包:scikitlearn、numpy、scipy、matplotlib

    3. python实现

    from __future__ import print_function

    from genetic_selection import GeneticSelectionCV

    import numpy as np

    from sklearn.neural_network import MLPRegressor

    import scipy.io as sio

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.metrics import r2_score, mean_squared_error

    import matplotlib.pyplot as plt

    def main():

    # 1.数据获取

    mat = sio.loadmat('NDFNDF_smote.mat')

    data = mat['NDFNDF_smote']

    x, y = data[:, :1050], data[:, 1050]

    print(x.shape, y.shape)

    # 2.样本集划分和预处理

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

    x_scale, y_scale = StandardScaler(), StandardScaler()

    x_train_scaled = x_scale.fit_transform(x_train)

    x_test_scaled = x_scale.transform(x_test)

    y_train_scaled = y_scale.fit_transform(y_train.reshape(-1, 1))

    y_test_scaled = y_scale.transform(y_test.reshape(-1, 1))

    print(x_train_scaled.shape, y_train_scaled.shape)

    print(x_test_scaled.shape, y_test_scaled.shape)

    # 3. 优化超参数

    base, size = 30, 21

    wavelengths_save, wavelengths_size, r2_test_save, mse_test_save = [], [], [], []

    for hidden_size in range(base, base+size):

    print('隐含层神经元数量: ', hidden_size)

    estimator = MLPRegressor(hidden_layer_sizes=hidden_size,

    activation='relu',

    solver='adam',

    alpha=0.0001,

    batch_size='auto',

    learning_rate='constant',

    learning_rate_init=0.001,

    power_t=0.5,

    max_iter=1000,

    shuffle=True,

    random_state=1,

    tol=0.0001,

    verbose=False,

    warm_start=False,

    momentum=0.9,

    nesterovs_momentum=True,

    early_stopping=False,

    validation_fraction=0.1,

    beta_1=0.9, beta_2=0.999,

    epsilon=1e-08)

    selector = GeneticSelectionCV(estimator,

    cv=5,

    verbose=1,

    scoring="neg_mean_squared_error",

    max_features=5,

    n_population=200,

    crossover_proba=0.5,

    mutation_proba=0.2,

    n_generations=200,

    crossover_independent_proba=0.5,

    mutation_independent_proba=0.05,

    tournament_size=3,

    n_gen_no_change=10,

    caching=True,

    n_jobs=-1)

    selector = selector.fit(x_train_scaled, y_train_scaled.ravel())

    print('有效变量的数量:', selector.n_features_)

    print(np.array(selector.population_).shape)

    print(selector.generation_scores_)

    x_train_s, x_test_s = x_train_scaled[:, selector.support_], x_test_scaled[:, selector.support_]

    estimator.fit(x_train_s, y_train_scaled.ravel())

    # y_train_pred = estimator.predict(x_train_s)

    y_test_pred = estimator.predict(x_test_s)

    # y_train_pred = y_scale.inverse_transform(y_train_pred)

    y_test_pred = y_scale.inverse_transform(y_test_pred)

    r2_test = r2_score(y_test, y_test_pred)

    mse_test = mean_squared_error(y_test, y_test_pred)

    wavelengths_save.append(list(selector.support_))

    wavelengths_size.append(selector.n_features_)

    r2_test_save.append(r2_test)

    mse_test_save.append(mse_test)

    print('决定系数:', r2_test, '均方误差:', mse_test)

    print('有效变量数量', wavelengths_size)

    # 4.保存过程数据

    dict_name = {'wavelengths_size': wavelengths_size, 'r2_test_save': r2_test_save,

    'mse_test_save': mse_test_save, 'wavelengths_save': wavelengths_save}

    f = open('bpnn_ga.txt', 'w')

    f.write(str(dict_name))

    f.close()

    # 5.绘制曲线

    plt.figure(figsize=(6, 4), dpi=300)

    fonts = 8

    xx = np.arange(base, base+size)

    plt.plot(xx, r2_test_save, color='r', linewidth=2, label='r2')

    plt.plot(xx, mse_test_save, color='k', linewidth=2, label='mse')

    plt.xlabel('generation', fontsize=fonts)

    plt.ylabel('accuracy', fontsize=fonts)

    plt.grid(True)

    plt.legend(fontsize=fonts)

    plt.tight_layout(pad=0.3)

    plt.show()

    if __name__ == "__main__":

    main()

    标签:特征选择,python,train,scorer,error,test,遗传算法,save,size

    来源: https://blog.csdn.net/Joseph__Lagrange/article/details/94451214

    展开全文
  • python实现1 遗传算法特征选取基本原理遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则...

    目录

    1 遗传算法特征选取基本原理

    2. 适应度函数选择和环境要求

    (1)适应度函数选择

    (2)依赖的第三方工具包

    3. python实现

    1 遗传算法特征选取基本原理

    遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明对应特征未被选取,该特征将不出现在分类器中。其基本步骤为:

    (1) 编码。采用二进制编码方法, 二进制码的每一位的值, “0”表示特征未被选中;“1”表示特征被选中。

    (2) 初始群体的生成。随机产生N个初始串构成初始种群, 通常种群数确定为50~100。

    (3) 适应度函数。适应度函数表明个体或解的优劣性。针对特征选取问题, 适应度函数的构造非常重要, 它主要依据类别的可分性判据以及特征的分类能力。

    (4) 将适应度最大的个体, 即种群中最好的个体无条件地复制到下一代新种群中, 然后对父代种群进行选择、交叉和变异等遗传算子运算, 从而繁殖出下一代新种群其它n-1个基因串。通常采用转轮法作为选取方法, 适应度大的基因串选择的机会大, 从而被遗传到下一代的机会大, 相反, 适应度小的基因串选择的机会小, 从而被淘汰的机率大。交叉和变异是产生新个体的遗传算子, 交叉率太大, 将使高适应度的基因串结构很快被破坏掉, 太小则使搜索停止不前, 一般取为0.5~0.9。变异率太大, 将使遗传算法变为随机搜索, 太小则不会产生新个体, 一般取为0.01~0.1。

    (5) 如果达到设定的繁衍代数, 返回最好的基因串, 并将其作为特征选取的依据, 算法结束。否则, 回到 (4) 继续下一代的繁衍。

    2. 适应度函数选择和环境要求

    (1)适应度函数选择

    可以参考scikitlearn库的model evaluation的分类指标和回归指标,本手稿选取mean_squared_error指标

    以下有很多指标可以选择

    SCORERS = dict(explained_variance=explained_variance_scorer,

    r2=r2_scorer,

    max_error=max_error_scorer,

    neg_median_absolute_error=neg_median_absolute_error_scorer,

    neg_mean_absolute_error=neg_mean_absolute_error_scorer,

    neg_mean_squared_error=neg_mean_squared_error_scorer,

    neg_mean_squared_log_error=neg_mean_squared_log_error_scorer,

    accuracy=accuracy_scorer, roc_auc=roc_auc_scorer,

    balanced_accuracy=balanced_accuracy_scorer,

    average_precision=average_precision_scorer,

    neg_log_loss=neg_log_loss_scorer,

    brier_score_loss=brier_score_loss_scorer,

    # Cluster metrics that use supervised evaluation

    adjusted_rand_score=adjusted_rand_scorer,

    homogeneity_score=homogeneity_scorer,

    completeness_score=completeness_scorer,

    v_measure_score=v_measure_scorer,

    mutual_info_score=mutual_info_scorer,

    adjusted_mutual_info_score=adjusted_mutual_info_scorer,

    normalized_mutual_info_score=normalized_mutual_info_scorer,

    fowlkes_mallows_score=fowlkes_mallows_scorer)

    (2)依赖的第三方工具包

    遗传算法工具包:sklearn-genetic

    python第三方工具包:scikitlearn、numpy、scipy、matplotlib

    3. python实现

    from __future__ import print_function

    from genetic_selection import GeneticSelectionCV

    import numpy as np

    from sklearn.neural_network import MLPRegressor

    import scipy.io as sio

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.metrics import r2_score, mean_squared_error

    import matplotlib.pyplot as plt

    def main():

    # 1.数据获取

    mat = sio.loadmat('NDFNDF_smote.mat')

    data = mat['NDFNDF_smote']

    x, y = data[:, :1050], data[:, 1050]

    print(x.shape, y.shape)

    # 2.样本集划分和预处理

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

    x_scale, y_scale = StandardScaler(), StandardScaler()

    x_train_scaled = x_scale.fit_transform(x_train)

    x_test_scaled = x_scale.transform(x_test)

    y_train_scaled = y_scale.fit_transform(y_train.reshape(-1, 1))

    y_test_scaled = y_scale.transform(y_test.reshape(-1, 1))

    print(x_train_scaled.shape, y_train_scaled.shape)

    print(x_test_scaled.shape, y_test_scaled.shape)

    # 3. 优化超参数

    base, size = 30, 21

    wavelengths_save, wavelengths_size, r2_test_save, mse_test_save = [], [], [], []

    for hidden_size in range(base, base+size):

    print('隐含层神经元数量: ', hidden_size)

    estimator = MLPRegressor(hidden_layer_sizes=hidden_size,

    activation='relu',

    solver='adam',

    alpha=0.0001,

    batch_size='auto',

    learning_rate='constant',

    learning_rate_init=0.001,

    power_t=0.5,

    max_iter=1000,

    shuffle=True,

    random_state=1,

    tol=0.0001,

    verbose=False,

    warm_start=False,

    momentum=0.9,

    nesterovs_momentum=True,

    early_stopping=False,

    validation_fraction=0.1,

    beta_1=0.9, beta_2=0.999,

    epsilon=1e-08)

    selector = GeneticSelectionCV(estimator,

    cv=5,

    verbose=1,

    scoring="neg_mean_squared_error",

    max_features=5,

    n_population=200,

    crossover_proba=0.5,

    mutation_proba=0.2,

    n_generations=200,

    crossover_independent_proba=0.5,

    mutation_independent_proba=0.05,

    tournament_size=3,

    n_gen_no_change=10,

    caching=True,

    n_jobs=-1)

    selector = selector.fit(x_train_scaled, y_train_scaled.ravel())

    print('有效变量的数量:', selector.n_features_)

    print(np.array(selector.population_).shape)

    print(selector.generation_scores_)

    x_train_s, x_test_s = x_train_scaled[:, selector.support_], x_test_scaled[:, selector.support_]

    estimator.fit(x_train_s, y_train_scaled.ravel())

    # y_train_pred = estimator.predict(x_train_s)

    y_test_pred = estimator.predict(x_test_s)

    # y_train_pred = y_scale.inverse_transform(y_train_pred)

    y_test_pred = y_scale.inverse_transform(y_test_pred)

    r2_test = r2_score(y_test, y_test_pred)

    mse_test = mean_squared_error(y_test, y_test_pred)

    wavelengths_save.append(list(selector.support_))

    wavelengths_size.append(selector.n_features_)

    r2_test_save.append(r2_test)

    mse_test_save.append(mse_test)

    print('决定系数:', r2_test, '均方误差:', mse_test)

    print('有效变量数量', wavelengths_size)

    # 4.保存过程数据

    dict_name = {'wavelengths_size': wavelengths_size, 'r2_test_save': r2_test_save,

    'mse_test_save': mse_test_save, 'wavelengths_save': wavelengths_save}

    f = open('bpnn_ga.txt', 'w')

    f.write(str(dict_name))

    f.close()

    # 5.绘制曲线

    plt.figure(figsize=(6, 4), dpi=300)

    fonts = 8

    xx = np.arange(base, base+size)

    plt.plot(xx, r2_test_save, color='r', linewidth=2, label='r2')

    plt.plot(xx, mse_test_save, color='k', linewidth=2, label='mse')

    plt.xlabel('generation', fontsize=fonts)

    plt.ylabel('accuracy', fontsize=fonts)

    plt.grid(True)

    plt.legend(fontsize=fonts)

    plt.tight_layout(pad=0.3)

    plt.show()

    if __name__ == "__main__":

    main()

    展开全文
  • 遗传算法特征选择的python实现

    千次阅读 热门讨论 2019-07-02 16:43:39
    遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明对应特征未被选取,该特征将不出现在...

    目录

    1 遗传算法特征选取基本原理

    2. 适应度函数选择和环境要求

    (1)适应度函数选择

    (2)依赖的第三方工具包

    3. python实现


    1 遗传算法特征选取基本原理

    遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明对应特征未被选取,该特征将不出现在分类器中。其基本步骤为:

    (1) 编码。采用二进制编码方法, 二进制码的每一位的值, “0”表示特征未被选中;“1”表示特征被选中。

    (2) 初始群体的生成。随机产生N个初始串构成初始种群, 通常种群数确定为50~100。

    (3) 适应度函数。适应度函数表明个体或解的优劣性。针对特征选取问题, 适应度函数的构造非常重要, 它主要依据类别的可分性判据以及特征的分类能力。

    (4) 将适应度最大的个体, 即种群中最好的个体无条件地复制到下一代新种群中, 然后对父代种群进行选择、交叉和变异等遗传算子运算, 从而繁殖出下一代新种群其它n-1个基因串。通常采用转轮法作为选取方法, 适应度大的基因串选择的机会大, 从而被遗传到下一代的机会大, 相反, 适应度小的基因串选择的机会小, 从而被淘汰的机率大。交叉和变异是产生新个体的遗传算子, 交叉率太大, 将使高适应度的基因串结构很快被破坏掉, 太小则使搜索停止不前, 一般取为0.5~0.9。变异率太大, 将使遗传算法变为随机搜索, 太小则不会产生新个体, 一般取为0.01~0.1。

    (5) 如果达到设定的繁衍代数, 返回最好的基因串, 并将其作为特征选取的依据, 算法结束。否则, 回到 (4) 继续下一代的繁衍。

    2. 适应度函数选择和环境要求

    (1)适应度函数选择

    可以参考scikitlearn库的model evaluation的分类指标和回归指标,本手稿选取mean_squared_error指标

    以下有很多指标可以选择

    SCORERS = dict(explained_variance=explained_variance_scorer,
                   r2=r2_scorer,
                   max_error=max_error_scorer,
                   neg_median_absolute_error=neg_median_absolute_error_scorer,
                   neg_mean_absolute_error=neg_mean_absolute_error_scorer,
                   neg_mean_squared_error=neg_mean_squared_error_scorer,
                   neg_mean_squared_log_error=neg_mean_squared_log_error_scorer,
                   accuracy=accuracy_scorer, roc_auc=roc_auc_scorer,
                   balanced_accuracy=balanced_accuracy_scorer,
                   average_precision=average_precision_scorer,
                   neg_log_loss=neg_log_loss_scorer,
                   brier_score_loss=brier_score_loss_scorer,
                   # Cluster metrics that use supervised evaluation
                   adjusted_rand_score=adjusted_rand_scorer,
                   homogeneity_score=homogeneity_scorer,
                   completeness_score=completeness_scorer,
                   v_measure_score=v_measure_scorer,
                   mutual_info_score=mutual_info_scorer,
                   adjusted_mutual_info_score=adjusted_mutual_info_scorer,
                   normalized_mutual_info_score=normalized_mutual_info_scorer,
                   fowlkes_mallows_score=fowlkes_mallows_scorer)

    (2)依赖的第三方工具包

    遗传算法工具包:sklearn-genetic

    python第三方工具包:scikitlearn、numpy、scipy、matplotlib

    3. python实现

    from __future__ import print_function
    from genetic_selection import GeneticSelectionCV
    import numpy as np
    from sklearn.neural_network import MLPRegressor
    import scipy.io as sio
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import r2_score, mean_squared_error
    import matplotlib.pyplot as plt
    
    
    def main():
        # 1.数据获取
        mat = sio.loadmat('NDFNDF_smote.mat')
        data = mat['NDFNDF_smote']
        x, y = data[:, :1050], data[:, 1050]
        print(x.shape, y.shape)
    
        # 2.样本集划分和预处理
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
    
        x_scale, y_scale = StandardScaler(), StandardScaler()
        x_train_scaled = x_scale.fit_transform(x_train)
        x_test_scaled = x_scale.transform(x_test)
        y_train_scaled = y_scale.fit_transform(y_train.reshape(-1, 1))
        y_test_scaled = y_scale.transform(y_test.reshape(-1, 1))
        print(x_train_scaled.shape, y_train_scaled.shape)
        print(x_test_scaled.shape, y_test_scaled.shape)
    
        # 3. 优化超参数
        base, size = 30, 21
        wavelengths_save, wavelengths_size, r2_test_save, mse_test_save = [], [], [], []
        for hidden_size in range(base, base+size):
            print('隐含层神经元数量: ', hidden_size)
            estimator = MLPRegressor(hidden_layer_sizes=hidden_size,
                                     activation='relu',
                                     solver='adam',
                                     alpha=0.0001,
                                     batch_size='auto',
                                     learning_rate='constant',
                                     learning_rate_init=0.001,
                                     power_t=0.5,
                                     max_iter=1000,
                                     shuffle=True,
                                     random_state=1,
                                     tol=0.0001,
                                     verbose=False,
                                     warm_start=False,
                                     momentum=0.9,
                                     nesterovs_momentum=True,
                                     early_stopping=False,
                                     validation_fraction=0.1,
                                     beta_1=0.9, beta_2=0.999,
                                     epsilon=1e-08)
    
            selector = GeneticSelectionCV(estimator,
                                          cv=5,
                                          verbose=1,
                                          scoring="neg_mean_squared_error",
                                          max_features=5,
                                          n_population=200,
                                          crossover_proba=0.5,
                                          mutation_proba=0.2,
                                          n_generations=200,
                                          crossover_independent_proba=0.5,
                                          mutation_independent_proba=0.05,
                                          tournament_size=3,
                                          n_gen_no_change=10,
                                          caching=True,
                                          n_jobs=-1)
            selector = selector.fit(x_train_scaled, y_train_scaled.ravel())
            print('有效变量的数量:', selector.n_features_)
            print(np.array(selector.population_).shape)
            print(selector.generation_scores_)
    
            x_train_s, x_test_s = x_train_scaled[:, selector.support_], x_test_scaled[:, selector.support_]
            estimator.fit(x_train_s, y_train_scaled.ravel())
    
            # y_train_pred = estimator.predict(x_train_s)
            y_test_pred = estimator.predict(x_test_s)
            # y_train_pred = y_scale.inverse_transform(y_train_pred)
            y_test_pred = y_scale.inverse_transform(y_test_pred)
            r2_test = r2_score(y_test, y_test_pred)
            mse_test = mean_squared_error(y_test, y_test_pred)
    
            wavelengths_save.append(list(selector.support_))  
            wavelengths_size.append(selector.n_features_)  
            r2_test_save.append(r2_test)
            mse_test_save.append(mse_test)
            print('决定系数:', r2_test, '均方误差:', mse_test)
    
        print('有效变量数量', wavelengths_size)
    
        # 4.保存过程数据
        dict_name = {'wavelengths_size': wavelengths_size, 'r2_test_save': r2_test_save,
                     'mse_test_save': mse_test_save, 'wavelengths_save': wavelengths_save}
        f = open('bpnn_ga.txt', 'w')
        f.write(str(dict_name))
        f.close()
    
        # 5.绘制曲线
        plt.figure(figsize=(6, 4), dpi=300)
        fonts = 8
        xx = np.arange(base, base+size)
        plt.plot(xx, r2_test_save, color='r', linewidth=2, label='r2')
        plt.plot(xx, mse_test_save, color='k', linewidth=2, label='mse')
        plt.xlabel('generation', fontsize=fonts)
        plt.ylabel('accuracy', fontsize=fonts)
        plt.grid(True)
        plt.legend(fontsize=fonts)
        plt.tight_layout(pad=0.3)
        plt.show()
    
    
    if __name__ == "__main__":
        main()
    
    

     

     

    展开全文
  • 内部节点表示一个特征或属性, 叶节点表示一个类.决策树(Decision Tree),又称为判定树, 是一种以树结构(包括二叉树和多叉树)形式表达预测分析模型.通过把实例从根节点排列到某个叶子节点来分类实例叶子节点为实例...

    1. 什么是决策树

    分类决策树模型是一种描述对实例进行分类的树形结构. 决策树由结点和有向边组成. 结点有两种类型: 内部结点和叶节点. 内部节点表示一个特征或属性, 叶节点表示一个类.

    决策树(Decision Tree),又称为判定树, 是一种以树结构(包括二叉树和多叉树)形式表达的预测分析模型.

    通过把实例从根节点排列到某个叶子节点来分类实例

    叶子节点为实例所属的分类

    树上每个节点说明了对实例的某个属性的测试, 节点的每个后继分支对应于该属性的一个可能值

    2.决策树结构

    决策树结构.png

    3.决策树种类

    分类树--对离散变量做决策树

    回归树--对连续变量做决策树

    4.决策树算法(贪心算法)

    有监督的学习

    非参数学习算法

    自顶向下递归方式构造决策树

    在每一步选择中都采取在当前状态下最好/优的选择

    决策树学习的算法通常是一个递归地选择最优特征, 并根据该特征对训练数据进行分割, 使得各个子数据集有一个最好的分类的过程.

    在决策树算法中,ID3基于信息增益作为属性选择的度量, C4.5基于信息增益作为属性选择的度量, CART基于基尼指数作为属性选择的度量

    5.决策树学习过程

    特征选择

    决策树生成: 递归结构, 对应于模型的局部最优

    决策树剪枝: 缩小树结构规模, 缓解过拟合, 对应于模型的全局选择

    6.决策树优缺点

    优点:

    (1)速度快: 计算量相对较小, 且容易转化成分类规则. 只要沿着树根向下一直走到叶, 沿途的分裂条件就能够唯一确定一条分类的谓词.

    (2)准确性高: 挖掘出来的分类规则准确性高, 便于理解, 决策树可以清晰的显示哪些字段比较重要, 即可以生成可以理解的规则.

    (3)可以处理连续和种类字段

    (4)不需要任何领域知识和参数假设

    (5)适合高维数据

    缺点:

    (1)对于各类别样本数量不一致的数据, 信息增益偏向于那些更多数值的特征

    (2)容易过拟合

    (3)忽略属性之间的相关性

    5.2 决策树数学知识

    1.信息论:

    若一事假有k种结果, 对应概率为

    , 则此事件发生后所得到的信息量I为:

    2.熵:

    给定包含关于某个目标概念的正反样例的样例集S, 那么S相对这个布尔型分类的熵为:

    其中

    代表正样例,

    代表反样例

    3.条件熵:

    假设随机变量(X,Y), 其联合分布概率为P(X=xi,Y=yi)=Pij, i=1,2,...,n;j=1,2,..,m

    则条件熵H(Y|X)表示在已知随机变量X的条件下随机变量Y的不确定性, 其定义为X在给定条件下Y的条件概率分布的熵对X的数学期望

    5.3 决策树算法Hunt

    在Hunt算法中, 通过递归的方式建立决策树.

    如果数据集D种所有的数据都属于一个类, 那么将该节点标记为节点.

    如果数据集D中包含属于多个类的训练数据, 那么选择一个属性将训练数据划分为较小的子集, 对于测试条件的每个输出, 创建一个子节点, 并根据测试结果将D种的记录分布到子节点中, 然后对每一个子节点重复1,2过程, 对子节点的子节点依然是递归地调用该算法, 直至最后停止.

    5.4 决策树算法ID3

    1.分类系统信息熵

    2.条件熵

    3.信息增益Gain(S, A) 定义

    4.属性选择度量

    使用信息增益, 选择最高信息增益的属性作为当前节点的测试属性

    5.算法不足

    使用ID3算法构建决策树时, 若出现各属性值取值数分布偏差大的情况, 分类精度会大打折扣

    ID3算法本身并未给出处理连续数据的方法

    ID3算法不能处理带有缺失值的数据集, 故在算法挖掘之前需要对数据集中的缺失值进行预处理

    ID3算法只有树的生成, 所以该算法生成的树容易产生过拟合

    6.算法流程

    ID3(Examples,Target_attribute,Attributes)

    Examples即训练样例集. Target_attribute是这棵树要预测的目标属性. Attributes是除目标属性外供学习到的决策树测试的属性列表. 返回能正确分类给定Examples的决策树.

    创建树的Root结点

    如果Examples都为正, 那么返回label=+的单节点树Root

    如果Examples都为负, 那么返回label=-的单节点树Root

    如果Attributes为空, 那么返回单节点树Root, label=Examples中最普通的Target_attribute值

    否则

    A ← Attributes中分类Examples能力最好*的属性

    7.算法Python实现

    Python实现熵的计算

    from math import log

    def calcShanNonEnt(dataSet):

    numEntries = len(dataSet)

    labelCounts = {}

    for featVec in dataSet:

    currentLabel = featVec[-1]

    if currentLabel not in labelCounts.keys():

    labelCounts[currentLabel] = 0

    labelCounts[currentLabel] += 1

    shannonEnt = 0.0

    for key in labelCounts:

    prob = float(labelCounts[key])/numEntries

    shannonEnt -= prob*log(prob,2)

    return shannonEnt

    # example

    dataset = [[1],[2],[3],[3],]

    sne = calcShanNonEnt(dataset)

    print(sne)

    Sklearn.tree参数介绍及使用建议

    class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

    # Examples

    from sklearn.datasets import load_iris

    from sklearn.model_selection import cross_val_score

    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier(random_state=0)

    iris = load_iris()

    cross_val_score(clf, iris.data, iris.target, cv=10)

    from sklearn.model_selection import train_test_split

    X_train,X_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.3)

    res = clf.fit(X_train,y_train)

    pre = clf.predict(X_test)

    sco = clf.score(X_test, y_test)

    print(y_test)

    print(pre)

    print(sco)

    clf.apply(X_train)

    clf.apply(X_test)

    clf.decision_path(X_train)

    type(clf.decision_path(X_train))

    X_train.shape

    clf.feature_importances_

    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier()

    clf.fit(X_train, y_train)

    clf.feature_importances_

    clf.get_params()

    clf.predict_log_proba(X_test)

    clf.predict_proba(X_test)

    限制决策树层数为4的DecisionTreeClassifier实例

    from itertools import product

    import numpy as np

    import matplotlib.pyplot as plt

    from sklearn import datasets

    from sklearn.tree import DecisionTreeClassifier

    # 使用iris数据

    iris = datasets.load_iris()

    X = iris.data[:, [0, 2]]

    y = iris.target

    # 训练模型, 限制树的最大深度为4

    clf = DecisionTreeClassifier(max_depth=4)

    clf.fit(X,y)

    # Plot

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, .1),

    np.arange(y_min, y_max, .1))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha=.4)

    plt.scatter(X[:, 0], X[:, 1], c=y, alpha=.8)

    plt.show()

    output_12_0.png

    This plot compares the decision surfaces learned by a dcision tree classifier(first column), by a random forest classifier(second column), by an extra-trees classifier(third column) and by an AdaBoost classifier(fouth column).

    print(__doc__)

    import numpy as np

    import matplotlib.pyplot as plt

    from matplotlib.colors import ListedColormap

    from sklearn import clone

    from sklearn.datasets import load_iris

    from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,

    AdaBoostClassifier)

    from sklearn.tree import DecisionTreeClassifier

    # Parameters

    n_classes = 3

    n_estimators = 30

    cmap = plt.cm.RdYlBu

    plot_step = 0.02

    plot_step_coarser = 0.5

    RANDOM_SEED = 13

    # Load data

    iris = load_iris()

    plot_idx = 1

    models = [DecisionTreeClassifier(max_depth=None),

    RandomForestClassifier(n_estimators=n_estimators),

    ExtraTreesClassifier(n_estimators=n_estimators),

    AdaBoostClassifier(DecisionTreeClassifier(max_depth=3), n_estimators=n_estimators)]

    for pair in ([0,1], [0,2], [2,3]):

    for model in models:

    # print(pair, model)

    # only take the two correspoding features

    X = iris.data[:, pair]

    y = iris.target

    # Shuffle

    idx = np.arange(X.shape[0])

    np.random.seed(RANDOM_SEED)

    np.random.shuffle(idx)

    X = X[idx]

    y = y[idx]

    # Standardize

    mean = X.mean(axis=0)

    std = X.std(axis=0)

    X = (X - mean) / std

    # Train

    clf = clone(model)

    clf = model.fit(X, y)

    scores = clf.score(X, y)

    # Create a title for each column and the console by using str() and

    # slicing away useless parts of the string

    model_title = str(type(model)).split(".")[-1][:-2][:-len('Classifier')]

    model_details = model_title

    if hasattr(model, "estimators_"):

    model_details += " with {} estimators".format(len(model.estimators_))

    print(model_details + " with features", pair,

    "has a score of", scores)

    plt.subplot(3, 4, plot_idx)

    if plot_idx <= len(models):

    # Add a title at the top of eeach column

    plt.title(model_title)

    # Now plot the decision boundary using a fine mesh as input to a filled contour plot

    x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1

    y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

    np.arange(y_min, y_max, plot_step))

    # Plot either a single DecisionTreeClassifier or alpha blend the

    # decision surfaces of the ensemble of classifiers

    if isinstance(model, DecisionTreeClassifier):

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)

    cs = plt.contourf(xx, yy, Z, cmap=cmap)

    else:

    # Choose alpha blend level with respect to the number of estimators

    # that are in use (nothing that AdaBoost can use fewer estimtors

    # than its maximum if it achieves a good enough fit early on)

    estimator_alpha = 1.0 / len(model.estimators_)

    print(len(model.estimators_))

    for tree in model.estimators_:

    Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)

    cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)

    # Build a coarser grid to plot a set of ensemble classifications

    # to show how these are different to what we see in the decision

    # surfaces. These points are regularly space and do not have a

    # black outline

    xx_coarser, yy_coarser = np.meshgrid(

    np.arange(x_min, x_max, plot_step_coarser),

    np.arange(y_min, y_max, plot_step_coarser))

    Z_points_coarser = model.predict(np.c_[xx_coarser.ravel(),

    yy_coarser.ravel()]

    ).reshape(xx_coarser.shape)

    cs_points = plt.scatter(xx_coarser, yy_coarser, s=15,

    c=Z_points_coarser, cmap=cmap,

    edgecolors="none")

    plt.scatter(X[:, 0], X[:, 1], c=y,

    cmap=ListedColormap(['r', 'y', 'b']),

    edgecolor='k', s=20)

    plot_idx += 1

    Output:

    Automatically created module for IPython interactive environment

    DecisionTree with features [0, 1] has a score of 0.9266666666666666

    RandomForest with 30 estimators with features [0, 1] has a score of 0.9266666666666666

    30

    ExtraTrees with 30 estimators with features [0, 1] has a score of 0.9266666666666666

    30

    AdaBoost with 30 estimators with features [0, 1] has a score of 0.84

    30

    DecisionTree with features [0, 2] has a score of 0.9933333333333333

    RandomForest with 30 estimators with features [0, 2] has a score of 0.9933333333333333

    30

    ExtraTrees with 30 estimators with features [0, 2] has a score of 0.9933333333333333

    30

    AdaBoost with 30 estimators with features [0, 2] has a score of 0.9933333333333333

    30

    DecisionTree with features [2, 3] has a score of 0.9933333333333333

    RandomForest with 30 estimators with features [2, 3] has a score of 0.9933333333333333

    30

    ExtraTrees with 30 estimators with features [2, 3] has a score of 0.9933333333333333

    30

    AdaBoost with 30 estimators with features [2, 3] has a score of 0.9933333333333333

    30

    output_14_1.png

    A comparison of a several classifiers in scikit-learn on synthetic datasets.

    The point of this examples is to illustrate the nature of decision boundaries of different classifiers.

    Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.

    print(__doc__)

    # Code source: Gaël Varoquaux

    # Andreas Müller

    # Mmodified for documentation by Jaques Grobler

    # License: BSD 3 clause

    import numpy as np

    import matplotlib.pyplot as plt

    from matplotlib.colors import ListedColormap

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn.datasets import make_moons, make_circles, make_classification

    # classifier

    from sklearn.neural_network import MLPClassifier

    from sklearn.neighbors import KNeighborsClassifier

    from sklearn.svm import SVC

    from sklearn.gaussian_process import GaussianProcessClassifier

    from sklearn.gaussian_process.kernels import RBF

    from sklearn.tree import DecisionTreeClassifier

    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

    from sklearn.naive_bayes import GaussianNB

    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

    h = .02 # step size in the mesh

    names = ['Nearest Neighbors', 'Linear SVM', 'RBF SVM', 'Gaussian Process', 'Decision Tree', 'Random Forest', 'Neural Net', 'AdaBoost',' Naive Bayes','QDA']

    classifiers = [

    KNeighborsClassifier(3),

    SVC(kernel="linear",C=0.025), # C is pantly parameter

    SVC(gamma=2, C=1), # kernel: rbf(default), gamma: Kernel coefficient

    GaussianProcessClassifier(1.0 * RBF(1.0)),

    DecisionTreeClassifier(max_depth=5),

    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),

    MLPClassifier(alpha=1), # MultiLayer Perceptron

    AdaBoostClassifier(),

    GaussianNB(), # Gaussian Naive Bayes

    QuadraticDiscriminantAnalysis()

    ]

    X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,

    random_state=1, n_clusters_per_class=1)

    rng = np.random.RandomState(2)

    X += 2 * rng.uniform(size=X.shape)

    linearly_separable = (X, y)

    datasets = [make_moons(noise=0.3, random_state=0),

    make_circles(noise=0.2, factor=0.5, random_state=1),

    linearly_separable

    ]

    figure = plt.figure(figsize=(27,9))

    i = 1

    # iterate over datasets

    for ds_cnt, ds in enumerate(datasets):

    # preprocess datasset, split into training and test part

    X, y = ds

    X = StandardScaler().fit_transform(X)

    X_train, X_test, y_train, y_test = \

    train_test_split(X,y,test_size=.4,random_state=42)

    x_min, x_max = X[:,0].min() - .5, X[:,0].max() + .5

    y_min, y_max = X[:,1].min() - .5, X[:,0].max() + .5

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

    np.arange(y_min, y_max, h))

    # just plot the dataset first

    cm = plt.cm.RdBu

    cm_bright = ListedColormap(['#FF0000', '#0000FF'])

    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

    if ds_cnt == 0:

    ax.set_title("Input data")

    # Plot the training points

    ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,

    edgecolors='k')

    # and testing points

    ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright, alpha=0.6,

    edgecolors='k')

    ax.set_xlim(xx.min(), xx.max())

    ax.set_ylim(yy.min(), yy.max())

    ax.set_xticks(())

    ax.set_yticks(())

    i += 1

    # iterate over classifiers

    for name, clf in zip(names, classifiers):

    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

    clf.fit(X_train, y_train)

    score = clf.score(X_test, y_test)

    # plot the decision boundary, For that, we will assign a color to each

    # point in the mesh [x_min, x_max]*[y_min, y_max].

    if hasattr(clf, 'decision_function'):

    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

    else:

    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

    # Put the result into a color plot

    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

    # plot also the training points

    ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=cm_bright,

    edgecolors='k')

    # and testing points

    ax.scatter(X_test[:,0], X_test[:,1], c=y_test, cmap=cm_bright,

    edgecolors='k', alpha=.6)

    ax.set_xlim(xx.min(), xx.max())

    ax.set_ylim(yy.min(), yy.max())

    ax.set_xticks(())

    ax.set_yticks(())

    if ds_cnt == 0:

    ax.set_title(name)

    ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),

    size=15, horizontalalignment='right')

    i += 1

    plt.tight_layout()

    plt.show()

    Classifier comparison.png

    This example fits an AdaBoost decisin stump on a non-linearly separable classification dataset composed of two "Gaussian quantiles" clusters and plots the decision boundary and decision scores.

    print(__doc__)

    # Author: Noel Dawe

    #

    # License; BSD 3 clause

    import numpy as np

    import matplotlib.pyplot as plt

    from sklearn.ensemble import AdaBoostClassifier

    from sklearn.tree import DecisionTreeClassifier

    from sklearn.datasets import make_gaussian_quantiles

    # Construct dataset

    X1, y1 = make_gaussian_quantiles(cov=2.,

    n_samples=200, n_features=2,

    n_classes=2, random_state=1)

    X2, y2 = make_gaussian_quantiles(mean=(3,3), cov=1.5,

    n_samples=300, n_features=2,

    n_classes=2, random_state=1)

    X = np.concatenate((X1, X2))

    y = np.concatenate((y1, -y2 + 1))

    # Create and fit an AdaBoosted decision tree

    bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),

    algorithm='SAMME',

    n_estimators=200)

    bdt.fit(X, y)

    plot_colors = 'br'

    plot_step = .02

    class_names = 'AB'

    plt.figure(figsize=(10,5))

    # plot the decision boundaries

    plt.subplot(121)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

    np.arange(y_min, y_max, plot_step))

    Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)

    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

    plt.axis("tight")

    # Plot the training points

    for i, n, c in zip(range(2), class_names, plot_colors):

    idx = np.where(y == i)

    plt.scatter(X[idx, 0], X[idx, 1],

    c=c, cmap=plt.cm.Paired,

    s=20, edgecolor='k',

    label=("Class %s" % n))

    plt.xlim(x_min, x_max)

    plt.ylim(y_min, y_max)

    plt.legend(loc='upper right')

    plt.xlabel('x')

    plt.ylabel('y')

    plt.title('Decision Boundary')

    # Plot the two-class decision scors

    twoclass_output = bdt.decision_function(X)

    plot_range = (twoclass_output.min(), twoclass_output.max())

    plt.subplot(122)

    for i, n, c in zip(range(2), class_names, plot_colors):

    plt.hist(twoclass_output[y == i],

    bins=10,

    range=plot_range,

    facecolor=c,

    label=('Class %s' % n),

    alpha=.5,

    edgecolor='k')

    x1, x2, y1, y2 = plt.axis()

    plt.axis((x1, x2, y1, y2 * 1.2))

    plt.legend(loc='upper right')

    plt.ylabel('Samples')

    plt.xlabel('Score')

    plt.title('Decision Scores')

    plt.tight_layout()

    plt.subplots_adjust(wspace=0.35)

    plt.show()

    Output:

    Automatically created module for IPython interactive environment

    Two-class AdaBoost.png

    展开全文
  • #分治思想#基本思想:将一个问题分成许多个规模最小子问题去解决#注意分治法能解决问题一般具有以下几个特征:#1.该问题规模小到一定程度就可以容易地解决#2.该问题可以分解为若干个规模较小相同问题,即该...
  • 2、算法的特征: 有穷性:算法必须在有限步骤内终止 确切性:算法的每一步需要有确切的步骤 输入项:一个算法有0个或多个输入,以刻画运算对象的初始情况,所谓0个输入是指算法本身定出了初始条件; 输出项...
  • 操作环境:python2.7第三方库:opencv for python、numpy第一种比较经典的算法就是特征脸法,本质上其实就是PCA降维,这种算法的基本思路是,把二维的图像先灰度化,转化为一通道的图像,之后再把它首尾相接转化为一...
  • 操作环境:python2.7第三方库:opencv for python、numpy第一种比较经典的算法就是特征脸法,本质上其实就是PCA降维,这种算法的基本思路是,把二维的图像先灰度化,转化为一通道的图像,之后再把它首尾相接转化为一...
  • 按照模拟退火算法基本流程的python实现,可以参考模拟退火算法特征选择的python实现(一) 特此申明:代码是作者辛辛苦苦码, 转载请注明出处 1.模拟退火算法特征选择的python实现(类封装) import numpy ...
  • 模拟退火算法的基本原理在这里就不一一赘述了, 关于原理,可以参考百度百科、博客1、博客2 在本节按照基本实现步骤实现模拟退火算法, 对于模拟退火算法的高级封装(类封装), 可以参考模拟退火算法之特征选择的...
  • 操作环境:python2.7第三方库:opencv for python、numpy第一种比较经典的算法就是特征脸法,本质上其实就是PCA降维,这种算法的基本思路是,把二维的图像先灰度化,转化为一通道的图像,之后再把它首尾相接转化为一...
  • Python基本上提供了面向对象编程语言所有元素,如果你已经至少掌握了一门面向对象语言,那么利用Python进行面向对象程序设计将会相当容易。一、封装面向对象程序设计中术语对象(Object)基本上可以看做数据...
  • 近来想要做一做人脸识别相关的内容,主要是想集成一个系统...第一种比较经典的算法就是特征脸法,本质上其实就是PCA降维,这种算法的基本思路是,把二维的图像先灰度化,转化为一通道的图像,之后再把它首尾相接转化...
  • SciPy包括优化、积分、插值、特征值问题、代数方程、微分方程和许多其他类问题的算法;它还提供特殊数据结构,如稀疏矩阵和k-维树。SciPy是建立在NUMPY之上,它提供了array数据结构和相关快速数学程序,而...
  • 邻近算法,或者说K最近...k-近邻算法概述k-近邻算法采用测量不同特征值之间距离方法来进行分类,距离计算方法可以有很多,其中比如说有欧式距离,曼哈顿距离,标准化欧式距离,夹角余弦等等,可以参考 yoyo 博客: ...
  • 第一种比较经典的算法就是特征脸法,本质上其实就是PCA降维,这种算法的基本思路是,把二维的图像先灰度化,转化为一通道的图像,之后再把它首尾相接转化为一个列向量,假设图像大小是20*20的,那么这个向量就是400...
  • 易于阅读,可以理解每种算法的基本概念。 选择了广泛使用和实用的算法。 最小依赖性。 有关更多详细信息,请参见本文: ( ) 要求 要运行每个示例代码: 的Python 3.9.x 麻木 科学的 matplotlib 大熊猫 ...
  • 立完flag正文开始基于python的sklearn库决策树算法基本实现关于不同年龄等几个特征的人进行是否购买电脑预测 导入训练文件见Github先贴代码 ~_~# -*- coding:utf-8 -*-&quot;&quot;&quot; fit...
  • 从应用的角度来说,数据特征如果明显线性不可分,核映射到高维空间,在高维空间可以转化成线性可分,有利于进行下一步工作,这是核映射的基本功能,这样的情况下是否倾向于选择核映射算法?以下是一些简单的推导和...
  • 基于信息增益决策树算法(ID3算法)及python实现 决策树概述 不同于逻辑回归,决策树属于非线性模型,可以用于分类,也可用于回归。它是一种树形结构,可以认为是if-then规则集合,是以实例为基础归纳学习。...
  • Python数据处理与特征工程

    千人学习 2020-10-11 16:58:28
    学会特征提取最基本的处理方式,为后续的算法学习打好基础。 经过19节课程学习,你可以基本掌握数据采集、读取以及清洗方法,具备进一步学习数据分析乃至深度学习能力,能够大大拓宽你日后求职道路。 ...
  • 特征提取方法基础知识,将不同类型数据转换成特征向量方便机器学习算法研究1.分类变量特征提取:分类数据独热编码方法,并用scikit-learnDictVectorizer类实现2.机器学习问题中常见文档特征向量:>>1)...
  • opencv-python SURF(加速稳健...作为尺度不变特征变换(SIFT)算法的加速版,SURF算法在适中的条件下完成两幅图像中物体的匹配基本实现了实时处理,其快速的基础实际上只有一个——积分图像haar求导. SURF算法原...
  • 这篇文章主要介绍了python K近邻算法的kd树实现小编觉得挺不错的现在分享给大家也给大家做个参考 一起跟随小编过 看看吧 k近近邻算算法法的的介介绍 k近邻算法是一种基本的分类和回归方法这里只实现分类的k近邻算法 ...
  • 因此文中在分析随机森林算法的基本原理的基础上,提出一种改进的基于随机森林的特征筛选算法,并应用Python编程设计了一个能够预处理数据、调用这些算法、控制各参数并展现测试结果的系统,最终将该系统应用于肝癌...
  • 算法的一些基本要点:*对大小为m的数据集进行样本量同样为m的有放回抽样;*对K个特征进行随机抽样,形成特征的子集,样本量的确定方法可以有平方根、自然对数等;*每棵树完全生成,不进行剪枝;*每个样本的预测结果...
  • 算法的一些基本要点:*对大小为m的数据集进行样本量同样为m的有放回抽样;*对K个特征进行随机抽样,形成特征的子集,样本量的确定方法可以有平方根、自然对数等;*每棵树完全生成,不进行剪枝;*每个样本的预测结果...
  • 其重要作用在于通过机器学习来做出预测,一般我们比较关注获得广发使用的算法,这些算法中有很多是关于类似于“函数逼近”问题,这是监督学习一个子集,线性回归和逻辑回归是解决此类问题最常见解法,其...
  • 特征脸方法基本是将人脸识别推向真正可用第一种方法,了解一下还是很有必要特征脸用到理论基础PCA在之前文章中已经讲过了。直接上特征脸方法步骤:步骤一:获取包含M张人脸图像集合S。在我们例子里...
  • Python编写线性回归算法前言线性回归(Linear Regression)是机器学习的基础,作为机器学习算法的基石,后人在此基础之上演化了各种更为精确、有效的算法。1 线性回归的基本原理线性回归是要用一组特征的线性组合,来...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 910
精华内容 364
关键字:

python算法的基本特征

python 订阅