精华内容
下载资源
问答
  • 10, random_state = 0) regressor.fit(x_train, y_train) y_pred = regressor.predict(x_test) print(y_pred) from sklearn.metrics import r2_score r2_score(y_test , y_pred) from sklearn.model_selection ...

    import numpy as np

    import pandas as pd

    from sklearn.preprocessing import LabelEncoder,OneHotEncoder

    dataset = pd.read_csv("HR_comma_sep.csv")

    x = dataset.iloc[:,:-1].values ##Independent variable

    y = dataset.iloc[:,9].values ##Dependent variable

    ##Encoding the categorical variables

    le_x1 = LabelEncoder()

    x[:,7] = le_x1.fit_transform(x[:,7])

    le_x2 = LabelEncoder()

    x[:,8] = le_x1.fit_transform(x[:,8])

    ohe = OneHotEncoder(categorical_features = [7,8])

    x = ohe.fit_transform(x).toarray()

    ##splitting the dataset in training and testing data

    from sklearn.cross_validation import train_test_split

    y = pd.factorize(dataset['left'].values)[0].reshape(-1, 1)

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

    from sklearn.preprocessing import StandardScaler

    sc_x = StandardScaler()

    x_train = sc_x.fit_transform(x_train)

    x_test = sc_x.transform(x_test)

    sc_y = StandardScaler()

    y_train = sc_y.fit_transform(y_train)

    from sklearn.ensemble import RandomForestRegressor

    regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)

    regressor.fit(x_train, y_train)

    y_pred = regressor.predict(x_test)

    print(y_pred)

    from sklearn.metrics import r2_score

    r2_score(y_test , y_pred)

    from sklearn.model_selection import cross_val_score

    accuracies = cross_val_score(estimator = regressor, X = x_train, y = y_train, cv = 10)

    accuracies.mean()

    accuracies.std()

    展开全文
  • 这里介绍3种方法来选择特征:最优子集选择、向前或向后逐步选择、交叉验证法。最优子集选择这种方法思想很简单,就是把所有特征组合都尝试建模一遍,然后选择最优模型。基本如下:对于p个特征,从k=1到k=p——...

    在多元线性回归中,并不是所用特征越多越好;选择少量、合适的特征既可以避免过拟合,也可以增加模型解释度。这里介绍3种方法来选择特征:最优子集选择、向前或向后逐步选择、交叉验证法。

    最优子集选择

    这种方法的思想很简单,就是把所有的特征组合都尝试建模一遍,然后选择最优的模型。基本如下:

    对于p个特征,从k=1到k=p——

    从p个特征中任意选择k个,建立C(p,k)个模型,选择最优的一个(RSS最小或R2最大);

    从p个最优模型中选择一个最优模型(交叉验证误差、Cp、BIC、Adjusted R2等指标)。

    这种方法优势很明显:所有各种可能的情况都尝遍了,最后选择的一定是最优;劣势一样很明显:当p越大时,计算量也会越发明显地增大(2^p)。因此这种方法只适用于p较小的情况。

    以下为R中ISLR包的Hitters数据集为例,构建棒球运动员的多元线性模型。

    > library(ISLR)

    > Hitters

    > dim(Hitters) # 除去Salary做为因变量,还剩下19个特征

    [1] 263 20

    > library(leaps)

    > regfit.full = regsubsets(Salary~.,Hitters,nvmax = 19) #选择最大19个特征的全子集选择模型

    > reg.summary = summary(regfit.full) # 可看到不同数量下的特征选择

    > plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type = "l") # 特征越多,RSS越小

    > plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type = "l")

    > points(which.max(reg.summary$adjr2),reg.summary$adjr2[11],col="red",cex=2,pch=20) # 11个特征时,Adjusted R2最大

    > plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type = "l")

    > points(which.min(reg.summary$cp),reg.summary$cp[10],col="red",cex=2,pch=20) # 10个特征时,Cp最小

    > plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type = "l")

    > points(which.min(reg.summary$bic),reg.summary$bic[6],col="red",cex=2,pch=20) # 6个特征时,BIC最小

    > plot(regfit.full,scale = "r2") #特征越多,R2越大,这不意外

    > plot(regfit.full,scale = "adjr2")

    > plot(regfit.full,scale = "Cp")

    > plot(regfit.full,scale = "bic")

    Adjust R2、Cp、BIC是三个用来评价模型的统计量(定义和公式就不写了),Adjust R2越接近1说明模型拟合得越好;其他两个指标则是越小越好。

    注意到在这3个指标下,特征选择的结果也不同。这里以Adjust R2为例,以它为指标选出了11个特征:

    从图中可见,当Adjusted R2最大(当然也就比0.5多一点,也不怎么样)时,选出的11个特征为:AtBat、Hits、Walks、CAtBat、CRuns、CRBI、CWalks、LeagueN、DivisionW、PutOuts、Assists。

    可以直接查看模型的系数:

    > coef(regfit.full,11)

    (Intercept) AtBat Hits Walks CAtBat

    135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914

    CRuns CRBI CWalks LeagueN DivisionW

    1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252

    PutOuts Assists

    0.2894087 0.2688277

    可见这11个特征与图中一致,现在特征筛选出来了,系数也算出来了,模型就已经构建出来了。

    逐步回归法

    这种方法的思想可以概括为“一条路走到黑”,每一次迭代都只能沿着上一次迭代的方向继续进行,不能反悔,不能丢锅。以向前逐步回归为例,基本过程如下:

    对于p个特征,从k=1到k=p——

    从p个特征中任意选择k个,建立C(p,k)个模型,选择最优的一个(RSS最小或R2最大);

    基于上一步的最优模型的k个特征,再选择加入一个,这样就可以构建p-k个模型,从中最优;

    重复以上过程,直到k=p迭代完成;

    从p个模型中选择最优。

    向后逐步回归法类似,只是一开始就用p个特征建模,之后每迭代一次就舍弃一个特征是模型更优。

    这种方法与最优子集选择法的差别在于,最优子集选择法可以选择任意(k+1)个特征进行建模,而逐步回归法只能基于之前所选的k个特征进行(k+1)轮建模。所以逐步回归法不能保证最优,因为前面的特征选择中很有可能选中一些不是很重要的特征在后面的迭代中也必须加上,从而就不可能产生最优特征组合了。但优势就是计算量大大减小(p*(p+1)/2),因此实用性更强。

    > regfit.fwd = regsubsets(Salary~.,data=Hitters,nvmax = 19,method = "forward")

    > summary(regfit.fwd) # 显示向前选择过程

    Subset selection object

    Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")

    Selection Algorithm: forward

    AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits

    1 ( 1 ) " " " " " " " " " " " " " " " " " "

    2 ( 1 ) " " "*" " " " " " " " " " " " " " "

    3 ( 1 ) " " "*" " " " " " " " " " " " " " "

    4 ( 1 ) " " "*" " " " " " " " " " " " " " "

    5 ( 1 ) "*" "*" " " " " " " " " " " " " " "

    6 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    7 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    8 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " "

    13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " "

    14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " "

    15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*"

    16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"

    17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"

    18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"

    19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"

    CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts

    1 ( 1 ) " " " " "*" " " " " " " " "

    2 ( 1 ) " " " " "*" " " " " " " " "

    3 ( 1 ) " " " " "*" " " " " " " "*"

    4 ( 1 ) " " " " "*" " " " " "*" "*"

    5 ( 1 ) " " " " "*" " " " " "*" "*"

    6 ( 1 ) " " " " "*" " " " " "*" "*"

    7 ( 1 ) " " " " "*" "*" " " "*" "*"

    8 ( 1 ) " " "*" "*" "*" " " "*" "*"

    9 ( 1 ) " " "*" "*" "*" " " "*" "*"

    10 ( 1 ) " " "*" "*" "*" " " "*" "*"

    11 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    12 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    13 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    14 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    15 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    16 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    17 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    18 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

    Assists Errors NewLeagueN

    1 ( 1 ) " " " " " "

    2 ( 1 ) " " " " " "

    3 ( 1 ) " " " " " "

    4 ( 1 ) " " " " " "

    5 ( 1 ) " " " " " "

    6 ( 1 ) " " " " " "

    7 ( 1 ) " " " " " "

    8 ( 1 ) " " " " " "

    9 ( 1 ) " " " " " "

    10 ( 1 ) "*" " " " "

    11 ( 1 ) "*" " " " "

    12 ( 1 ) "*" " " " "

    13 ( 1 ) "*" "*" " "

    14 ( 1 ) "*" "*" " "

    15 ( 1 ) "*" "*" " "

    16 ( 1 ) "*" "*" " "

    17 ( 1 ) "*" "*" "*"

    18 ( 1 ) "*" "*" "*"

    19 ( 1 ) "*" "*" "*"

    > regfit.bwd = regsubsets(Salary~.,data=Hitters,nvmax = 19,method = "backward")

    > summary(regfit.bwd) # 显示向后选择过程

    Subset selection object

    Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")

    Selection Algorithm: backward

    AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits

    1 ( 1 ) " " " " " " " " " " " " " " " " " "

    2 ( 1 ) " " "*" " " " " " " " " " " " " " "

    3 ( 1 ) " " "*" " " " " " " " " " " " " " "

    4 ( 1 ) "*" "*" " " " " " " " " " " " " " "

    5 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    6 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    7 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    8 ( 1 ) "*" "*" " " " " " " "*" " " " " " "

    9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " "

    12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " "

    13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " "

    14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " "

    15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*"

    16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"

    17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*"

    18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"

    19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"

    CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts

    1 ( 1 ) " " "*" " " " " " " " " " "

    2 ( 1 ) " " "*" " " " " " " " " " "

    3 ( 1 ) " " "*" " " " " " " " " "*"

    4 ( 1 ) " " "*" " " " " " " " " "*"

    5 ( 1 ) " " "*" " " " " " " " " "*"

    6 ( 1 ) " " "*" " " " " " " "*" "*"

    7 ( 1 ) " " "*" " " "*" " " "*" "*"

    8 ( 1 ) " " "*" "*" "*" " " "*" "*"

    9 ( 1 ) " " "*" "*" "*" " " "*" "*"

    10 ( 1 ) " " "*" "*" "*" " " "*" "*"

    11 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    12 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    13 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    14 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    15 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    16 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    17 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    18 ( 1 ) " " "*" "*" "*" "*" "*" "*"

    19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

    Assists Errors NewLeagueN

    1 ( 1 ) " " " " " "

    2 ( 1 ) " " " " " "

    3 ( 1 ) " " " " " "

    4 ( 1 ) " " " " " "

    5 ( 1 ) " " " " " "

    6 ( 1 ) " " " " " "

    7 ( 1 ) " " " " " "

    8 ( 1 ) " " " " " "

    9 ( 1 ) " " " " " "

    10 ( 1 ) "*" " " " "

    11 ( 1 ) "*" " " " "

    12 ( 1 ) "*" " " " "

    13 ( 1 ) "*" "*" " "

    14 ( 1 ) "*" "*" " "

    15 ( 1 ) "*" "*" " "

    16 ( 1 ) "*" "*" " "

    17 ( 1 ) "*" "*" "*"

    18 ( 1 ) "*" "*" "*"

    19 ( 1 ) "*" "*" "*"

    需要注意的是,全子集回归、向前逐步回归和向后逐步回归的特征选择结果可能不同:

    > coef(regfit.full,7)

    (Intercept) Hits Walks CAtBat CHits

    79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073

    CHmRun DivisionW PutOuts

    1.4420538 -129.9866432 0.2366813

    > coef(regfit.fwd,7)

    (Intercept) AtBat Hits Walks CRBI

    109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622

    CWalks DivisionW PutOuts

    -0.3053070 -127.1223928 0.2533404

    > coef(regfit.bwd,7)

    (Intercept) AtBat Hits Walks CRuns

    105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095

    CWalks DivisionW PutOuts

    -0.7163346 -116.1692169 0.3028847

    交叉验证法

    交叉验证法是机器学习中一个普适的检验模型偏差和方差的方法,并不局限于具体的模型本身。这里介绍一种折中的k折交叉验证法,过程如下:

    将样本随机划入k(一般取10)个大小接近的折(fold)

    取第i(1<=i<=k)折的样本作为验证集,其它作为训练集训练模型

    k个模型的验证误差的均值即作为模型的总体验证误差

    k-fold CV比留一交叉验证法(LOOCV)的优势有两点:1、计算量小,LOOCV要计算n次,k-fold只需计算k次;2、LOOCV每次只留一个样本作为验证集,相当于差不多还是把全部整体作为训练集,这样每次拟合的模型都差不多,而且很容易造成过拟合,使验证误差方差过大。k-fold没有用那么多的样本来训练,可以有效避免过拟合的问题。

    所以对于不同数量的特征,都可以用k折交叉验证法求一个验证误差,最后比较验证误差与特征数量的关系(同样,这种思想方法也不仅局限于线性模型)。

    > set.seed(1)

    > # 随机划分训练集和测试集

    > train = sample(c(T,F),nrow(Hitters),rep=T)

    > test = !train

    >

    > # 训练集上进行全子集最优回归

    > regfit.best = regsubsets(Salary~.,data = Hitters[train,],nvmax = 19)

    > test.mat = model.matrix(Salary~.,data = Hitters[test,])

    >

    > val.errors = rep(NA,19)

    >

    > for(i in 1:19){

    + coefi = coef(regfit.best,id=i)

    + pred = test.mat[,names(coefi)]%*%coefi # 这一步用向量乘法来计算测试集的预测值

    + val.errors[i] = mean((Hitters$Salary[test]-pred)^2) # 计算MSE

    + }

    >

    > val.errors

    [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6

    [7] 162377.1 157909.3 154055.7 148162.1 151156.4 151742.5

    [13] 152214.5 157358.7 158541.4 158743.3 159972.7 159859.8

    [19] 160105.6

    > which.min(val.errors)

    [1] 10

    > coef(regfit.best,10)

    (Intercept) AtBat Hits Walks CAtBat

    -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698

    CHits CHmRun CWalks LeagueN DivisionW

    1.1053238 1.3844863 -0.7483170 84.5576103 -53.0289658

    PutOuts

    0.2381662

    上例是将样本随机分为训练集和测试集,然后在训练集上按不同特征数通过全子集回归构建模型并计算不同特征数下的MSE,可见10个特征下MSE最小。

    下面用k-折交叉验证法来选择特征:

    > k = 10

    > set.seed(1)

    > folds = sample(1:k,nrow(Hitters),replace = T) # 将样本可重复地划入10折中

    > table(folds) # 大致差不多

    folds

    1 2 3 4 5 6 7 8 9 10

    13 25 31 32 33 27 26 30 22 24

    > cv.errors = matrix(NA,k,19,dimnames = list(NULL,paste(1:19))) # 构建一个k*19的矩阵来存放测试误差。每一行代表一折,每一列代表特征数

    >

    > for(j in 1:k){

    + best.fit = regsubsets(Salary~.,data = Hitters[folds!=j,],nvmax = 19) # 以第j折以外的训练集作全子集最优回归

    + for(i in 1:19){ # 计算分别取1-19个特征下的MSE

    + pred = predict(best.fit,Hitters[folds==j,],id=i)

    + cv.errors[j,i] = mean((Hitters$Salary[folds==j]-pred)^2)

    + }

    + }

    >

    > cv.errors

    1 2 3 4 5 6

    [1,] 187479.08 141652.61 163000.36 169584.40 141745.39 151086.36

    [2,] 96953.41 63783.33 85037.65 76643.17 64943.58 56414.96

    [3,] 165455.17 167628.28 166950.43 152446.17 156473.24 135551.12

    [4,] 124448.91 110672.67 107993.98 113989.64 108523.54 92925.54

    [5,] 136168.29 79595.09 86881.88 94404.06 89153.27 83111.09

    [6,] 171886.20 120892.96 120879.58 106957.31 100767.73 89494.38

    [7,] 56375.90 74835.19 72726.96 59493.96 64024.85 59914.20

    [8,] 93744.51 85579.47 98227.05 109847.35 100709.25 88934.97

    [9,] 421669.62 454728.90 437024.28 419721.20 427986.39 401473.33

    [10,] 146753.76 102599.22 192447.51 208506.12 214085.78 224120.38

    7 8 9 10 11 12

    [1,] 193584.17 144806.44 159388.10 138585.25 140047.07 158928.92

    [2,] 63233.49 63054.88 60503.10 60213.51 58210.21 57939.91

    [3,] 137609.30 146028.36 131999.41 122733.87 127967.69 129804.19

    [4,] 104522.24 96227.18 93363.36 96084.53 99397.85 100151.19

    [5,] 86412.18 77319.95 80439.75 75912.55 81680.13 83861.19

    [6,] 94093.52 86104.48 84884.10 80575.26 80155.27 75768.73

    [7,] 62942.94 60371.85 61436.77 62082.63 66155.09 65960.47

    [8,] 90779.58 77151.69 75016.23 71782.40 76971.60 77696.55

    [9,] 396247.58 381851.15 369574.22 376137.45 373544.77 382668.48

    [10,] 214037.26 169160.95 177991.11 169239.17 147408.48 149955.85

    13 14 15 16 17 18

    [1,] 161322.76 155152.28 153394.07 153336.85 153069.00 152838.76

    [2,] 59975.07 58629.57 58961.90 58757.55 58570.71 58890.03

    [3,] 133746.86 135748.87 137937.17 140321.51 141302.29 140985.80

    [4,] 103073.96 106622.46 106211.72 107797.54 106288.67 106913.18

    [5,] 85111.01 84901.63 82829.44 84923.57 83994.95 84184.48

    [6,] 76927.44 76529.74 78219.76 78256.23 77973.40 79151.81

    [7,] 66310.58 70079.10 69553.50 68242.10 68114.27 67961.32

    [8,] 78460.91 81107.16 82431.25 82213.66 81958.75 81893.97

    [9,] 375284.60 376527.06 374706.25 372917.91 371622.53 373745.20

    [10,] 194397.12 194448.21 174012.18 172060.78 184614.12 184397.75

    19

    [1,] 153197.11

    [2,] 58949.25

    [3,] 140392.48

    [4,] 106919.66

    [5,] 84284.62

    [6,] 78988.92

    [7,] 67943.62

    [8,] 81848.89

    [9,] 372365.67

    [10,] 183156.97

    >

    > mean.cv.errors = apply(cv.errors,2,mean) # 计算各特征数下10折的平均MSE

    > mean.cv.errors

    1 2 3 4 5 6 7

    160093.5 140196.8 153117.0 151159.3 146841.3 138302.6 144346.2

    8 9 10 11 12 13 14

    130207.7 129459.6 125334.7 125153.8 128273.5 133461.0 133974.6

    15 16 17 18 19

    131825.7 131882.8 132750.9 133096.2 132804.7

    > plot(mean.cv.errors,type = "b")

    可见交叉验证的结果是选择11个特征。

    那么就可以对整个数据集进行全子集回归,选择11变量结果了。

    > reg.best = regsubsets(Salary~.,data = Hitters,nvmax = 19)

    > coef(reg.best,11)

    (Intercept) AtBat Hits Walks CAtBat

    135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914

    CRuns CRBI CWalks LeagueN DivisionW

    1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252

    PutOuts Assists

    0.2894087 0.2688277

    展开全文
  • 整理了一个小代码片段,阐述了如何使用DecisionTreeRegressor和交叉验证。A.在第一段代码中,使用了“cross-val_score”。但是,r2úu分数可能为负,这让我们了解到模型学习效果不佳。from sklearn.model_...

    整理了一个小代码片段,阐述了如何使用DecisionTreeRegressor和交叉验证。

    A.在第一段代码中,使用了“cross-val_score”。但是,r2úu分数可能为负,这让我们了解到模型的学习效果不佳。from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y,

    test_size=0.20, random_state=0)

    dt = DecisionTreeRegressor(random_state=0, criterion="mae")

    dt_fit = dt.fit(X_train, y_train)

    dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 5)

    print("mean cross validation score: {}".format(np.mean(dt_scores)))

    print("score without cv: {}".format(dt_fit.score(X_train, y_train)))

    # on the test or hold-out set

    from sklearn.metrics import r2_score

    print(r2_score(y_test, dt_fit.predict(X_test)))

    print(dt_fit.score(X_test, y_test))

    B.在下一节中,使用交叉验证对参数“min_samples_split”执行网格搜索,然后使用最佳估计量对有效/保持集进行评分。

    #使用GridSearch:

    从sklearn.model_selection导入GridSearchCV

    从sklearn.metrics导入make-scorer

    来自sklearn.metrics import mean_absolute_error

    从sklearn.metrics导入r2_分数scoring = make_scorer(r2_score)

    g_cv = GridSearchCV(DecisionTreeRegressor(random_state=0),

    param_grid={'min_samples_split': range(2, 10)},

    scoring=scoring, cv=5, refit=True)

    g_cv.fit(X_train, y_train)

    g_cv.best_params_

    result = g_cv.cv_results_

    # print(result)

    r2_score(y_test, g_cv.best_estimator_.predict(X_test))

    希望,这是有用的。

    参考:

    展开全文
  • 1、机器学习算法对于整体的数据训练和拟合,以典型的多元线性回归的方式为例,通过设定拟合的最高次数,然后对比输出的曲线结果可以看出,随着拟合函数次数的增大,其拟合线性回归模型的R2的值在不断地增大,均方差...

    机器学习中的过拟合和欠拟合

    1、机器学习算法对于整体的数据训练和拟合,以典型的多元线性回归的方式为例,通过设定拟合的最高次数,然后对比输出的曲线结果可以看出,随着拟合函数次数的增大,其拟合线性回归模型的R2的值在不断地增大,均方差也在不断地减小,看起来拟合的结果越来越准确,其实质只是对于所存在原始数据的拟合误差越来越小,而对于新的数据样本则并不一定适合这就是说存在过拟合(overfitting)的现象;而如果设定的多项式次数太小,又会使得整体的R2值太小,均方误差太大,从而使得拟合结果不足,这又是欠拟合(under fitting)的情况

    其中过拟合和欠拟合的输出准确度一般可以用均方差来进行对比和衡量,将不同机器学习算法的学习曲线定义为函数输出如下所示:

     

    #将不同的机器学习算法的学习曲线封装成为函数可以方便输出
    def plot_learning_curve(algo,x_train,x_test,y_train,y_test):
    train_score = []
    test_score = []
    for i in range(1, len(x_train)):
    algo.fit(x_train[:i], y_train[:i])
    y_train_pre = algo.predict(x_train[:i])
    y_test_pre =algo.predict(x_test)
    train_score.append(mean_squared_error(y_train[:i], y_train_pre))
    test_score.append(mean_squared_error(y_test, y_test_pre))
    plt.figure()
    plt.plot([i for i in range(1, len(x_train))], np.sqrt(train_score), "g", label="train_error")
    plt.plot([i for i in range(1, len(x_train))], np.sqrt(test_score), "r", label="test_error")
    plt.legend()
    plt.axis([0,len(x_train)+1,0,5])
    plt.show()
    plot_learning_curve(LinearRegression(),x_train,x_test,y_train,y_test)
    plot_learning_curve(polynomialRegression(degree=1),x_train,x_test,y_train,y_test) #欠拟合的情况
    plot_learning_curve(polynomialRegression(degree=2),x_train,x_test,y_train,y_test) #最佳拟合的情况
    plot_learning_curve(polynomialRegression(degree=10),x_train,x_test,y_train,y_test) #过拟合的情况

    其中机器学习算法存在过拟合和欠拟合情况时的均方差输出如下图所示:

    
    

    从上面两张图对比可以明显看出,机器学习算法过拟合和欠拟合时学习曲线的特点总结如下:

    (1)欠拟合:学习曲线的训练均方差一直增大,而测试数据集的均方差一直减小,最终都趋于一定的定值不变,并且这两个定值也大致趋近,但是总体明显要大于1;
    (2)最佳拟合:学习曲线的训练均方差一直增大,而测试数据集的均方差一直减小,最终都趋于一定的定值不变,并且这两个定值相近,都在1附近;
    (3)过拟合:学习曲线的训练均方差刚开始几乎为0,后续随着数据集的增多一直增大,趋于定值;而测试数据集的均方差波动起伏,整体趋势一直减小,最终也趋于一定的定值不变,但是这两者定值之间存在一定的差距,并且过拟合的测试数据集均方差在开始和后续一些地方是趋于无穷的;

    因此,对于机器学习算法的使用,在数据训练的过程中可能会出现训练的欠拟合和过拟合现象,这个必须尽量注意和避免!
    2、机器学习算法对于数据的处理主要是解决过拟合的现象,对于过拟合的训练模型,其实质是指模型的泛化能力不足,而要解决数据训练模型的过拟合现象,比较有效的方法就是将其数据集分为训练数据集合测试数据集进行训练和相关测试,这就是机器学习算法中利用sklearn中的train_test_split函数的意义所在
    3、对于数据的拟合过程,其训练模型复杂度越高,其训练数据的准确率会越高,但是对于新的测试数据集的预测准确度呈现先增后降的变化趋势,主要是因为基础数据集存在数据噪音,随着模型复杂度的增高,其算法过多的表达了数据间噪音的关系,因此我们需要找到对于测试数据集预测率最高的那个数据训练模型,也就是所说的泛化能力最好的数据模型。


    4、对于机器学习算法的过拟合现象,将数据分割成为训练集和测试集进行模型的训练是一种有效避免的方式,但是并不是最好的方式,因为这样的方式可能会使得训练数据集出现过拟合的现象。为了更好地去提高模型的准确度,避免训练过程中出现的过拟合和欠拟合现象,更加有效地方式是采用验证数据集进行交叉验证的方式,这样可以最大程度地避免训练过程中存在过拟合的现象,提高训练模型的准确度。


    5、机器学习算法的交叉验证的方式可以使用sklearn库中的cross_var_score(knn1,x_train,y_train,cv)函数来进行,其中cv参数就是指交叉验证时将训练数据集划分的份数,它的本质其实和GridSearch网格搜索的验证方式是一致的,都是交叉验证,并且GridSearch网格搜索函数也含有一个自定义参数cv,它和cross_var_score的cv参数是一致的。
    6、采用交叉验证的方式对于模型进行训练的结果是相对比较靠谱的一种训练方式,并且将训练数据集划分为k份进行交叉验证的方式一般可以称为k-fold cross validation,随着k的增大,其验证的结果是越来越可靠的。不过也有它的缺点:每次训练k个模型,相当于整体的性能慢了k倍。
    7、留一法(LOO CV):Leaves-One-Out cross validation是指训练数据x_train集含有多少个数据样本(即数据的长度m),则将其划分为多少份进行交叉验证,其中m-1份数据用来训练,剩余一份用来数据的验证,它的最大优点在于这种方式将完全不受随机的影响,得到的性能指标最接近真正的性能指标,不过因为数据量的巨大,整体的计算量巨大,计算过程非常慢。一般很少用这种方法,不过在学术研究过程中为了结果的可靠和准确也会经常用到。

    其中关于交叉验证的具体的代码如下所示:

    #机器学习算法的交叉验证方式实现代码:
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    digits=datasets.load_digits()
    x=digits.data
    y=digits.target
    from sklearn.model_selection import train_test_split
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,random_state=666)
    #1-1普通的训练测试方式
    from sklearn.neighbors import KNeighborsClassifier
    best_score=0
    best_p=0
    best_k=0
    for k in range(2,11):
    for p in range(1,6):
    knn1=KNeighborsClassifier(weights="distance",p=p,n_neighbors=k)
    knn1.fit(x_train,y_train)
    score=knn1.score(x_test,y_test)
    if score>best_score:
    best_p=p
    best_k=k
    best_score=score
    print("best_score:",best_score)
    print("best_k:",best_k)
    print("best_p:",best_p)
    #1-2交叉验证方式
    from sklearn.model_selection import cross_val_score
    best_p=0
    best_k=0
    best_score=0
    for k in range(2,11):
    for p in range(1,6):
    knn2=KNeighborsClassifier(weights="distance",p=p,n_neighbors=k)
    knn2.fit(x_train,y_train)
    scores=cross_val_score(knn2,x_train,y_train,cv=5) #这里的cv参数就是指将训练数据集分为几份进行交叉验证,默认为3
    score=np.mean(scores)
    if score>best_score:
    best_p=p
    best_k=k
    best_score=score
    print("best_score:",best_score)
    print("best_k:",best_k)
    print("best_p:",best_p)
    knn11=KNeighborsClassifier(weights="distance",p=2,n_neighbors=2)
    knn11.fit(x_train,y_train)
    print(knn11.score(x_test,y_test))
    #1-3利用网格搜索的方式寻找最优的超参数组合就是对于训练数据集进行交叉验证寻找最优
    from sklearn.model_selection import GridSearchCV
    knn3=KNeighborsClassifier()
    param=[
    {
    "weights":["distance"],
    "n_neighbors":[i for i in range(2,11)],
    "p":[k for k in range(1,6)]
    }
    ]
    grid1=GridSearchCV(knn3,param,verbose=1,cv=5) #这里的cv参数就是指将训练数据集分为几份进行交叉验证,默认为3
    grid1.fit(x_train,y_train)
    print(grid1.best_score_)
    print(grid1.best_params_)
    kn2=grid1.best_estimator_
    print(kn2.score(x_test,y_test))

    其具体的结果如下所示:

    (1)过拟合和最佳拟合学习曲线实际输出

    (2)欠拟合和最佳拟合学习曲线实际输出

    转载于:https://www.cnblogs.com/Yanjy-OnlyOne/p/11343381.html

    展开全文
  • 分析结果表明,该模型具有良好的稳定性和预测能力,交叉验证的Q2和非交叉验证的R2值分别为0.527和0.995。 该模型的轮廓图解释了查耳酮衍生物的结构与抗肿瘤活性之间的关系,可以进行分析以设计抗肿瘤查耳酮衍生物。...
  • 经过认真,严格清理程序以防止实验室玻璃器皿产生交叉污染,PAE标准显示出具有良好特异性信号。 DEHP,DBP和DnOP检测限分别为130、122和89 ng / g。 因此,对于DEHP,DBP和DnOP,计算定量限分别为394、370...
  • 从两种农作物120个样品种群中,涵盖处理中多个采样日期,使用涵盖61-85个随机选择样品可见光谱和NIR区域光谱信息,使用经过改进偏最小二乘(MPLS)回归,并使用内部交叉验证。 在MPLS协议中,我们在校准...
  • 文章目录细分构建机器学习应用程序流程-测试模型1.1 metrics评估指标1.2 测试回归模型1.2.1 r2_socre1.2.1 explained_variance_score1.3 测试分类模型1.3.1 准确度1.3.2 查准率1.3.3 查全率1.3.4 F1值1.3.5 ROC...
  • 通过对五个空间插值模型之间统计比较以及使用(交叉验证)对结果进行验证,发现通用克里格(UK)方法是代表萨尔曼地区地下水水位最佳方法,因为该模型具有最低均方根误差(RMSE),最低均方误差(ME)和最高...
  • 结合基于模拟退火算法网格搜索法与10折交叉验证法获得模型参数最优值为:γ=3.833 4×104,σ2=0.717 6。对57组数据进行预测。结果表明预测相对误差为±15%以内,均方差为4.01×10-5,确定系数R2为0.984 2。模型预测...
  • 您还可以生成回归图或分类图,或在重复的交叉验证实验中评估预测精度(分类)或RMSE / R2(回归)。 安装ugtm 安装简单: 点安装ugtm 如果收到错误消息,请尝试升级软件包: pip install --upgrade pip numpy ...
  • 交叉验证最佳主成分数为1,但建模效果却变好了,预测集和测试集R2均达到了0.9以上,RPD也大于3.这是什么原因呢,这样话建模是正确吗?请求各位有经验科研战友支援。 ...
  • 针对混合溶液在400~800 nm波长段吸收光谱, 利用特征区间联合法以分区方式对Zn(II)、Co(II)特征区间进行筛选, 并以留一交叉验证均方根误差VRMSECV最小和决定系数R2最大挑选出Zn(II)、Co(II)最优特征区间;...
  • 已建立拓扑CoMFA模型交叉验证相关系数(r2)为0.912,留一法相关系数(q2)为0.540,具有统计学意义。 理论上预测抗LOX效能与实验观察到抑制活性非常吻合,证明了QSAR模型合理预测能力。 详细讨论了...
  • 这些模型显示以下统计指标:回归相关系数R2 = 0.986-0.905,标准偏差S = 0.516-0.153,Fischer检验F = 106.718-14.220,交叉验证的相关系数= 0.985- 0.895和= 0.010-0.001。 已建立的QSAR模型的统计特性满足接受和...
  • 以干旱区典型样点实测土壤含水量及其室内可见光-近红外光谱数据作为数据集,通过蒙特卡罗交叉验证确定77个有效样本;基于竞争适应重加权采样算法筛选出最优光谱变量子集,利用3种机器学习方法——BP神经网络、随机...
  • Crowding_model-源码

    2021-02-27 06:12:26
    tbl_sloan_pelli.mat-Matlab表,在外围具有斯隆和佩利数据tbl_sloan.mat-在外围有斯隆数据的matlab表VF.mat-用于随机交叉验证的矩阵,用于R2的可重复性fit_linear_models.m-适合拥挤模型的脚本(支持模型m3,m7,m8...
  • 模型评估方法 1.Estimator对象score方法 score会调用predict函数,获取预测相应,然后与传入真实值对比,计算得分 分类器继承了classifiermixin类, 判别标准sklearn.metrics...2.交叉验证中使用scoring参数...
  • 使用sklearn.metrics中r2’(决定系数R^2)时, 报错: UndefinedMetricWarning: R^2 score is not well-defined with less than two samples. ...说明Y值只有一个(所以R^2不适用于留一法交叉验证
  • OSPF基础配置模拟拓扑图模拟目的具体命令R1命令R2命令R3命令R4命令配置完成后的验证总结及疑问 从头开始学网络知识,先从模拟实验开始。 拓扑图 模拟目的 如图有四台路由组成网络,R1和R2通过交叉线连接两端...
  • house_prices-源码

    2021-03-25 19:27:36
    使用交叉验证来评估您模型。 (b)试用预处理技术。 (c)在讲座中,您学习了三种类型特征选择。 选择其中两个,并为每个实现至少一种方法。 实验性地评估您功能选择。 (d)为使模型步骤和预处理步骤达到...
  • 如m个自变量会拟合2m-1个子集回归方程,然后用回归方程统计量作准则(如交叉验证误差、Cp、BIC、调整R2等指标)从中挑选。采用R包是leaps,函数是regsubsets()。结合一个线性回归例子,和大家分享一下如何运用R...
  • 线性回归实践解答

    2020-02-28 20:37:57
    文章目录实例1导入各种包读取数据训练模型调节超参数画图解释以下三句话意思线性回归中参数实际中R2的值实例2预处理包系数换行为了画图以及访问中文生成渐变颜色---得到若干颜色实例3实例4--绘制roc曲线得到...
  • 实验结果表明,变量范围选取10796.2~8246.6 cm-1,采用多元散射校正对光谱进行预处理,所建立血清甘油三酯偏最小二乘定量校正模型相关系数R2为0.9454,交叉验证校正标准差(RMSECV)为0.146;检验集预测标准差...
  • 性能评估指标1. 分类算法性能评估指标1.1 ... 回归算法性能评估指标2.1 平均绝对误差(MAE)2.2 平均平方误差(MSE)2.3 均方根误差(RMSE)2.4 决定系数(R2 score)2.5 交叉验证(cross-validation) 机器学...
  • 139.03 判断交叉检索key 140.04 redis使用总结: C& g/ P ^; B. n8 Y6 d1 C# ^9 l 141.05 一些小问题. @( |. p' J! V5 s; Z+ E 142.06 关于搜索技术介绍& H, z8 Q3 } F 143.07 solr文本型缓存数据库搜索web应用...
  • r1 b# x1 k3 e 137.01 交叉检索redis 138.02 交叉检索redis* Y7 `1 z9 P" D% J+ E 139.03 判断交叉检索key 140.04 redis使用总结: C& g/ P ^; B. n8 Y6 d1 C# ^9 l 141.05 一些小问题. @( |. p' J! V5 s; Z+ E ...
  • CruiseYoung提供带有详细书签电子书籍目录 http://blog.csdn.net/fksec/article/details/7888251 Oracle Database 11g RMAN备份与恢复 基本信息 原书名: Oracle RMAN 11g Backup and Recovery 原出版社: ...

空空如也

空空如也

1 2
收藏数 31
精华内容 12
关键字:

交叉验证的r2