-
2019-09-20 15:49:51
实战:
def print_best_score(gsearch,param_test): # 输出best score print("Best score: %0.3f" % gsearch.best_score_) print("Best parameters set:") # 输出最佳的分类器到底使用了怎样的参数 best_parameters = gsearch.best_estimator_.get_params() for param_name in sorted(param_test.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name])) params = {'depth': [4, 6, 10], 'learning_rate' : [0.05, 0.1, 0.15], # 'l2_leaf_reg': [1,4,9] # 'iterations': [1200], # 'early_stopping_rounds':[1000], # 'task_type':['GPU'], # 'loss_function':['MultiClass'], } # cb = cbt.CatBoostClassifier() estimator =cbt.CatBoostClassifier(iterations=2000,verbose=400,early_stopping_rounds=200,task_type='GPU', loss_function='MultiClass') cbt_model = GridSearchCV(estimator, param_grid = params, scoring="accuracy", cv = 3) # cbt_model = cbt.CatBoostClassifier(iterations=1200,learning_rate=0.05,verbose=300, # early_stopping_rounds=1000,task_type='GPU', # loss_function='MultiClass') cbt_model.fit(train_x,train_y,eval_set=(train_x,train_y)) # cbt_model.grid_scores_, gsearch.best_params_, gsearch.best_score_ print_best_score(cbt_model,params) oof = cbt_model.predict_proba(test_x)
上述涉及到的知识:
gridSearchCV(网格搜索)的参数、方法
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
(1) estimator
选择使用的分类器,并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法:estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10),
(2)param_grid
需要最优化的参数的取值,值为字典或者列表,例如:param_grid =param_test1,param_test1 = {‘n_estimators’:range(10,71,10)}。
(3)scoring=None
模型评价标准,默认None,这时需要使用score函数;或者如scoring=‘roc_auc’,根据所选模型不同,评价准则不同。字符串(函数名),或是可调用对象,需要其函数签名形如:scorer(estimator, X, y);如果是None,则使用estimator的误差估计函数
CatBoostClassifier/CatBoostRegressor
通用参数
learning_rate(eta)=automatically
depth(max_depth)=6: 树的深度
l2_leaf_reg(reg_lambda)=3 L2正则化系数
n_estimators(num_boost_round)(num_trees=1000)=1000: 解决ml问题的树的最大数量
one_hot_max_size=2: 对于某些变量进行one-hot编码
loss_function=‘Logloss’:
CatBoost具有两大优势,其一,它在训练过程中处理类别型特征,而不是在特征预处理阶段处理类别型特征;其二,选择树结构时,计算叶子节点的算法可以避免过拟合。注意:
在对 CatBoost 调参时,很难对分类特征赋予指标。因此,同时给出了不传递分类特征时的调参结果,并评估了两个模型:一个包含分类特征,另一个不包含。我单独调整了独热最大量,因为它并不会影响其他参数。
如果未在cat_features参数中传递任何内容,CatBoost会将所有列视为数值变量。注意,如果某一列数据中包含字符串值,CatBoost 算法就会抛出错误。另外,带有默认值的 int 型变量也会默认被当成数值数据处理。在 CatBoost 中,必须对变量进行声明,才可以让算法将其作为分类变量处理。cat_features_index = [2,3,4,5,6,7] # With Categorical features clf = cbt.CatBoostClassifier(iterations=2000,learning_rate=0.05,verbose=300,early_stopping_rounds=1000,task_type='GPU', loss_function='MultiClass',depth=4) clf.fit(train_x,train_y, cat_features= cat_features_index)
参考文献:
https://www.kaggle.com/manrunning/catboost-for-titanic-top-7
https://blog.csdn.net/weixin_41988628/article/details/83098130
https://blog.csdn.net/linxid/article/details/80723811
http://www.atyun.com/4650.html
https://www.cnblogs.com/nxf-rabbit75/p/10923549.html更多相关内容 -
ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细...
2022-04-30 16:16:35ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法精解之详细攻略 LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法精解之详细攻略 ...ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略
目录
LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略
LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略
LGBMClassifier
LGBMClassifier.feature_importances_函数,采用split方式计算
LGBMC.feature_importances_ importance_type='split',
def feature_importances_(self):
"""Get feature importances.Note
----
Feature importance in sklearn interface used to normalize to 1,it's deprecated after 2.0.4 and is the same as Booster.feature_importance() now.
``importance_type`` attribute is passed to the function to configure the type of importance values to be extracted.
"""
if self._n_features is None:
raise LGBMNotFittedError('No feature_importances found. Need to call fit beforehand.')
return self.booster_.feature_importance(importance_type=self.importance_type)@property
def booster_(self):
"""Get the underlying lightgbm Booster of this model."""
if self._Booster is None:
raise LGBMNotFittedError('No booster found. Need to call fit beforehand.')
return self._Boosterdef num_feature(self):
"""Get number of features.Returns
-------
num_feature : int
The number of features.
"""
out_num_feature = ctypes.c_int(0)
_safe_call(_LIB.LGBM_BoosterGetNumFeature(
self.handle,
ctypes.byref(out_num_feature)))
return out_num_feature.valueself.booster_.feature_importance
(importance_type=self.importance_type)
def feature_importance(self, importance_type='split', iteration=None):
"""Get feature importances.Parameters
----------
importance_type : string, optional (default="split"). How the importance is calculated. 字符串,可选(默认值=“split”)。如何计算重要性。
If "split", result contains numbers of times the feature is used in a model. 如果“split”,则结果包含该特征在模型中使用的次数。
If "gain", result contains total gains of splits which use the feature.如果“gain”,则结果包含使用该特征的拆分的总增益。
iteration : int or None, optional (default=None).Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used. If <= 0, all trees are used (no limits).Returns
-------
result : numpy array
Array with feature importances.
"""
if iteration is None:
iteration = self.best_iteration
if importance_type == "split":
importance_type_int = 0
elif importance_type == "gain":
importance_type_int = 1
else:
importance_type_int = -1
result = np.zeros(self.num_feature(), dtype=np.float64)
_safe_call(_LIB.LGBM_BoosterFeatureImportance(
self.handle,
ctypes.c_int(iteration),
ctypes.c_int(importance_type_int),
result.ctypes.data_as(ctypes.POINTER(ctypes.c_double))))
if importance_type_int == 0:
return result.astype(int)
else:
return resultXGBClassifier
XGBClassifier.feature_importances_函数,采用weight方式计算
XGBC.
feature_importances_
importance_type="weight" # 默认 gain、weight、cover、total_gain、total_cover
def feature_importances_(self):
"""
Feature importances property.. note:: Feature importance is defined only for tree boosters
Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners .仅当选择决策树模型作为基础学习者(`booster=gbtree`)时,才定义特征重要性。它不适用于其他基本学习者类型,例如线性学习者(`booster=gblinear`).
Returns
-------
feature_importances_ : array of shape ``[n_features]``"""
if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':
raise AttributeError('Feature importance is not defined for Booster type {}'
.format(self.booster))
b = self.get_booster()
score = b.get_score(importance_type=self.importance_type)
all_features = [score.get(f, 0.) for f in b.feature_names]
all_features = np.array(all_features, dtype=np.float32)
return all_features / all_features.sum()
get_score def get_score(self, fmap='', importance_type='weight'):
"""Get feature importance of each feature.
Importance type can be defined as:- * 'weight': the number of times a feature is used to split the data across all trees.一个特征用于在所有树上分割数据的次数。
- * 'gain': the average gain across all splits the feature is used in.使用该特征的所有拆分的平均增益。
- * 'cover': the average coverage across all splits the feature is used in.使用该特征的所有拆分的平均覆盖率。
- * 'total_gain': the total gain across all splits the feature is used in.该特征在所有分割中使用的总增益。
- * 'total_cover': the total coverage across all splits the feature is used in.使用该特征的所有拆分的总覆盖率。
.. note:: Feature importance is defined only for tree boosters
Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners (`booster=gblinear`).
Parameters
----------
fmap: str (optional)
The name of feature map file.
importance_type: str, default 'weight'
One of the importance types defined above.
"""
if getattr(self, 'booster', None) is not None and self.booster not in {'gbtree', 'dart'}: raise ValueError('Feature importance is not defined for Booster type {}' .format(self.booster))allowed_importance_types = ['weight', 'gain', 'cover', 'total_gain', 'total_cover']
if importance_type not in allowed_importance_types: msg = ("importance_type mismatch, got '{}', expected one of " + repr(allowed_importance_types))
raise ValueError(msg.format(importance_type))# if it's weight, then omap stores the number of missing values
if importance_type == 'weight':
# do a simpler tree dump to save time
trees = self.get_dump(fmap, with_stats=False)fmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue# extract feature name from string between []
fid = arr[1].split(']')[0].split('<')[0]if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
else:
fmap[fid] += 1return fmap
else:
average_over_splits = True
if importance_type == 'total_gain':
importance_type = 'gain'
average_over_splits = False
elif importance_type == 'total_cover':
importance_type = 'cover'
average_over_splits = Falsetrees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g# calculate average value (gain/cover) for each feature
if average_over_splits:
for fid in gmap:
gmap[fid] = gmap[fid] / fmap[fid]return gmap
CatBoostClassifier
CatBoostClassifier.feature_importances_函数,采用is_groupwise_metric(loss)方式计算
CatC.feature_importances_ def feature_importances_(self):
loss = self._object._get_loss_function_name()
if loss and is_groupwise_metric(loss):
return np.array(getattr(self, "_loss_value_change", None))
else:
return np.array(getattr(self, "_prediction_values_change", None))CatBoost简单地利用了在正常情况下(当我们包括特征时)使用模型获得的度量(损失函数)与不使用该特征的模型(模型建立大约与此功能从所有的树在合奏)。差别越大,特征就越重要。
-
【CASE】芝加哥犯罪率数据集(CatBoostClassifier)
2019-12-10 21:15:47参考:top 2% based on CatBoostClassifier 导入库与数据 import numpy as np import pandas as pd pd.set_option("display.max_columns", None) from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, ...参考:top 2% based on CatBoostClassifier
导入库与数据
import numpy as np import pandas as pd pd.set_option("display.max_columns", None) from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.base import clone from sklearn.model_selection import StratifiedKFold, train_test_split from sklearn.metrics import log_loss from sklearn.decomposition import PCA from sklearn.mixture import GaussianMixture import catboost import gensim data_train = pd.read_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\kaggle\\sf-crime\\train.csv") data_test = pd.read_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\kaggle\\sf-crime\\test.csv")
特征处理
def transformTimeDataset(dataset): dataset['Dates'] = pd.to_datetime(dataset['Dates']) dataset['Date'] = dataset['Dates'].dt.date dataset['n_days'] = (dataset['Date']-dataset['Date'].min()).apply(lambda x:x.days) dataset['Year'] = dataset['Dates'].dt.year dataset['DayOfWeek'] = dataset['Dates'].dt.dayofweek dataset['WeekOfYear'] = dataset['Dates'].dt.weekofyear dataset['Month'] = dataset['Dates'].dt.month dataset['Hour'] = dataset['Dates'].dt.hour return dataset data_train = transformTimeDataset(data_train) data_test = transformTimeDataset(data_test) def transformdGeoDataset(dataset): dataset['Block'] = dataset['Address'].str.contains('block', case=False) dataset['Block'] = dataset['Block'].map(lambda x: 1 if x == True else 0) dataset = pd.get_dummies(data=dataset, columns=['PdDistrict'], drop_first=True) return dataset data_train = transformdGeoDataset(data_train) data_test = transformdGeoDataset(data_test) data_train = data_train.drop(["Descript", "Resolution","Address","Dates","Date"], axis = 1) data_test = data_test.drop(["Address","Dates","Date"], axis = 1) from sklearn.preprocessing import LabelEncoder le = LabelEncoder() data_train.Category = le.fit_transform(data_train.Category)
设置特征与目标
X = data_train.drop("Category",axis=1) y = data_train['Category'] print(X.head())
DayOfWeek X Y n_days Year WeekOfYear Month Hour Block PdDistrict_CENTRAL PdDistrict_INGLESIDE PdDistrict_MISSION PdDistrict_NORTHERN PdDistrict_PARK PdDistrict_RICHMOND PdDistrict_SOUTHERN PdDistrict_TARAVAL PdDistrict_TENDERLOIN 0 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 1 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 2 2 -122.424363 37.800414 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 3 2 -122.426995 37.800873 4510 2015 20 5 23 1 0 0 0 1 0 0 0 0 0 4 2 -122.438738 37.771541 4510 2015 20 5 23 1 0 0 0 0 1 0 0 0 0
先把处理的数据封装起来输出(节省时间)
data_train = pd.DataFrame(data_train) data_train.to_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\Results\\Crimedatatrain.csv")
重新导入
data_train = pd.read_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\Results\\Crimedatatrain.csv") data_test = pd.read_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\Results\\Crimedatatest.csv") print(data_train.head())
Category DayOfWeek X Y n_days Year WeekOfYear Month Hour Block PdDistrict_CENTRAL PdDistrict_INGLESIDE PdDistrict_MISSION PdDistrict_NORTHERN PdDistrict_PARK PdDistrict_RICHMOND PdDistrict_SOUTHERN PdDistrict_TARAVAL PdDistrict_TENDERLOIN 0 37 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 1 21 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 2 21 2 -122.424363 37.800414 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 3 16 2 -122.426995 37.800873 4510 2015 20 5 23 1 0 0 0 1 0 0 0 0 0 4 16 2 -122.438738 37.771541 4510 2015 20 5 23 1 0 0 0 0 1 0 0 0 0
修改X,Y的列名
def XYrename(data): data.rename(columns={'X':'lat', 'Y':'lon'}, inplace = True) return data XYrename(data_train) XYrename(data_test) print(data_test.head())
合并X,y,处理地理坐标点
all_data = pd.concat((data_train,data_test),ignore_index=True) print(all_data['lat'].min(),all_data['lat'].max()) print(all_data['lon'].min(),all_data['lon'].max())
-122.51364209999998 -120.5 37.70787902 90.0
为了改善模型,扩展地理特征
关于PCA降低维度
pca = PCA(n_components=2) pca.fit(data_train[['lat','lon']]) XYt = pca.transform(data_train[['lat','lon']]) print(XYt)
[[ 0.00345383 0.00340625] [ 0.00345383 0.00340625] [ 0.02930857 0.00284016] ... [ 0.00995497 -0.01886836] [ 0.01077519 -0.03170572] [-0.03175461 -0.02889356]]
将主成分分析和高斯混合聚类运用到地理坐标中扩展维度改善模型表现
高斯混淆聚类的一些介绍:
机器学习(17)——GMM算法
高斯混合模型(Gaussian mixture models)from sklearn.decomposition import PCA from sklearn.mixture import GaussianMixture pca = PCA(n_components=2) pca.fit(data_train[['lat','lon']]) XYt = pca.transform(data_train[['lat','lon']]) data_train['Spatialpca1'] = XYt[:,0] data_train['Spatialpca2'] = XYt[:,1] clf = GaussianMixture(n_components=150,covariance_type='diag',random_state=0).fit(data_train[['lat','lon']]) data_train['Spatialpcacluster'] = clf.predict(data_train[['lat','lon']]) print(data_train.Spatialpcacluster.head())
结果:
0 85 1 85 2 8 3 8 4 147 Name: Spatialpcacluster, dtype: int64
DEMO(写一个函数,方便train和test数据集的转换)
from sklearn.decomposition import PCA from sklearn.mixture import GaussianMixture def Spatialtransfer(dataset): lan_median = dataset[dataset['lat']<-120.5]['lat'].median() lon_median = dataset[dataset['lon']<90]['lon'].median() dataset.loc[dataset['lat']>=-120.5,'lan']= lan_median dataset.loc[dataset['lon']>=90,'lon']= lon_median dataset['lat+lon']=dataset['lat']+dataset['lon'] dataset['lat-lon']=dataset['lat']-dataset['lon'] dataset['Spatial30_1'] = dataset['lat']*np.cos(np.pi/6)+dataset['lon']*np.sin(np.pi/6) dataset['Spatial30_2'] = dataset['lon']*np.cos(np.pi/6)-dataset['lan']*np.sin(np.pi/6) dataset['Spatial60_1'] = dataset['lat']*np.cos(np.pi/3)+dataset['lon']*np.sin(np.pi/3) dataset['Spatial60_2'] = dataset['lon']*np.cos(np.pi/3)-dataset['lan']*np.sin(np.pi/3) dataset['Spatial1'] = (dataset['lat'] - dataset['lan'].min()) ** 2 + (dataset['lon'] - dataset['lon'].min()) ** 2 dataset['Spatial2'] = (dataset['lat'].max() - dataset['lan']) ** 2 + (dataset['lon'] - dataset['lon'].min()) ** 2 dataset['Spatial3'] = (dataset['lat'] - dataset['lan'].min()) ** 2 + (dataset['lon'].max() - dataset['lon']) ** 2 dataset['Spatial4'] = (dataset['lat'].max() - dataset['lan']) ** 2 + (dataset['lon'].max() - dataset['lon']) ** 2 dataset['Spatial5'] = (dataset['lat'] - lan_median) ** 2 + (dataset['lon'] - lon_median) ** 2 pca = PCA(n_components=2) pca.fit(dataset[['lat','lon']]) XYt = pca.transform(dataset[['lat','lon']]) dataset['Spatialpca1'] = XYt[:,0] dataset['Spatialpca2'] = XYt[:,1] clf = GaussianMixture(n_components=150,covariance_type='diag',random_state=0).fit(dataset[['lat','lon']]) dataset['Spatialpcacluster'] = clf.predict(dataset[['lat','lon']]) return dataset Spatialtransfer(data_train) print(data_train.head())
成功输出:
Category DayOfWeek lat lon n_days Year WeekOfYear Month Hour Block PdDistrict_CENTRAL PdDistrict_INGLESIDE PdDistrict_MISSION PdDistrict_NORTHERN PdDistrict_PARK PdDistrict_RICHMOND PdDistrict_SOUTHERN PdDistrict_TARAVAL PdDistrict_TENDERLOIN lan lat+lon lat-lon Spatial30_1 Spatial30_2 Spatial60_1 Spatial60_2 Spatial1 Spatial2 Spatial3 Spatial4 Spatial5 Spatialpca1 Spatialpca2 Spatialpcacluster 0 37 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 NaN -84.651293 -160.200490 -87.136633 NaN -28.499184 NaN 0.004541 NaN 0.002149 NaN 0.000090 -0.001242 -0.008148 83 1 21 2 -122.425892 37.774599 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 NaN -84.651293 -160.200490 -87.136633 NaN -28.499184 NaN 0.004541 NaN 0.002149 NaN 0.000090 -0.001242 -0.008148 83 2 21 2 -122.424363 37.800414 4510 2015 20 5 23 0 0 0 0 1 0 0 0 0 0 NaN -84.623949 -160.224777 -87.122401 NaN -28.476062 NaN 0.008626 NaN 0.000446 NaN 0.000688 0.006807 -0.032724 101 3 16 2 -122.426995 37.800873 4510 2015 20 5 23 1 0 0 0 1 0 0 0 0 0 NaN -84.626123 -160.227868 -87.124452 NaN -28.476982 NaN 0.008760 NaN 0.000477 NaN 0.000760 0.004378 -0.033838 101 4 16 2 -122.438738 37.771541 4510 2015 20 5 23 1 0 0 0 0 1 0 0 0 0 NaN -84.667196 -160.210279 -87.149287 NaN -28.508255 NaN 0.004551 NaN 0.002844 NaN 0.000513 -0.014443 -0.008461 88
还有些小错误(没有把‘lan’改成’lat’导致的空值,下午改。函数没问题的)
pycharm的查找与替换:Ctrl + R 替换
PyCharm中批量查找及替换修改后:
from sklearn.decomposition import PCA from sklearn.mixture import GaussianMixture def Spatialtransfer(dataset): lat_median = dataset[dataset['lat']<-120.5]['lat'].median() lon_median = dataset[dataset['lon']<90]['lon'].median() dataset.loc[dataset['lat']>=-120.5,'lat']= lat_median dataset.loc[dataset['lon']>=90,'lon']= lon_median dataset['lat+lon']=dataset['lat']+dataset['lon'] dataset['lat-lon']=dataset['lat']-dataset['lon'] dataset['Spatial30_1'] = dataset['lat']*np.cos(np.pi/6)+dataset['lon']*np.sin(np.pi/6) dataset['Spatial30_2'] = dataset['lon']*np.cos(np.pi/6)-dataset['lat']*np.sin(np.pi/6) dataset['Spatial60_1'] = dataset['lat']*np.cos(np.pi/3)+dataset['lon']*np.sin(np.pi/3) dataset['Spatial60_2'] = dataset['lon']*np.cos(np.pi/3)-dataset['lat']*np.sin(np.pi/3) dataset['Spatial1'] = (dataset['lat'] - dataset['lat'].min()) ** 2 + (dataset['lon'] - dataset['lon'].min()) ** 2 dataset['Spatial2'] = (dataset['lat'].max() - dataset['lat']) ** 2 + (dataset['lon'] - dataset['lon'].min()) ** 2 dataset['Spatial3'] = (dataset['lat'] - dataset['lat'].min()) ** 2 + (dataset['lon'].max() - dataset['lon']) ** 2 dataset['Spatial4'] = (dataset['lat'].max() - dataset['lat']) ** 2 + (dataset['lon'].max() - dataset['lon']) ** 2 dataset['Spatial5'] = (dataset['lat'] - lat_median) ** 2 + (dataset['lon'] - lon_median) ** 2 pca = PCA(n_components=2) pca.fit(dataset[['lat','lon']]) XYt = pca.transform(dataset[['lat','lon']]) dataset['Spatialpca1'] = XYt[:,0] dataset['Spatialpca2'] = XYt[:,1] clf = GaussianMixture(n_components=150,covariance_type='diag',random_state=0).fit(dataset[['lat','lon']]) dataset['Spatialpcacluster'] = clf.predict(dataset[['lat','lon']]) return dataset Spatialtransfer(data_train) print(data_train.head())
封装数据处理的结果并输出
data_train.to_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\Results\\Crimedatatrain2.csv") data_test.to_csv("C:\\Users\\Nihil\\Documents\\pythonlearn\\data\\Results\\Crimedatatest2.csv")
查看一下数据类型
RangeIndex: 878049 entries, 0 to 878048 Data columns (total 33 columns): Category 878049 non-null int64 DayOfWeek 878049 non-null int64 lat 878049 non-null float64 lon 878049 non-null float64 n_days 878049 non-null int64 Year 878049 non-null int64 WeekOfYear 878049 non-null int64 Month 878049 non-null int64 Hour 878049 non-null int64 Block 878049 non-null int64 PdDistrict_CENTRAL 878049 non-null int64 PdDistrict_INGLESIDE 878049 non-null int64 PdDistrict_MISSION 878049 non-null int64 PdDistrict_NORTHERN 878049 non-null int64 PdDistrict_PARK 878049 non-null int64 PdDistrict_RICHMOND 878049 non-null int64 PdDistrict_SOUTHERN 878049 non-null int64 PdDistrict_TARAVAL 878049 non-null int64 PdDistrict_TENDERLOIN 878049 non-null int64 lat+lon 878049 non-null float64 lat-lon 878049 non-null float64 Spatial30_1 878049 non-null float64 Spatial30_2 878049 non-null float64 Spatial60_1 878049 non-null float64 Spatial60_2 878049 non-null float64 Spatial1 878049 non-null float64 Spatial2 878049 non-null float64 Spatial3 878049 non-null float64 Spatial4 878049 non-null float64 Spatial5 878049 non-null float64 Spatialpca1 878049 non-null float64 Spatialpca2 878049 non-null float64 Spatialpcacluster 878049 non-null int64 dtypes: float64(15), int64(18) memory usage: 221.1 MB
用CatBoostClassfier
X = data_train.drop(['Category'], axis=1) y = data_train.Category categorical_features_indices = np.where(X.dtypes != np.float)[0] from sklearn.model_selection import train_test_split X_train,X_validation,y_train,y_validation = train_test_split(X,y,random_state=42,train_size=0.3) from catboost import CatBoostClassifier from catboost import Pool from catboost import cv from sklearn.metrics import accuracy_score model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42) model.fit(X_train,y_train,cat_features=categorical_features_indices,eval_set=(X_validation,y_validation))
算得太慢了就先不算完了。
-
CatBoost 原理及应用
2022-07-10 00:15:25pip install catboost from catboost import CatBoostClassifier, Pool, metrics, cv from sklearn.metrics import accuracy_score model = CatBoostClassifier( custom_loss=[metrics.Accuracy()], #...CatBoost(categorical boosting)是一种能够很好地处理类别型特征的梯度提升算法库。本文中,我们对 CatBoost 基本原理及应用实例做个详细介绍。后面小猴子还将针对其中几个重要特性做专门介绍,如 CatBoost 对类别型特征处理、特征选择、文本特征处理、超参数调整以及多标签目标处理,敬请期待,看完记得点个赞支持下!
梯度提升概述
要理解
boosting
,我们首先理解集成学习,为了获得更好的预测性能,集成学习结合多个模型(弱学习器)的预测结果。它的策略就是大力出奇迹,因为弱学习器的有效组合可以生成更准确和更鲁棒的模型。集成学习方法分为三大类:Bagging:该技术使用随机数据子集并行构建不同的模型,并聚合所有预测变量的预测结果。
Boosting:这种技术是可迭代的、顺序进行的和自适应的,因为每个预测器都是针对上一个模型的错误进行修正。
Stacking:这是一种元学习技术,涉及结合来自多种机器学习算法的预测,例如 bagging 和 boosting。
什么是 CatBoost
CatBoost
(categorical boosting)
是 Yandex 开源的机器学习算法。它可以与深度学习框架轻松集成。它可以处理多种数据类型,以帮助解决企业今天面临的各种问题。CatBoost 和 XGBoost、LightGBM 并称为 GBDT 的三大主流神器,都是在 GBDT 算法框架下的一种改进实现。XGBoost 被广泛的应用于工业界,LightGBM 有效的提升了 GBDT 的计算效率,而 Yandex 的 CatBoost 号称是比 XGBoost 和 LightGBM 在算法准确率等方面表现更为优秀的算法。CatBoost 是一种基于 对称决策树
(oblivious trees)
为基学习器实现的参数较少、支持类别型变量和高准确性的GBDT框架,主要解决的痛点是高效合理地处理类别型特征,这一点从它的名字中可以看出来,CatBoost 是由Categorical
和Boosting
组成。此外,CatBoost 还解决了梯度偏差(Gradient Bias)
以及预测偏移(Prediction shift)
的问题,从而减少过拟合的发生,进而提高算法的准确性和泛化能力。此外,CatBoost 梯度提升算法库中的学习算法基于 GPU 实现,打分算法基于 CPU 实现。
CatBoost 的主要特点
CatBoost 优于同类产品的一些关键特性:
01 对称树
与 XGBoost 和 LightGBM 不同,CatBoost 构建对称(平衡)树。在每一步中,前一棵树的叶子都使用相同的条件进行拆分。选择损失最低的特征分割对并将其用于所有级别的节点。这种平衡的树结构有助于高效的 CPU 实现,减少预测时间,模型结构可作为正则化以防止过度拟合。
在对称决策树中,只使用一个特性来构建每个树级别上的所有分支。我们使用图示方法更加清晰地观察三种类型的拆分:
"FloatFeature"、"OneHotFeature" 和 "OnlineCtr"
。FloatFeature
模型没有类别型特征时,在可视化的树中只有
"FloatFeature"
节点。"FloatFeature"
拆分对应的节点包含特征索引和边界值,用于拆分对象。boston = load_boston() y = boston['target'] X = boston['data'] pool = catboost.Pool(data=X, label=y) model = CatBoostRegressor(depth=2, verbose=False, iterations=1).fit(X, y) model.plot_tree(tree_idx=0, # pool=pool, )
在这个例子中,深度为0的节点表示对象被它们的第0个带边界值的特征分割。类似地,深度1的节点通过其具有边界值的第二个特征来分割对象。
OneHotFeature
titanic_df = titanic() X = titanic_df[0].drop('Survived',axis=1) y = titanic_df[0].Survived # 分类变量的缺失值用"NAN"填充,代码略 pool = Pool(X, y, cat_features=cat_features_index, feature_names=list(X.columns)) model = CatBoostClassifier( max_depth=2, verbose=False, max_ctr_complexity=1, random_seed=42, iterations=2).fit(pool) model.plot_tree( tree_idx=0, pool=pool # 对于一个需要使用独热编码的特征,"pool" 是一个必须的参数 )
第一棵树只包含一个由
"OneHotFeature"
特征产生的分裂。这种分割将"Sex=female"
的对象放在左边,而"other"
的对象放在右边。OnlineCtr
model.plot_tree(tree_idx=1, pool=pool)
02 Ordered Boosting
经典提升算法存在预测偏移的问题,容易在小的/嘈杂的数据集上过度拟合。在计算数据实例的梯度估计时,这些算法使用与构建模型相同的数据实例,因此没有机会遇到看不见的数据。
另一方面,CatBoost 使用排序提升的概念,这是一种置换驱动的方法,在数据子集上训练模型,同时在另一个子集上计算残差,从而防止目标泄漏和过度拟合。
03 鲁棒性
它减少了对大量超参数调整的需求,并降低了过度拟合的机会,这也导致了更通用的模型。虽然,CatBoost 有多个参数需要调整,它包含树的数量、学习率、正则化、树深度、折叠大小、装袋温度等参数。您可以在此处阅读所有这些参数。
04 原生特征支持,易于使用
CatBoost 支持数字、分类或文本的各种特征,节省了预处理的时间和精力。可以从命令行使用 CatBoost,使用 Python 和 R 的用户友好 API。
CatBoost 的基本使用
导入基本数据
我们使用 CatBoost 库里自带的经典数据集 titanic。
from catboost.datasets import titanic import numpy as np train_df, test_df = titanic() # 用一些超出分布范围的数字来填充缺失值 train_df.fillna(-999, inplace=True) test_df.fillna(-999, inplace=True) # 拆分特征变量及标签变量 X = train_df.drop('Survived', axis=1) y = train_df.Survived # 划分训练集和测试集 from sklearn.model_selection import train_test_split X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42) X_test = test_df
Titanic 数据集中的特征变量具有不同的类型—有些是数值类型,有些是分类类型,甚至有些只是字符串类型,通常应该以某种特定的方式进行处理(例如使用词袋表示进行编码)。但这里我们可以将这些字符串特征视为分类特征——所有繁重的工作都是在 CatBoost 中完成的。
创建一个baseline模型
我们首先从一个基础baseline模型开始,认识如何使用catboost进行预测目标。
# !pip install catboost from catboost import CatBoostClassifier, Pool, metrics, cv from sklearn.metrics import accuracy_score model = CatBoostClassifier( custom_loss=[metrics.Accuracy()], # 该指标可以计算logloss,并且在该规模的数据集上更加光滑 random_seed=42, logging_level='Silent' ) # 模型训练 model.fit( X_train, y_train, cat_features=categorical_features_indices, eval_set=(X_validation, y_validation), # logging_level='Verbose', # you can uncomment this for text output plot=True );
可以通过详细输出或使用漂亮的绘图来观察我们的模型学习,我们使用CatBoost自带的可交互模型过程可视化,查看模型的学习过程。
特征变量统计
Float feature
feature = 'Fare' res = model.calc_feature_statistics( X_train, y_train, feature, plot=True)
One-hot feature
feature = 'Sex' res = model.calc_feature_statistics(X_train, y_train, feature, plot=True)
模型交叉验证
验证模型是否是性能最佳的,使用交叉验证,可能甚至会得到更好的模型。
cv_params = model.get_params() cv_params.update({ 'loss_function': metrics.Logloss() }) cv_data = cv( Pool(X, y, cat_features=categorical_features_indices), cv_params, plot=True )
现在我们得到了每个 boosting 步骤 3folds 平均的损失函数值,这提供了更准确的模型性能估计。
print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format( np.max(cv_data['test-Accuracy-mean']), cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])], np.argmax(cv_data['test-Accuracy-mean']) )) print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))
Best validation accuracy score: 0.83±0.02 on step 355 Precise validation accuracy score: 0.8294051627384961
如上所示,最初对单次验证的性能估计并不是特别理想,而经过交叉验证后会有所提升。
模型应用
模型训练结束后,可以保存模型,以供预测使用。
predictions = model.predict(X_test) predictions_probs = model.predict_proba(X_test) print(predictions[:10]) print(predictions_probs[:10])
[0 0 0 0 1 0 1 0 1 0] [[0.85473931 0.14526069] [0.76313031 0.23686969] [0.88972889 0.11027111] [0.87876173 0.12123827] [0.3611047 0.6388953 ] [0.90513381 0.09486619] [0.33434185 0.66565815] [0.78468564 0.21531436] [0.39429048 0.60570952] [0.94047549 0.05952451]]
CatBoost 应用案例
出于演示目的,我们同样使用 catboost 自带的 amazon 数据集。
import pandas as pd import os import numpy as np np.set_printoptions(precision=4) import catboost from catboost import * from catboost import datasets (train_df, test_df) = catboost.datasets.amazon() train_df.head()
数据预处理
数据标签提取
y = train_df.ACTION X = train_df.drop('ACTION', axis=1)
检查数据集中标签平衡性
print('Labels: {}'.format(set(y))) print('Zero count = {}, One count = {}'.format(len(y) - sum(y), sum(y)))
abels: {0, 1} Zero count = 1897, One count = 30872
保存数据
dataset_dir = './amazon' if not os.path.exists(dataset_dir): os.makedirs(dataset_dir) train_df.to_csv( os.path.join(dataset_dir, 'train.csv'), index=False, sep=',', header=True ) test_df.to_csv( os.path.join(dataset_dir, 'test.csv'), index=False, sep=',', header=True )
创建 Pool 类
from catboost.utils import create_cd feature_names = dict() for column, name in enumerate(train_df): if column == 0: continue feature_names[column - 1] = name create_cd( label=0, cat_features=list(range(1, train_df.columns.shape[0])), feature_names=feature_names, output_path=os.path.join(dataset_dir, 'train.cd') ) !cat amazon/train.cd
0 Label 1 Categ RESOURCE 2 Categ MGR_ID 3 Categ ROLE_ROLLUP_1 4 Categ ROLE_ROLLUP_2 5 Categ ROLE_DEPTNAME 6 Categ ROLE_TITLE 7 Categ ROLE_FAMILY_DESC 8 Categ ROLE_FAMILY 9 Categ ROLE_CODE
这里展示了几种创建Pool的不同方法,实际中你选中其中一种创建方法即可。
pool1 = Pool(data=X, label=y, cat_features=cat_features) pool2 = Pool( data=os.path.join(dataset_dir, 'train.csv'), delimiter=',', column_description=os.path.join(dataset_dir, 'train.cd'), has_header=True ) pool3 = Pool(data=X, cat_features=cat_features) # 创建Pool的最快方法是从numpy矩阵创建它。 # 如果你想要快速的预测或者以最快的方式在python中加载数据,就应该使用这种方式。 X_prepared = X.values.astype(str).astype(object) # 对于FeaturesData类,类别特性必须具有str类型 pool4 = Pool( data=FeaturesData( cat_feature_data=X_prepared, cat_feature_names=list(X) ), label=y.values ) print('Dataset shape') print('dataset 1:' + str(pool1.shape) + '\ndataset 2:' + str(pool2.shape) + '\ndataset 3:' + str(pool3.shape) + '\ndataset 4:' + str(pool4.shape)) print('\n') print('Column names') print('dataset 1:') print(pool1.get_feature_names()) print('\ndataset 2:') print(pool2.get_feature_names()) print('\ndataset 3:') print(pool3.get_feature_names()) print('\ndataset 4:') print(pool4.get_feature_names())
Dataset shape dataset 1:(32769, 9) dataset 2:(32769, 9) dataset 3:(32769, 9) dataset 4:(32769, 9) Column names dataset 1: ['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE'] dataset 2: ['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE'] dataset 3: ['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE'] dataset 4: ['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']
拆分训练集和测试集
这一步大家应该比较熟悉,就不做过多的介绍。
from sklearn.model_selection import train_test_split X_train, X_validation, y_train, y_validation = train_test_split( X, y, train_size=0.8, random_state=1234)
选择目标函数
对于二分类数据集,目标函数可以选择
Logloss
,如果要预测目标标签的概率值,推荐使用交叉熵CrossEntropy
。from catboost import CatBoostClassifier model = CatBoostClassifier( iterations=5, learning_rate=0.1, # loss_function='CrossEntropy' ) model.fit( X_train, y_train, cat_features=cat_features, eval_set=(X_validation, y_validation), verbose=False ) print('Model is fitted: ' + str(model.is_fitted())) print('Model params:') print(model.get_params())
Model is fitted: True Model params: {'iterations': 5, 'learning_rate': 0.1}
训练模型
模型训练与通常sklearn模型训练差异不大,先实例化模型 model,然后直接
fit
训练即可。from catboost import CatBoostClassifier model = CatBoostClassifier( iterations=15, # verbose=5, ) model.fit( X_train, y_train, cat_features=cat_features, eval_set=(X_validation, y_validation), )
评估模型
Catboost 做模型评估时,同一般模型少有区别,该模型在
model.fit()
时,传递给参数eval_set
相应的验证子集,设置参数plot
为True
,即可在训练模型的同时,用验证集评估模型,并且输出过程可视化结果,可谓是非常方便与惊艳。from catboost import CatBoostClassifier model = CatBoostClassifier( iterations=50, random_seed=63, learning_rate=0.5, custom_loss=['AUC', 'Accuracy'] ) model.fit( X_train, y_train, cat_features=cat_features, eval_set=(X_validation, y_validation), verbose=False, plot=True )
模型比较
与模型评估一样,使用相同 CatBoostClassifier 分类器,仅仅设置不同的
learning_rate
,并设置train_dir
分别为'learing_rate_0.7'
及'learing_rate_0.01'
。model1 = CatBoostClassifier( learning_rate=0.7, iterations=100, random_seed=0, train_dir='learing_rate_0.7' ) model2 = CatBoostClassifier( learning_rate=0.01, iterations=100, random_seed=0, train_dir='learing_rate_0.01' ) model1.fit( X_train, y_train, eval_set=(X_validation, y_validation), cat_features=cat_features, verbose=False ) model2.fit( X_train, y_train, eval_set=(X_validation, y_validation), cat_features=cat_features, verbose=False )
然后使用catboost的
MetricVisualizer
方法比较两个模型。该方法在单个图表上绘制有关训练、指标评估或交叉验证运行的信息。根据输入信息,一个图表可以包含有关一次或多次运行的信息。图表既可以在训练进行时实时绘制,也可以在训练结束后绘制。from catboost import MetricVisualizer MetricVisualizer(['learing_rate_0.01', 'learing_rate_0.7']).start()
交叉验证
在前面已经提到,使用交叉验证可以得到性能更好的模型,进而得到更好的预测结果。相对使用sklearn 中的交叉验证方法,Catboost 模型自带的交叉验证方法简单、灵活,还可以直接显示可视化交叉验证过程及结果。下面小猴子录制了动画,展示交叉验证过程。
from catboost import cv # 设置参数空间 params = {} params['loss_function'] = 'Logloss' params['iterations'] = 80 params['custom_loss'] = 'AUC' params['random_seed'] = 63 params['learning_rate'] = 0.5 # 直接使用catboost中自带的cv参数。 cv_data = cv( params = params, pool = Pool(X, label=y, cat_features=cat_features), # 设置Pool类。 fold_count=5, shuffle=True, partition_random_seed=0, plot=True, # 设置可视化过程 stratified=False, verbose=False )
交叉验证过程中所有数据都记录下来并以DataFrame格式返回,可以直接查看,或后续使用,非常方便!
cv_data.head()
其实,我们只关系最佳得分,使用如下方法可以轻松获得:
best_value = np.min(cv_data['test-Logloss-mean']) best_iter = np.argmin(cv_data['test-Logloss-mean']) print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format( best_value, cv_data['test-Logloss-std'][best_iter], best_iter) )
Best validation Logloss score, not stratified: 0.1581±0.0104 on step 52
过拟合检验
在创建CatBoostClassifier实例时,设置参数
early_stopping_rounds=20
(根据实际情况设置),模型可以在early_stopping_rounds
所设置的迭代轮数内寻找模型效果最好的,这个模型效果评价指标可以通过eval_metric
设置,默认Logloss
,也可以设置为"AUC"
。还可以通过设置custom_metric
参数,使用自定义评价指标函数。model_with_early_stop = CatBoostClassifier( eval_metric='AUC', iterations=200, random_seed=63, learning_rate=0.5, early_stopping_rounds=20 ) model_with_early_stop.fit( X_train, y_train, cat_features=cat_features, eval_set=(X_validation, y_validation), verbose=False, plot=True )
print(model_with_early_stop.tree_count_)
30
可以使用
tree_count_
属性查看在何时停止的。选择概率决策边界
绘制 ROC 曲线
首先使用catboost的工具函数
get_roc_curve
获取到在验证池中的数据fpr
和tpr
值,然后将其输入到 sklearn 中的auc
函数中,计算得到roc_auc
面积大小。为了更加直观,我们绘制如下曲线。from catboost.utils import get_roc_curve import sklearn from sklearn import metrics eval_pool = Pool(X_validation, y_validation, cat_features=cat_features) curve = get_roc_curve(model, eval_pool) (fpr, tpr, thresholds) = curve roc_auc = sklearn.metrics.auc(fpr, tpr) lw = 2 plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc, alpha=0.5) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', alpha=0.5)
除了上面上面用于ROC曲线的FPR,TPR,另外还可以绘制FPR,FNR曲线。
from catboost.utils import get_fpr_curve from catboost.utils import get_fnr_curve (thresholds, fpr) = get_fpr_curve(curve=curve) (thresholds, fnr) = get_fnr_curve(curve=curve) lw = 2 plt.plot(thresholds, fpr, color='blue', lw=lw, label='FPR', alpha=0.5) plt.plot(thresholds, fnr, color='green', lw=lw, label='FNR', alpha=0.5)
返回实现指定FNR或FPR所需的概率边界。
from catboost.utils import select_threshold print(select_threshold(model=model, data=eval_pool, FNR=0.01)) print(select_threshold(model=model, data=eval_pool, FPR=0.01))
0.48689529945049076 0.9899713850692811
模型预测
CatBoost预测有四种方式,
predict、staged_predict、predict_proba 及 staged_predict_prob
。我们看下他们之间的区别。首先
predict
和predict_proba
,将模型应用于给定数据集,预测得到结果,predict
是直接得到计算后的结果,如果是二分类,就是0或1。predict_proba
结果是归属于哪种类别的概率值。print(model.predict_proba(X=X_validation))
[[0.0608 0.9392] [0.0141 0.9859] [0.0126 0.9874] ... [0.0148 0.9852] [0.0215 0.9785] [0.0333 0.9667]]
print(model.predict(data=X_validation))
[1 1 1 ... 1 1 1]
与常规预测不同,
Predict()
函数中有个prediction_type
参数,支持的预测类型包含多种:Probability
Class
RawFormulaVal
Exponent
LogProbability
raw_pred = model.predict( data=X_validation, prediction_type='RawFormulaVal' ) print(raw_pred)
[2.7374 4.2445 4.3614 ... 4.1992 3.8198 3.3681]
可以通过 Sigmoid 函数将上面结果转换为概率。
from numpy import exp sigmoid = lambda x: 1 / (1 + exp(-x)) probabilities = sigmoid(raw_pred) print(probabilities)
[0.9392 0.9859 0.9874 ... 0.9852 0.9785 0.9667]
另一个就是 staged_predict 及 staged_predict_prob,他是阶段预测,仅考虑 trees 在range[0; i) 内的计算结果值。这个范围是通过参数
eval_period
控制的:要在应用模型或计算指标时减少要使用的树的数量,将树索引的范围设置为
[ntree_start; ntree_end)
并将要使用的树的步长设置为eval_period
。此参数定义迭代范围的步骤
[ntree_start; ntree_end)
。例如,假设设置了以下参数值:ntree_start
设置为 0ntree_end
设置为 N(总树数)eval_period
设置为 2
在这种情况下,将返回以下树范围的结果:
[0, 2), [0, 4), ... , [0, N)
。predictions_gen = model.staged_predict_proba( data=X_validation, ntree_start=0, ntree_end=5, eval_period=1 ) try: for iteration, predictions in enumerate(predictions_gen): print('Iteration ' + str(iteration) + ', predictions:') print(predictions) except Exception: pass
Iteration 0, predictions: [[0.3726 0.6274] ... [0.3726 0.6274]] ... Iteration 4, predictions: [[0.1388 0.8612] ... [0.175 0.825 ]]
在未知数据集上评估模型
我们使用
eval_metrics
方法计算指定数据集的指定指标。metrics = model.eval_metrics( data=pool1, metrics=['Logloss','AUC'], ntree_start=0, ntree_end=0, eval_period=1, plot=True )
从可视化结果看,
eval_metrics
只包含 Eval 结果曲线,我们设置了metrics=['Logloss','AUC']
,因此包含'Logloss'
和'AUC'
两条评估曲线。print('AUC values:') print(np.array(metrics['AUC']))
特征重要性
使用模型自带的
get_feature_importance
方法。model.get_feature_importance(prettified=True)
使用第三方解释库
Shap
。与一般模型直接使用Shap
有所不同,使用model.get_feature_importance()
方法,并设置参数type='ShapValues'
, 直接输出shap_values
值,该值可直接用户输出结果值及绘制相应可视化图形。shap_values = model.get_feature_importance( pool1, type='ShapValues') expected_value = shap_values[0,-1] shap_values = shap_values[:,:-1] print(shap_values.shape)
(32769, 9)
import shap shap.initjs() shap.force_plot(expected_value, shap_values[3,:], X.iloc[3,:])
shap.initjs() shap.force_plot(expected_value, shap_values[91,:], X.iloc[91,:])
shap.summary_plot(shap_values, X)
X_small = X.iloc[0:200] shap_small = shap_values[:200] shap.force_plot(expected_value, shap_small, X_small)
特征评估
CatBoost还有个很厉害的功能,就是对指定特征进行评估,给出评估结果,是好是坏
from catboost.eval.catboost_evaluation import * learn_params = {'iterations': 20, # 2000 'learning_rate': 0.5, # we set big learning_rate, # because we have small # #iterations 'random_seed': 0, 'verbose': False, 'loss_function' : 'Logloss', 'boosting_type': 'Plain'} evaluator = CatboostEvaluation( 'amazon/train.tsv', fold_size=10000, # <= 50% of dataset fold_count=20, column_description='amazon/train.cd', partition_random_seed=0, #working_dir=... ) result = evaluator.eval_features( learn_config=learn_params, eval_metrics=['Logloss', 'Accuracy'], features_to_eval=[6, 7, 8])
以上设定用来评估的特征是[6, 7, 8],从以下结果看到特征6得到正向结论,而特征8得到负向结论,特征7从各项指标中得不到确切的指标。
from catboost.eval.evaluation_result import * logloss_result = result.get_metric_results('Logloss') logloss_result.get_baseline_comparison( ScoreConfig(ScoreType.Rel, overfit_iterations_info=False) )
模型保存和导入
当我们得到一个较为理想的模型后,需要保存模型,以后期使用模型,因此,该步骤还是非常重要的。而CatBoost保存模型非常方便,无需借助第三方库如pickle等,直接使用其
save_model
方法,即可保存模型。save_model
保存模型,可以保存为各种格式:cbm — CatBoost 二进制格式。
coreml — Apple CoreML 格式(目前仅支持没有分类特征的数据集)。
json — JSON 格式。有关格式详细信息,请参阅CatBoost JSON 模型教程[1]。
python — 独立的 Python 代码(目前不支持多分类模型)。有关应用结果模型的详细信息,请参阅 Python[2]部分。
cpp — 独立 C++ 代码(当前不支持多分类模型)。有关应用结果模型的详细信息,请参阅 C++[3]部分。
onnx — ONNX-ML 格式(目前仅支持没有分类特征的数据集)。详情请参阅 https://onnx.ai/。有关应用结果模型的详细信息,请参阅 ONNX[4]部分。
pmml — PMML 4.3 版[5]格式。如果训练数据集中存在分类特征,则必须在训练期间将其解释为 one-hot 编码。这可以通过将
--one-hot-max-size
/one_hot_max_size
参数设置为大于数据集中所有分类特征中唯一分类特征值的最大数量的值来实现。有关应用结果模型的详细信息,请参阅PMML[6]部分。
my_best_model.save_model('catboost_model.bin') my_best_model.save_model('catboost_model.json', format='json')
当然,导入模型也是非常方便,直接使用
load_model
方法my_best_model.load_model('catboost_model.bin') print(my_best_model.get_params()) print(my_best_model.random_seed_)
参考资料
[1]
CatBoost JSON 模型教程: https://github.com/catboost/tutorials/blob/master/model_analysis/model_export_as_json_tutorial.ipynb
[2]Python: https://catboost.ai/en/docs/concepts/python-reference_apply_catboost_model
[3]C++: https://catboost.ai/en/docs/concepts/c-plus-plus-api_applycatboostmodel
[4]ONNX: https://catboost.ai/en/docs/concepts/apply-onnx-ml
[5]PMML 4.3 版: http://dmg.org/pmml/pmml-v4-3.html
[6]PMML: https://catboost.ai/en/docs/concepts/apply-pmml
推荐阅读:
公众号:AI蜗牛车
保持谦逊、保持自律、保持进步
发送【蜗牛】获取一份《手把手AI项目》(AI蜗牛车著)
发送【1222】获取一份不错的leetcode刷题笔记
发送【AI四大名著】获取四本经典AI电子书
-
Catboost参数
2021-08-05 11:55:08, [4, 5, 6, 7], [30, 40, 50, 60]], label=[1, 1, -1], weight=[0.1, 0.2, 0.3]) train_data # model = CatBoostClassifier(iterations=10) model.fit(train_data) preds_class = model.predict(train_data) 几种... -
catboost 学习案例
2021-04-14 16:31:17catboost basics from catboost import CatBoostClassifier, Pool, cv from sklearn.metrics import accuracy_score 2.1 model training model = CatBoostClassifier( custom_loss=['Accuracy'] # default = 'log... -
实现机器学习算法:CatBoost
2022-06-24 01:50:01, axis=1), data['income'], random_state=10, test_size=0.3) # 配置训练参数 clf = cb.CatBoostClassifier(eval_metric="AUC", depth=4, iterations=500, l2_leaf_reg=1, learning_rate=0.1) # 类别特征索引 cat_... -
总结-模型评价指标的定义(基于CatBoost文档)
2022-04-06 11:40:38weights[index] der2 *= weights[index] result.append((der1, der2)) return result model = CatBoostClassifier(loss_function=LoglossObjective()) 回归RMSE class RmseObjective(object): def calc_ders_range...