精华内容
下载资源
问答
  • Catboost参数
    2021-08-05 11:55:08

    CatBoost是一种基于对称决策树(oblivious trees)为基学习器实现的参数较少、支持类别型变量和高准确性的GBDT框架,主要解决的痛点是高效合理地处理类别型特征,这一点从它的名字中可以看出来,CatBoost是由Categorical和Boosting组成。此外,CatBoost还解决了梯度偏差(Gradient Bias)以及预测偏移(Prediction shift)的问题,从而减少过拟合的发生,进而提高算法的准确性和泛化能力。
    与XGBoost、LightGBM相比,CatBoost的创新点有:

    • 嵌入了自动将类别型特征处理为数值型特征的创新算法。首先对categorical features做一些统计,计算某个类别特征(category)出现的频率,之后加上超参数,生成新的数值型特征(numerical features)。
    • Catboost还使用了组合类别特征,可以利用到特征之间的联系,这极大的丰富了特征维度。
    • 采用排序提升的方法对抗训练集中的噪声点,从而避免梯度估计的偏差,进而解决预测偏移的问题。
    • 采用了完全对称树作为基模型。

    CatBoost处理Categorical features总结

    • 首先会计算一些数据的statistics。计算某个category出现的频率,加上超参数,生成新的numerical features。这一策略要求同一标签数据不能排列在一起(即先全是之后全是这种方式),训练之前需要打乱数据集。
    • 第二,使用数据的不同排列(实际上是个)。在每一轮建立树之前,先扔一轮骰子,决定使用哪个排列来生成树。
    • 第三,考虑使用categorical features的不同组合。例如颜色和种类组合起来,可以构成类似于blue dog这样的特征。当需要组合的categorical features变多时,CatBoost只考虑一部分combinations。在选择第一个节点时,只考虑选择一个特征,例如A。在生成第二个节点时,考虑A和任意一个categorical feature的组合,选择其中最好的。就这样使用贪心算法生成combinations。
    • 第四,除非向gender这种维数很小的情况,不建议自己生成One-hot编码向量,最好交给算法来处理。

    参数:

    loss_function 损失函数,支持的有RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE,Poisson。默认RMSE。
    custom_metric 训练过程中输出的度量值。这些功能未经优化,仅出于信息目的显示。默认None。
    eval_metric 用于过拟合检验(设置True)和最佳模型选择(设置True)的loss function,用于优化。 RMSE、Logloss、MAE、CrossEntropy、Recall、Precision、F1、Accuracy、AUC、R2
    iterations 最大树数。默认1000。
    learning_rate 学习率。默认0.03。
    random_seed 训练时候的随机种子
    l2_leaf_reg L2正则参数。默认3
    bootstrap_type 定义权重计算逻辑,可选参数:Poisson (supported for GPU only)/Bayesian/Bernoulli/No,默认为Bayesian
    bagging_temperature 贝叶斯套袋控制强度,区间[0, 1]。默认1。
    subsample 设置样本率,当bootstrap_type为Poisson或Bernoulli时使用,默认66
    sampling_frequency设置创建树时的采样频率,可选值PerTree/PerTreeLevel,默认为PerTreeLevel
    random_strength 分数标准差乘数。默认1。
    use_best_model 设置此参数时,需要提供测试数据,树的个数通过训练参数和优化loss function获得。默认False。
    best_model_min_trees 最佳模型应该具有的树的最小数目。
    depth 树深,最大16,建议在1到10之间。默认6。
    ignored_features 忽略数据集中的某些特征。默认None。
    one_hot_max_size 如果feature包含的不同值的数目超过了指定值,将feature转化为float。默认False
    has_time 在将categorical features转化为numerical features和选择树结构时,顺序选择输入数据。默认False(随机)
    rsm 随机子空间(Random subspace method)。默认1。
    nan_mode处理输入数据中缺失值的方法,包括Forbidden(禁止存在缺失),Min(用最小值补),Max(用最大值补)。默认Min。
    fold_permutation_block_size数据集中的对象在随机排列之前按块分组。此参数定义块的大小。值越小,训练越慢。较大的值可能导致质量下降。
    leaf_estimation_method 计算叶子值的方法,Newton/ Gradient。默认Gradient。
    leaf_estimation_iterations 计算叶子值时梯度步数。
    leaf_estimation_backtracking 在梯度下降期间要使用的回溯类型。
    fold_len_multiplier folds长度系数。设置大于1的参数,在参数较小时获得最佳结果。默认2。
    approx_on_full_history 计算近似值,False:使用1/fold_len_multiplier计算;True:使用fold中前面所有行计算。默认False。
    class_weights 类别的权重。默认None。
    scale_pos_weight 二进制分类中class 1的权重。该值用作class 1中对象权重的乘数。
    boosting_type 增压方案
    allow_const_label 使用它为所有对象训练具有相同标签值的数据集的模型。默认为False
    od_type 要使用的过拟合检测器的类型。可能的值:‘IncToDec’、‘Iter’
    od_wait 在迭代之后以最佳度量值继续训练的迭代次数。此参数的用途因所选的过拟合检测器类型而异:1.IncToDec —达到阈值时忽略过拟合检测器,并在迭代后使用最佳度量值继续学习指定的迭代次数。2.Iter —考虑模型过度拟合,并且自从具有最佳度量值的迭代以来,在指定的迭代次数后停止训练。
    grow_policy 树生长策略。定义如何执行贪婪树的构建。可能的值:

    • SymmetricTree — 逐级构建树,直到达到指定的深度。在每次迭代中,以相同条件分割最        后一棵树级别的所有叶子。生成的树结构始终是对称的。
    • Depthwise - 逐级构建一棵树,直到达到指定的深度。在每次迭代中,将拆分来自最后一棵        树的所有非终端叶子。每片叶子均按条件分割,损失改善最佳。
    • Lossguide- 逐叶构建一棵树,直到达到指定的最大叶数。在每次迭代中,将损失损失最佳        的非终端叶子进行拆分。
      注意:不能使用PredictionDiff特征重要性来分析使用Depthwise和Lossguide增长策略的生成        模型,只能将其导出到json和cbm。
      task_type=CPU:训练的器件
      devices=None:训练的GPU设备ID
    class Pool(data, 
               label=None, cat_features=None, column_description=None,
               pairs=None, delimiter='\t', has_header=False,
               weight=None, group_id=None, group_weight=None,
               subgroup_id=None, pairs_weight=None, baseline=None,
               feature_names=None, thread_count=-1)
    

    简单分类:

    from catboost import CatBoostClassifier, Pool
    
    train_data = Pool(data=[[1, 4, 5, 6],
                            [4, 5, 6, 7],
                            [30, 40, 50, 60]],
                      label=[1, 1, -1],
                      weight=[0.1, 0.2, 0.3])
    train_data 
    # <catboost.core.Pool at 0x1a22af06d0>
    
    model = CatBoostClassifier(iterations=10)
    model.fit(train_data)
    preds_class = model.predict(train_data)
    

    几种预测:

    # Get predicted classes
    preds_class = model.predict(test_data)
    # Get predicted probabilities for each class
    preds_proba = model.predict_proba(test_data)
    # Get predicted RawFormulaVal
    preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
    

    use_best_model 选择最好的模型输出

    # 数据准备的部分见库和数据集准备部分
    params = {
        'iterations': 500,
        'learning_rate': 0.1,
        'eval_metric': 'Accuracy',
        'random_seed': 666,
        'logging_level': 'Silent',
        'use_best_model': False
    }
    # train
    train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
    # validation
    validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)
    
    # train with 'use_best_model': False
    model = CatBoostClassifier(**params)
    model.fit(train_pool, eval_set=validate_pool)
    
    # train with 'use_best_model': True
    best_model_params = params.copy()
    best_model_params.update({'use_best_model': True})
    best_model = CatBoostClassifier(**best_model_params)
    best_model.fit(train_pool, eval_set=validate_pool);
    
    # show result
    print('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format(
        accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_))
    print('')
    print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format(
        accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_))
    

    早停

    earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20)
    

    显示特征重要性:

    feature_importances = model.get_feature_importance(train_pool)
    feature_names = X_train.columns
    for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
        print('{}: {}'.format(name, score))
    

    在之前预训练的基础上继续训练

    model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
    # Get baseline (only with prediction_type='RawFormulaVal')
    baseline = model.predict(X_train, prediction_type='RawFormulaVal')
    # Fit new model
    model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);
    

    使用模型处理分类特征

    categorical_features_indices = np.where(X.dtypes != np.float)[0]
    # categorical_features_indices  可以加载fit里,也可以放在Pool里
    # 如果fit 输入数据是Pool格式的,categorical_features_indices必须为None
    
    更多相关内容
  • catboost参数详解及实战(强推)

    千次阅读 2022-07-02 11:21:13
    catboost参数详解(史上最细),以及实战贝叶斯调参

    目录

    一 参数详解

    二 实战

    1 导包

    2 数据读取

    3 贷后y标签分布,逾期率20%

    4 预处理

    5 特征分布

    6 特征分组

    7 初始参数

    8 catboost建模函数

     9 初始模型

    10 特征重要性

    11 贝叶斯调参


    一 参数详解

            由于catboost参数较多,本文仅列出重要及常用参数(如需直接使用,可将 :替换为 #

    '''
    公共参数
    '''
    params={
        'loss_function': , : 损失函数,取值RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE, Poisson。默认Logloss。
        'custom_loss': , : 训练过程中计算显示的损失函数,取值Logloss、CrossEntropy、Precision、Recall、F、F1、BalancedAccuracy、AUC等等
        'eval_metric': , : 用于过度拟合检测和最佳模型选择的指标,取值范围同custom_loss
        'iterations': , : 最大迭代次数,默认500. 别名:num_boost_round, n_estimators, num_trees
        'learning_rate': , : 学习速率,默认0.03 别名:eta
        'random_seed': , : 训练的随机种子,别名:random_state
        'l2_leaf_reg': , : l2正则项,别名:reg_lambda
        'bootstrap_type': , : 确定抽样时的样本权重,取值Bayesian、Bernoulli(伯努利实验)、MVS(仅支持cpu)、Poisson(仅支持gpu)、No(取值为No时,每棵树为简单随机抽样);默认值GPU下为Bayesian、CPU下为MVS
        'bagging_temperature': ,  : bootstrap_type=Bayesian时使用,取值为1时采样权重服从指数分布;取值为0时所有采样权重均等于1。取值范围[0,inf),值越大、bagging就越激进
        'subsample': , : 样本采样比率(行采样)
        'sampling_frequency': , : 采样频率,取值PerTree(在构建每棵新树之前采样)、PerTreeLevel(默认值,在子树的每次分裂之前采样);仅支持CPU
        'random_strength': , : 设置特征分裂信息增益的扰动项,用于避免过拟合。子树分裂时,正常会寻找最大信息增益的特征+分裂点进行分裂,此处对每个特征+分裂点的信息增益值+扰动项后再确定最大值。扰动项服从正态分布、均值为0,random_strength参数值会作为正态分布的方差,默认值1、对应标准正态分布;设置0时则无扰动项
        'use_best_model': , : 让模型使用效果最优的子树棵树/迭代次数,使用验证集的最优效果对应的迭代次数(eval_metric:评估指标,eval_set:验证集数据),布尔类型可取值0,1(取1时要求设置验证集数据)
        'best_model_min_trees': , : 最少子树棵树,和use_best_model一起使用
        'depth': , : 树深,默认值6
        'grow_policy': , : 子树生长策略,取值SymmetricTree(默认值,对称树)、Depthwise(整层生长,同xgb)、Lossguide(叶子结点生长,同lgb)
        'min_data_in_leaf': , : 叶子结点最小样本量
        'max_leaves': , : 最大叶子结点数量
        'one_hot_max_size': , : 对唯一值数量<one_hot_max_size的类别型特征使用one-hot编码
        'rsm': , : 列采样比率,别名colsample_bylevel 取值(0,1],默认值1
        'nan_mode': , : 缺失值处理方法,取值Forbidden(不支持缺失值,输入包含缺失时会报错)、Min(处理为该列的最小值,比原最小值更小)、Max(同理)
        'input_borders': , : 特征数据边界(最大最小边界)、会影响缺失值的处理(nan_mode取值Min、Max时),默认值None、在训练时特征取值的最大最小值即为特征值边界
        'class_weights': , : y标签类别权重、用于类别不均衡处理,默认各类权重均为1
        'auto_class_weights': , : 自动计算平衡各类别权重
        'scale_pos_weight': , : 二分类中第1类的权重,默认值1(不可与class_weights、auto_class_weights同时设置)
        'boosting_type': , : 提升类型,取值Ordered(catboost特有的排序提升,在小数据集上效果可能更好,但是运行速度较慢)、Plain(经典提升)
        'feature_weights': , : 特征权重,在子树分裂时计算各特征的信息增益✖️该特征权重,选取最大结果对应特征分裂;设置方式:1、feature_weights = [0.1, 1, 3];2、feature_weights = {"Feature2":1.1,"Feature4":0.3}
    }
    
    '''
    category参数
    '''
    params={
        'max_ctr_complexity': , : 分类特征交叉的最高阶数,默认值4
    }
    
    
    '''
    output 参数 
    '''
    params={
        'logging_level': , : 模型训练过程的信息输出等级,取值Silent(不输出信息)、Verbose(默认值,输出评估指标、已训练时间、剩余时间等)、Info(输出额外信息、树的棵树)、Debug(debug信息)
        'metric_period': , : 计算目标值、评估指标的频率,默认值1、即每次迭代都输出目标值、评估指标
        'verbose': , : 输出日记信息等级,类似于logging_level(两者只设置一个),取值True对应上方Verbose、False对应Silent
    }
    
    
    '''
    过拟合检测设置
    '''
    params={
        'early_stopping_rounds': , : 设置提前停止训练,在得到最佳的评估结果后、再迭代n(参数值为n)次停止训练,默认值不启用
        'od_type': , : 过拟合检测类型,取值IncToDec(默认值)、Iter
        'od_pval': , : IncToDec过拟合检测的阈值,当达到指定值时,训练将停止。要求输入验证数据集,建议取值范围[10e-10,10e-2s],默认值0、不使用过拟合检测
        'od_wait': , : 与early_stopping_rounds部分相似,od_wait为达到最佳评估值后继续迭代的次数,检测器为IncToDec时达到最佳评估值后继续迭代n次(n为od_wait参数值);检测器为Iter时达到最优评估值后停止,默认值20
    }
    
    '''
    设备类型参数
    '''
    params={
        'task_type': , : 模型训练的处理单元类型,取值CPU(默认)、GPU
        'devices': , : GPU设备id
    }
    
    '''
    数值型变量分箱设置参数
    '''
    params={
        'border_count': , : 数值型特征的分箱数,别名max_bin,取值范围[1,65535]、默认值254(CPU下)
        'feature_border_type': , : 数值型特征的分箱方法,取值Median、Uniform、UniformAndQuantiles、MaxLogSum、MinEntropy、GreedyLogSum(默认值)
    }
    
    '''
    文本型变量设置参数
    '''
    params={
        'tokenizers': , : 分词器,如果给出一个分词器、三个字典和两个特征,则为每个原始文本功能总共创建了6组新的功能(1⋅3⋅2=6)。
        '''
        设置示例:
            tokenizers = [{
            'tokenizerId': 'Space',
            'delimiter': ' ',
            'separator_type': 'ByDelimiter',
            },{
            'tokenizerId': 'Sense',
            'separator_type': 'BySense',
            }]
        '''
        
        'dictionaries': , : 预处理文本型特征的参数字典,
        'feature_calcers': , : 文本型特征名的列表
        'text_processing': , : 文本型特征完整参数设置,仅设置该参数或设置上三个参数
    }

    二 实战

    1 导包

    import re
    import os
    import pandas as pd
    import numpy as np
    import warnings
    warnings.filterwarnings('ignore')
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import roc_curve,roc_auc_score
    import matplotlib.pyplot as plt
    import gc
    from bayes_opt import BayesianOptimization
    from catboost import Pool, cv

    2 数据读取

    df=pd.read_csv('E:/train.csv',engine='python').head(80000)
    print(df.shape)
    df.head()

    3 贷后y标签分布,逾期率20%

    pd.concat([df_copy['isDefault'].value_counts()
                ,df_copy['isDefault'].value_counts(normalize=True)],axis=1)

    4 预处理

            employmentLength字段为工作年限,提取出年数

    df_copy['employmentLength']=df_copy['employmentLength'].replace(' years','')
    dic={'< 1':0,'10+':20}
    df_copy['employmentLength']=df_copy['employmentLength'].map(dic).astype('float')

    5 特征分布

    import seaborn as sns
    sns.pairplot(df_copy.loc[:,'loanAmnt':'isDefault'].drop(['issueDate'],axis=1)
                 , kind="scatter",hue="isDefault"
                 , plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))

    6 特征分组

    float_col=list(df_copy.select_dtypes(exclude=['string','object']).drop(['id','isDefault'],axis=1).columns).copy()
    cate_col=['grade', 'subGrade']
    all_fea=float_col+cate_col

    7 初始参数

    params={
        'loss_function': 'Logloss', # 损失函数,取值RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE, Poisson。默认Logloss。
        'custom_loss': 'AUC', # 训练过程中计算显示的损失函数,取值Logloss、CrossEntropy、Precision、Recall、F、F1、BalancedAccuracy、AUC等等
        'eval_metric': 'AUC', # 用于过度拟合检测和最佳模型选择的指标,取值范围同custom_loss
        'iterations': 50, # 最大迭代次数,默认500. 别名:num_boost_round, n_estimators, num_trees
        'learning_rate': 0.1, # 学习速率,默认0.03 别名:eta
        'random_seed': 123, # 训练的随机种子,别名:random_state
        'l2_leaf_reg': 5, # l2正则项,别名:reg_lambda
        'bootstrap_type': 'Bernoulli', # 确定抽样时的样本权重,取值Bayesian、Bernoulli(伯努利实验)、MVS(仅支持cpu)、Poisson(仅支持gpu)、No(取值为No时,每棵树为简单随机抽样);默认值GPU下为Bayesian、CPU下为MVS
    #     'bagging_temperature': 0,  # bootstrap_type=Bayesian时使用,取值为1时采样权重服从指数分布;取值为0时所有采样权重均等于1。取值范围[0,inf),值越大、bagging就越激进
        'subsample': 0.6, # 样本采样比率(行采样)
        'sampling_frequency': 'PerTree', # 采样频率,取值PerTree(在构建每棵新树之前采样)、PerTreeLevel(默认值,在子树的每次分裂之前采样);仅支持CPU
        'use_best_model': True, # 让模型使用效果最优的子树棵树/迭代次数,使用验证集的最优效果对应的迭代次数(eval_metric:评估指标,eval_set:验证集数据),布尔类型可取值0,1(取1时要求设置验证集数据)
        'best_model_min_trees': 50, # 最少子树棵树,和use_best_model一起使用
        'depth': 4, # 树深,默认值6
        'grow_policy': 'SymmetricTree', # 子树生长策略,取值SymmetricTree(默认值,对称树)、Depthwise(整层生长,同xgb)、Lossguide(叶子结点生长,同lgb)
        'min_data_in_leaf': 500, # 叶子结点最小样本量
    #     'max_leaves': 12, # 最大叶子结点数量
        'one_hot_max_size': 4, # 对唯一值数量<one_hot_max_size的类别型特征使用one-hot编码
        'rsm': 0.6, # 列采样比率,别名colsample_bylevel 取值(0,1],默认值1
        'nan_mode': 'Max', # 缺失值处理方法,取值Forbidden(不支持缺失值,输入包含缺失时会报错)、Min(处理为该列的最小值,比原最小值更小)、Max(同理)
        'input_borders': None, # 特征数据边界(最大最小边界)、会影响缺失值的处理(nan_mode取值Min、Max时),默认值None、在训练时特征取值的最大最小值即为特征值边界
        'boosting_type': 'Ordered', # 提升类型,取值Ordered(catboost特有的排序提升,在小数据集上效果可能更好,但是运行速度较慢)、Plain(经典提升)
        'max_ctr_complexity': 2, # 分类特征交叉的最高阶数,默认值4
        'logging_level':'Verbose', # 模型训练过程的信息输出等级,取值Silent(不输出信息)、Verbose(默认值,输出评估指标、已训练时间、剩余时间等)、Info(输出额外信息、树的棵树)、Debug(debug信息)
        'metric_period': 1, # 计算目标值、评估指标的频率,默认值1、即每次迭代都输出目标值、评估指标
        'early_stopping_rounds': 20,
        'border_count': 254, # 数值型特征的分箱数,别名max_bin,取值范围[1,65535]、默认值254(CPU下), # 设置提前停止训练,在得到最佳的评估结果后、再迭代n(参数值为n)次停止训练,默认值不启用
        'feature_border_type': 'GreedyLogSum', # 数值型特征的分箱方法,取值Median、Uniform、UniformAndQuantiles、MaxLogSum、MinEntropy、GreedyLogSum(默认值)
        
    }

    8 catboost建模函数

    import catboost
    from catboost import CatBoostClassifier
    def catboost_model(df,y_name,params,cate_col=[]):
        x_train,x_test, y_train, y_test =train_test_split(df.drop(y_name,axis=1),df[y_name],test_size=0.2, random_state=123)
        
        model = CatBoostClassifier(**params)
        model.fit(x_train, y_train,eval_set=[(x_train, y_train),(x_test,y_test)],cat_features=cate_col)
        
        train_pred = [pred[1] for pred in  model.predict_proba(x_train)]
        train_auc= roc_auc_score(list(y_train),train_pred)
        
        test_pred = [pred[1] for pred in  model.predict_proba(x_test)]
        test_auc= roc_auc_score(list(y_test),test_pred)
        
        result={
            'train_auc':train_auc,
            'test_auc':test_auc,
        }
        return model,result

     9 初始模型

    model,model_result=catboost_model(df_copy[all_fea+['isDefault']]
                                        ,'isDefault',params,cate_col)

    10 特征重要性

    def feature_importance_catboost(model):
        result=pd.DataFrame(model.get_feature_importance(),index=model.feature_names_,columns=['FeatureImportance'])
        return result.sort_values('FeatureImportance',ascending=False)
    feature_importance_catboost(model)

     

    11 贝叶斯调参

    (1)自定义调参目标,此处使用测试集的AUC值为调参目标

    def catboost_cv(iterations,learning_rate,depth,subsample,rsm):
        params={
            'loss_function': 'Logloss', # 损失函数,取值RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE, Poisson。默认Logloss。
            'custom_loss': 'AUC', # 训练过程中计算显示的损失函数,取值Logloss、CrossEntropy、Precision、Recall、F、F1、BalancedAccuracy、AUC等等
            'eval_metric': 'AUC', # 用于过度拟合检测和最佳模型选择的指标,取值范围同custom_loss
            'iterations': 50, # 最大迭代次数,默认500. 别名:num_boost_round, n_estimators, num_trees
            'learning_rate': 0.1, # 学习速率,默认0.03 别名:eta
            'random_seed': 123, # 训练的随机种子,别名:random_state
            'l2_leaf_reg': 5, # l2正则项,别名:reg_lambda
            'bootstrap_type': 'Bernoulli', # 确定抽样时的样本权重,取值Bayesian、Bernoulli(伯努利实验)、MVS(仅支持cpu)、Poisson(仅支持gpu)、No(取值为No时,每棵树为简单随机抽样);默认值GPU下为Bayesian、CPU下为MVS
        #     'bagging_temperature': 0,  # bootstrap_type=Bayesian时使用,取值为1时采样权重服从指数分布;取值为0时所有采样权重均等于1。取值范围[0,inf),值越大、bagging就越激进
            'subsample': 0.6, # 样本采样比率(行采样)
            'sampling_frequency': 'PerTree', # 采样频率,取值PerTree(在构建每棵新树之前采样)、PerTreeLevel(默认值,在子树的每次分裂之前采样);仅支持CPU
            'use_best_model': True, # 让模型使用效果最优的子树棵树/迭代次数,使用验证集的最优效果对应的迭代次数(eval_metric:评估指标,eval_set:验证集数据),布尔类型可取值0,1(取1时要求设置验证集数据)
            'best_model_min_trees': 50, # 最少子树棵树,和use_best_model一起使用
            'depth': 4, # 树深,默认值6
            'grow_policy': 'SymmetricTree', # 子树生长策略,取值SymmetricTree(默认值,对称树)、Depthwise(整层生长,同xgb)、Lossguide(叶子结点生长,同lgb)
            'min_data_in_leaf': 500, # 叶子结点最小样本量
        #     'max_leaves': 12, # 最大叶子结点数量
            'one_hot_max_size': 4, # 对唯一值数量<one_hot_max_size的类别型特征使用one-hot编码
            'rsm': 0.6, # 列采样比率,别名colsample_bylevel 取值(0,1],默认值1
            'nan_mode': 'Max', # 缺失值处理方法,取值Forbidden(不支持缺失值,输入包含缺失时会报错)、Min(处理为该列的最小值,比原最小值更小)、Max(同理)
            'input_borders': None, # 特征数据边界(最大最小边界)、会影响缺失值的处理(nan_mode取值Min、Max时),默认值None、在训练时特征取值的最大最小值即为特征值边界
            'boosting_type': 'Ordered', # 提升类型,取值Ordered(catboost特有的排序提升,在小数据集上效果可能更好,但是运行速度较慢)、Plain(经典提升)
            'max_ctr_complexity': 2, # 分类特征交叉的最高阶数,默认值4
            'logging_level':'Silent', # 模型训练过程的信息输出等级,取值Silent(不输出信息)、Verbose(默认值,输出评估指标、已训练时间、剩余时间等)、Info(输出额外信息、树的棵树)、Debug(debug信息)
            'metric_period': 1, # 计算目标值、评估指标的频率,默认值1、即每次迭代都输出目标值、评估指标
            'early_stopping_rounds': 20,
            'border_count': 254, # 数值型特征的分箱数,别名max_bin,取值范围[1,65535]、默认值254(CPU下), # 设置提前停止训练,在得到最佳的评估结果后、再迭代n(参数值为n)次停止训练,默认值不启用
            'feature_border_type': 'GreedyLogSum', # 数值型特征的分箱方法,取值Median、Uniform、UniformAndQuantiles、MaxLogSum、MinEntropy、GreedyLogSum(默认值)
    #         'cat_features':cate_col
        }
        params.update({'iterations':int(iterations),'depth':int(depth),'learning_rate':learning_rate,'subsample':subsample,'rsm':rsm})
    #     model = CatBoostClassifier(**params)
    #     cv_result=cross_validate(model,df_copy[all_fea],df_copy['isDefault'], cv=5,scoring='roc_auc',return_train_score=True)
        
        model,result=catboost_model(df_copy[all_fea+['isDefault']],'isDefault',params,cate_col)
        
        return result.get('test_auc')

     (2)调参

    param_value_dics={
                    'iterations':(20, 50),
                    'learning_rate':(0.02,0.2),
                    'depth':(3, 6),
                    'subsample':(0.6, 1.0),
                    'rsm':(0.6, 1.0)
                    }
    
    cat_bayes = BayesianOptimization(
            catboost_cv,
            param_value_dics
        )        
    cat_bayes.maximize(init_points=1,n_iter=20) #init_points-调参基准点,n_iter-迭代次数

    cat_bayes.max.get('params')

    (3)设置最优参数并重新训练模型

    cat_bayes.max.get('params')
    params.update(
        {
            'depth': 5,
            'iterations': 45,
            'learning_rate': 0.189,
            'rsm': 0.707,
            'subsample': 0.890 
        }
    )
    model,model_result=catboost_model(df_copy[all_fea+['isDefault']],'isDefault',params,cate_col)
    model_result

     获取数据及完整代码,关注公众号Python风控模型与数据分析、回复catboost实战获取

    展开全文
  • 最详细的Catboost参数详解与实例应用

    万次阅读 多人点赞 2020-12-16 12:39:22
    集成学习的两大准则:基学习器的准确性和多样性。 算法:串行的Boosting和...参数详解2.1通用参数:2.2默认参数2.3性能参数2.4参数调优3.CatBoost实战应用3.1回归案例3.2使用Pool加载数据集并进行预测3.3多分类案例..

    在这里插入图片描述

    集成学习的两大准则:基学习器的准确性和多样性。
    算法:串行的Boosting和并行的Bagging,前者通过错判训练样本重新赋权来重复训练,来提高基学习器的准确性,降低偏差!后者通过采样方法,训练出多样性的基学习器,降低方差



    1.CatBoost简介

    1.1CatBoost介绍

    CatBoost这个名字来自两个词“Category”和“Boosting”。如前所述,该库可以很好地处理各种类别型数据,是一种能够很好地处理类别型特征的梯度提升算法库。

    CatBoost是俄罗斯的搜索巨头Yandex在2017年开源的机器学习库,是Boosting族算法的一种。CatBoost和XGBoost、LightGBM并称为GBDT的三大主流神器,都是在GBDT算法框架下的一种改进实现。XGBoost被广泛的应用于工业界,LightGBM有效的提升了GBDT的计算效率,而Yandex的CatBoost号称是比XGBoost和LightGBM在算法准确率等方面表现更为优秀的算法。

    CatBoost的主要算法原理可以参照以下两篇论文:

    • Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev,
      Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev “Fighting biases with dynamic boosting”. arXiv:1706.09516, 2017
    • Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin “CatBoost:
      gradient boosting with categorical features support”. Workshop on ML
      Systems at NIPS 2017

    1.2CatBoost优缺点

    1. 性能卓越:在性能方面可以匹敌任何先进的机器学习算法
    2. 鲁棒性/强健性:它减少了对很多超参数调优的需求,并降低了过度拟合的机会,这也使得模型变得更加具有通用性
    3. 易于使用:提供与scikit集成的Python接口,以及R和命令行界面
    4. 实用:可以处理类别型、数值型特征
    5. 可扩展:支持自定义损失函数
    6. 支持类别型变量,无需对非数值型特征进行预处理
    7. 快速、可扩展的GPU版本,可以用基于GPU的梯度提升算法实现来训练你的模型,支持多卡并行
    8. 快速预测,即便应对延时非常苛刻的任务也能够快速高效部署模型

    CatBoost是一种基于对称决策树(oblivious
    trees)为基学习器实现的参数较少、支持类别型变量和高准确性的GBDT框架,主要解决的痛点是高效合理地处理类别型特征,这一点从它的名字中可以看出来,CatBoost是由Categorical和Boosting组成。此外,CatBoost还解决了梯度偏差(Gradient
    Bias)以及预测偏移(Prediction shift)的问题,从而减少过拟合的发生,进而提高算法的准确性和泛化能力。

    与XGBoost、LightGBM相比,CatBoost的创新点有:

    1. 嵌入了自动将类别型特征处理为数值型特征的创新算法。首先对categorical
      features做一些统计,计算某个类别特征(category)出现的频率,之后加上超参数,生成新的数值型特征(numerical
      features)。
    2. Catboost还使用了组合类别特征,可以利用到特征之间的联系,这极大的丰富了特征维度。
    3. 采用排序提升的方法对抗训练集中的噪声点,从而避免梯度估计的偏差,进而解决预测偏移的问题。
    4. 采用了完全对称树作为基模型。

    1.3CatBoost安装

    用pip
    pip install catboost
    或者用conda
    conda install -c conda-forge catboost
    速度极慢,直到下载失败

    安装jupyter notebook中的交互组件,用于交互绘图
    pip install ipywidgets
    jupyter nbextension enable --py widgetsnbextension

    可以使用清华镜像安装:
    pip install catboost -i https://pypi.tuna.tsinghua.edu.cn/simple
    下载速度嗖嗖的,完成了。

    2.参数详解

    参考官网:https://catboost.ai/

    2.1通用参数:

    1. loss_function 损失函数,支持的有RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE,Poisson。默认RMSE。
    2. custom_metric 训练过程中输出的度量值。这些功能未经优化,仅出于信息目的显示。默认None。
    3. eval_metric 用于过拟合检验(设置True)和最佳模型选择(设置True)的loss function,用于优化。
    4. iterations 最大树数。默认1000。
    5. learning_rate 学习率。默认0.03。
    6. random_seed 训练时候的随机种子
    7. l2_leaf_reg L2正则参数。默认3
    8. bootstrap_type 定义权重计算逻辑,可选参数:Poisson (supported for GPU only)/Bayesian/Bernoulli/No,默认为Bayesian
    9. bagging_temperature 贝叶斯套袋控制强度,区间[0, 1]。默认1。
    10. subsample 设置样本率,当bootstrap_type为Poisson或Bernoulli时使用,默认66
    11. sampling_frequency设置创建树时的采样频率,可选值PerTree/PerTreeLevel,默认为PerTreeLevel
    12. random_strength 分数标准差乘数。默认1。
    13. use_best_model 设置此参数时,需要提供测试数据,树的个数通过训练参数和优化loss function获得。默认False。
    14. best_model_min_trees 最佳模型应该具有的树的最小数目。
    15. depth 树深,最大16,建议在1到10之间。默认6。
    16. ignored_features 忽略数据集中的某些特征。默认None。
    17. one_hot_max_size 如果feature包含的不同值的数目超过了指定值,将feature转化为float。默认False
    18. has_time 在将categorical features转化为numerical
      features和选择树结构时,顺序选择输入数据。默认False(随机)
    19. rsm 随机子空间(Random subspace method)。默认1。
    20. nan_mode处理输入数据中缺失值的方法,包括Forbidden(禁止存在缺失),Min(用最小值补),Max(用最大值补)。默认Min。
    21. fold_permutation_block_size数据集中的对象在随机排列之前按块分组。此参数定义块的大小。值越小,训练越慢。较大的值可能导致质量下降。
    22. leaf_estimation_method 计算叶子值的方法,Newton/ Gradient。默认Gradient。
    23. leaf_estimation_iterations 计算叶子值时梯度步数。
    24. leaf_estimation_backtracking 在梯度下降期间要使用的回溯类型。
    25. fold_len_multiplier folds长度系数。设置大于1的参数,在参数较小时获得最佳结果。默认2。
    26. approx_on_full_history 计算近似值,False:使用1/fold_len_multiplier计算;True:使用fold中前面所有行计算。默认False。
    27. class_weights 类别的权重。默认None。
    28. scale_pos_weight 二进制分类中class 1的权重。该值用作class 1中对象权重的乘数。
    29. boosting_type 增压方案
    30. allow_const_label 使用它为所有对象训练具有相同标签值的数据集的模型。默认为False

    2.2默认参数

    CatBoost默认参数:

    ‘iterations’: 1000,
    ‘learning_rate’:0.03,
    ‘l2_leaf_reg’:3,
    ‘bagging_temperature’:1,
    ‘subsample’:0.66,
    ‘random_strength’:1,
    ‘depth’:6,
    ‘rsm’:1,
    ‘one_hot_max_size’:2
    ‘leaf_estimation_method’:’Gradient’,
    ‘fold_len_multiplier’:2,
    ‘border_count’:128,
    

    2.3性能参数

    1. thread_count=-1:训练时所用的cpu/gpu核数
    2. used_ram_limit=None:CTR问题,计算时的内存限制
    3. gpu_ram_part=None:GPU内存限制

    2.4参数调优

    采用GridSearchCV的方法进行自动搜索最优参数
    示例:

    from catboost import CatBoostRegressor
    from sklearn.model_selection import GridSearchCV
    #指定category类型的列,可以是索引,也可以是列名
    cat_features = [0,1,2,3,4,5,6,7,8,9,10,11,12,13]
    X = df_ios_droped.iloc[:,:-1]
    y = df_ios_droped.iloc[:,-1]
    cv_params = {'iterations': [500,600,700,800]}
    other_params = {
        'iterations': 1000,
        'learning_rate':0.03,
        'l2_leaf_reg':3,
        'bagging_temperature':1,
        'random_strength':1,
        'depth':6,
        'rsm':1,
        'one_hot_max_size':2,
        'leaf_estimation_method':'Gradient',
        'fold_len_multiplier':2,
        'border_count':128,
    }
    model_cb = CatBoostRegressor(**other_params)
    optimized_cb = GridSearchCV(estimator=model_cb, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=2)
    optimized_cb.fit(X,y,cat_features =category_features)
    print('参数的最佳取值:{0}'.format(optimized_cb.best_params_))
    print('最佳模型得分:{0}'.format(optimized_cb.best_score_))
    print(optimized_cb.cv_results_['mean_test_score'])
    print(optimized_cb.cv_results_['params'])
    

    3.CatBoost实战应用

    CatBoost可以用于分类和回归两种类型的应用,详细使用方法大家可以参考官方网站给出的案例
    在这里插入图片描述

    3.1回归案例

    from catboost import CatBoostRegressor
    # Initialize data
    
    train_data = [[1, 4, 5, 6],
                  [4, 5, 6, 7],
                  [30, 40, 50, 60]]
    
    eval_data = [[2, 4, 6, 8],
                 [1, 4, 50, 60]]
    
    train_labels = [10, 20, 30]
    # Initialize CatBoostRegressor
    model = CatBoostRegressor(iterations=2,
                              learning_rate=1,
                              depth=2)
    # Fit model
    model.fit(train_data, train_labels)
    # Get predictions
    preds = model.predict(eval_data)
    

    使用Gpu训练

    from catboost import CatBoostClassifier
    
    train_data = [[0, 3],
                  [4, 1],
                  [8, 1],
                  [9, 1]]
    train_labels = [0, 0, 1, 1]
    
    model = CatBoostClassifier(iterations=1000, 
                               task_type="GPU",
                               devices='0:1')
    model.fit(train_data,
              train_labels,
              verbose=False)
    

    3.2使用Pool加载数据集并进行预测

    Pool是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool,其内存和速度都更优。

    from catboost import CatBoostClassifier, Pool
    
    train_data = Pool(data=[[1, 4, 5, 6],
                            [4, 5, 6, 7],
                            [30, 40, 50, 60]],
                      label=[1, 1, -1],
                      weight=[0.1, 0.2, 0.3])
    
    model = CatBoostClassifier(iterations=10)
    
    model.fit(train_data)
    preds_class = model.predict(train_data)
    

    3.3多分类案例

    from catboost import Pool, CatBoostClassifier
    
    train_data = [["summer", 1924, 44],
                  ["summer", 1932, 37],
                  ["winter", 1980, 37],
                  ["summer", 2012, 204]]
    
    eval_data = [["winter", 1996, 197],
                 ["winter", 1968, 37],
                 ["summer", 2002, 77],
                 ["summer", 1948, 59]]
    
    cat_features = [0]
    
    train_label = ["France", "USA", "USA", "UK"]
    eval_label = ["USA", "France", "USA", "UK"]
    
    
    train_dataset = Pool(data=train_data,
                         label=train_label,
                         cat_features=cat_features)
    
    eval_dataset = Pool(data=eval_data,
                        label=eval_label,
                        cat_features=cat_features)
    
    # Initialize CatBoostClassifier
    model = CatBoostClassifier(iterations=10,
                               learning_rate=1,
                               depth=2,
                               loss_function='MultiClass')
    # Fit model
    model.fit(train_dataset)
    # Get predicted classes
    preds_class = model.predict(eval_dataset)
    # Get predicted probabilities for each class
    preds_proba = model.predict_proba(eval_dataset)
    # Get predicted RawFormulaVal
    preds_raw = model.predict(eval_dataset, 
                              prediction_type='RawFormulaVal')
    

    4.总结

    后序还会更新关于lgb的相关内容,还是建议大家去看一下这几个方法的理论知识,也便于自己更好的使用模型。(共同学习进步!)

    记录时间:2020年12月16日

    展开全文
  • Catboost参数全集

    千次阅读 2019-07-06 22:29:52
    CatBoost 英文官网地址:https://catboost.ai/docs/concepts/python-reference_parameters-list.html Training parameters Python package training parameters Several parameters have aliases. For example, ...

     

    CatBoost 英文官网地址:https://catboost.ai/docs/concepts/python-reference_parameters-list.html

    Training parameters

    Python package training parameters

    Several parameters have aliases. For example, the iterations parameter has the following synonyms: num_boost_round, n_estimators, num_trees. Simultaneous usage of different names of one parameter raises an error.

    Training on GPU requires NVIDIA Driver of version 390.xx or higher.

     
    ParameterTypeDescriptionDefault valueSupported processing units
    Common parameters

    loss_function

    Alias: objective

    • string

    • object

    The metric to use in training. The specified value also determines the machine learning problem to solve. Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).

    Format:
    
    <Metric>[:<parameter 1>=<value>;..;<parameter N>=<value>]
    Supported metrics:
    • RMSE

    • Logloss

    • MAE

    • CrossEntropy

    • Quantile

    • LogLinQuantile

    • Lq

    • MultiClass

    • MultiClassOneVsAll

    • MAPE

    • Poisson

    • PairLogit

    • PairLogitPairwise

    • QueryRMSE

    • QuerySoftMax

    • YetiRank

    • YetiRankPairwise

    A custom python object can also be set as the value of this parameter (see an example).

    For example, use the following construction to calculate the value of Quantile with the coefficient :
    
    Quantile:alpha=0.1

    Depends on the class

    CPU and GPU

    custom_metric

    • string

    • list of strings

    Metric values to output during training. These functions are not optimized and are displayed for informational purposes only. Some metrics support optional parameters (see the Objectives and metricssection for details on each metric)..

    Format:
    <Metric>[:<parameter 1>=<value>;..;<parameter N>=<value>]
    Supported metrics:
    • RMSE

    • Logloss

    • MAE

    • CrossEntropy

    • Quantile

    • LogLinQuantile

    • Lq

    • MultiClass

    • MultiClassOneVsAll

    • MAPE

    • Poisson

    • PairLogit

    • PairLogitPairwise

    • QueryRMSE

    • QuerySoftMax

    • SMAPE

    • Recall

    • Precision

    • F1

    • TotalF1

    • Accuracy

    • BalancedAccuracy

    • BalancedErrorRate

    • Kappa

    • WKappa

    • LogLikelihoodOfPrediction

    • AUC

    • R2

    • NumErrors

    • MCC

    • BrierScore

    • HingeLoss

    • HammingLoss

    • ZeroOneLoss

    • MSLE

    • MedianAbsoluteError

    • Huber

    • PairAccuracy

    • AverageGain

    • PFound

    • NDCG

    • PrecisionAt

    • RecallAt

    • MAP

    • CtrFactor

    Examples:
    • Calculate the value of CrossEntropy:

      CrossEntropy
    • Calculate the value of в with the coefficient 
      Quantile:alpha=0.1
    • Calculate the values of Logloss and AUC:
      ['Logloss', 'AUC']

    Values of all custom metrics for learn and validation datasets are saved to the Metric output files (learn_error.tsv and test_error.tsvrespectively). The directory for these files is specified in the --train-dir (train_dir) parameter.

    Use the visualization tools to see a live chart with the dynamics of the specified metrics.

    None

    CPU and GPU

    eval_metric

    • string

    • object

    The metric used for overfitting detection (if enabled) and best model selection (if enabled). Some metrics support optional parameters (see the Objectives and metrics section for details on each metric).

    Format:
    <Metric>[:<parameter 1>=<value>;..;<parameter N>=<value>]
    Supported metrics:
    • RMSE

    • Logloss

    • MAE

    • CrossEntropy

    • Quantile

    • LogLinQuantile

    • Lq

    • MultiClass

    • MultiClassOneVsAll

    • MAPE

    • Poisson

    • PairLogit

    • PairLogitPairwise

    • QueryRMSE

    • QuerySoftMax

    • SMAPE

    • Recall

    • Precision

    • F1

    • TotalF1

    • Accuracy

    • BalancedAccuracy

    • BalancedErrorRate

    • Kappa

    • WKappa

    • LogLikelihoodOfPrediction

    • AUC

    • R2

    • NumErrors

    • MCC

    • BrierScore

    • HingeLoss

    • HammingLoss

    • ZeroOneLoss

    • MSLE

    • MedianAbsoluteError

    • Huber

    • PairAccuracy

    • AverageGain

    • PFound

    • NDCG

    • PrecisionAt

    • RecallAt

    • MAP

    A user-defined function can also be set as the value (see an example).

    Examples:
    R2
    Optimized objective is used

    CPU and GPU

    iterations

    Aliases:
    • num_boost_round

    • n_estimators

    • num_trees

    int

    The maximum number of trees that can be built when solving machine learning problems.

    When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.

    1000

    CPU and GPU

    learning_rate

    Alias: eta

    float

    The learning rate.

    Used for reducing the gradient step.

    The default value is defined automatically for binary classification based on the dataset properties and the number of iterations if none of these parametersis set. In this case, the selected learning rate is printed to stdout and saved in the model.

    In other cases, the default value is 0.03.

    CPU and GPU

    random_seed

    Alias: random_state

    int

    The random seed used for training.

    None (0)

    CPU and GPU

    l2_leaf_reg

    Alias: reg_lambda

    floatCoefficient at the L2 regularization term of the cost function.

    Any positive value is allowed.

    3.0

    CPU and GPU

    bootstrap_type

    string

    Bootstrap type. Defines the method for sampling the weights of objects.

    Supported methods:

    • Bayesian

    • Bernoulli

    • MVS

    • Poisson (supported for GPU only)

    • No

    Bayesian

    CPU and GPU

    bagging_temperature

    float

    Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes.

    Use the Bayesian bootstrap to assign random weights to objects.

    The weights are sampled from exponential distribution if the value of this parameter is set to “1”. All weights are equal to 1 if the value of this parameter is set to “0”.

    Possible values are in the range . The higher the value the more aggressive the bagging is.

    This parameter can be used if the selected bootstrap type is Bayesian.

    1

    CPU and GPU

    subsample

    float

    Sample rate for bagging.

    This parameter can be used if one of the following bootstrap types is selected:
    • Poisson

    • Bernoulli

    0.66

    CPU and GPU

    sampling_frequency

    string

    Frequency to sample weights and objects when building trees.

    Supported values:

    • PerTree — Before constructing each new tree

    • PerTreeLevel — Before choosing each new split of a tree

    PerTreeLevel

    CPU and GPU

    sampling_unit

    String

    The sampling scheme.

    Possible values:
    • Object — The weight  of the i-th object  is used for sampling the corresponding object.

    • Group — The weight  of the group  is used for sampling each object  from the group .

    Object

    CPU and GPU

    mvs_head_fractionfloat

    Controls the fraction of the highest by absolute value gradients taken for the minimal variance sampling. Possible values are in the range .

    This parameter can be used if the selected bootstrap type is MVS.

    1.0CPU

    random_strength

    float

    The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model.

    The value of this parameter is used when selecting splits. On every iteration each possible split gets a score (for example, the score indicates how much adding this split will improve the loss function for the training dataset). The split with the highest score is selected.

    The scores have no randomness. A normally distributed random variable is added to the score of the feature. It has a zero mean and a variance that decreases during the training. The value of this parameter is the multiplier of the variance.Note.This parameter is not supported for the following loss functions:
    • QueryCrossEntropy

    • YetiRankPairwise

    • PairLogitPairwise

    1CPU

    use_best_model

    boolIf this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
    1. Build the number of trees defined by the training parameters.

    2. Use the validation dataset to identify the iteration with the optimal value of the metric specified in  --eval-metric (eval_metric).

    No trees are saved after this iteration.

    This option requires a validation dataset to be provided.

    True if a validation set is input (the eval_setparameter is defined) and at least one of the label values of objects in this set differs from the others. False otherwise.

    CPU and GPU

    best_model_min_trees

    int

    The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees.

    Should be used with the use_best_model parameter.

    None (The minimal number of trees for the best model is not set)

    CPU and GPU

    depth

    Alias: max_depth

    int

    Depth of the tree.

    The range of supported values depends on the processing unit type and the type of the selected loss function:
    • CPU — Any integer up to  16.

    • GPU — Any integer up to 8 pairwise modes (YetiRank, PairLogitPairwise and QueryCrossEntropy) and up to   16 for all other loss functions.

    6 (16 if the growing policy is set to Lossguide)

    CPU and GPU

    grow_policystring

    The tree growing policy. Defines how to perform greedy tree construction.

    Possible values:
    • SymmetricTree —A tree is built level by level until the specified depth is reached. On each iteration, all leaves from the last tree level are split with the same condition. The resulting tree structure is always symmetric.

    • Depthwise — A tree is built level by level until the specified depth is reached. On each iteration, all non-terminal leaves from the last tree level are split. Each leaf is split by condition with the best loss improvement.

    • Lossguide — A tree is built leaf by leaf until the specified maximum number of leaves is reached. On each iteration, non-terminal leaf with the best loss improvement is split.

    Note. The Depthwise and Lossguidegrowing policies are currently supported only in training and prediction modes. They are not supported for model analysis (such as Feature importance and ShapValues) and exporting to different model formats (such as AppleCoreML , onnx and json) .

    SymmetricTree

    GPU

    min_data_in_leaf

    intThe minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with samples count less than the specified value.

    Can be used only with the Lossguide and Depthwisegrowing policies.

    1

    GPU

    max_leaves

    int

    The maximum number of leafs in the resulting tree. Can be used only with the Lossguide growing policy.

    Tip. It is not recommended to use values greater than 64, since it can significantly slow down the training process.

    31

    GPU

    ignored_features

    list

    Feature indices or names to exclude from the training. It is assumed that all passed values are feature names if at least one of the passed values can not be converted to a number or a range of numbers. Otherwise, it is assumed that all passed values are feature indices.

    Specifics:

    • Non-negative indices that do not match any features are successfully ignored. For example, if five features are defined for the objects in the dataset and this parameter is set to “42”, the corresponding non-existing feature is successfully ignored.

    • The identifier corresponds to the feature's index. Feature indices used in train and feature importance are numbered from 0 to featureCount – 1. If a file is used as input data then any non-feature column types are ignored when calculating these indices. For example, each row in the input file contains data in the following order: cat feature<\t>label value<\t>num feature. So for the row rock<\t>0<\t>42, the identifier for the “rock” feature is 0, and for the “42” feature it's 1.

    • The addition of a non-existing feature name raises an error.

    For example, use the following construction if features indexed 1, 2, 7, 42, 43, 44, 45, should be ignored:
    [1,2,7,42,43,44,45]
    None (use all features)

    CPU and GPU

    one_hot_max_size

    int

    Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value. Ctrs are not calculated for such features.

    See details.

    The default value depends on various conditions:

    • N/A if training is performed on CPU in Pairwise scoring mode

    • 255 if training is performed on GPU and the selected Ctr types require target data that is not available during the training

    • 10 if training is performed in Ranking mode

    • 2 if none of the conditions above is met

    CPU and GPU

    has_time

    bool

    Use the order of objects in the input data (do not perform random permutations during the Transforming categorical features to numerical features and Choosing the tree structure stages).

    The Timestamp column type is used to determine the order of objects if specified in the input data.

    False (not used; generates random permutations)

    CPU and GPU

    rsm

    Alias: colsample_bylevel

    float (0;1]

    Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.

    The value must be in the range (0;1].

    None (set to 1)

    CPU

    nan_mode

    string

    The method for  processing missing values in the input dataset.

    Possible values:
    • “Forbidden” — Missing values are not supported, their presence is interpreted as an error.

    • “Min” — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

    • “Max” — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

    Using the  Min or Max value of this parameter guarantees that a split between missing values and other values is considered when selecting a new split in the tree.

    Note.

    The method for processing missing values can be set individually for each feature in the Custom quantization borders and missing value modes input file. Such values override the ones specified in this parameter.

    Min

    CPU and GPU

    input_bordersstring

    Load Custom quantization borders and missing value modes from a file (do not generate them).

    Borders are automatically generated before training if this parameter is not set.

    None

    CPU and GPU

    output_bordersstring

    Save quantization borders for the current dataset to a file.

    Refer to the file format description.

    Noneкк

    CPU and GPU

    fold_permutation_blockint

    Objects in the dataset are grouped in blocks before the random permutations. This parameter defines the size of the blocks. The smaller is the value, the slower is the training. Large values may result in quality degradation.

    1

    CPU and GPU

    leaf_estimation_methodstring

    The method used to calculate the values in leaves.

    Possible values:
    • Newton

    • Gradient

    Gradient

    CPU and GPU

    leaf_estimation_iterationsint

    The number of gradient steps when calculating the values in leaves.

    None (Depends on the training objective)

    CPU and GPU

    leaf_estimation_backtrackingstring

    The type of backtracking to use during the gradient descent.

    Possible values:
    • No — Do not use backtracking. Supported on CPUand GPU.

    • AnyImprovement — Reduce the descent step up to the point when the loss function value is smaller than it was on the previous step. Supported on CPUand GPU.

    • Armijo — Reduce the descent step until the Armijo condition is met. Supported only on GPU.

    AnyImprovementDepends on the selected value
    fold_len_multiplierfloat

    Coefficient for changing the length of folds.

    The value must be greater than 1. The best validation result is achieved with minimum values.

    With values close to 1 (for example, ), each iteration takes a quadratic amount of memory and time for the number of objects in the iteration. Thus, low values are possible only when there is a small number of objects.

    2

    CPU and GPU

    approx_on_full_historybool

    The principles for calculating the approximated values.

    Possible values:
    • “False” — Use only а fraction of the fold for calculating the approximated values. The size of the fraction is calculated as follows: , where X is the specified coefficient for changing the length of folds. This mode is faster and in rare cases slightly less accurate

    • “True” — Use all the preceding rows in the fold for calculating the approximated values. This mode is slower and in rare cases slightly more accurate.

    False

    CPU

    class_weightslist

    Class weights. The values are used as multipliers for the object weights. This parameter can be used for solving classification and multiclassification problems.

    Tip.

    For imbalanced datasets with binary classification the weight multiplier can be set to 1 for class 0 and to  for class 1.

    For example, class_weights=[0.1, 4] multiplies the weights of objects from class 0 by 0.1 and the weights of objects from class 1 by 4.

    None (the weight for all classes is set to 1)

    CPU and GPU

    scale_pos_weightfloat

    The weight for class 1 in binary classification. The value is used as a multiplier for the weights of objects from class 1.

    Tip. For imbalanced datasets, the weight multiplier can be set to 

    1.0

    CPU and GPU

    boosting_typestring

    Boosting scheme.

    Possible values:
    • Ordered — Usually provides better quality on small datasets, but it may be slower than the Plain scheme.

    • Plain — The classic gradient boosting scheme.

    Depends on the number of objects in the training dataset and the selected learning mode

    CPU and GPU

    Only the Plainmode is supported for the MultiClassloss on GPU

    allow_const_labelbool

    Use it to train models with datasets that have equal label values for all objects.

    False

    CPU and GPU

    score_functionstring

    The score type used to select the next split during the tree construction.

    Possible values:

    • Cosine (do not use this score type with the Lossguide tree growing policy)

    • L2

    • LOOL2

    • NewtonCosine (do not use this score type with the Lossguide tree growing policy)

    • NewtonL2

    • SatL2

    • SolarL2

    Correlation (NewtonL2 if the growing policy is set to Lossguide)

    GPU

    Overfitting detection settings
    early_stopping_roundsintSet the overfitting detector type to Iter and stop the training after the specified number of iterations since the iteration with the optimal metric value.False

    CPU and GPU

    od_typestring

    The type of the overfitting detector to use.

    Possible values:
    • IncToDec

    • Iter

    IncToDec

    CPU and GPU

    od_pvalfloat

    The threshold for the IncToDec overfitting detectortype. The training is stopped when the specified value is reached. Requires that a validation dataset was input.

    For best results, it is recommended to set a value in the range .

    The larger the value, the earlier overfitting is detected.

    Restriction.

    Do not use this parameter with the Iteroverfitting detector type.

    0 (the overfitting detection is turned off)

    CPU and GPU

    od_waitintThe number of iterations to continue the training after the iteration with the optimal metric value.The purpose of this parameter differs depending on the selected overfitting detector type:
    • IncToDec — Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal metric value.

    • Iter — Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal metric value.

    20

    CPU and GPU

    Quantization settings
    target_borderfloat

    If set, defines the border for converting target values to 0 and 1.

    Depending on the specified value:

    •  the target is converted to 0

    •  the target is converted to 1

    None

    CPU and GPU

    border_count

    Alias: max_bin

    intThe number of splits for numerical features. Allowed values depend on the processing unit type:
    • CPU — integers from 1 to 65535 inclusively.

    • GPU — integers from 1 to 255 inclusively.

    254 (if training is performed on CPU) or 128 (if training is performed on GPU)

    CPU and GPU

    feature_border_typestring

    The quantization mode for numerical features.

    Possible values:
    • Median

    • Uniform

    • UniformAndQuantiles

    • MaxLogSum

    • MinEntropy

    • GreedyLogSum

    GreedyLogSum

    CPU and GPU

    Multiclassification settings
    classes_countint

    The upper limit for the numeric class label. Defines the number of classes for multiclassification.

    Only non-negative integers can be specified. The given integer should be greater than any of the label values.

    If this parameter is specified the labels for all classes in the input dataset should be smaller than the given value

    None.

    Calculation principles

    CPU and GPU

    Performance settings
    thread_countint

    The number of threads to use during training.

    • For CPU

      Optimizes the speed of execution. This parameter doesn't affect results.

    • For GPU

      The given value is used for reading the data from the hard drive and does not affect the training.

      During the training one main thread and one thread for each GPU are used.

    -1 (the number of threads is equal to the number of processor cores)

    CPU and GPU

    used_ram_limitint

    Attempt to limit the amount of used CPU RAM.

    Restriction.
    • This option affects only the CTR calculation memory usage.

    • In some cases it is impossible to limit the amount of CPU RAM used in accordance with the specified value.

    Format:
    <size><measure of information>
    Supported measures of information (non case-sensitive):
    • MB

    • KB

    • GB

    For example:
    2gb
    None (memory usage is no limited)

    CPU

    gpu_ram_partfloat

    How much of the GPU RAM to use for training.

    0.95

    GPU

    pinned_memory_sizeint

    How much pinned (page-locked) CPU RAM to use per GPU.

    1073741824

    GPU

    gpu_cat_features_storagestring

    The method for storing the categorical features' values.

    Possible values:
    • CpuPinnedMemory

    • GpuRam

    Tip.

    Use the CpuPinnedMemory value if feature combinations are used and the available GPU RAM is not sufficient.

    None (set to GpuRam)

    GPU

    data_partitionstring

    The method for splitting the input dataset between multiple workers.

    Possible values:
    • FeatureParallel — Split the input dataset by features and calculate the value of each of these features on a certain GPU.

      For example:

      • GPU0 is used to calculate the values of features indexed 0, 1, 2

      • GPU1 is used to calculate the values of features indexed 3, 4, 5, etc.

    • DocParallel — Split the input dataset by objects and calculate all features for each of these objects on a certain GPU. It is recommended to use powers of two as the value for optimal performance.

      For example:
      • GPU0 is used to calculate all features for objects indexed object_1object_2

      • GPU1 is used to calculate all features for objects indexed object_3object_4, etc.

    Depends on the learning mode and the input dataset

    GPU

    Processing unit settings
    task_typestring

    The processing unit type to use for training.

    Possible values:
    • CPU

    • GPU

    CPU

    CPU and GPU

    devicesstring

    IDs of the GPU devices to use for training (indices are zero-based).

    Format

    • <unit ID> for one device (for example, 3)

    • <unit ID1>:<unit ID2>:..:<unit IDN> for multiple devices (for example, devices='0:1:3')

    • <unit ID1>-<unit IDN> for a range of devices (for example, devices='0-3')

    NULL (all GPU devices are used if the corresponding processing unit type is selected)

    GPU

    Visualization settings
    namestringThe experiment name to display in visualization tools.experiment

    CPU and GPU

    Output settings
    logging_levelstring

    The logging level to output to stdout.

    Possible values:
    • Silent — Do not output any logging information to stdout.

    • Verbose — Output the following data to stdout:

      • optimized metric

      • elapsed time of training

      • remaining time of training

    • Info — Output additional information and the number of trees.

    • Debug — Output debugging information.

    None (corresponds to the Verboselogging level)

    CPU and GPU

    metric_periodint

    The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.

    The usage of this parameter speeds up the training.

    Note.

    It is recommended to increase the value of this parameter to maintain training speed if a GPU processing unit type is used.

    1

    CPU and GPU

    verbose

    Alias: verbose_eval

    • bool

    • int

    The purpose of this parameter depends on the type of the given value:

    • bool — Defines the logging level:
      • “True”  corresponds to the Verbose logging level

      • “False” corresponds to the Silent logging level

    • int — Use the Verbose logging level and set the logging period to the value of this parameter.

    Restriction. Do not use this parameter with the logging_level parameter.

    1

    CPU and GPU

    train_dirstring

    The directory for storing the files generated during training.

    catboost_info

    CPU and GPU

    model_size_regfloat

    The model size regularization coefficient. The larger the value, the smaller the model size. Refer to the Model size regularization coefficient section for details.

    Possible values are in the range .

    This regularization is needed only for models with categorical features (other models are small). Models with categorical features might weight tens of gigabytes or more if categorical features have a lot of values. If the value of the regularizer differs from zero, then the usage of categorical features or feature combinations with a lot of values has a penalty, so less of them are used in the resulting model.

    Note that the resulting quality of the model can be affected. Set the value to 0 to turn off the model size optimization option.

    None (Turned on and set to 0.5 on CPU and turned off for GPU)

    CPU

    allow_writing_filesbool

    Allow to write analytical and snapshot files during training.

    If set to “False”, the snapshot and data visualizationtools are unavailable.

    True

    CPU and GPU

    save_snapshotbool

    Enable snapshotting for restoring the training progress after an interruption. If enabled, the default period for making snapshots is 600 seconds. Use the snapshot_interval parameter to change this period.

    Note. This parameter is not supported in the params parameter of the cv function.

    None

    CPU and GPU

    snapshot_filestring

    The name of the file to save the training progress information in. This file is used for recovering training after an interruption.

    Depending on whether the specified file exists in the file system:
    • Missing — Write information about training progress to the specified file.

    • Exists — Load data from the specified file and continue training from where it left off.

    Note. This parameter is not supported in the params parameter of the cv function.

    experiment...

    CPU and GPU

    snapshot_intervalint

    The interval between saving snapshots in seconds.

    The first snapshot is taken after the specified number of seconds since the start of training. Every subsequent snapshot is taken after the specified number of seconds since the previous one. The last snapshot is taken at the end of the training.

    Note. This parameter is not supported in the params parameter of the cv function.

    600

    CPU and GPU

    roc_filestring

    The name of the output file to save the ROC curve points to. This parameter can only be set in cross-validation mode if the Logloss loss function is selected. The ROC curve points are calculated for the test fold.

    The output file is saved to the catboost_infodirectory.

    None (the file is not saved)

    CPU and GPU

    CTR settings
    simple_ctrstring

    Quantization settings for simple categorical features. Use this parameter to specify the principles for defining the class of the object for regression tasks. By default, it is considered that an object belongs to the positive class if its' label value is greater than the median of all label values of the dataset.

    Format:

    ['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
     'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
      ...]
    Components:
    • CtrType — The method for transforming categorical features to numerical features.

      Supported methods for training on CPU:

      • Borders

      • Buckets

      • BinarizedTargetMeanValue

      • Counter

      Supported methods for training on GPU:

      • Borders

      • Buckets

      • FeatureFreq

      • FloatTargetMeanValue

    • TargetBorderCount — The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.

      This option is available for training on CPU only.

    • TargetBorderType — The quantization type for the label value. Only used for regression problems.

      Possible values:

      • Median

      • Uniform

      • UniformAndQuantiles

      • MaxLogSum

      • MinEntropy

      • GreedyLogSum

      By default, MinEntropy.

      This option is available for training on CPU only.

    • CtrBorderCount — The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

    • CtrBorderType — The quantization type for categorical features.

      Supported values for training on CPU:
      • Uniform

      Supported values for training on GPU:

      • Median

      • Uniform

      • UniformAndQuantiles

      • MaxLogSum

      • MinEntropy

      • GreedyLogSum

    • Prior — Use the specified priors during training (several values can be specified).

      Possible formats:
      • One number — Adds the value to the numerator.

      • Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.

    Examples
    • simple_ctr='Borders:TargetBorderCount=2'

      Two new features with differing quantization settings are generated. The first one concludes that an object belongs to the positive class when the label value exceeds the first border. The second one concludes that an object belongs to the positive class when the label value exceeds the second border.

      For example, if the label takes three different values (0, 1, 2), the first border is 0.5 while the second one is 1.5.

    • simple_ctr='Buckets:TargetBorderCount=2'

      The number of features depends on the number of different labels. For example, three new features are generated if the label takes three different values (0, 1, 2). In this case, the first one concludes that an object belongs to the positive class when the value of the feature is equal to 0 or belongs to the bucket indexed 0. The second one concludes that an object belongs to the positive class when the value of the feature is equal to 1 or belongs to the bucket indexed 1, and so on.

     

    CPU and GPU

    combinations_ctrstring

    Quantization settings for combinations of categorical features.

    ['CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
     'CtrType[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
      ...]
    Components:
    • CtrType — The method for transforming categorical features to numerical features.

      Supported methods for training on CPU:

      • Borders

      • Buckets

      • BinarizedTargetMeanValue

      • Counter

      Supported methods for training on GPU:

      • Borders

      • Buckets

      • FeatureFreq

      • FloatTargetMeanValue

    • TargetBorderCount — The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.

      This option is available for training on CPU only.

    • TargetBorderType — The quantization type for the label value. Only used for regression problems.

      Possible values:

      • Median

      • Uniform

      • UniformAndQuantiles

      • MaxLogSum

      • MinEntropy

      • GreedyLogSum

      By default, MinEntropy.

      This option is available for training on CPU only.

    • CtrBorderCount — The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

    • CtrBorderType — The quantization type for categorical features.

      Supported values for training on CPU:
      • Uniform

      Supported values for training on GPU:
      • Uniform

      • Median

    • Prior — Use the specified priors during training (several values can be specified).

      Possible formats:
      • One number — Adds the value to the numerator.

      • Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.

     

    CPU and GPU

    per_feature_ctrstring

    Per-feature quantization settings for categorical features.

    ['FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
     'FeatureId:CtrType:[:TargetBorderCount=BorderCount][:TargetBorderType=BorderType][:CtrBorderCount=Count][:CtrBorderType=Type][:Prior=num_1/denum_1]..[:Prior=num_N/denum_N]',
      ...]
    Components:
    • FeatureId — A zero-based feature identifier.

    • CtrType — The method for transforming categorical features to numerical features.

      Supported methods for training on CPU:

      • Borders

      • Buckets

      • BinarizedTargetMeanValue

      • Counter

      Supported methods for training on GPU:

      • Borders

      • Buckets

      • FeatureFreq

      • FloatTargetMeanValue

    • TargetBorderCount — The number of borders for label value quantization. Only used for regression problems. Allowed values are integers from 1 to 255 inclusively. The default value is 1.

      This option is available for training on CPU only.

    • TargetBorderType — The quantization type for the label value. Only used for regression problems.

      Possible values:

      • Median

      • Uniform

      • UniformAndQuantiles

      • MaxLogSum

      • MinEntropy

      • GreedyLogSum

      By default, MinEntropy.

      This option is available for training on CPU only.

    • CtrBorderCount — The number of splits for categorical features. Allowed values are integers from 1 to 255 inclusively.

    • CtrBorderType — The quantization type for categorical features.

      Supported values for training on CPU:
      • Uniform

      Supported values for training on GPU:

      • Median

      • Uniform

      • UniformAndQuantiles

      • MaxLogSum

      • MinEntropy

      • GreedyLogSum

    • Prior — Use the specified priors during training (several values can be specified).

      Possible formats:
      • One number — Adds the value to the numerator.

      • Two slash-delimited numbers (for GPU only) — Use this format to set a fraction. The number is added to the numerator and the second is added to the denominator.

     

    CPU and GPU

    ctr_target_border_countint

    The maximum number of borders to use in target quantization for categorical features that need it. Allowed values are integers from 1 to 255 inclusively.

    The value of the TargetBorderCount component overrides this parameter if it is specified for one of the following parameters:

    • simple_ctr

    • combinations_ctr

    • per_feature_ctr

    Number_of_classes - 1 for Multiclassification problems when training on CPU, 1 otherwise

    CPU and GPU

    counter_calc_methodstring

    The method for calculating the Counter CTR type.

    Possible values:
    • SkipTest — Objects from the validation dataset are not considered at all

    • Full — All objects from both learn and validation datasets are considered

    None (Full is used)

    CPU and GPU

    max_ctr_complexityint

    The maximum number of features that can be combined.

    Each resulting combination consists of one or more categorical features and can optionally contain binary features in the following form: “numeric feature > value”.

    4

    CPU and GPU

    ctr_leaf_count_limitint

    The maximum number of leaves with categorical features. If the quantity exceeds the specified value a part of leaves is discarded.

    The leaves to be discarded are selected as follows:

    1. The leaves are sorted by the frequency of the values.

    2. The top N leaves are selected, where N is the value specified in the parameter.

    3. All leaves starting from N+1 are discarded.

    This option reduces the resulting model size and the amount of memory required for training. Note that the resulting quality of the model can be affected.

    None

    The number of different category values is not limited

    CPU

    store_all_simple_ctrbool

    Ignore categorical features, which are not used in feature combinations, when choosing candidates for exclusion.

    Use this parameter with ctr_leaf_count_limitonly.

    None (set to False)

    Both simple features and feature combinations are taken in account when limiting the number of leafs with categorical features

    CPU

    final_ctr_computation_modestring

    Final CTR computation mode.

    Possible values:
    • Default — Compute final CTRs for learn and validation datasets.

    • Skip — Do not compute final CTRs for learn and validation datasets. In this case, the resulting model can not be applied. This mode decreases the size of the resulting model. It can be useful for research purposes when only the metric values have to be calculated.

    Default

    CPU and GPU

    展开全文
  • CatBoost参数解释

    万次阅读 2017-10-13 11:38:22
    CatBoost参数简单中文解释。
  • CatBoost参数解释和实战

    万次阅读 多人点赞 2018-06-18 13:41:03
    据开发者所说超越Lightgbm和XGBoost的又一个神器,不过具体性能,还要看在比赛中的表现了。 整理一下里面简单的教程和参数介绍,很多参数不是那种重要,只解释部分重要的...import catboost as cb train_data = ...
  • 集成算法之CatBoost参数解释

    千次阅读 2019-11-10 10:31:12
    Catboost基础介绍 作者介绍的很详细了,包括: 安装 Pool/FeaturesData(内存和速度都更优) Case Visualization(fit时settingplot=True,实时观测训练情况) Early Stopping策略(防止过拟合、节约训练时间) ...
  • catboost原理、参数详解及python实例

    千次阅读 2019-09-02 10:23:33
    catboost 简介 优点: 1)它自动采用特殊的方式处理类别型特征(categorical features)。首先对categorical features做一些统计,计算某个类别特征(category)出现的频率,之后加上超参数,生成新的数值型特征...
  • catboost算法及参数说明

    千次阅读 2022-03-09 17:39:38
    catboost回归 catboost有一下三个的优点: 它自动采用特殊的方式处理类别型特征(categorical features)。首先对categorical features做一些统计,计算某个类别特征(category)出现的频率,之后加上超参数,生成...
  • CatBoostLSS-CatBoost扩展到概率预测 我们提出了一个的新框架,该框架可预测单变量响应变量的整个条件分布。特别是, CatBoostLSS可以对参数分布的所有矩进行建模,即均值,位置,比例和形状(LSS),而不仅仅是条件...
  • 机器学习Author:louwillMachine Learning Lab 虽然现在深度学习大行其道,但以XGBoost、LightGBM和CatBoost为代表的Boostin...
  • catboost】官方调参教程

    千次阅读 2020-05-20 13:13:37
    CatBoost参数调整提供了灵活的界面,可以对其进行配置以适合不同的任务。 本节包含有关可能的参数设置的一些提示。 catBoost提供了为Python、R语言和命令行都提供了可使用的参数,其中Python和R的完全相同,...
  • Catboost原理详解

    2022-07-25 18:23:27
    对于类别型变量而言,xgb需要先自行...catboost进一步处理,不仅嵌入了对类别型变量的处理,并附带类别型特征交叉功能、还加入了部分文本数据的处理。本文深入浅出地详解catboost,全篇通俗易懂帮助大家掌握原理。...
  • CatBoost 原理及应用

    2022-07-10 00:15:25
    后面小猴子还将针对其中几个重要特性做专门介绍,如 CatBoost 对类别型特征处理、特征选择、文本特征处理、超参数调整以及多标签目标处理,敬请期待,看完记得点个赞支持下!梯度提升概述要理解 boos...
  • 本教程说明如何将数据转换为CatBoost Pool,如何训练模型以及如何进行交叉验证和参数校正。 命令行 本教程显示了如何使用命令行工具训练和应用模型。 分类 这是CatBoost解决二进制分类和多重分类问题的示例。 排行 ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 5,007
精华内容 2,002
关键字:

Catboost参数