精华内容
下载资源
问答
  • lgbm模型转为pmml格式的jar包,jpmml-lightgbm-executable-1.3-SNAPSHOT.jar
  • lgbm模型使用问题

    千次阅读 2019-07-16 10:49:18
    使用lgbm模型中,遇到过几个问题: 1.Estimator not fitted, call `fit` before exploiting the model. 训练模型时,创建了多个estimator(参数合集),最终生成pkl文件时,参数版本紊乱。 2.No module named '...

    使用lgbm模型中,遇到过几个问题:

    1.Estimator not fitted, call `fit` before exploiting the model.

        训练模型时,创建了多个estimator(参数合集),最终生成pkl文件时,参数版本紊乱。

    2.No module named 'pandas.core.indexes

        训练模型和模型部署运行的环境不一致,python对包的要求比较严格,lgbm的版本和pandas的版本都要一致,要不然pkl的生成和解析就会出现问题,报这个错误。

    展开全文
  • 解决方法 在: 初始化模型sklearn类LGBMRegressor()或是原版lgbm都可以 使用fit()函数 这两个地方的都设置参数:verbose=-1

    解决方法

    在:

    1. 初始化模型sklearn类LGBMRegressor()或是原版lgbm都可以
    2. 使用fit()函数

    这两个地方的都设置参数:verbose=-1

    展开全文
  • 1:报名地址 ... 2:排名分数 ...4:模型源码 废话不多说,直接上源码 import pandas as pd import numpy as np import pickle #数据加载 raw=pd.read_csv('./train.csv') train_raw=raw[raw['order...

    1:报名地址

         https://aistudio.baidu.com/aistudio/competition/detail/51

            

    2:排名分数

            

     3:证书登场

     

    位置先空着

    4: 模型源码

    废话不多说,直接上源码

    import pandas as pd
    import numpy as np
    import pickle
    
    #数据加载
    raw=pd.read_csv('./train.csv')
    train_raw=raw[raw['order_pay_time']<='2013-07-31 23:59:59']
    raw.sort_values('order_pay_time',ascending=True,inplace=True)
    
    #下个月8月份 购买的用户集合
    label_raw=set(raw[raw['order_pay_time']>'2013-07-31 23:59:59']['customer_id'].dropna())
    
    #数据预处理
    def preprocess(raw,train = 'train'):
        # 按照customer_id ,统计特征
        data = pd.DataFrame(
            # 如果 gender 为空, 则填充为0 
            raw.groupby('customer_id')['customer_gender'].last().fillna(0)
        )  
        # 用户与商品的交互特征(最后一次行为)
        data[['goods_id_last','goods_status_last','goods_price_last','goods_has_discount_last','goods_list_time_last',
              'goods_delist_time_last']]= \
            raw.groupby('customer_id')['goods_id','goods_status','goods_price','goods_has_discount','goods_list_time',
                                       'goods_delist_time'].last() 
        
        # 用户与订单的交互特征(最后一次行为)
        data[['order_total_num_last','order_amount_last','order_total_payment_last','order_total_discount_last','order_pay_time_last',
              'order_status_last','order_count_last','is_customer_rate_last','order_detail_status_last', 'order_detail_goods_num_last', 
              'order_detail_amount_last','order_detail_payment_last', 'order_detail_discount_last']]= \
            raw.groupby('customer_id')['order_total_num', 'order_amount','order_total_payment', 'order_total_discount', 'order_pay_time',
                   'order_status', 'order_count', 'is_customer_rate','order_detail_status', 'order_detail_goods_num', 
                    'order_detail_amount','order_detail_payment', 'order_detail_discount'].last()     
        
        
        # 用户与会员的交互特征(最后一次行为)++
        data[['member_id_last','member_status_last','is_member_actived_last']]= \
            raw.groupby('customer_id')['member_id','member_status','is_member_actived'].last() 
        
        # 商品原始价格(多种统计字段)
        data[['goods_price_min','goods_price_max','goods_price_mean','goods_price_std']]= \
            raw.groupby('customer_id',as_index = False)['goods_price'].agg({'goods_price_min':'min','goods_price_max':'max','goods_price_mean':'mean','goods_price_std':'std'}).drop(['customer_id'],axis=1)
        
        
        #订单实付金额(多种统计字段)
        data[['order_total_payment_min','order_total_payment_max','order_total_payment_mean','order_total_payment_std']]= \
            raw.groupby('customer_id',as_index = False)['order_total_payment'].agg({'order_total_payment_min':'min','order_total_payment_max':'max',
              'order_total_payment_mean':'mean','order_total_payment_std':'std'}).drop(['customer_id'],axis=1)
        
        #用户购买的订单数量
        data[['order_count']] = raw.groupby('customer_id',as_index = False)['order_id'].count().drop(['customer_id'],axis=1)
        
        #用户购买商品数量
        data[['goods_count']] = raw.groupby('customer_id',as_index = False)['goods_id'].count().drop(['customer_id'],axis=1)
        
        #用户所在省份
        data[['customer_province']] = raw.groupby('customer_id')['customer_province'].last()
        
        #用户所在城市
        data[['customer_city']] = raw.groupby('customer_id')['customer_city'].last()
        
        #用户是否评价,统计结果(平均,总和)
        data[['is_customer_rate_mean','is_customer_rate_sum']]=raw.groupby('customer_id')['is_customer_rate'].agg([
            ('is_customer_rate_mean',np.mean),
            ('is_customer_rate_sum',np.sum)
        ])
        
        #应付金额除以实付金额 ++,优惠比例越大,越容易购买
        data['discount']=data['order_detail_amount_last']/data['order_detail_payment_last']
        
        #用户的会员状态,++
        data[['member_status_mean','member_status_sum']]=raw.groupby('customer_id')['member_status'].agg([
            ('member_status_mean',np.mean),
            ('member_status_sum',np.sum)
        ])
        
        #订单优惠金额  订单优惠金额越多,越容易购买
        data[['order_detail_discount_mean','order_detail_discount_sum']]=raw.groupby('customer_id')['order_detail_discount'].agg([
            ('order_detail_discount_mean',np.mean),
            ('order_detail_discount_sum',np.sum)
        ])      
        
        #商品库存状态
        data[['goods_status_mean','goods_status_sum']]=raw.groupby('customer_id')['goods_status'].agg([
            ('goods_status_mean',np.mean),
            ('goods_status_sum',np.sum)
        ])   
        
        #会员激活状态
        data[['is_member_actived_mean','is_member_actived_sum']]=raw.groupby('customer_id')['is_member_actived'].agg([
            ('is_member_actived_mean',np.mean),
            ('is_member_actived_sum',np.sum)
        ])  
        
        #订单状态 
        data[['order_status_mean','order_status_sum']]=raw.groupby('customer_id')['order_status'].agg([
            ('order_status_mean',np.mean),
            ('order_status_sum',np.sum)
        ])
        
        #用户购买的goods数量
        data[['order_detail_count']] = raw.groupby('customer_id')['customer_id'].count()
        
        #商品折扣统计属性    
        data[['goods_has_discount_mean','goods_has_discount_sum']]= raw.groupby('customer_id')['goods_has_discount'].agg([
            ('goods_has_discount_mean',np.mean),
            ('goods_has_discount_sum',np.sum)
        ])
        
        #订单实付金额 统计属性
        data[['order_total_payment_mean','order_total_payment_sum']]= raw.groupby('customer_id')['order_total_payment'].agg([
            ('order_total_payment_mean',np.mean),
            ('order_total_payment_sum',np.sum)
        ])
            
        #订单商品数量 统计属性
        data[['order_total_num_mean','order_total_num_sum']]= raw.groupby('customer_id')['order_total_num'].agg([
            ('order_total_num_mean',np.mean),
            ('order_total_num_sum',np.sum)
        ])    
        data['order_pay_time_last'] = pd.to_datetime(data['order_pay_time_last'])
        data['order_pay_time_last_m'] = data['order_pay_time_last'].dt.month
        data['order_pay_time_last_d'] = data['order_pay_time_last'].dt.day
        data['order_pay_time_last_h'] = data['order_pay_time_last'].dt.hour
        data['order_pay_time_last_min'] = data['order_pay_time_last'].dt.minute
        data['order_pay_time_last_s'] = data['order_pay_time_last'].dt.second
        data['order_pay_time_last_weekday'] = data['order_pay_time_last'].dt.weekday
        
        #计算order_pay_time_last的时间diff
        t_min=pd.to_datetime('2012-10-11 00:00:00')
        data['order_pay_time_last_diff'] = (data['order_pay_time_last']-t_min).dt.days
        
        #商品最新上架时间diff :假设其实时间为2012-10-11 00:00:00    
        data['goods_list_time_last'] =pd.to_datetime(data['goods_list_time_last'])    
        data['goods_list_time_diff'] = (data['goods_list_time_last']-t_min).dt.days
        
        #商品最新下架时间diff :假设其实时间为2012-10-11 00:00:00
        data['goods_delist_time_last'] =pd.to_datetime(data['goods_delist_time_last'])    
        data['goods_delist_time_diff'] = (data['goods_delist_time_last']-t_min).dt.days
        
        #商品展示时间(下架时间-上架时间)
        data['goods_time_diff'] =  data['goods_delist_time_diff']-data['goods_list_time_diff']
        return data
    
    #训练集预处理
    train_raw2=preprocess(train_raw)
    train_raw2['label']=train_raw2.index.map(lambda x:int(x in label_raw))
    train_raw2.drop(['goods_list_time_last','goods_delist_time_last','order_pay_time_last'],axis=1,inplace=True)
    
    #测试集预处理
    test=preprocess(raw)
    test.drop(['goods_list_time_last','goods_delist_time_last','order_pay_time_last'],axis=1,inplace=True)
    
    #训练集与测试集-省市进行LabelEncoder
    test['customer_province'] = test['customer_province'].astype('str') 
    test['customer_city'] = test['customer_city'].astype('str')
    train_raw2['customer_province'] = train_raw2['customer_province'].astype('str') 
    train_raw2['customer_city'] = train_raw2['customer_city'].astype('str')
    from sklearn.preprocessing import LabelEncoder
    lel=LabelEncoder()
    test['customer_province']=lel.fit_transform(test['customer_province'])
    train_raw2['customer_province']=lel.fit_transform(train_raw2['customer_province'])
    le2=LabelEncoder()
    test['customer_city']=le2.fit_transform(test['customer_city'])
    train_raw2['customer_city']=le2.fit_transform(train_raw2['customer_city'])
    
    from sklearn.preprocessing import LabelEncoder
    lel=LabelEncoder()
    test['customer_province']=lel.fit_transform(test['customer_province'])
    train_raw2['customer_province']=lel.fit_transform(train_raw2['customer_province'])
    le2=LabelEncoder()
    test['customer_city']=le2.fit_transform(test['customer_city'])
    train_raw2['customer_city']=le2.fit_transform(train_raw2['customer_city'])
    
    
    #预处理数据临时保存
    import pickle
    test[test.index==1585917]['customer_city']
    train_raw2.to_pickle('./train_raw.pkl')
    test.to_pickle('./test.pkl')
    
    
    #加载预处理的文件
    with open('./train_raw.pkl', 'rb') as file:
        train_raw2 = pickle.load(file)
    
    with open('./test.pkl', 'rb') as file:
        test = pickle.load(file)
    
    train_raw2=train_raw2.reset_index()
    test=test.reset_index()
    all_df=pd.concat([train_raw2,test],axis=0)
    train_raw2=all_df[all_df['label'].notnull()]
    test=all_df[all_df['label'].isnull()]
    
    #LGBM建模
    import lightgbm as lgb
    # LGBMClassifier经验参数
    clf = lgb.LGBMClassifier(
                num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25, objective='binary',
                max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2021,
                n_estimators=2500, subsample=1, colsample_bytree=1,
            )
    clf.fit(train_raw2.drop(['label','customer_id'],axis=1),train_raw2['label'])
    
    
    #结果处理
    #buy_num设置的值
    #0.70457,300000
    #0.7139,400000
    #0.71512,500000
    #0.70902 600000
    #0.71555 450000
    cols=train_raw2.columns.tolist()
    cols.remove('label')
    cols.remove('customer_id')
    
    y_pred=clf.predict_proba(test.drop(['label','customer_id'],axis=1))[:,1] 
    result=pd.read_csv('./submission.csv')
    result['result']=y_pred
    result2=result.sort_values('result',ascending=False).copy()
    buy_num=450000
    result2.index=range(len(result2))
    result2.loc[result.index<=buy_num,'result']=1
    result2.loc[result.index>buy_num,'result']=0
    result2.sort_values('customer_id',ascending=True,inplace=True)
    result2.to_csv('./baseline_0.7155.csv',index=False)

    5:提分要领

    1:结果集buy_num参数的调整对结果集的影响比较大

            刚刚开始buy_num设置为10000左右分数比较低 0.5左右

            我们在面对参数值调参的过程中可以采用2倍方式调整,例如第一次buy_num=10000,后面依次是20000,40000,80000....每次都以指数的方式增加,当出现分数有下降的趋势,则可以缩小访问,一步步的逼近当前模型的最高分

    2:特征值的处理永远都是提分的关键

            如何抓取到准确的特征值,更多的是考个人对当前领域场景的把控。例如本次是购买预测,可以结合自己在购买东西时候的关注点,再进行分析。

            例如折扣费用,库存,优惠金额等等都是作为一个购买者考虑的问题。   

    6:相关知识

    LGBMClassifier经验参数

    import lightgbm as lgb

    clf = lgb.LGBMClassifier(            

    num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25, objective='binary',             max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2021,             n_estimators=2000, subsample=1, colsample_bytree=1,        

    )

    num_leavel=2**5-1 #树的最大叶子数,对比XGBoost一般为2^(max_depth) reg_alpha,L1正则化系数

    reg_lambda,L2正则化系数

    max_depth,最大树的深度

    n_estimators,树的个数,相当于训练的轮数

    subsample,训练样本采样率(行采样)

    colsample_bytree,训练特征采样率(列采样)

    展开全文
  • 1. LGBM回归任务代码 import numpy as np import lightgbm as lgbm from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.datasets import make_...

    1. LGBM回归任务代码

    import numpy as np
    import lightgbm as lgbm
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    from sklearn.datasets import make_regression
    import nni
    
    
    def load_data(): # 这个用来读取数据
        reg_x, reg_y = make_regression(n_samples=1000, n_features=5)
        x_train, x_test, y_train, y_test = train_test_split(reg_x, reg_y, train_size=0.8, random_state=0, shuffle=True)
        return x_train, y_train, x_test, y_test
    
    
    def get_model(param, x_train, y_train, x_test, y_test): # 这个用来计算结果,并报告给NNI框架
        model = lgbm.LGBMRegressor(**param)
        model.fit(x_train, np.array(y_train), eval_set=[(x_test, np.array(y_test))], early_stopping_rounds=100, verbose=-1)
        value = float(mean_squared_error(y_test, model.predict(x_test)))  # 注意这里只能是int,float类型,不能是numpy,ndarry这种类型
        nni.report_final_result(value)
    
    
    if __name__ == '__main__':
        RECEIVED_PARAMS = nni.get_next_parameter() # 得到框架返回的参数
        params = {'num_leaves': 120,
                  'min_data_in_leaf': 60,
                  'objective': 'regression',
                  'max_depth': -1,
                  'learning_rate': 0.08,
                  "min_child_samples": 40,
                  "boosting": "gbdt",
                  "feature_fraction": 0.95, } # 初始参数
        x_train, y_train, x_test, y_test = load_data()
        params.update(RECEIVED_PARAMS) # 用框架的新参数,更新初始参数
        model = lgbm.LGBMRegressor(**params) # 建模
        get_model(params, x_train, y_train, x_test, y_test)  # 得到结果
    

    2. 配置search_space.json

    这个文件用来配置参数的调节范围:

    {
      "num_leaves":{"_type":"randint","_value":[ 20,50 ]},
      "learning_rate":{"_type":"uniform","_value":[0.05 , 0.5 ]},
      "n_estimators":{"_type":"randint","_value":[ 50 , 200 ]},
      "min_child_samples":{"_type":"randint","_value":[ 5, 30 ]}
    }
    

    3. 配置config.yml

    这是框架的配置文件

    新建一个config.yml文件,写入:

    searchSpaceFile: search_space.json # 这是参数空间,文件名匹配就行
    trialCommand: python my_nni_lgbm.py # 这是python文件,文件名匹配就行
    trialConcurrency: 10 # 同步进行10轮调参
    maxTrialNumber: 10000 # 自动调参10000次
    maxExperimentDuration: 8h # 持续8小时
    tuner:
      name: TPE
      classArgs:
        optimize_mode: minimize
    trainingService: # For other platforms, check mnist-pytorch example
      platform: local
    

    然后在控制台运行:

    nnictl create --config config.yml -p 12387
    

    看到执行成功的日志:

    INFO:  Starting restful server...
    INFO:  Successfully started Restful server!
    INFO:  Starting experiment...
    INFO:  Successfully started experiment!
    ------------------------------------------------------------------------------------
    The experiment id is me5YV9Q8
    The Web UI urls are: http://127.0.0.1:12387   http://10.8.74.190:12387
    ------------------------------------------------------------------------------------
    
    You can use these commands to get more information about the experiment
    ------------------------------------------------------------------------------------
             commands                       description
    1. nnictl experiment show        show the information of experiments
    2. nnictl trial ls               list all of trial jobs
    3. nnictl top                    monitor the status of running experiments
    4. nnictl log stderr             show stderr log content
    5. nnictl log stdout             show stdout log content
    6. nnictl stop                   stop an experiment
    7. nnictl trial kill             kill a trial job by id
    8. nnictl --help                 get help information about nnictl
    ------------------------------------------------------------------------------------
    Command reference document https://nni.readthedocs.io/en/latest/Tutorial/Nnictl.html
    ------------------------------------------------------------------------------------
    

    访问本机:http://127.0.0.1:12387 ,就可以看到成功执行的界面:
    在这里插入图片描述

    展开全文
  • 比如现在有一个lgbm_model import joblib # 保存 joblib.dump(lgbm_model, "lgbm_model.pkl") # 加载 my_model = joblib.load("lgbm_model.pkl") 方法二(适用于原生接口:import lightgbm): 这里容易报错:...
  • LGBM分类模型预测

    2021-05-28 16:56:36
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) # 模型训练 gbm = LGBMClassifier(num_leaves=31, learning_rate=0.05, n_estimators=20) gbm.fit(x_train, y_train, eval_set=[(x_test...
  • LGBM使用贝叶斯调参

    2021-08-26 08:38:28
    贝叶斯调参的相关知识与代码格式请参考:...构造LGBM模型 model = LGBMRegressor( num_leaves=31, learning_rate=learning_rate, n_estimators=int(n_estimators), silent=True, n_jobs=5
  • lgbm.LGBMRegressor使用方法 1.安装包:pip install lightgbm 2.整理好你的输数据 ...3.整理模型 def fit_lgbm(x_train, y_train, x_valid, y_valid,num, params: dict=None, verbose=100): #判断是否有训练好的模
  • lgbm的roc曲线,auc计算

    2021-08-12 23:03:15
    lgbm模型画ROC曲线 1、得到分类的概率 import numpy as np import pandas as pd import lightgbm as lgb from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = \ ...
  • [机器学习] 模型融合GBDT(xgb/lgbm/rf)+LR 的原理及实践

    千次阅读 多人点赞 2019-01-23 16:21:37
    目录 一、原理 ...GBDT + LR 模型提升 三、实践 1 如何获得样本落在哪个叶子节点 2 举例 2.2.1 训练集准备 2.2.2 RF+LR 2.2.3 GBDT+LR 2.2.4 Xgboost+LR 2.2.5 单独使用RF, GBDT和Xgbo...
  • 对附件一数据建立了总里程等相关因素与成交价格、指导价格、线路成本的多元线性回归模型;为了提高线路价格的预测精度,又建立了基于多因素的 LightGBM 回归模型并与多元线性回归模型进行比较;最后,以历史成功交易...
  • LGBM算法

    万次阅读 2019-04-13 21:42:41
    LGBM 算法定义 算法实践 其他 算法概念 Light GBM is a gradient boosting framework that uses tree based learning algorithm。 传统的GBDT算法存在的问题: 如何减少训练数据 常用的减少训练数据量的方式是...
  • LGBM的原始发布论文

    2020-04-26 09:22:38
    LGBM发布的原始论文
  • LGBM调参方法学习

    万次阅读 2019-04-30 18:24:12
    一、了解LGBM参数: LGBM是微软发布的轻量梯度提升机,最主要的特点是快,回归和分类树模型。使用LGBM首先需要查看其参数含义: 微软官方github上的说明: ...LGBM中文手册: ...
  • 这里做一下使用sklearn直接导入决策树模型进行分类问题,回归问题的解决演示。 关于sklearn中封装的决策树模型的详细参数可以看这篇文章SKlearn中分类决策树的重要参数详解。 下面的示例所有数据和代码看这里。 如果...
  • 本文主要记录和分享个人对集成学习模型学习后的一些总结,略去严格的数学推导过程,重点在于阐述各个不同模型的思想、关系和异同,方便理解和记忆 ------ 集成模型就是认为一个机器学习器的学习能力(包括学习方向...
  • 使用lgbm分类文本

    2019-11-03 21:20:13
    加载包 import lightgbm as lgb import pandas as pd from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split import gensim import jieba import os ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 728
精华内容 291
关键字:

lgbm模型