精华内容
下载资源
问答
  • kaggle第五课能源预测

    2018-04-07 13:22:29
    kaggle第五课能源预测,非常适合机器学习入门学习。。。
  • kaggle 实战 lecture05 能源预测与分配问题.zip
  • 某些数据缺失75% 处理缺失值的方法很一般 利用的中位数 找出异常值 ,表的读书为0的情况 log1p参考: https://www.zhihu.com/question/28676215 ... 数据预处理: 参考:https://www.cnblogs.com/p...

    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    某些数据缺失75% 处理缺失值的方法很一般 利用的中位数
    在这里插入图片描述
    找出异常值 ,表的读书为0的情况
    在这里插入图片描述
    在这里插入图片描述
    log1p参考:
    https://www.zhihu.com/question/28676215
    https://blog.csdn.net/qq_36523839/article/details/82422865
    在这里插入图片描述
    数据预处理:
    参考:https://www.cnblogs.com/purple5252/p/11343835.html
    时间戳对齐:
    气温达到最高温度的时间,
    在这里插入图片描述
    训练时间8-10小时
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    0.3 0.3 0.4 是多次尝试得到的结果

    时间戳 都弄到 下午2点 可能会提升

    展开全文
  • 这是在2019年11~12月期间,我参加的一个kaggle比赛——ASHRAE - Great Energy Predictor III 最终成绩:排名前1%,22/3614。差一点点就金牌了,感觉前面的名次每前进一名都跟炼丹似的,有一定运气成分。。。。难搞哦...

    这是在2019年11~12月期间,我参加的一个kaggle比赛——ASHRAE - Great Energy Predictor III

    最终成绩:排名前1%,22/3614。差一点点就金牌了,感觉前面的名次每前进一名都跟炼丹似的,有一定运气成分。。。。难搞哦。
    在这里插入图片描述
    比赛的具体要求我就不详细介绍了,直接去比赛官网看就好。

    我当时自己整理了notebook,记录了当时自己的解决方案,但是不好分享,关键代码如下:
    在这里插入图片描述

    # coding: utf-8
    
    # ## 1. 比赛介绍: ASHRAE - Great Energy Predictor III
    
    # > 在现实生活中,很多建筑物需要消耗能源,比如说夏天的时候,需要空调来进行制冷。这不仅仅带来了经济支出,还对环境造成了不好的影响。为了减少能源的消耗,我们需要对建筑物的能源使用进行预测。本次比赛将通过结合天气数据,建筑数据,以及热水,冷水能源消耗数据来预测能源使用量。下面是官网的描述。
    # 
    # ### How much energy will a building consume?
    #  
    # * Q: How much does it cost to cool a skyscraper in the summer?
    # * A: A lot! And not just in dollars, but in environmental impact.
    # 
    
    # > 下面是这次比赛用到的文件,我就不翻译了.你们可以自己上官网翻译理解一下。
    # 
    
    
    # # <a id='1-2'>1.2 评价指标</a> (<a href='#9'>返回</a>)
    
    # 这次比赛的评价指标是RMSLE,Root Mean Squared Logarithmic Error。
    # 可以调用如下代码:
    # ```
    # from sklearn.metrics import mean_squared_log_error
    # loss=np.sqrt(mean_squared_log_error( y_test, predictions))
    # ```
    
    # # 2. 导入库函数
    
    # In[2]:
    
    
    import random
    import datetime
    import os,gc,math
    import numpy as np 
    import pandas as pd
    import seaborn as sns
    import lightgbm as lgb
    import matplotlib.pyplot as plt
    from sklearn.model_selection import KFold
    from sklearn.preprocessing import LabelEncoder
    from pandas.api.types import is_categorical_dtype
    from pandas.api.types import is_datetime64_any_dtype as is_datetime
    get_ipython().run_line_magic('matplotlib', 'inline')
    
    
    # # <a id='3'>3. 数据读入与压缩</a> (<a href='#9'>返回</a>)
    
    # In[3]:
    
    
    get_ipython().run_cell_magic('time', '', "data_path = './ashrae-energy-prediction/'\ntrain_df = pd.read_csv(data_path + 'train.csv')\ntest_df = pd.read_csv(data_path + 'test.csv')\nweather_train_df = pd.read_csv(data_path + 'weather_train.csv')\nweather_test_df = pd.read_csv(data_path + 'weather_test.csv')\nbuilding_meta_df = pd.read_csv(data_path + 'building_metadata.csv')\nsample_submission = pd.read_csv(data_path + 'sample_submission.csv')")
    
    
    # In[4]:
    
    
    train_df.head()
    
    
    # In[5]:
    
    
    train_df.columns
    
    
    # In[6]:
    
    
    weather_train_df.columns
    
    
    # In[7]:
    
    
    ## 压缩数据(这个看情况,如果内存大可以不用,效果会差一点点)
    def reduce_mem_usage(df, verbose=True):
        numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
        start_mem = df.memory_usage().sum() / 1024**2    
        for col in df.columns:
            col_type = df[col].dtypes
            if col_type in numerics:
                c_min = df[col].min()
                c_max = df[col].max()
                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)  
                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        df[col] = df[col].astype(np.float64)    
                        
        end_mem = df.memory_usage().sum() / 1024**2
        if verbose: 
            print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
            
        return df
    
    
    # In[8]:
    
    
    train_df = reduce_mem_usage(train_df)
    test_df = reduce_mem_usage(test_df)
    
    
    # # <a id='4'>4. 数据分析</a> (<a href='#9'>返回</a>)
    
    # In[9]:
    
    
    train_df.head()
    
    
    # In[10]:
    
    
    test_df.head()
    
    
    # In[10]:
    
    
    weather_train_df.head()
    
    
    # In[11]:
    
    
    weather_test_df.head()
    
    
    # In[12]:
    
    
    building_meta_df.head()
    
    
    # In[13]:
    
    
    train_df.dtypes
    
    
    # In[14]:
    
    
    weather_train_df.dtypes
    
    
    # In[15]:
    
    
    building_meta_df.dtypes
    
    
    # In[16]:
    
    
    # 目标值的分布
    a = plt.hist(train_df['meter_reading'], range(0,500))
    
    
    # In[17]:
    
    
    train_df.groupby(["meter"])["meter_reading"].agg(['mean','std'])
    
    
    # In[18]:
    
    
    import seaborn as sns
    
    
    # In[19]:
    
    
    # 相关性分析(weather)
    num_cols = ['air_temperature','cloud_coverage','dew_temperature','precip_depth_1_hr','sea_level_pressure','wind_direction','wind_speed' ]
    plt.figure(figsize=(5,5))
    sns.heatmap(weather_train_df[num_cols].dropna(inplace=False).corr(), annot=True)
    
    
    # In[20]:
    
    
    #相关性分析(building)
    num_cols2 = ['square_feet','floor_count']
    plt.figure(figsize=(5,5))
    sns.heatmap(building_meta_df[num_cols2].dropna(inplace=False).corr(),annot=True)
    
    
    # In[21]:
    
    
    # 画出primary_use的分布
    plt.figure(figsize = (15,8))
    data = building_meta_df['primary_use'].value_counts()  # Series
    sns.barplot(data.index,data)
    
    
    # # <a id='5'>5. 特征工程</a> (<a href='#9'>返回</a>)
    
    # In[22]:
    
    
    def fill_weather_dataset(weather_df):
        
        # 根据最大时期和最小时期的差,计算最多有多少个小时
        time_format = "%Y-%m-%d %H:%M:%S"
        start_date = datetime.datetime.strptime(weather_df['timestamp'].min(),time_format)
        end_date = datetime.datetime.strptime(weather_df['timestamp'].max(),time_format)
        total_hours = int(((end_date - start_date).total_seconds() + 3600) / 3600)
        hours_list = [(end_date - datetime.timedelta(hours=x)).strftime(time_format) for x in range(total_hours)]
        
        # 把消失的小时加入到里面去
        missing_hours = []
        for site_id in range(16):
            site_hours = np.array(weather_df[weather_df['site_id'] == site_id]['timestamp'])
            new_rows = pd.DataFrame(np.setdiff1d(hours_list,site_hours),columns=['timestamp'])
            new_rows['site_id'] = site_id
            weather_df = pd.concat([weather_df,new_rows])
    
            weather_df = weather_df.reset_index(drop=True)           
    
        # 添加新的特征day, week,month
        weather_df["datetime"] = pd.to_datetime(weather_df["timestamp"])
        weather_df["day"] = weather_df["datetime"].dt.day
        weather_df["week"] = weather_df["datetime"].dt.week
        weather_df["month"] = weather_df["datetime"].dt.month
        
        # 重新设置Index
        weather_df = weather_df.set_index(['site_id','day','month'])
        
        # 根据地点,日期,和月把缺少的温度补全
        air_temperature_filler = pd.DataFrame(weather_df.groupby(['site_id','day','month'])['air_temperature'].mean(),columns=["air_temperature"])
        weather_df.update(air_temperature_filler,overwrite=False)
    
        # 根据地点,日期,和月把缺少的云层覆盖率补全
        cloud_coverage_filler = weather_df.groupby(['site_id','day','month'])['cloud_coverage'].mean()
        cloud_coverage_filler = pd.DataFrame(cloud_coverage_filler.fillna(method='ffill'),columns=["cloud_coverage"])
        weather_df.update(cloud_coverage_filler,overwrite=False)
        
        # 根据地点,日期,和月把缺少的露水温度补全
        due_temperature_filler = pd.DataFrame(weather_df.groupby(['site_id','day','month'])['dew_temperature'].mean(),columns=["dew_temperature"])
        weather_df.update(due_temperature_filler,overwrite=False)
    
        # 根据地点,日期,和月把海拔给补全
        sea_level_filler = weather_df.groupby(['site_id','day','month'])['sea_level_pressure'].mean()
        sea_level_filler = pd.DataFrame(sea_level_filler.fillna(method='ffill'),columns=['sea_level_pressure'])
        weather_df.update(sea_level_filler,overwrite=False)
        
        # 风的方向,补全
        wind_direction_filler =  pd.DataFrame(weather_df.groupby(['site_id','day','month'])['wind_direction'].mean(),columns=['wind_direction'])
        weather_df.update(wind_direction_filler,overwrite=False)
        
        # 风速补全
        wind_speed_filler =  pd.DataFrame(weather_df.groupby(['site_id','day','month'])['wind_speed'].mean(),columns=['wind_speed'])
        weather_df.update(wind_speed_filler,overwrite=False)
    
        # 降雨量补全
        precip_depth_filler = weather_df.groupby(['site_id','day','month'])['precip_depth_1_hr'].mean()
        precip_depth_filler = pd.DataFrame(precip_depth_filler.fillna(method='ffill'),columns=['precip_depth_1_hr'])
        weather_df.update(precip_depth_filler,overwrite=False)
        
        # 删掉一些列
        weather_df = weather_df.reset_index()
        weather_df = weather_df.drop(['datetime','day','week','month'],axis=1)
            
        return weather_df
    
    
    # In[23]:
    
    
    # 加一些累计的特征(window = 24表示,24小时内的平均温度)
    def add_lag_feature(weather_df, window=3):
        group_df = weather_df.groupby(['site_id','building_id'])
        cols = ['air_temperature'] 
        rolled = group_df[cols].rolling(window=window, min_periods=0)
        lag_mean = rolled.mean().reset_index().astype(np.float16)
    #     lag_std = rolled.std().reset_index().astype(np.float16)
        for col in cols:
            weather_df[f'{col}_mean_lag{window}'] = lag_mean[col]
    #         weather_df[f'{col}_std_lag{window}'] = lag_mean[col]
    
    # 加入一些频率特征
    def encode_FE(df,cols):
        for col in cols:
            vc = df[col].value_counts(dropna=True, normalize=True).to_dict()
            vc[-1] = -1
            nm = col+'_FE'
            df[nm] = df[col].map(vc)
            df[nm] = df[nm].astype('float16')
            print(nm,', ',end='')
            
    # 加一些特征工程        
    def features_engineering(df,categorical_features):
        
        # 按照timeStamp
        df.sort_values("timestamp")
        df.reset_index(drop=True)
        
        # 加入时间特征
        df["timestamp"] = pd.to_datetime(df["timestamp"],format="%Y-%m-%d %H:%M:%S")
        df["hour"] = df["timestamp"].dt.hour
        df["weekend"] = df["timestamp"].dt.weekday
        
        # 加入假日特征
        holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
                        "2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-26",
                        "2017-01-02", "2017-01-16", "2017-02-20", "2017-05-29", "2017-07-04",
                        "2017-09-04", "2017-10-09", "2017-11-10", "2017-11-23", "2017-12-25",
                        "2018-01-01", "2018-01-15", "2018-02-19", "2018-05-28", "2018-07-04",
                        "2018-09-03", "2018-10-08", "2018-11-12", "2018-11-22", "2018-12-25",
                        "2019-01-01"]
        df["is_holiday"] = (df.timestamp.isin(holidays)).astype(int)
        
        # 面积
        df['square_feet'] =  np.log1p(df['square_feet'])
        # primary_use的使用频率
        encode_FE(df,['primary_use'])
        
        # 舍弃掉一些特征
        drop = ["timestamp","sea_level_pressure", "wind_direction", "wind_speed","year_built","floor_count"]
        df = df.drop(drop, axis=1)
        gc.collect()
        
        # Label encode
        for c in categorical_features:
            le = LabelEncoder()
            df[c] = le.fit_transform(df[c])
        add_lag_feature(df,24)
        return df
    
    
    # In[26]:
    
    
    # 有一段时间的一些房子的数据有问题,是异常值得删掉
    train_df = train_df [ train_df['building_id'] != 1099 ]
    train_df = train_df.query('not (building_id <= 104 & meter == 0 & timestamp <= "2016-05-20")')
    weather_train_df = fill_weather_dataset(weather_train_df)
    train_df = train_df.merge(building_meta_df, left_on='building_id',right_on='building_id',how='left')
    train_df = train_df.merge(weather_train_df,how='left',left_on=['site_id','timestamp'],right_on=['site_id','timestamp'])
    train_df = features_engineering(train_df,['primary_use','primary_use_FE'])
    train_df.head(10)
    
    # weather_train_df = reduce_mem_usage(weather_train_df)
    # weather_test_df = reduce_mem_usage(weather_test_df)
    train_df = reduce_mem_usage(train_df)
    train_df.to_pickle("../input/train.pickle",index=False)
    del train_df, weather_train_df
    gc.collect()
    
    row_ids = test_df["row_id"]
    test_df.drop("row_id", axis=1, inplace=True)
    weather_test_df = fill_weather_dataset(weather_test_df)
    test_df = test_df.merge(building_meta_df, left_on='building_id',right_on='building_id',how='left')
    test_df = test_df.merge(weather_test_df,how='left',left_on=['site_id','timestamp'],right_on=['site_id','timestamp'])
    test_df = features_engineering(test_df,['primary_use','primary_use_FE'])
    test_df.head(10)
    test_df = reduce_mem_usage(test_df)
    test_df.to_pickle("../input/test.pickle",index=False)
    del test_df, weather_test_df
    gc.collect()
    
    
    # # <a id='6'>6. 使用回归模型进行训练</a> (<a href='#9'>返回</a>)
    
    # In[ ]:
    
    
    target = np.log1p(train_df["meter_reading"])
    features = train_df.drop(['meter_reading'], axis = 1)
    categorical_features = ["building_id", "site_id", "meter", "primary_use", "weekend","primary_use_FE"]
    selected_features = ['building_id',
     'meter',
     'site_id',
     'primary_use',
     'square_feet',
     'primary_use_FE',
     'air_temperature',
     'cloud_coverage',
     'dew_temperature',
     'hour',
     'weekend',
     'air_temperature_mean_lag24']
    
    
    # In[ ]:
    
    
    params = {
        "objective": "regression",
        "boosting": "gbdt",
        "num_leaves": 1280,
        "learning_rate": 0.01,
        "feature_fraction": 0.85,
        "reg_lambda": 2,
        "metric": "rmse",
        "random_seed":10
    }
    kf = KFold(n_splits=3)
    
    
    # In[ ]:
    
    
    models = []
    history = []
    for train_index,test_index in kf.split(features):
        train_features = features.loc[train_index][selected_features]
        train_target = target.loc[train_index]
        
        test_features = features.loc[test_index][selected_features]
        test_target = target.loc[test_index]
        
        d_training = lgb.Dataset(train_features, label=train_target,categorical_feature=categorical_features, free_raw_data=False)
        d_test = lgb.Dataset(test_features, label=test_target,categorical_feature=categorical_features, free_raw_data=False)
        
        model = lgb.train(params, train_set=d_training, num_boost_round=1000,valid_sets=[d_training,d_test], verbose_eval=25, early_stopping_rounds=50)
        models.append(model)
        del train_features, train_target, test_features, test_target, d_training, d_test
        gc.collect()
    
    
    # In[ ]:
    
    
    cv_scores = np.mean([model.best_score['valid_1']['rmse'] for model in models])
    cv_scores
    
    
    # # <a  id='7'>7. 打印特征重要性</a> (<a href='#9'>返回</a>)
    
    # In[ ]:
    
    
    for model in models:
        lgb.plot_importance(model)
        plt.show()
    
    
    # In[ ]:
    
    
    展开全文
  • **kwargs): self.fit(df) return self.transform(df, *args, **kwargs) # load data train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/train.csv") test = pd.read_csv("/kaggle/input/ashrae-energy-...

    喜欢的话请关注我们的微信公众号~《你好世界炼丹师》。

    • 公众号主要讲统计学,数据科学,机器学习,深度学习,以及一些参加Kaggle竞赛的经验。
    • 公众号内容建议作为课后的一些相关知识的补充,饭后甜点。
    • 此外,为了不过多打扰,公众号每周推送一次,每次4~6篇精选文章。
      微信搜索公众号:你好世界炼丹师。期待您的关注。

    1 概述

    先上第一名分析的图
    在这里插入图片描述

    2 处理思想学习

    2.1 移除异常值

    Long steaks of constant values

    1. 恒定值的长条纹
      Large positive/negative spikes
    2. 极端的大尖峰

    我们使用一个数据中所有建筑物验证了潜在的异常-如果同时在多个建筑物中出现异常,我们可以合理地确定这确实是一个真正的异常。

    总结:异常值使用多个角度来验证这是真实的一个异常值

    2.2 缺失值

    温度元数据中缺少很多值。我们发现使用线性插值插补丢失的数据对我们的模型有帮助。

    2.3 目标函数

    一般人使用的都是log1p(meter_reading),他们与常人不同的使用了log1p(meter_reading)/square_feet来进行预测。

    2.4 特征工程

    1. categorical interactions such as concatenation of building_id and meter
      串联building_idmeter产生新的categorical featurebuilding_id_meter
    2. count frequency of feautures
      计算特征的数量
    3. Smoothed and 1st, 2nd-order differentiation temperature features using Savitzky-Golay filter.
    4. Cyclic encoding of periodic features; e.g., hour gets mapped to hour_x = cos(2pihour/24) and hour_y = sin(2pihour/24)
      这个很骚,就是对于循环特征的编码,用cos和sin进行编码
    5. Bayesian target encoding
      这个是作者自己写的一种target编码,下面会详细讲一下
    6. 3rd 的思路:作者因为缺乏时间,仅仅消除了一些异常值。使用的是当同一时间同一地区都出现0的时候,消除他们,然后消除了一些最大的异常值。
    7. 温度的滞后,多个高分作者都提到过。

    2.4.1 Savitzky-Golay filter

    • Savitzky-Golay卷积平滑算法是移动平滑算法的改进。
    • Savitzky-Golay卷积平滑关键在于矩阵算子的求解。
      在这里插入图片描述
      总结:先计算出B,然后计算预测Y,这个需要利用矩阵的运算。应该不难。回头复现的时候上代码
      下面回到比赛上来看他们的处理结果
      在这里插入图片描述
    • 第一个图蓝色线是原数据
    • 第一个图黄色线是用G-S平滑后的数据
    • 第二个图蓝色线是G-S平滑后的数据的一阶导数
    • 第二个图黄色线是G-S平滑后的数据的二阶导数

    2.4.2 Bayesian target encoding(python实现)

    import gc
    import numpy as np
    import pandas as pd 
    from sklearn.linear_model import RidgeCV
    from sklearn.metrics import mean_squared_error
    
    PRIOR_PRECISION = 10
    class GaussianTargetEncoder():
            
        def __init__(self, group_cols, target_col="target", prior_cols=None):
            self.group_cols = group_cols
            self.target_col = target_col
            self.prior_cols = prior_cols
    
        def _get_prior(self, df):
            if self.prior_cols is None:
                prior = np.full(len(df), df[self.target_col].mean())
            else:
                prior = df[self.prior_cols].mean(1)
            return prior
                        
        def fit(self, df):
            self.stats = df.assign(mu_prior=self._get_prior(df), y=df[self.target_col])
            self.stats = self.stats.groupby(self.group_cols).agg(
                n        = ("y", "count"),
                mu_mle   = ("y", np.mean),
                sig2_mle = ("y", np.var),
                mu_prior = ("mu_prior", np.mean),
            )        
        
        def transform(self, df, prior_precision=1000, stat_type="mean"):
            
            precision = prior_precision + self.stats.n/self.stats.sig2_mle
            
            if stat_type == "mean":
                numer = prior_precision*self.stats.mu_prior\
                        + self.stats.n/self.stats.sig2_mle*self.stats.mu_mle
                denom = precision
            elif stat_type == "var":
                numer = 1.0
                denom = precision
            elif stat_type == "precision":
                numer = precision
                denom = 1.0
            else: 
                raise ValueError(f"stat_type={stat_type} not recognized.")
            
            mapper = dict(zip(self.stats.index, numer / denom))
            if isinstance(self.group_cols, str):
                keys = df[self.group_cols].values.tolist()
            elif len(self.group_cols) == 1:
                keys = df[self.group_cols[0]].values.tolist()
            else:
                keys = zip(*[df[x] for x in self.group_cols])
            
            values = np.array([mapper.get(k) for k in keys]).astype(float)
            
            prior = self._get_prior(df)
            values[~np.isfinite(values)] = prior[~np.isfinite(values)]
            
            return values
        
        def fit_transform(self, df, *args, **kwargs):
            self.fit(df)
            return self.transform(df, *args, **kwargs)
    
    # load data
    train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/train.csv")
    test  = pd.read_csv("/kaggle/input/ashrae-energy-prediction/test.csv")
    
    # create target
    train["target"] = np.log1p(train.meter_reading)
    test["target"] = train.target.mean()
    
    # create time features
    def add_time_features(df):
        df.timestamp = pd.to_datetime(df.timestamp)    
        df["hour"]    = df.timestamp.dt.hour
        df["weekday"] = df.timestamp.dt.weekday
        df["month"]   = df.timestamp.dt.month
    
    add_time_features(train)
    add_time_features(test)
    
    # define groupings and corresponding priors
    groups_and_priors = {
        
        # singe encodings
        ("hour",):        None,
        ("weekday",):     None,
        ("month",):       None,
        ("building_id",): None,
        ("meter",):       None,
        
        # second-order interactions
        ("meter", "hour"):        ["gte_meter", "gte_hour"],
        ("meter", "weekday"):     ["gte_meter", "gte_weekday"],
        ("meter", "month"):       ["gte_meter", "gte_month"],
        ("meter", "building_id"): ["gte_meter", "gte_building_id"],
            
        # higher-order interactions
        ("meter", "building_id", "hour"):    ["gte_meter_building_id", "gte_meter_hour"],
        ("meter", "building_id", "weekday"): ["gte_meter_building_id", "gte_meter_weekday"],
        ("meter", "building_id", "month"):   ["gte_meter_building_id", "gte_meter_month"],
    }
    
    features = []
    for group_cols, prior_cols in groups_and_priors.items():
        features.append(f"gte_{'_'.join(group_cols)}")
        gte = GaussianTargetEncoder(list(group_cols), "target", prior_cols)    
        train[features[-1]] = gte.fit_transform(train, PRIOR_PRECISION)
        test[features[-1]]  = gte.transform(test,  PRIOR_PRECISION)
    
    train_preds = np.zeros(len(train))
    test_preds = np.zeros(len(test))
    
    for m in range(4):
        
        print(f"Meter {m}", end="") 
        
        # instantiate model
        model = RidgeCV(
            alphas=np.logspace(-10, 1, 25), 
            normalize=True,
        )    
        
        # fit model
        model.fit(
            X=train.loc[train.meter==m, features].values, 
            y=train.loc[train.meter==m, "target"].values
        )
    
        # make predictions 
        train_preds[train.meter==m] = model.predict(train.loc[train.meter==m, features].values)
        test_preds[test.meter==m]   = model.predict(test.loc[test.meter==m, features].values)
        
        # transform predictions
        train_preds[train_preds < 0] = 0
        train_preds[train.meter==m] = np.expm1(train_preds[train.meter==m])
        
        test_preds[test_preds < 0] = 0 
        test_preds[test.meter==m] = np.expm1(test_preds[test.meter==m])
        
        # evaluate model
        meter_rmsle = rmsle(
            train_preds[train.meter==m],
            train.loc[train.meter==m, "meter_reading"].values
        )
        
        print(f", rmsle={meter_rmsle:0.5f}")
    
    print(f"Overall rmsle={rmsle(train_preds, train.meter_reading.values):0.5f}")
    del train, train_preds, test
    gc.collect()
    

    2.5 models ensemble

    • 2nd的思想:Due to the size of the dataset and difficulty in setting up a robust validation framework, we did not focus much on feature engineering, fearing it might not extrapolate cleanly to the test data. Instead we chose to ensemble as many different models as possible to capture more information and help the predictions to be stable across years.
      因为数据集的规模巨大,以及难以建立验证框架的困难,他们担心特征工程可能不发清晰的推断到测试数据上,因此并未过多的关注特征工程。相反,整合尽可能多的不同模型来捕获更多的信息,并帮助预测集的平稳。
      根据他们过去的经验,在没有可靠的验证框架的情况下,构建好的特征是非常棘手的
    • 2nd的思想:We bagged a bunch of boosting models XGB, LGBM, CB at various levels of data: Models for every site+meter, models for every building+meter, models for every building-type+meter and models using entire train data. It was very useful to build a separate model for each site so that the model could capture site-specific patterns and each site could be fitted with a different parameter set suitable for it. It also automatically solved for issues like timestamp alignment and feature measurement scale being different across sites so we didn’t have to solve for them separately.
      为每一个建立单独的model,作者大概为这次比赛总共建立了超过5000个models进行融合
    • 3nd的思想:Given diverse experiments with different CV schemes I did over the period of the competition, I decided to simply combine all the results (over 30), I got into a single submission using a simple average after selection by pearson correlation (6th on private LB).
      作者因为时间不充裕,所以简单的融合了所有的结果,超过30个的结果。作者简单的使用平均的方法。然后使用peason correlation皮尔逊相关系数来选择平均那几个结果。
    • 不得不说,就算lightGBM的性能高于XGB和Catboost,但是这三个都是要用在比赛中的,可能是能提取不同的信息。有的人还会使用CNN和FFNN

    2.5.1 pearson correlation(+python 实现)

    Pearson相关系数(Pearson CorrelationCoefficient)是用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系。

    #如何使用
    from scipy.stats.stats import pearsonr
    pearsonr(x, y)
    
    #具体文档查看
    from pydoc import help
    from scipy.stats.stats import pearsonr
    help(pearsonr)
    
    >>>
    Help on function pearsonr in module scipy.stats.stats:
    
    pearsonr(x, y)
     Calculates a Pearson correlation coefficient and the p-value for testing
     non-correlation.
    
     The Pearson correlation coefficient measures the linear relationship
     between two datasets. Strictly speaking, Pearson's correlation requires
     that each dataset be normally distributed. Like other correlation
     coefficients, this one varies between -1 and +1 with 0 implying no
     correlation. Correlations of -1 or +1 imply an exact linear
     relationship. Positive correlations imply that as x increases, so does
     y. Negative correlations imply that as x increases, y decreases.
    
     The p-value roughly indicates the probability of an uncorrelated system
     producing datasets that have a Pearson correlation at least as extreme
     as the one computed from these datasets. The p-values are not entirely
     reliable but are probably reasonable for datasets larger than 500 or so.
    
     Parameters
     ----------
     x : 1D array
     y : 1D array the same length as x
    
     Returns
     -------
     (Pearson's correlation coefficient,
      2-tailed p-value)
    
     References
     ----------
     http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
    

    2.6 Why does postprocessing work? 2nd place magic

    Why does postprocessing work? 2nd place magic

    展开全文
  • 目前是3400多人参加,我的排名55名 。还有三天结束。 但愿冲刺到金牌。 结束后我会把我遇到的问题,以及解决方案。从介绍数据到算法改进,模型调优。详细的写给大家以及自己的源代码。 ...

    目前是3400多人参加,我的排名55名 。还有三天结束。
    但愿冲刺到金牌。
    结束后我会把我遇到的问题,以及解决方案。从介绍数据到算法改进,模型调优。详细的写给大家以及自己的源代码。

    展开全文
  • 以下存储库包含几个笔记本,可与ASHRAE的数据库配合使用,以应对Kaggle上的Great Energy Predictor III竞赛。 我们将第三个模型与其他两个模型之间的问题框架稍有不同的三个模型进行比较。 它们的网络性质不同,前两...
  • 房价预测kaggle官网的一个竞赛项目,算是机器学习的一个入门项目。kaggle官网链接: link. 关于kaggle竞赛项目的操作流程可以参看这篇博客: link. 一、kaggle介绍 kaggle主要为开发商和数据科学家提供举办机器学习...
  • Kaggle竞赛之销售数据预测

    千次阅读 2019-08-05 19:02:58
    之前在广州时便看到kaggle上有一个根据过去的销售数据预测下个月的销量的竞赛,数据量也达到了293万条,本来想那个时候就摸索着尝试做一下,无奈上班时事情太多导致没法专心搞,现在离职不知不觉也有3个月了,趁着...
  • 特征提取 天气类的特征 Time of day Day of year Day of week 一些历史性的数据,比如48/72小时前的数据
  • Kaggle 竞赛项目——Rossmann 销售预测 Top1%

    千次阅读 热门讨论 2018-07-10 16:39:02
    # ## Rossmann 销售预测 # 1.数据分析 # In[1]: #导入所需要的库 import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib.....
  • 未来数据的预测能源领域非常重要,因为有关未来消费和发电趋势的信息可以帮助规划电厂的运营。 该分析比较了各种模型进行时间序列预测,以确定哪种模型效果最好 数据 数据来自Kaggle: ://...
  • 这是一场 Kaggle 比赛,我们的方法排名 67/2256。 项目目标 在这个项目中,我们的目标是找到一个数学模型来提高新餐厅投资的有效性,这将使 TFI 能够在其他重要业务领域进行更多投资,例如可持续性、创新和新员工...
  • kaggle机器学习实战

    2018-06-27 20:29:21
    机器学习算法、工具与流程概述 经济金融领域的应用 排序与CTR预估 自然语言处理类问题 能源预测与分配问题 走起-深度学习 推荐与销量预测相关问题 金融风控问题
  • Kaggle是由联合创始人、首席执行官安东尼·高德布卢姆(Anthony Goldbloom)2010年在墨尔本创立的,主要为开发商和数据科学家提供举办... 能源预测与分配问题 走起-深度学习 推荐与销量预测相关问题 金融风控问题
  • kaggle案例实战

    千次阅读 2018-09-18 09:15:51
    第一课 通过kaggle经典案例掌握机器学习算法的通用流程 知识点1:数据比赛与特征工程/模型调优流程与sklearn、xgboost工具 实战项目:泰坦尼克号之灾(分类) 实战项目:自行车租赁量预测(回归) 第二课 经济金融...
  • Kaggle入门介绍

    2018-11-21 15:04:37
    一年过去了,Kaggle 的赛制和积分体系等都发生了一些变化,不过本文中描述的依然是行之有效的入门 Kaggle 或者其他任何数据科学项目的方法。 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行...
  • Kaggle 入门指南

    2017-12-27 20:08:29
    这是我去年 4 月份参加完第一次 Kaggle 比赛并拿到前 5% 的成绩后写的总结。本文的英文版当时还被 Kaggle 的官方推特转发推荐。一年过去了,Kaggle 的赛制和积分体系等都发生了一些变化,不过本文中描述的依然是...
  • 能源预测③(排名前36%):在本次竞赛中,基于ASHRA公司数据,在以下领域开发计量建筑能源使用的精确模型:冷水,电,热水和蒸汽表,数据来自超过1000栋建筑,历时三年,对这些节能投资的更好估计,大型投资者和...

空空如也

空空如也

1 2 3 4 5 ... 15
收藏数 285
精华内容 114
关键字:

能源预测kaggle