精华内容
下载资源
问答
  • 机器学习 预测模型 数据科学 , 机器学习 (Data Science, Machine Learning) 前言 (Preface) Cardiovascular diseases are diseases of the heart and blood vessels and they typically include heart attacks, ...

    机器学习 预测模型

    数据科学机器学习 (Data Science, Machine Learning)

    前言 (Preface)

    Cardiovascular diseases are diseases of the heart and blood vessels and they typically include heart attacks, strokes, and heart failures [1]. According to the World Health Organization (WHO), cardiovascular diseases like ischaemic heart disease and stroke have been the leading causes of deaths worldwide for the last decade and a half [2].

    心血管疾病是心脏和血管疾病,通常包括心脏病发作,中风和心力衰竭[1]。 根据世界卫生组织(WHO)的研究,在过去的15年中,缺血性心脏病和中风等心血管疾病已成为全球死亡的主要原因[2]。

    动机 (Motivation)

    A few months ago, a new heart failure dataset was uploaded on Kaggle. This dataset contained health records of 299 anonymized patients and had 12 clinical and lifestyle features. The task was to predict heart failure using these features.

    几个月前,一个新的心力衰竭数据集被上传到Kaggle上 。 该数据集包含299名匿名患者的健康记录,并具有12种临床和生活方式特征。 他们的任务是使用这些功能来预测心力衰竭。

    Through this post, I aim to document my workflow on this task and present it as a research exercise. So this would naturally involve a bit of domain knowledge, references to journal papers, and deriving insights from them.

    通过这篇文章,我旨在记录我有关此任务的工作流程,并将其作为研究练习进行介绍。 因此,这自然会涉及到一些领域知识,对期刊论文的引用以及从中得出的见解。

    Warning: This post is nearly 10 minutes long and things may get a little dense as you scroll down, but I encourage you to give it a shot.

    警告:这篇文章将近10分钟,当您向下滚动时,内容可能会变得有些密集,但我建议您试一试。

    关于数据 (About the data)

    The dataset was originally released by Ahmed et al., in 2017 [3] as a supplement to their analysis of survival of heart failure patients at Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad, Pakistan. The dataset was subsequently accessed and analyzed by Chicco and Jurman in 2020 to predict heart failures using a bunch of machine learning techniques [4]. The dataset hosted on Kaggle cites these authors and their research paper.

    该数据集最初由Ahmed等人在2017年发布[3],作为他们对巴基斯坦费萨拉巴德心脏病研究所和联合王国费萨拉巴德联合医院心力衰竭患者生存率分析的补充。 随后,Chicco和Jurman于2020年访问并分析了该数据集,以使用一系列机器学习技术预测心力衰竭[4]。 Kaggle托管的数据集引用了这些作者及其研究论文。

    The dataset primarily consists of clinical and lifestyle features of 105 female and 194 male heart failure patients. You can find each feature explained in the figure below.

    该数据集主要由105位女性和194位男性心力衰竭患者的临床和生活方式特征组成。 您可以找到下图中说明的每个功能。

    Image for post
    Fig. 1 — Clinical and lifestyle features of 299 patients in the dataset (credit: author)
    图1 —数据集中299名患者的临床和生活方式特征(来源:作者)

    项目工作流程 (Project Workflow)

    The workflow would be pretty straightforward —

    工作流程将非常简单-

    1. Data Preprocessing — Cleaning the data, imputing missing values, creating new features if needed, etc.

      数据预处理-清理数据,估算缺失值,根据需要创建新功能等。

    2. Exploratory Data Analysis — This would involve summary statistics, plotting relationships, mapping trends, etc.

      探索性数据分析-这将涉及摘要统计,绘制关系,绘制趋势等。

    3. Model Building — Building a baseline prediction model, followed by at least 2 classification models to train and test.

      建立模型—建立基线预测模型,然后建立至少两个分类模型以进行训练和测试。

    4. Hyper-parameter Tuning — Fine-tune the hyper-parameters of each model to arrive at acceptable levels of prediction metrics.

      超参数调整-微调每个模型的超参数,以达到可接受的预测指标水平。

    5. Consolidating Results — Presenting relevant findings in a clear and concise manner.

      合并结果—清晰,简明地陈述相关发现。

    The entire project can be found as a Jupyter notebook on my GitHub repository.

    整个项目都可以在我的 GitHub 存储库中 找到,作为Jupyter笔记本

    让我们开始! (Let’s begin!)

    数据预处理 (Data Preprocessing)

    Let’s read in the .csv file into a dataframe —

    让我们将.csv文件读入数据框-

    df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

    df.info() is a quick way to get a summary of the dataframe data types. We see that the dataset has no missing or spurious values and is clean enough to begin data exploration.

    df.info()是获取数据框数据类型摘要的快速方法。 我们看到数据集没有丢失或伪造的值,并且足够干净以开始数据探索。

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 299 entries, 0 to 298
    Data columns (total 13 columns):
    # Column Non-Null Count Dtype
    --- ------ -------------- -----
    0 age 299 non-null float64
    1 anaemia 299 non-null int64
    2 creatinine_phosphokinase 299 non-null int64
    3 diabetes 299 non-null int64
    4 ejection_fraction 299 non-null int64
    5 high_blood_pressure 299 non-null int64
    6 platelets 299 non-null float64
    7 serum_creatinine 299 non-null float64
    8 serum_sodium 299 non-null int64
    9 sex 299 non-null int64
    10 smoking 299 non-null int64
    11 time 299 non-null int64
    12 DEATH_EVENT 299 non-null int64
    dtypes: float64(3), int64(10)
    memory usage: 30.5 KB

    But before that, let us rearrange and rename some of the features, add another feature called chk(which would be useful later during EDA) and replace the binary values in the categorical features with their labels (again, useful during EDA).

    但是在此之前,让我们重新排列并重命名一些功能,添加另一个名为chk功能( 在EDA中稍后会 chk ),然后用其标签替换分类功能中的二进制值( 再次在EDA中使用 )。

    df = df.rename(columns={'smoking':'smk',
    'diabetes':'dia',
    'anaemia':'anm',
    'platelets':'plt',
    'high_blood_pressure':'hbp',
    'creatinine_phosphokinase':'cpk',
    'ejection_fraction':'ejf',
    'serum_creatinine':'scr',
    'serum_sodium':'sna',
    'DEATH_EVENT':'death'})df['chk'] = 1df['sex'] = df['sex'].apply(lambda x: 'Female' if x==0 else 'Male')
    df['smk'] = df['smk'].apply(lambda x: 'No' if x==0 else 'Yes')
    df['dia'] = df['dia'].apply(lambda x: 'No' if x==0 else 'Yes')
    df['anm'] = df['anm'].apply(lambda x: 'No' if x==0 else 'Yes')
    df['hbp'] = df['hbp'].apply(lambda x: 'No' if x==0 else 'Yes')
    df['death'] = df['death'].apply(lambda x: 'No' if x==0 else 'Yes')df.info()<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 299 entries, 0 to 298
    Data columns (total 14 columns):
    # Column Non-Null Count Dtype
    --- ------ -------------- -----
    0 sex 299 non-null object
    1 age 299 non-null float64
    2 smk 299 non-null object
    3 dia 299 non-null object
    4 hbp 299 non-null object
    5 anm 299 non-null object
    6 plt 299 non-null float64
    7 ejf 299 non-null int64
    8 cpk 299 non-null int64
    9 scr 299 non-null float64
    10 sna 299 non-null int64
    11 time 299 non-null int64
    12 death 299 non-null object
    13 chk 299 non-null int64
    dtypes: float64(3), int64(5), object(6)
    memory usage: 32.8+ KB

    We observe that sex, dia, anm, hbp, smk anddeathare categorical features (object), while age, plt,cpk, ejf, scr, timeand sna are numerical features (int64 or float64). All features except death would be potential predictors and death would be the target for our prospective ML model.

    我们观察到, sexdiaanmhbpsmkdeath的类别特征( 对象 ),而agepltcpkejfscrtimesna的数字功能(Int64的或float64)。death以外的所有功能都是潜在的预测因素,而death将成为我们预期的ML模型的目标。

    探索性数据分析 (Exploratory Data Analysis)

    A.数值特征汇总统计 (A. Summary Statistics of Numerical Features)

    Since our dataset has many numerical features, it would be helpful to look at some aggregate measures of the data in hand, with the help of df.describe() (Usually, this method gives values up to 6 decimal places, so it would better to round it off to two by df.describe().round(2))

    由于我们的数据集具有许多数值特征,因此在df.describe()的帮助下df.describe()手头数据的一些聚合度量将很有帮助( 通常,此方法最多可提供小数点后6位的值,因此最好到轮它关闭两个由 df.describe().round(2)

    Image for post
    • Age: We can see that the average age of the patients is 60 years with most of the patients (<75%) below 70 years and above 40 years. The follow-up time after their heart failure also varies from 4 days to 285 days, with an average of 130 days.

      年龄 :我们可以看到患者的平均年龄为60岁,其中大多数患者(<75%)低于70岁且高于40岁。 他们心力衰竭后的随访时间也从4天到285天不等,平均为130天。

    • Platelets: These are a type of blood cells that are responsible for repairing damaged blood vessels. A normal person has a platelet count of 150,000–400,000 kiloplatelets/mL of blood [5]. In our dataset, 75% of the patients have a platelet count well within this range.

      血小板血小板 是负责修复受损血管的一种血细胞。 正常人的血小板计数为150,000–400,000血小板/ mL血液[5]。 在我们的数据集中,有75%的患者血小板计数在此范围内。

    • Ejection fraction: This is a measure (in %) of how much blood is pumped out of a ventricle in each contraction. To brush up a little human anatomy — the heart has 4 chambers of which the atria receive blood from different parts of the body and the ventricles pump it to back. The left ventricle is the thickest chamber and pumps blood to the rest of the body while the right ventricle pumps blood to the lungs. In a healthy adult, this fraction is 55% and heart failure with reduced ejection fraction implies a value < 40%[6]. In our dataset, 75% of the patients have this value < 45% which is expected because they are all heart failure patients in the first place.

      射血分数这是每次收缩中从脑室中抽出多少血液的量度(%)。 为了梳理一点人体解剖学,心脏有4个腔室,心房从身体的不同部位接收血液,心室将其泵回。 左心室是最厚的腔室,将血液泵送到身体的其余部分,而右心室则将血液泵到肺。 在健康的成年人中,这一比例为55%,而射血分数降低的心力衰竭意味着其值<40%[6]。 在我们的数据集中,有75%的患者的此值<45% ,这是可以预期的,因为他们首先都是心力衰竭患者。

    • Creatinine Phosphokinase: This is an enzyme that is present in the blood and helps in repairing damaged tissues. A high level of CPK implies heart failure or injury. The normal levels in males are 55–170 mcg/L and in females are 30–135 mcg/L [7]. In our dataset, since all patients have had heart failure, the average value (550 mcg/L) and median (250 mcg/L) are higher than normal.

      肌酐磷酸激酶这是一种存在于血液中的酶,有助于修复受损的组织。 高水平的CPK意味着心力衰竭或伤害。 男性的正常水平为55–170 mcg / L,女性为30–135 mcg / L [7]。 在我们的数据集中,由于所有患者都患有心力衰竭, 因此平均值(550 mcg / L)和中位数(250 mcg / L)高于正常水平。

    • Serum creatinine: This is a waste product that is produced as a part of muscle metabolism especially during muscle breakdown. This creatinine is filtered by the kidneys and increased levels are indicative of poor cardiac output and possible renal failure[8]. The normal levels are between 0.84 to 1.21 mg/dL [9] and in our dataset, the average and median are above 1.10 mg/dL, which is pretty close to the upper limit of the normal range.

      血清肌酐这是一种废物,是肌肉代谢的一部分,特别是在肌肉分解过程中。 肌酐被肾脏过滤,水平升高表明心输出量不良和可能的肾衰竭 [8]。 正常水平在0.84至1.21 mg / dL之间[9],在我们的数据集中,平均值和中位数高于1.10 mg / dL, 非常接近正常范围的上限

    • Serum sodium: This refers to the level of sodium in the blood and a high level of > 135 mEq/L is called hypernatremia, which is considered typical in heart failure patients [10]. In our dataset, we find that the average and the median are > 135 mEq/L.

      血清钠指血液中的钠水平,> 135 mEq / L的高水平被称为高钠血症,在心力衰竭患者中被认为是典型的 [10]。 在我们的数据集中,我们发现平均值和中位数> 135 mEq / L。

    A neat way to visualize these statistics is with a boxenplot which shows the spread and distribution of values (The line in the center is the median and the diamonds at the end are the outliers).

    直观显示这些统计数据的一种好方法是使用boxenplot ,该boxenplot显示值的分布和分布( 中间的线是中位数,而末端的菱形是异常值 )。

    fig,ax = plt.subplots(3,2,figsize=[10,10])
    num_features_set1 = ['age', 'scr', 'sna']
    num_features_set2 = ['plt', 'ejf', 'cpk']
    for i in range(0,3):
    sns.boxenplot(df[num_features_set1[i]], ax=ax[i,0], color='steelblue')
    sns.boxenplot(df[num_features_set2[i]], ax=ax[i,1], color='steelblue')
    Image for post
    Fig. 2 — Visualising the summary statistics for numerical features of the dataset
    图2 —可视化数据集数值特征的摘要统计

    B.分类特征摘要统计 (B. Summary Statistics of Categorical Features)

    The number of patients belonging to each of the lifestyle categorical features can be summarised with a simple bar plot .

    可以通过简单的bar plot总结属于每种生活方式分类特征的患者人数。

    fig = plt.subplots(figsize=[10,6])bar1 = df.smk.value_counts().values
    bar2 = df.hbp.value_counts().values
    bar3 = df.dia.value_counts().values
    bar4 = df.anm.value_counts().values
    ticks = np.arange(0,3, 2)
    width = 0.3
    plt.bar(ticks, bar1, width=width, color='teal', label='smoker')
    plt.bar(ticks+width, bar2, width=width, color='darkorange', label='high blood pressure')
    plt.bar(ticks+2*width, bar3, width=width, color='limegreen', label='diabetes')
    plt.bar(ticks+3*width, bar4, width=width, color='tomato', label='anaemic')plt.xticks(ticks+1.5*width, ['Yes', 'No'])
    plt.ylabel('Number of patients')
    plt.legend()
    Image for post
    Fig. 3 — Total number of patients in each lifestyle categorical feature
    图3-每种生活方式分类特征中的患者总数

    Additional summaries can be generated using the crosstab function in pandas. An example is shown for the categorical feature smk . The results can be normalized with respect to either the total number of smokers (‘index’) or the total number of deaths (‘columns’). Since our interest is in predicting survival, we normalize with respect to death.

    可以使用pandas中的crosstab功能生成其他摘要。 显示了分类特征smk的示例。 可以根据吸烟者总数( “指数” )或死亡总数( “列” )对结果进行标准化。 由于我们的兴趣在于预测生存,因此我们将死亡归一化。

    pd.crosstab(index=df['smk'], columns=df['death'], values=df['chk'], aggfunc=np.sum, margins=True)pd.crosstab(index=df['smk'], columns=df['death'], values=df['chk'], aggfunc=np.sum, margins=True, normalize='columns').round(2)*100
    Image for post

    We see that 68% of all heart failure patients did not smoke while 32% did. Of those who died, 69% were non-smokers while 31% were smokers. Of those who survived, 67% were non-smokers and 33% were smokers. At this point, it is difficult to say, conclusively, that heart failure patients who smoked have a greater chance of dying.

    我们发现68%的心力衰竭患者不吸烟,而32%的人吸烟。 在死亡者中 ,不吸烟者占69%,吸烟者占31%。 在幸存者中 ,不吸烟者占67%,吸烟者占33%。 在这一点上,很难说得出结论,吸烟的心力衰竭患者死亡的机会更大。

    In a similar manner, let’s summarise the rest of the categorical features and normalize the results with respect to deaths.

    以类似的方式,让我们总结一下其余的分类特征,并就死亡对结果进行归一化。

    Image for post
    • 65% of the Male and 35% of the Female heart patients died.

      65%的男性心脏病患者和35%的女性心脏病患者死亡。
    • 48% of the patients who died were anemic while 41% of the patients who survived were anemic as well.

      死亡的患者中有48%贫血,而幸存的患者中有41%贫血。
    • 42% of the patients who died and 42% who survived were diabetic.

      42%的死亡患者和42%的幸存者患有糖尿病。
    • 31% of the dead were smokers while 33% of the survivors were smokers.

      死者中有31%是吸烟者,而幸存者中有33%是吸烟者。
    • 41% of those who died had high blood pressure, while 33% of those who survived had high blood pressure as well.

      死者中有41%患有高血压,而幸存者中有33%患有高血压。

    Based on these statistics, we get a rough idea that the lifestyle features are almost similarly distributed amongst those who died and those who survived. The difference is the greatest in the case of high blood pressure, which could perhaps have a greater influence on the survival of heart patients.

    根据这些统计数据,我们可以粗略了解一下,生活方式特征在死者和幸存者之间的分布几乎相似。 在高血压的情况下,差异最大,这可能对心脏病患者的生存产生更大的影响。

    C.探索数字特征之间的关系 (C. Exploring relationships between numerical features)

    The next step is to visualize the relationship between features. We start with the numerical features by writing a single line of code to plot a pair-wise plotting of features using seaborn’s pairplot

    下一步是可视化要素之间的关系。 我们从数字特征开始,编写一行代码以使用seaborn的对图绘制特征的pairplot

    sns.pairplot(df[['plt', 'ejf', 'cpk', 'scr', 'sna', 'death']], 
    hue='death', palette='husl', corner=True)
    Image for post
    Fig. 4— Pair-wise scatterplots between numerical features in the dataset
    图4-数据集中数字特征之间的成对散点图

    We observe a few interesting points —

    我们观察到一些有趣的观点-

    • Most of the patients who died following a heart failure seem to have a lower Ejection Fraction that those who survived. They also seem to have slightly higher levels of Serum Creatinine and Creatine Phosphokinase. They also tend to be on the higher side of 80 years.

      死于心力衰竭的大多数患者的射血分数似乎比那些幸存者低。 他们的血清肌酐和肌酸磷酸激酶水平似乎也略高。 他们也往往处于80年的较高地位。
    • There are no strong correlations between features and this can be validated by calculating the Spearman R correlation coefficient (We consider the spearman because we are not sure about the population distribution from which the feature values are drawn).

      特征之间没有很强的相关性,可以通过计算Spearman R相关系数来验证( 我们考虑使用spearman,因为我们不确定从中得出特征值的总体分布 )。

    df[['plt', 'ejf', 'cpk', 'scr', 'sna']].corr(method='spearman')
    Image for post
    • As observed, the correlation coefficients are moderately encouraging for age-serum creatinine and serum creatinine-serum sodium. From literature, we see that with age, the serum creatinine content increases [11], which explains their slightly positive relationship. Literature also tells us [12] that the sodium to serum creatinine ratio is high in the case of chronic kidney disease which implies a negative relationship between the two. The slight negative correlation coefficient also implies the prevalence of renal issues in the patients.

      如观察到的, 年龄-血清肌酐血清肌酐-血清钠的相关系数适度令人鼓舞 。 从文献中我们看到,随着年龄的增长,血清肌酐含量增加[11],这说明了它们之间的正相关关系 。 文献还告诉我们[12],在慢性肾脏疾病的情况下,钠与血清肌酐的比例较高,这意味着两者之间存在负相关关系 。 轻微的负相关系数也意味着患者中肾脏疾病的患病率。

    D.探索分类特征之间的关系 (D. Exploring relationships between categorical features)

    One way of relating categorical features is to create a pivot table and pivot about a subset of the features. This would give us the number of values for a particular subset of feature values. For this dataset, let’s look at the lifestyle features — smoking, anemic, high blood pressure, and diabetes.

    一种关联分类要素的方法是创建数据透视表并围绕要素的子集进行透视。 这将为我们提供特征值特定子集的值数量。 对于此数据集,让我们看一下生活方式特征-吸烟,贫血,高血压和糖尿病。

    lifestyle_surv = pd.pivot_table(df.loc[df.death=='No'], 
    values='chk',
    columns=['hbp','dia'],
    index=['smk','anm'],
    aggfunc=np.sum)lifestyle_dead = pd.pivot_table(df.loc[df.death=='Yes'],
    values='chk',
    columns=['hbp','dia'],
    index=['smk','anm'],
    aggfunc=np.sum)fig, ax= plt.subplots(1, 2, figsize=[15,6])
    sns.heatmap(lifestyle_surv, cmap='Greens', annot=True, ax=ax[0])
    ax[0].set_title('Survivors')
    sns.heatmap(lifestyle_dead, cmap='Reds', annot=True, ax=ax[1])
    ax[1].set_title('Deceased')
    Image for post
    Fig. 5— Heatmap of the number of patients in each subset of lifestyle features
    图5:生活方式特征的每个子集中的患者数量热图

    A few insights can be drawn —

    可以得出一些见解-

    • A large number of the patients did not smoke, were not anemic and did not suffer from high blood pressure or diabetes.

      许多患者不吸烟,没有贫血,没有高血压或糖尿病。
    • There were very few patients who had all the four lifestyle features.

      具有这四种生活方式特征的患者很少。
    • Many of the survivors were either only smokers or only diabetic.

      许多幸存者要么只是吸烟者,要么只是糖尿病患者。
    • The majority of the deceased had none of the lifestyle features, or at the most were anemic.

      死者中大多数没有生活方式特征,或者最多是贫血。
    • Many of the deceased were anemic and diabetic as well.

      许多死者也是贫血和糖尿病患者。

    E.探索所有功能之间的关系 (E. Exploring relationships between all features)

    An easy way to combine categorical and numerical features into a single graph is bypassing the categorical feature as a hue input. In this case, we use the binary death feature and plot violin-plots to visualize the relationships across all features.

    将分类特征和数字特征组合到单个图形中的一种简单方法是绕过分类特征作为hue输入。 在这种情况下,我们使用二进制death特征并绘制小提琴图以可视化所有特征之间的关系。

    fig,ax = plt.subplots(6, 5, figsize=[20,22])
    cat_features = ['sex','smk','anm', 'dia', 'hbp']
    num_features = ['age', 'scr', 'sna', 'plt', 'ejf', 'cpk']
    for i in range(0,6):
    for j in range(0,5):
    sns.violinplot(data=df, x=cat_features[j],y=num_features[i], hue='death', split=True, palette='husl',facet_kws={'despine':False}, ax=ax[i,j])
    ax[i,j].legend(title='death', loc='upper center')
    Image for post
    Fig. 6— Violinplots for relating numerical and categorical features in the dataset
    图6-用于关联数据集中的数字和分类特征的Violinplots

    Here are a few insights from these plots —

    以下是这些图的一些见解-

    • Sex: Of the patients who died, the ejection fraction seems to be lower in males than in females. Also, the creatinine phosphokinase seems to be higher in males than in females.

      性别 :在死亡的患者中,男性的射血分数似乎低于女性。 另外,男性的肌酐磷酸激酶似乎高于女性。

    • Smoking: A slightly lower ejection fraction was seen in the smokers who died than in the non-smokers who died. The creatinine phosphokinase levels seem to be higher in smokers who survived, than in non-smokers who survived.

      吸烟 :死亡的吸烟者的射血分数比未死亡的非吸烟者略低。 存活的吸烟者的肌酐磷酸激酶水平似乎高于存活的非吸烟者。

    • Anemia: The anemic patients tend to have lower creatinine phosphokinase levels and higher serum creatinine levels, than non-anemic patients. Among the anemic patients, the ejection fraction is lower in those who died than in those who survived.

      贫血 :与非贫血患者相比,贫血患者的肌酐磷酸激酶水平和血清肌酐水平较高。 在贫血患者中,死亡者的射血分数低于幸存者。

    • Diabetes: The diabetic patients tend to have lower sodium levels and again, the ejection fraction is lower in those who died than in the survivors.

      糖尿病 :糖尿病患者的钠水平较低,而且死亡者的射血分数比幸存者低。

    • High Blood Pressure: The ejection fraction seems to show greater variation in deceased patients with high blood pressure than in deceased patients without high blood pressure.

      高血压 :高血压的死者的射血分数似乎比没有高血压的死者更大。

    I hope you found this useful. The steps for building ML models, tuning their hyper-parameters, and consolidating the results will be shown in the next post.

    希望您觉得这有用。 建立ML模型,调整其超参数以及合并结果的步骤将在下一篇文章中显示。

    Ciao!

    再见!

    翻译自: https://medium.com/towards-artificial-intelligence/predicting-heart-failure-survival-with-machine-learning-models-part-i-7ff1ab58cff8

    机器学习 预测模型

    展开全文
  • 预测月份温度机器学习模型A Practical Machine Learning Workflow Example 实用的机器学习工作流程示例 问题介绍 (Problem Introduction) The problem we will tackle is predicting the average global land and ...

    预测月份温度机器学习模型

    A Practical Machine Learning Workflow Example

    实用的机器学习工作流程示例

    问题介绍 (Problem Introduction)

    The problem we will tackle is predicting the average global land and ocean temperature using over 100 years of past weather data. We are going to act as if we don’t have access to any weather forecasts. What we do have access to is a century’s worth of historical global temperatures averages including; global maximum temperatures, global minimum temperatures, and global land and ocean temperatures. Having all of this, we know that this is a supervised, regression machine learning problem

    我们将要解决的问题是使用100多年的过去天气数据来预测全球平均陆地和海洋温度。 我们将采取行动,好像我们无法获得任何天气预报一样。 我们所能获得的是一个世纪以来全球历史平均温度值,包括: 全球最高温度,全球最低温度以及全球陆地和海洋温度。 有了所有这些,我们知道这是一个有监督的回归机器学习问题

    It’s supervised because we have both the features and the target that we want to predict, also our target makes this a regression task because it is continuous. During training, we will give multiple regression models both the features and targets and it must learn how to map the data to a prediction. Moreover, this is a regression task because the target value is continuous (as opposed to discrete classes in classification).

    之所以受到监督,是因为我们既具有要预测的特征和目标,又因为它是连续的,所以我们的目标使它成为回归任务。 在训练期间,我们将提供特征和目标的多个回归模型,并且它必须学习如何将数据映射到预测。 此外,这是一项回归任务,因为目标值是连续的(与分类中的离散类相对)。

    That’s pretty much all the background we need, so let’s start!

    这几乎是我们需要的所有背景,所以让我们开始吧!

    ML工作流程 (ML Workflow)

    Before we jump right into programming, we should outline exactly what we want to do. The following steps are the basis of my machine learning workflow now that we have our problem and model in mind:

    在开始进行编程之前,我们应该准确概述我们想做的事情。 考虑到我们的问题和模型,以下步骤是我的机器学习工作流程的基础:

    1. State the question and determine the required data (completed)

      陈述问题并确定所需数据(已完成)
    2. Acquire the data

      采集数据
    3. Identify and correct missing data points/anomalies

      识别并纠正丢失的数据点/异常
    4. Prepare the data for the machine learning model by cleaning/wrangling

      通过清理/整理为机器学习模型准备数据
    5. Establish a baseline model

      建立基准模型
    6. Train the model on the training data

      根据训练数据训练模型
    7. Make predictions on the test data

      对测试数据做出预测
    8. Compare predictions to the known test set targets and calculate performance metrics

      将预测与已知测试集目标进行比较,并计算性能指标
    9. If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique

      如果性能不令人满意,请调整模型,获取更多数据或尝试其他建模技术
    10. Interpret model and report results visually and numerically

      可视化和数字化解释模型并报告结果

    数据采集 (Data Acquisition)

    First, we need some data. To use a realistic example, I retrieved temperature data from the Berkeley Earth Climate Change: Earth Surface Temperature Dataset found on Kaggle.com. Being that this dataset was created from one of the most prestigious research universities in the world, we will assume data in the dataset is truthful.

    首先,我们需要一些数据。 举一个实际的例子,我从Kaggle.com上的“伯克利地球气候变化:地球表面温度数据集”中检索了温度数据。 由于该数据集是由世界上最负盛名的研究型大学之一创建的,因此我们将假定数据集中的数据是真实的。

    Dataset link:https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

    数据集链接: https : //www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

    After importing some important libraries and modules, the code below loads in the CSV data which I store into a variable we can use later:

    导入一些重要的库和模块后,下面的代码将CSV数据加载到我存储的变量中,以备后用:

    Image for post

    Following are explanations of each column:

    以下是各列的说明:

    dt: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures

    dt:平均陆地温度从1750年开始,最高和最低陆地温度以及全球海洋和陆地温度从1850年开始

    LandAverageTemperature: global average land temperature in celsius

    LandAverageTemperature:摄氏全球平均气温

    LandAverageTemperatureUncertainty: the 95% confidence interval around the average

    LandAverageTemperatureUncertainty:围绕平均值的95%置信区间

    LandMaxTemperature: global average maximum land temperature in celsius

    LandMaxTemperature:全球平均最高气温,以摄氏度为单位

    LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

    LandMaxTemperatureUncertainty:最高陆地温度附近的95%置信区间

    LandMinTemperature: global average minimum land temperature in celsius

    LandMinTemperature:摄氏全球平均最低气温

    LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

    LandMinTemperatureUncertainty:最低地面温度附近的95%置信区间

    LandAndOceanAverageTemperature: global average land and ocean temperature in celsius

    LandAndOceanAverageTemperature:全球平均陆地和海洋温度以摄氏

    LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

    陆地和海洋平均温度不确定性:全球平均陆地和海洋温度的95%置信区间

    识别异常/丢失数据 (Identify Anomalies/ Missing Data)

    Looking through the data (shown above) from Berkeley Earth, I noticed several missing data points, which is a great reminder that data collected in the real-world will never be perfect. Missing data can impact analysis immensely, as can incorrect data or outliers.

    通过查看来自伯克利地球的数据(如上所示),我注意到了一些缺失的数据点,这很提醒我们,在现实世界中收集的数据永远不会是完美的。 数据丢失或不正确的数据或异常值都会极大地影响分析。

    To identify anomalies, we can quickly find missing using the info() method on our DataFrame.

    为了识别异常,我们可以使用DataFrame上的info()方法快速找到缺失的内容。

    Image for post

    Also, we can use the “.isnull()” and “.sum()” methods directly on our dataframe to find the total amount of missing values in each column.

    另外,我们可以直接在数据帧上使用“ .isnull()”和“ .sum()”方法来查找每一列中缺失值的总数。

    Image for post

    资料准备 (Data Preparation)

    Unfortunately, we aren’t quite at the point where we can just feed the raw data into a model and have it return an answer (although you could, it would not be the most accurate)! We will need to do some minor modification to put our data into machine-understandable terms.

    不幸的是,我们还不能完全将原始数据馈入模型并让其返回答案(尽管可以,但这并不是最准确的)! 我们将需要做一些小的修改,以使我们的数据成为机器可理解的术语。

    The exact steps for preparation of the data will depend on the model used and the data gathered, but some amount of data manipulation will be required.

    准备数据的确切步骤将取决于所使用的模型和收集的数据,但是将需要一定数量的数据处理。

    First things first, I will be creating a function called wrangle() in which I will call our dataframe.

    首先,我将创建一个名为wrangle()的函数,在其中将调用我们的数据框。

    Image for post

    We want to make a copy of the dataframe so we do not corrupt the original. After that, we are going to drop columns that hold high cardinality.

    我们要复制数据框,以免损坏原始数据框。 之后,我们将删除具有高基数的列。

    High cardinality refers to columns with values that are very uncommon or unique. Given how common high-cardinality data are within most time-series datasets, we are going to address this problem directly by removing these high cardinality columns from our dataset completely as to not confuse our model in the future.

    高基数是指具有非常不常见或唯一的值的列。 考虑到大多数时间序列数据集中常见的高基数数据,我们将通过从数据集中完全删除这些高基数列来直接解决此问题,以免将来混淆我们的模型。

    Next in the set of instructions for our function, we are going to create a function within our pending wrangle function, called convertTemp(). Essentially this convertTemp function is just for my own eyes (and maybe yours) and being that I am from the United States, our official measurement for temperature is in Fahrenheit and the dataset I have used is measured in Celsius.

    在该函数的指令集中,接下来,我们将在待处理的wrangle函数中创建一个函数convertTemp() 。 本质上,这个convertTemp函数仅适用于我自己(也许是您的)自己的眼睛,并且由于我来自美国,因此,我们对温度的官方度量单位是华氏度,而我使用的数据集是摄氏温度。

    So just for ease purposes, not that it will affect our model results or predictions in any way, I chose to apply that function to the remaining columns which hold Celsius temperature:

    因此,仅为方便起见,并不是要以任何方式影响我们的模型结果或预测,我选择将该函数应用于保持摄氏温度的其余列:

    Image for post

    Finally, the last step in our data wrangling function would be to convert the dt(Date) column to a DateTime object. After which we will create subsequent columns for the month and year, eventually dropping the dt and Month columns.

    最后,数据整理功能的最后一步是将dt(Date)列转换为DateTime对象。 之后,我们将为月份和年份创建后续列,最终删除dtMonth列。

    Now if you remember we also had missing values which we saw earlier in our dataset. From just analyzing the dataset and from what I described about the Date column, the LandAverageTemperature column starts in 1750 while the other 4 columns we chose to keep in our wrangle function start in 1850.

    现在,如果您还记得,我们还缺少我们在数据集中前面看到的值。 从分析数据集和我对日期列的描述来看,LandAverageTemperature列始于1750年,而我们选择保留在纠缠函数中的其他4列始于1850年。

    So I think we will solve much of the missing value problem by just splicing the dataset by the year, creating a new dataset that starts from the year 1850 and above. We will also call the dropna(), just in case there are any other missing values in our dataset:

    因此,我认为我们将通过按年份拼接数据集并创建一个始于1850年及以上年份的新数据集来解决许多缺失值问题。 我们还将调用dropna(),以防数据集中还有其他缺失值:

    Image for post

    Let's see how it looks:

    让我们看看它的外观:

    Image for post
    Image for post

    After calling our wrangle function to our globalTemp dataframe, we can now see a new cleaned-up version of our globalTemp dataframe free of any missing values

    在对我们的globalTemp数据框调用了wrangle函数之后,我们现在可以看到一个新的globalTemp数据框的清理版本,其中没有任何缺失值

    It looks like we are ready for the next step, Setting up our target and features, train/test split, and establishing our baseline…

    看来我们已准备好进行下一步,即设置目标和功能,训练/测试组以及建立基线…

    快速关联可视化 (Quick Correlation Visualization)

    One thing I like to do when working with regression problems is to look at the cleaned dataframe and to see if we can truly use one column as our target and the others as our features.

    处理回归问题时,我想做的一件事是查看清理的数据框,看看我们是否可以真正地将一列用作目标,将其他列真正用作我们的功能。

    One way I loosely determine that is by plotting a correlation matrix, just to get an understanding of how related each column is to each other:

    我粗略地确定这一点的一种方法是绘制一个相关矩阵,以了解每列之间的相关性:

    Image for post
    Global Temps Correlation Matrix Plot
    全局温度相关矩阵图

    As we can see, and some as some of you probably guessed, The columns we chose to keep moving forward are HIGHLY correlated to one another. So we should have pretty strong & positive predictions just from glancing at this plot.

    我们可以看到,有些人可能已经猜到了,我们选择继续前进的各列之间是高度相关的。 因此,只要浏览一下该图,我们就应该有非常强大而积极的预测。

    将目标与功能区分开 (Separating our Target From Our Features)

    Now, we need to separate the data into the features and targets. The target, also known as Y, is the value we want to predict, in this case, the actual land and ocean average temperature and the features are all the columns (minus our target) the model uses to make a prediction:

    现在,我们需要将数据分为特征和目标。 目标,也称为Y,是我们要预测的值,在这种情况下,实际的陆地和海洋平均温度,特征是模型用来进行预测的所有列(减去目标):

    Image for post
    Creating Target Vector and Features Matrix
    创建目标向量和特征矩阵

    火车测试拆分 (Train-Test Split)

    Now we are on the final step of the data preparation part of our ML workflow: splitting data into training and testing sets.

    现在,我们进入了ML工作流程中数据准备部分的最后一步:将数据分为训练和测试集。

    During training, we let the model ‘see’ the answers, in this case, the actual temperature, so it can learn how to predict the temperature from the features. As we know, there is a relationship between all the features and the target value, and the model’s job is to learn this relationship during training. Then, when it comes time to evaluate the model, we ask it to make predictions on a testing set where it only has access to the features (not the target)!

    在训练过程中,我们让模型“查看”答案,在这种情况下为实际温度,因此它可以学习如何根据特征预测温度。 众所周知,所有特征和目标值之间都有关系,而模型的工作就是在训练过程中学习这种关系。 然后,当需要评估模型时,我们要求它在只能访问特征(而不是目标)的测试集中进行预测!

    Generally, when training a regression model, we randomly split the data into training and testing sets to get a representation of all data points.

    通常,在训练回归模型时,我们将数据随机分为训练和测试集,以表示所有数据点。

    For example, if we trained the model on the first nine months of the year and then used the final three months for prediction, our algorithm would not perform well because it has not seen any data from those last three months.

    例如,如果我们在一年的前九个月对模型进行训练,然后将最后三个月用于预测,则我们的算法将无法很好地执行,因为它没有看到过去三个月的任何数据。

    Make sense?

    合理?

    The following code splits the data sets:

    以下代码拆分数据集:

    Image for post
    Train/Test Split Creation
    训练/测试拆分创建

    We can look at the shape of all the data to make sure we did everything correctly. We expect the training(X_train) features number of columns to match the testing (X_val) feature number of columns and the number of rows to match for the respective training and testing features and target:

    我们可以查看所有数据的形状,以确保正确完成了所有操作。 我们期望training(X_train)功能的列数与测试(X_val)功能的列数相匹配,并且行数与相应的训练和测试功能及目标相匹配:

    Image for post
    The shape of each training and test set
    每个训练和测试集的形状

    It looks as if everything is in order! Just to recap, we:

    看来一切都井然有序! 回顾一下,我们:

    1. Got rid of missing values and unneeded columns

      消除了缺少的值和不需要的列
    2. Split data into features and target

      将数据分为特征和目标
    3. Split data into training and testing sets

      将数据分为训练和测试集

    These steps may seem tedious at first, but once you get the basic ML workflow, it will be generally the same for any machine learning problem. It’s all about taking human-readable data and putting it into a form that can be understood by a machine learning model.

    这些步骤乍一看可能很乏味,但是一旦您获得了基本的ML工作流程,对于任何机器学习问题,它通常都是相同的。 一切都是关于将人类可读的数据放入一种机器学习模型可以理解的形式。

    建立基线平均绝对误差 (Establish Baseline Mean Absolute Error)

    Before we can make and evaluate predictions, we need to establish a baseline, a sensible measure that we hope to beat with our model. If our model cannot improve upon the baseline, then it will be a failure and we should try a different model or admit that machine learning is not right for our problem.

    在做出和评估预测之前,我们需要建立一个基线,这是我们希望与我们的模型相抗衡的明智措施。 如果我们的模型无法在基线上得到改善,那将是一个失败,我们应该尝试其他模型或承认机器学习不适合我们的问题。

    The baseline prediction for our case will be the yearly average temperature. In other words, our baseline is the error we would get if we simply predicted the average temperature for our target dataset (Y_train)

    我们的案例的基线预测将是年平均温度。 换句话说,我们的基线是如果我们仅预测目标数据集(Y_train)的平均温度就会得到的误差

    In order to find out the MAE, very easily, we can import the mean_absolute_error method from the sci-kit learn library which will calculate it for us:

    为了很容易地找到MAE,我们可以从sci-kit学习库中导入mean_absolute_error方法,该方法将为我们进行计算:

    Image for post
    Baseline Mean Absolute Error
    基线平均绝对误差

    We now have our goal! If we can’t beat an average error of 2 degrees, then we need to rethink our approach.

    现在我们有了目标! 如果我们无法克服2度的平均误差,那么我们需要重新考虑我们的方法。

    火车模型 (Train Model)

    After all the work of data preparation, creating and training the model is pretty simple using scikit-learn. For this problem, we could try a multitude of models, but in this situation, we are going to use two different models; a Linear Regression Model and a Random Forest Regressor Model.

    在完成所有数据准备工作之后,使用scikit-learn创建和训练模型非常简单。 对于这个问题,我们可以尝试多种模型,但是在这种情况下,我们将使用两种不同的模型; 线性回归模型和随机森林回归模型。

    线性回归模型 (Linear Regression Model)

    Linear regression is a statistical approach that models the relationship between input features and output. Our goal here is to predict the value of the output based on the input features.

    线性回归是一种统计方法,可对输入要素和输出之间的关系进行建模。 我们的目标是根据输入要素预测输出值。

    In the code below, I created what is called pipeline which allows stacking multiple processes into a single scikit-learn estimator. Here the only processes we are using is a StandardScalar(), which subtracts the mean from each feature and then scaled to the variance of the unit and obviously the LinearRegression() process:

    在下面的代码中,我创建了所谓的管道,该管道允许将多个进程堆叠到单个scikit-learn估计器中 在这里,我们使用的唯一过程是StandardScalar() ,它从每个特征中减去均值,然后缩放到单位的方差,并且显然缩放到LinearRegression()过程:

    Image for post
    Linear Regression pipeline
    线性回归管线

    随机森林回归模型 (Random Forest Regressor Model)

    A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique commonly known as bagging.

    随机森林是一种集成技术,能够通过使用多个决策树来执行回归和分类任务,并且该技术通常称为装袋

    The basic idea behind bagging is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model:

    套袋背后的基本思想是在确定最终输出时结合多个决策树,而不是依赖于单个决策树。RandomForest具有多个决策树作为基础学习模型。 我们从数据集中随机执行行采样和特征采样,形成每个模型的样本数据集:

    Image for post
    Random Forest Regressor Pipeline
    随机森林回归管道

    Little information on whats going on in the code snippet above:

    上面的代码片段中发生了什么的小信息:

    n_estimators represents the number of trees in the random forest.

    n_estimators表示随机森林中树木的数量。

    max depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.

    最大深度代表森林中每棵树的深度。 树越深,其分裂就越多,它会捕获有关数据的更多信息。

    n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.

    n_jobs是指回归器将使用的核心数。 -1表示它将使用所有可用的内核来运行回归器。

    SelectKBest just scores the features using an internal function. In this case, I chose to score all the features.

    SelectKBest仅使用内部函数对功能进行评分。 在这种情况下,我选择对所有功能进行评分。

    After creating our pipelines and having fit our training data into our pipeline models, we now need to make some predictions.

    创建管道并将训练数据拟合到管道模型之后,我们现在需要进行一些预测。

    对测试集进行预测 (Make Predictions on the Test Set)

    Our model has now been trained to learn the relationships between the features and the targets. The next step is figuring out how good the model is! To do this we make predictions on the test features and compare the predictions to the known answers.

    现在,我们的模型已经过训练,可以学习特征和目标之间的关系。 下一步是弄清楚模型的好坏! 为此,我们对测试功能进行预测,并将预测结果与已知答案进行比较。

    When performing regression predictions, we need to make sure to use the absolute error because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value (as we also did when establishing the original baseline earlier in this blog):

    在执行回归预测时,我们需要确保使用绝对误差,因为我们希望我们的某些答案很低,而有些答案则很高。 我们对平均预测值与实际值相距多远感兴趣,因此我们采用了绝对值(就像我们在本博客前面建立原始基线时所做的那样):

    Image for post
    Linear Regression MAE
    线性回归

    Let look at our Random Forest Regressor MAE:

    让我们看一下我们的随机森林回归MAE:

    Image for post

    Our average temperature prediction estimate is off by 0.28 degrees in our Linear Regression MAE and 0.24 for our Random Forest MAE. That is almost a 2-degree average improvement over the baseline of 2.03 degrees.

    对于线性回归MAE,我们的平均温度预测估计值偏离了0.28度;对于随机森林MAE,我们的平均温度预测值偏离了0.24度。 这比2.03度的基准几乎提高了2度。

    Although this might not seem significant, it is nearly 95% better than the baseline, which, depending on the field and the problem, could represent millions of dollars to a company.

    尽管这看起来似乎并不重要,但它比基准要好95%,这取决于领域和问题,对公司而言可能代表数百万美元。

    确定性能指标 (Determine Performance Metrics)

    To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

    为了正确理解我们的预测,我们可以使用从100%中减去的平均百分比误差来计算准确性。

    线性回归测试/火车精度: (Linear Regression Test/Train Accuracy:)

    Image for post

    随机森林回归器训练/测试精度: (Random Forest Regressor Train/Test Accuracy:)

    Image for post

    By looking at the error metric values we got, we can say that our model performs optimally and is able to give accurate predictions, given a new set of records(y_pred).

    通过查看我们获得的误差度量值,可以说我们的模型在给出新的记录集(y_pred)的情况下表现最佳,并且能够给出准确的预测。

    Our model has learned how to predict the average temperature for the next year with 99% accuracy in both our models.

    我们的模型已经学会了如何在两个模型中以99%的准确度预测明年的平均温度。

    Nice!!

    不错!

    模型调整 (Model Tuning)

    In the usual machine learning workflow, we would stop here after achieving 99% accuracy. But in most cases, as I stated before, the dataset would not be as clean, this would be when to start hyperparameter tuning the model.

    在通常的机器学习工作流程中,达到99%的准确性后,我们将在此处停止。 但是在大多数情况下,正如我之前所说,数据集不会那么干净,这是何时开始对模型进行超参数调整的时候。

    Hyperparameter tuning is a complicated phrase that means “adjust the settings to improve performance”. The most common way to do this is to simply make a bunch of models with different settings, evaluate them all on the same validation set, and see which one does best.

    超参数调整是一个复杂的短语,表示“调整设置以提高性能”。 最常见的方法是简单地制作一堆具有不同设置的模型,在相同的验证集上对它们进行评估,然后看看哪个模型效果最好。

    An accuracy of 99% is obviously satisfactory for this problem, but it is known that the first model built will almost never be the model that makes it to production. So let us try to reach 100% accuracy if that is possible.

    99%的准确度显然可以解决此问题,但是众所周知,第一个建立的模型几乎永远不会成为能够投入生产的模型。 因此,如果可能的话,让我们尝试达到100%的准确性。

    随机搜索 (RandomizedSearchCV)

    In the beginning, I decided I wanted to use GridSearchCV to hyper tune my model, but GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters.

    一开始,我决定要使用GridSearchCV来超调我的模型,但是GridSearchCV的计算量可能会很大,尤其是当您在大超参数空间中搜索并处理多个超参数时。

    The most efficient way to find an optimal set of hyperparameters for a machine learning model is to use random search. A solution to this is to use another sci-kit learn method named RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.

    查找机器学习模型的最佳超参数集的最有效方法是使用随机搜索。 一种解决方案是使用另一种名为RandomizedSearchCV的sci-kit学习方法,在该方法中,并非所有超参数值都经过尝试。 而是从指定的概率分布中采样固定数量的超参数设置。

    Now being that we only have 5 columns in total, there is really no need for us to use RandomizedSearchCV, but for blogging purposes, we will see how to use RandomizedSearchCV to tune your model.

    现在我们总共只有5列,实际上我们不需要使用RandomizedSearchCV,但是出于博客目的,我们将看到如何使用RandomizedSearchCV来调整模型。

    Let’s see if we have any gains in our prediction accuracy score and MAE:

    让我们看看我们的预测准确性得分和MAE是否有所提高:

    Image for post
    RandomizedSearchCV pipeline
    RandomizedSearchCV管道

    Little information on the code snippet above:

    上面的代码片段中的信息很少:

    n_iter: represents the number of iterations. Each iteration represents a new model trained on a new draw from your dictionary of hyperparameter distributions.

    n_iter:表示迭代次数。 每次迭代都代表一个新模型,该模型在您的超参数分布字典的新图纸上进行训练。

    param_distributions: specify parameters and distributions to sample from

    param_distributions:指定要从中采样的参数和分布

    cv: 10-fold cross-validation (cv). The number of cross-validation chosen determines how many times it will train each model on a different subset of data.

    简历: 10倍交叉验证(cv)。 选择的交叉验证次数决定了它将在不同数据子集上训练每个模型多少次。

    n_jobs refers to the number of cores the regressor will use. -1 means it will use all cores available to run the regressor.

    n_jobs是指回归器将使用的核心数。 -1表示它将使用所有可用的内核来运行回归器。

    best_estimator_: refers to an attribute is an instance of the specified model type, which has the ‘best’ combination of given parameters from the params variable

    best_estimator_:指的是 属性是指定模型类型的实例,该模型具有params变量中给定参数的“最佳”组合

    We then use the best set of hyperparameter values chosen in the RandomizedSearchCV in the actual model which we named best_model as shown:

    然后,我们在实际模型中使用在RandomizedSearchCV中选择的最佳超参数值集,我们将其命名为best_model,如下所示:

    Image for post
    RandomizedSearchCV MAE
    随机搜索
    Image for post

    As suspected, after running our using our predict method on our best_model, we can see RandomizedSearchCV output the same prediction results and accuracy score percentage as our Random Forest Regressor model earlier.

    可以怀疑,在best_model上使用我们的预测方法后,我们可以看到RandomizedSearchCV输出的预测结果和准确性得分百分比与之前的Random Forest Regressor模型相同。

    Although no need for it, we have seen how hyper tuning could essentially help improve model scores if needed

    尽管不需要它,但我们已经看到了超调如何从本质上帮助改善模型得分(如果需要)

    可视化 (Visualizations)

    部分依赖图 (Partial Dependence Plots)

    PDPbox is a partial dependence plot toolbox written in Python. The goal of pdpbox is to visualize the impact of certain features towards model prediction for any supervised learning algorithm.

    PDPbox是用Python编写的部分依赖图工具箱。 pdpbox的目标是可视化某些功能对任何监督学习算法的模型预测的影响。

    The problem is when using machine learning algorithms like random forest, it is hard to understand the relations between predictors and model outcomes. For example, in terms of random forest, all we get is the feature importance. Although we can know which feature is significantly influencing the outcome based on the importance calculation, we really don’t know in which direction it is influencing.

    问题在于,当使用诸如随机森林之类的机器学习算法时,很难理解预测变量与模型结果之间的关系。 例如,就随机森林而言,我们所获得的只是功能的重要性。 尽管根据重要性计算我们可以知道哪个功能对结果有重大影响,但我们实际上并不知道它在哪个方向上影响。

    This is where PDPbox comes into play:

    这是PDPbox发挥作用的地方:

    Image for post

    A little background on whats going on in the code above:

    上面的代码中发生了什么的一些背景知识:

    feature: the feature column we want to compare against our model to see the effect it has on the model prediction (our target)

    feature:我们要与模型进行比较的特征列,以查看其对模型预测(我们的目标)的影响

    isolated: pdp_isolate is what we call to create our PDP pipeline. Being that we are only comparing one feature, hence the name isolated

    隔离: pdp_isolate是我们创建PDP管道的调用。 由于我们只是比较一个功能,因此名称被隔离

    All other columns should be self-explanatory.

    所有其他列应不言自明。

    Now let us look at our plot:

    现在让我们看看我们的情节:

    Image for post

    From this plot, we can see that as the average LandAndOceanTemperature rises and LandAverageTemperature increases, the predicted temperature tends to increase.

    从该图可以看出,随着平均LandAndOceanTemperature升高和LandAverageTemperature升高,预测温度趋于升高。

    We also created another PDPbox plot in which we used two features (LandMinTemperature and LandMaxTemperature) to see how it affect model prediction with our target column(LandAndOceanTemperature):

    我们还创建了另一个PDPbox图,其中使用了两个功能(LandMinTemperature和LandMaxTemperature)来查看它如何影响目标列(LandAndOceanTemperature)的模型预测:

    Image for post

    From this plot, we can see the same results as well. As the average LandMaxTemperature rises and LandMinTemperature increases, the predicted target, LandAndOcean, temperature tends to increase.

    从该图中可以看到相同的结果。 随着平均LandMaxTemperature升高和LandMinTemperature升高,预测的目标LandAndOcean温度趋于升高。

    结论 (Conclusion)

    We have now completed an entire end-to-end machine learning example!

    现在,我们已经完成了一个完整的端到端机器学习示例!

    At this point, if we want to improve our model, we could try different hyperparameters ( RandomizedSearchCV, or something new like GridSearchCV), try a different algorithm, or the best approach of all, just gather more data! The performance of any model is directly related to how much data it can learn from, and we were using a very limited amount of information for training.

    在这一点上,如果我们要改进模型,可以尝试使用不同的超参数(RandomizedSearchCV,或诸如GridSearchCV之类的新东西),尝试使用不同的算法,或者使用所有最佳方法,只是收集更多数据! 任何模型的性能都与它可以从中学习多少数据直接相关,并且我们使用的信息量非常有限。

    I hope everyone who made it through has seen how accessible machine learning has become and it uses.

    我希望所有通过它的人都已经了解了机器学习的可访问性及其使用方式。

    Until next time, my friends…

    直到下一次,我的朋友们…

    Image for post

    翻译自: https://medium.com/swlh/predicting-weather-temperature-change-using-machine-learning-models-4f98c8983d08

    预测月份温度机器学习模型

    展开全文
  • 基于机器学习的乳腺癌预测模型(附Python代码)前提说明项目介绍导入数据概述数据数据可视化评估算法实施预测代码参考 前提说明 此博客内容为2018年山东省人工智能大赛曲阜师范大学青春梦想队所创作,未经授权,禁止...

    基于机器学习的乳腺癌预测模型(附Python代码)

    项目介绍

    这个项目是针对乳腺癌进行分类的一个项目,使用的乳腺癌数据集,具有如下特点:
    ①所以特征数字都是数字,不需要考虑如何导入以及如何处理数据
    ②特征列第一列为用户ID信息,不参与此次机器学习模型构建
    ③这是一个分类问题,可通过有监督学习算法来解决问题,这是一个二分类问题,无需进行特殊处理
    ④所有特征的数值采用相同的单位,即特征按照程度划分为1~10,无需进行尺度的转换
    ⑤在某些特征列数据缺失,需要进行数据缺失处理
    接下来,我们会对这一些特点进行分析,以下为相关步骤:
    ①导入数据
    ②概述数据并进行数据缺失处理
    ③数据可视化
    ④评估算法
    ⑤实施预测

    导入数据

    首先,需要导入项目所需的类库和数据集,类库包括numpy、pandas、sklearn、matplotlib库,数据集为之前介绍的乳腺癌数据集,在这里将采用Pandas来导入数据。在接下来的部分将采用Pandas来对数据进行描述性统计分析和数据可视化。需要注意的是,在导入数据时,给每个数据特征设定了名称,进一步解决了在医疗上名称过长的问题,这有助于后面对数据的展示工作。

    概述数据

    接下来对数据我们做一些理解,以便于选择合适的算法,我们会通过以下几个角度审查数据,并附上概述结果:
    I.数据的维度
    II.查看数据自身及其描述
    III.统计描述所有的数据特征
    IV.数据分类分布
    以下为概述结果:
    数据的维度概述结果:
    在这里插入图片描述
    查看数据自身及其描述概述结果:
    在这里插入图片描述
    统计描述所有的数据特征概述结果:
    在这里插入图片描述
    数据分布情况概述结果:
    在这里插入图片描述

    数据可视化

    我们对我们所处理的数据有了基本的了解,将通过图表来查看数据的分布情况和数据不同特征直接的相互关系,其中单变量图表更好地理解每一个特征属性,多变量图表来理解不同特征属性的关系。
    3.1箱线图
    首先从单变量图表开始显示每一个单独的特征属性,因为每个特征属性都是数字,因此我们可以通过箱线图来展示的属性和与中位值的离散速度,其执行结果如下:
    在这里插入图片描述

    3.2直方图
    接下来通过直方图来显示每个特征属性的分布状况,在输出的图表中,我们看到特征符合高斯分布,执行结果如下:
    在这里插入图片描述
    3.3散点矩阵图
    接下来通过多变量图表来看一下不同特征属性之间的关系。在这里我们通过散点矩阵图来查看属性的俩俩之间的影响关系。
    在这里插入图片描述

    评估算法

    我们会通过六种不同的算法来创建模型,并评估他们的准确度,以便于我们找到合适的算法。我们会通过以下步骤进行:
    I. 分离出评估数据集。
    II. 采用10交叉折验证来评估算法模型。
    III. 生成5各不同的模型来预测新数据。
    IV. 选择最优模型。
    4.1分离出评估数据集
    创建的模型是否足够好。后面我们会采用统计学的方法来评估我们创建的模型。但是,我们更想实际看一下我们的模型对真实的数据的准确度是什么情况,这是我们要保留一部分数据用来评估算法模型的主要原因。我们将会按照80%的训练数据集,20%的评估数据集来分离数据。于是就分离出了X_train,y_train用来训练算法创建模型,X_validation,y_validation在后面用来验证评估模型。
    4.2评估模型
    在这里将通过10折交叉验证来分离训练数据集,并评估算法模型的准确度。10折交叉验证是随机的将数据分成10份,8份用来训练模型,2份用来评估算法。在接下来的部分,将会对每一个算法,使用相同的数据进行训练和评估,并从中选择最好的模型。
    4.3创建模型
    我们不知道哪个算法对这个问题比较有效。通过前边的图表,我们发现有些数据特征线性分布,所有可以期待算法会得到比较好的结果。接下来评估6种不同的算法:
    · 线性回归(LR)
    · 线性判别分析(LDA)
    · k近邻(KNN)
    · 分类与回归树(CART)
    · 贝叶斯分类器(NB)
    · 支持向量机(SVM)
    这个算法列表中包含了线性算法(LR、LDA),非线性算法(KNN、CART、NB、SVM)。在每次对算法进行评估前都会重现设置随机数的种子,以确保每次对算法的评估采用相同的数据集,保证对算法评估的公平性。接下来就创建并评估这六种算法模型。
    4.4选择最优模型
    现在我们有6各模型,并且评估了它们的精确度。我们需要比较这六个模型,并从中选出准确度最高的算法。执行代码结果如下:
    在这里插入图片描述
    选择最优模型执行结果
    通过以上结果,我们会发现,针对乳腺癌预测模型,我们选择KNN(非线性算法)的准确度相对于其他算法更高,为了进一步比较算法之间的优劣关系,我们制作了算法比较的箱线图,其结果如下:
    在这里插入图片描述
    由此我们可以最终得出结论,选择KNN算法其准确度最高,模型最优。

    实施预测

    评估的结果显示,支持KNN算法是准确度最高的的算法。现在我们需要使用我们预留的评估数据集来验证这个算法模型。接下来我们可以使用我们得出的KNN的算法模型,并用评估数据集给出一个算法模型的报告。
    执行代码,我们会看到其准确度为0.97。通过观察冲突矩阵,我们发现对于乳腺癌良性及恶性症状类型,每个分类仅有两例预测错误,其下为其准确度、召回率、F1值的数据。
    在这里插入图片描述

    代码

    # 导入类库
    from pandas import read_csv
    import pandas as pd
    from sklearn import datasets
    from pandas.plotting import scatter_matrix
    from matplotlib import pyplot
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.preprocessing import LabelEncoder
    from sklearn.linear_model import LogisticRegression
    
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns #要注意的是一旦导入了seaborn,matplotlib的默认作图风格就会被覆盖成seaborn的格式
    
    import numpy as np
    
    
    
    #breast_cancer_data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',header=None
    #                               ,names = ['C_D','C_T','U_C_Si','U_C_Sh','M_A','S_E_C_S'
    #                                        ,'B_N','B_C','N_N','M','Class'])
    breast_cancer_data = pd.read_csv('breast_data.csv',header=None
                                   ,names = ['C_D','C_T','U_C_Si','U_C_Sh','M_A','S_E_C_S'
                                            ,'B_N','B_C','N_N','M','Class'])
    #打印数据
    print(breast_cancer_data)
    
    #查看维度
    print('查看维度:')
    print(breast_cancer_data.shape)
    #查看数据
    print('查看数据')
    breast_cancer_data.info()
    breast_cancer_data.head(25)
    
    #数据统计描述
    print('数据统计描述')
    print(breast_cancer_data.describe())
    #数据分布情况
    print('数据分布情况')
    print(breast_cancer_data.groupby('Class').size())
    
    #缺失数据处理
    mean_value = breast_cancer_data[breast_cancer_data["B_N"]!="?"]["B_N"].astype(np.int).mean()
    breast_cancer_data = breast_cancer_data.replace('?',mean_value)
    breast_cancer_data["B_N"] = breast_cancer_data["B_N"].astype(np.int64)
    
    #数据的可视化处理
        #单变量图表
        #箱线图
    breast_cancer_data.plot(kind='box',subplots=True,layout=(3,4),sharex=False,sharey=False)
    pyplot.show()
        #直方图
    breast_cancer_data.hist()
    pyplot.show()
    #多变量的图表
        #散点矩阵图
    scatter_matrix(breast_cancer_data)
    pyplot.show()
    
    
    #评估算法
        #分离数据集
    array = breast_cancer_data.values
    X = array[:,1:9]
    y = array[:,10]
    
    validation_size = 0.2
    seed = 7
    #train训练,validation验证确认
    X_train,X_validation,y_train,y_validation = train_test_split(X,y,test_size=validation_size,random_state=seed)
    
        #评估算法(算法审查)
    models = {}
    models['LR'] = LogisticRegression()
    models['LDA'] = LinearDiscriminantAnalysis()
    models['KNN'] = KNeighborsClassifier()
    models['CART'] = DecisionTreeClassifier()
    models['NB'] = GaussianNB()
    models['SVM'] = SVC()
    
    
    
    num_folds = 10
    seed = 7
    kfold = KFold(n_splits=num_folds,random_state=seed)
    
        #评估算法(评估算法)
    results = []
    for name in models:
        result = cross_val_score(models[name],X_train,y_train,cv=kfold,scoring='accuracy')
        results.append(result)
        msg = '%s:%.3f(%.3f)'%(name,result.mean(),result.std())
        print(msg)
        #评估算法(图标显示)
    fig = pyplot.figure()
    fig.suptitle('Algorithm Comparison')
    ax = fig.add_subplot(111)
    pyplot.boxplot(results)
    ax.set_xticklabels(models.keys())
    pyplot.show()
    
    
    #实施预测
    #使用评估数据集评估算法
    knn = KNeighborsClassifier()
    knn.fit(X=X_train,y=y_train)
    predictions = knn.predict(X_validation)
    
    print('最终使用KNN算法')
    print(accuracy_score(y_validation,predictions))
    print(confusion_matrix(y_validation,predictions))
    print(classification_report(y_validation,predictions))
    
    
    

    参考

    https://read.douban.com/reader/column/6939417/chapter/35867190/#
    https://blog.csdn.net/ruoyunliufeng/article/details/79369142

    如有侵权,请及时联系。
    
    展开全文
  • 全文共2723字,预计学习时长9分钟图源:unsplash很多面试官都喜欢问...笔者将解释什么是机器学习以及不同类型的机器学习,再介绍常见的模型。本文里,笔者不会介绍任何数学运算,小白请放心食用。对于没有或几乎没有...

    全文共2723字,预计学习时长9分钟

    38dae2de3136db0a901b8951916e9fb5.png

    图源:unsplash

    很多面试官都喜欢问这个问题:“假设我是个5岁的小孩儿,请向我解释[某项技术]。”给幼儿园的小朋友讲清楚机器学习可能有点夸张,实际上这一问题的要求就是,尽可能简单地解释某一技术。

    这就是笔者在本文中尝试做到的事。笔者将解释什么是机器学习以及不同类型的机器学习,再介绍常见的模型。本文里,笔者不会介绍任何数学运算,小白请放心食用。

    对于没有或几乎没有数据科学背景的成年人来说,它应该是容易弄懂的(如果不能,请在评论区告诉我)。

    6ef72df9548c6521a01c0e11b610b969.png

    机器学习的定义

    7f13441b7621835cb0ad20433e94e849.png

    机器学习图

    机器学习是指将大量数据加载到计算机程序中并选择一种模型“拟合”数据,使得计算机(在无需你帮助的情况下)得出预测。计算机创建模型的方式是通过算法进行的,算法既包括简单的方程式(如直线方程式),又包括非常复杂的逻辑/数学系统,使计算机得出最佳预测。

    机器学习恰如其名,一旦选择要使用的模型并对其进行调整(也就是通过调整来改进模型),机器就会使用该模型来学习数据中的模式。然后,输入新的条件(观测值),它就能预测结果!

    08d5b3353978d2cc0f07e2e670a6f789.png

    图源:unsplash

    6ef72df9548c6521a01c0e11b610b969.png

    有监督机器学习的定义

    监督学习是一种机器学习,其中放入模型中的数据被“标记”。简单来说,标记也就意味着观察结果(也就是数据行)是已知的。

    例如,如果你的模型正尝试预测你的朋友是否会去打高尔夫球,那么可能会有温度、星期几等变量。如果你的数据被标记,那么当你的朋友真的去打高尔夫了,你也会有一个值为1的变量,当他们没有去打高尔夫,变量的值则为0。

    6ef72df9548c6521a01c0e11b610b969.png

    无监督机器学习的定义

    在标记数据时,无监督学习与有监督学习恰好相反。在无监督学习的情况下,你不知道朋友是否会去打高尔夫球——这都由计算机通过模型找到模式来猜测已经发生了什么或预测将会发生什么。

    6ef72df9548c6521a01c0e11b610b969.png

    有监督机器学习模型

    逻辑回归

    在遇到分类问题时,可使用逻辑回归。这意味着目标变量(也就是需要预测的变量)由不同类别组成。这些类别可以是“是/否”,也可以是代表客户满意度的1到10之间的数字。

    逻辑回归模型用方程式创建包含数据的曲线,然后用该曲线预测新观测的结果。

    85ceba0cf61c7c570406d47b58b7b962.png

    逻辑回归图

    上图中,新观测值的预测值为0,因为它位于曲线的左侧。如果查看此曲线上的数据,就能解释清楚了,因为图中“预测值为0”的区域里,大多数数据点的y值都为0。

    线性回归

    线性回归是人们通常知道的最早的机器学习模型之一。这是因为仅使用一个x变量时,它的算法(即幕后方程式)相对容易理解——画出一条最适合的直线,这是小学阶段教授的内容。然后,这条最佳拟合线可以预测出新的数据点(参见下图)。

    d615e1827c067c67c6cef954da625524.png

    线性回归图

    线性回归与逻辑回归类似,但是当目标变量连续时,才能使用线性回归,这意味着线性回归可以用任何数值。实际上,任何具有连续目标变量的模型都可以归类为“回归”。连续变量的一个例子是房屋的售价。

    线性回归也很容易解释。模型方程式包含每个变量的系数,并且这些系数指示目标变量随着自变量(x变量)中的每个变化而变化的量。

    以房价为例,这意味着你可以查看回归方程式,并可能这样说道:“哦,这告诉我,房屋面积(x变量)每增加1平方英尺,售价(目标变量)就增加25美元。”

    6e9e01990453e87a111ca7439a0ef7fb.png

    图源:unsplash

    K近邻算法(KNN)

    该模型可用于分类或回归!“K近邻算法”这个名字并不会造成混淆。该模型首先要绘制出所有数据。其中,“ K”部分是指模型为了确定预测值应使用的最邻近数据点的数量(如下图)。你可以选择K,然后可以使用这些值来查看哪个值提供最佳预测。

    68fca353f88e08bad6f4368ac070c7ae.png

    K近邻算法图

    K = __圈中的所有数据点都可以对这个新数据点的目标变量值进行“投票”。得票最多的那个值是KNN为新数据点预测的值。

    上图中,最近的点中有2个是1类,而1个是2类。因此,模型将为此数据点预测为1类。如果模型试图预测数值而非类别,则所有“投票”都是取平均值的数值,从而获得预测值。

    支持向量机

    支持向量机在数据点之间建立边界来运行,其中一类中的大多数落在边界的一侧(在2D情况下又称为线),而另一类中的大多数落在另一侧。

    f5e80179a4c2185d12a58977f18daaf1.png

    支持向量机图

    其工作方式是机器力求找出具有最大边距的边界。边距是指每个类的最近点与边界之间的距离。然后绘制新的数据点,并根据它们落在边界的哪一侧将其分类。

    笔者对此模型的解释是根据分类情况来的,不过你也可以用SVM进行回归。

    决策树和随机森林

    9f295a3a6a991f9f36fab0565a20f8cd.png

    图源:unsplash

    这点笔者已经在上一篇文章中解释过了——《向五岁小孩儿解释数据科学概念:在面试中描述技术概念》(决策树和随机森林在邻近结尾部分)。

    链接:https://towardsdatascience.com/data-science-concepts-explained-to-a-five-year-old-ad440c7b3cbd

    867357242abc9a34648180abfc2a8b53.png

    无监督机器学习模型

    接着到了“深水区”,我们来看看无监督学习。提醒一下,这意味着数据集未标记,因此不知道观察结果。

    k均值聚类

    在用K表示聚类时,必须首先假设数据集中有K个聚类。由于不知道数据中实际上有多少个组,因此必须尝试不同的K值,并使用可视化和度量标准来查看哪个K值行得通。K表示最适合圆形和相似大小的聚类。

    k均值聚类算法首先选择最佳的K个数据点,以形成K个聚类中每个聚类的中心。然后,它对每个点重复以下两个步骤:

    1.将数据点分配到最近的聚类中心

    2.通过获取此聚类中所有数据点的平均值来创建一个新中心

    f6f028e2314654d6fccee8a743f17123.png

    K均值聚类图

    DBSCAN聚类

    DBSCAN聚类模型与K均值聚类的不同之处在于,它不需要输入K的值,并且它还可以找到任何形状的聚类。你无需指定聚类数,而是输入聚类中所需的最小数据点数,并在数据点周围半径之内搜索聚类。

    DBSCAN将为您找到聚类,然后,你可以更改用于创建模型的值,直到获得对数据集有意义的聚类为止。

    e4c3618d7f6ff34b87456246c8e375f6.png

    此外,DBSCAN模型会分类“噪声”点(即,远离所有其他观测值的点)。数据点非常靠近时,此模型比K均值的效果更好。

    神经网络

    927efa3516366a252186c92815aaa6ff.png

    图源:unsplash

    在笔者看来,神经网络是最酷、最神秘的模型。它们之所以被称为“神经网络”,是因为它们是根据我们大脑中神经元的工作方式进行建模的。这些模型在数据集中寻找模式;有时它们会发现人类可能永远无法识别的模式。

    神经网络可以很好地处理图像和音频等复杂数据。从面部识别到文本分类,这些都是我们现在经常看到的软件背后的逻辑原理。

    e749513e45a176a84b27541ebd2de3b4.png

    图源:unsplash

    有时你可能会有困惑的地方,即使专家也无法完全理解为什么计算机得出这个结论。在某些情况下,我们在乎的只是它擅长预测!

    不过有时我们会关心计算机如何得出其预测结果的,比如是否正在用模型来确定哪些求职者会获得第一轮面试的机会。

    希望本文能让你加深对这些模型的理解,还能使你意识到它们是多么酷炫!

    288dd0c03a0f3856ec74f00fd00624cd.png

    留言点赞关注

    我们一起分享AI学习与发展的干货

    如转载,请后台留言,遵守转载规范

    展开全文
  • 本备忘单目的是为你提供一些提升机器学习性能想法。要获得突破,你所需要可能就是其中一个。找到你要那个,然后回来,再找下一个再提升。 我把这份清单分为4个子主题: 基于数据改善性能 借助算法改善...
  • 使用集成学习构建机器学习预测模型

    千次阅读 多人点赞 2018-04-22 20:21:25
    前段时间参加了一家量化投资公司面试,其中用了集成学习算法,发现效果很好,现在将代码公布出来,以便小白学习,大神请...3)以投资金额高低区分高投资与低投资用户,以此为目标变量建立一至两个机器学习模型(G...
  • 机器学习模型预测股票走势

    千次阅读 2018-09-26 16:47:01
    机器学习的模型预测沪深300个股第二天走势,具体思路和每天预测结果都会在这个公众号下面更新,欢迎有兴趣的同学一起来看看~ 有啥疑问和想法也欢迎留言~ ...
  • 利用化合物结构与活性数据,基于RDKit和Python3的机器学习活性预测模型小示例。代码示例:#导入必须包 #!/usr/bin/env python3 from rdkit.Ch...
  • 这篇文章主要从工程角度来总结在实际运用机器学习进行预测时,我们可以用哪些tips来提高最终的预测效果,主要分为Data Cleaning,Features Engineering, Models Training三个部分。 Data Cleaning 移除多余的...
  • 参见原书1.5节 构建预测模型的一般流程 ...机器学习:开发一个可以实际部署模型全部过程,包括对机器学习算法理解和实际操作 通常,有非常切实原因,导致某些算法被经常使用,了解背后...
  • ARIMA模型全称是自回归移动平均模型(Autoregressive Integrated Moving Average Model),...ARIMA模型的核心思想是将预测对象随时间推移而形成数据序列视为一个随机序列,以时间序列自相关分析为基础,用一定数据
  • 本节书摘来异步社区《Python机器学习——...1.5 构建预测模型的流程 使用机器学习需要几项不同技能。一项就是编程技能,本书不会把重点放在这。其他技能用于获得合适模型进行训练和部署。这些其他技能将是...
  • 毫无疑问,机器学习是当前大数据分析中最热门的话题。这也是一些最令人兴奋的技术领域的基本概念,... 用Python搭建机器学习模型预测房租价格旨在向您介绍机器学习的基本概念。在继续学习时,您将从头开始构建第一个
  • 机器学习】基于朴素贝叶斯疾病预测模型

    千次阅读 多人点赞 2019-06-25 19:46:58
    机器学习】基于朴素贝叶斯疾病预测模型 目录 【机器学习】基于朴素贝叶斯疾病预测模型 第1章 朴素贝叶斯 1.1 朴素贝叶斯简介 1.2 条件概率 1.3 朴素贝叶斯分类器 1.4 朴素贝叶斯分类器分类过程 1.5 ...
  • 预测模型在 LinkedIn 产品中被广泛应用,如 Feed、广告、工作推荐、邮件营销、用户搜索等。这些模型在提升用户体验时起到了重要作用。为了满足建模需求,LinkedIn 开发并且开源了 Photon-ML 大规模机器学习库。...
  • 机器学习预测股票收益率工作已经屡见不鲜,并有了非常不错结果,如《当实证资产定价遇上机器学习》。那么,能不能把同样方法运用到债券中呢?2020年在Review of Financial Studies上“Bond Risk Premiums ...
  • 介绍 评价指标与机器学习任务具有相关性。...分类、回归和排序是监督学习的例子,它包含了大多数机器学习应用程序。在本文中,我们将关注监督机器学习模块的度量标准。 什么是模型评估? 评估模型是整个机...
  • 随机森林被用来开发使用不同分子描述符,活性阈值和训练集合成的预测模型。与先前提取数据集的研究报告相比,该模型在外部验证中表现出优异的性能。 #导入依赖库 import pandas as pd impor...
  • log P(油水分配系数)是确定化合物是否适合用作药物的最重要属性之一。...不幸的是,当前缺乏可用于训练更好的预测工具的公开可用的实验log P数据集。 此测试使用论文中发布的实验log P数据:“Large...
  • PCA是非常经典的降维算法,属于无监督降维,做机器学习的应该都有所了解。但是,除了基本的PCA推导和应用之外,还有SparsePCA、KernelPCA、TruncatedSVD等等,另外PCA和特征值、奇异值的关系以及SparsePCA和字典...
  • 使用机器学习模型进行预测过程中,出现只输出一种结果问题, 通过比较模型训练时特征值和预测的特征值, 1,发现部分特征值数值差距很多, 2,还有正负符号相反情况, 都可以导致预测结果只输出同样数值...
  • 【临床预测模型】----10步建立预测模型 好多小伙伴在首次构建一个临床预测模型构建时,一头雾水找不着北???? 为了解决这一问题,小编思索良久,决定彻夜归纳,5min快速概括,告诉各位头大小朋友,每一步应该怎么做...
  • 那么传统的机器学习分类模型在这方面效果如何呢? 本文在只考虑5、10、20日移动平均线、移动指数平均线这六项指标情况下,比较了支持向量机、决策树、随机森林三种模型预测股价涨跌效果 一般来说技术指标越多...
  • 模型预测控制与机器学习

    千次阅读 2020-11-17 21:10:12
    文章目录一、模型预测控制原理及应用二、浅谈机器学习技术三、基于机器学习的模型预测控制四、总结 最近几年,人工智能和机器学习受到了各行各业的热捧,已经不再是计算机科学系(CS)的“专利”,甚至连我这个...
  • 注:在进行特征工程之前需要单独封装一个函数,用于数据预处理,也便于后续进行模型预测,主要是特征工程暂时无法操作数据预处理步骤。 转载于:...
  • [机器学习]PMML预测模型标记语言

    千次阅读 2018-11-15 13:52:06
    PMML 允许您在不同应用程序之间轻松共享预测分析模型。因此,您可以在一个系统中定型一个模型,在 PMML 中对其进行表达,然后将其移动到另一个系统中,而不需考虑分析和预测过程中具体实现细节。使得模型的部署...
  • 大数据文摘作品编译:小明同学君、吴双、Yawei xia新年总是跟黄金密不可分...答案是肯定,让我们使用机器学习回归算法来预测世界上贵重金属之一,黄金价格吧。我们将建立一个机器学习线性回归模型,它将从黄...
  • Airbnb信任和安全小组通过构建机器学习模型进行欺诈预测,本文介绍了其设计思想。假想模型预测某些虚拟人物是否为“反面人物”,基本步骤:构建模型预期,构建训练集和测试集,特征学习,模型性能评估。其中特征...
  • 机器不学习 jqbxx.com -...幸运的是,结合/融合/整合 (integration/ combination/ fusion)多个机器学习模型往往可以提高整体的预测能力。这是一种非常有效的提升手段,在多分类器系统(multi-classifier system)和集...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 14,229
精华内容 5,691
关键字:

机器学习的预测模型