精华内容
下载资源
问答
  • 维度诅咒

    2020-08-13 17:02:12
    分类中的维度诅咒 1简介 2维度和过度拟合的诅咒 3如何避免维数的诅咒? 介绍 在本文中,我们将讨论所谓的“维度诅咒”,并解释在设计分类器时它的重要性。在下面的章节中,我将提供这个概念的直观解释,由一个...

    分类中的维度诅咒

    介绍

    在本文中,我们将讨论所谓的“维度诅咒”,并解释在设计分类器时它的重要性。在下面的章节中,我将提供这个概念的直观解释,由一个由于维数诅咒而过度拟合的明显例子说明。

    考虑一个例子,其中我们有一组图像,每个图像描绘一只猫或一只狗。我们想创建一个能够自动区分狗和猫的分类器。为此,我们首先需要考虑可以用数字表示的每个对象类的描述符,这样数学算法(即分类器)可以使用这些数字来识别对象。例如,我们可以说猫和狗的颜色通常不同。区分这两个类的可能描述符可以由三个数组成; 正在考虑的图像的平均红色,平均绿色和平均蓝色。例如,一个简单的线性分类器可以线性地组合这些特征来决定类标签:

    If 0.5*red + 0.3*green + 0.2*blue > 0.6 : return cat;
    else return dog;

    然而,这三种颜色描述数字,称为特征,显然不足以获得完美的分类。因此,我们可以决定添加一些描述图像纹理的特征,例如通过计算X和Y方向的平均边缘或梯度强度。我们现在有5个特征组合在一起,可以通过分类算法来区分猫和狗。

    为了获得更准确的分类,我们可以根据颜色或纹理直方图,统计矩等添加更多功能。也许我们可以通过仔细定义几百个这些功能来获得完美的分类?这个问题的答案可能听起来有点违反直觉:不,我们不能!。事实上,在某一点之后,通过添加新功能来增加问题的维度实际上会降低分类器的性能。这由图1说明,并且通常被称为“维度的诅咒”。

     

    图1.随着维度的增加,分类器的性能会提高,直到达到最佳要素数。进一步增加维度而不增加训练样本的数量导致分类器性能的降低。

    在接下来的部分中,我们将回顾上述原因是什么,以及如何避免维度的诅咒。

    维度和过度拟合的诅咒

    在早先介绍的猫和狗的例子中,我们假设有无数的猫和狗生活在我们的星球上。然而,由于我们有限的时间和处理能力,我们只能获得10张猫狗照片。然后,分类的最终目标是基于这10个训练实例训练分类器,该分类器能够正确地分类我们不了解的无限数量的狗和猫实例。

    现在让我们使用一个简单的线性分类器,并尝试获得一个完美的分类。我们可以从一个特征开始,例如图像中的平均“红色”颜色:

     

    图2.单个功能不会导致我们的训练数据完美区分。

    图2显示,如果仅使用单个特征,则无法获得完美的分类结果。因此,我们可能决定添加另一个特征,例如图像中的平均“绿色”颜色:

     

    图3.添加第二个特征仍然不会导致线性可分的分类问题:在此示例中,没有一条线可以将所有猫与所有狗分开。

    最后,我们决定添加第三个特征,例如图像中的平均“蓝色”颜色,从而产生三维特征空间:

     

    图4.在我们的示例中,添加第三个特征会导致线性可分的分类问题。存在一种将狗与猫完美分开的平面。

    在三维特征空间中,我们现在可以找到一个完美地将狗与猫分开的平面。这意味着可以使用这三个特征的线性组合来获得10幅图像的训练数据的完美分类结果:

     

    图5.我们使用的特征越多,我们成功区分类的可能性就越高。

    上面的插图似乎表明,在获得完美的分类结果之前增加特征的数量是训练分类器的最佳方式,而在引言中,如图1所示,我们认为情况并非如此。但是,请注意当我们增加问题的维数时,训练样本的密度如何呈指数下降。

    在1D情况下(图2),10个训练实例覆盖了完整的1D特征空间,其宽度为5个单位间隔。因此,在1D情况下,样品密度为10/5 = 2个样品/间隔。然而,在二维情况下(图3),我们仍然有10个训练实例,现在覆盖了一个面积为5×5 = 25个单位正方形的2D特征空间。因此,在2D情况下,样品密度为10/25 = 0.4个样品/间隔。最后,在3D情况下,10个样本必须覆盖5x5x5 = 125个单位立方体的特征空间体积。因此,在3D情况下,样品密度为10/125 = 0.08个样品/间隔。

    如果我们继续添加特征,则特征空间的维度会增长,并变得更稀疏和稀疏。由于这种稀疏性,找到可分离的超平面变得更加容易,因为当特征的数量变得无限大时,训练样本位于最佳超平面的错误侧的可能性变得无限小。但是,如果我们将高维分类结果投影回较低维空间,则与此方法相关的严重问题变得明显:

     

    图6.使用太多特征会导致过度拟合。分类器开始学习特定于训练数据的异常,并且在遇到新数据时不能很好地概括。

    图6显示了投影到2D特征空间的3D分类结果。尽管数据在3D空间中是线性可分的,但在较低维度的特征空间中却不是这种情况。实际上,添加第三维以获得完美的分类结果,简单地对应于在较低维特征空间中使用复杂的非线性分类器。因此,分类器学习我们的训练数据集的特例和异常。因此,生成的分类器将在真实世界数据上失败,包括通常不遵守这些异常的无限量的看不见的猫和狗。

    这个概念被称为过度拟合,是维度诅咒的直接结果。图7显示了仅使用2个特征而不是3个特征训练的线性分类器的结果:

     

    图7.尽管训练数据未被完美分类,但该分类器在看不见的数据上比图5中的数据获得更好的结果。

    虽然图7中显示的具有决策边界的简单线性分类器似乎比图5中的非线性分类器表现更差,但是这个简单的分类器更好地概括了看不见的数据,因为它没有学习仅在我们的训练数据中的特定异常。巧合。换句话说,通过使用较少的特征,避免了维数的诅咒,使得分类器不会过度拟合训练数据。

    下面的解释非常经典

    图8以不同的方式说明了上述内容。假设我们想要仅使用一个值为0到1的单个特征来训练分类器。让我们假设这个特征对于每只猫和狗都是唯一的。如果我们希望我们的训练数据覆盖此范围的20%,那么所需的训练数据量将占整个猫狗数量的20%。现在,如果我们添加另一个特征,生成2D特征空间,事情会发生变化; 为了覆盖20%的2D特征范围,我们现在需要在每个维度中获得猫和狗总数的45%(0.45 ^ 2 = 0.2)。在3D情况下,这变得更糟:要覆盖20%的3D特征范围,我们需要在每个维度中获得总数的58%(0.58 ^ 3 = 0.2)。

     

    图8.覆盖20%特征范围所需的训练数据量随着维度的数量呈指数增长。

    换句话说,如果可用的训练数据量是固定的,那么如果我们继续添加维度就会发生过度拟合。另一方面,如果我们不断增加维度,训练数据量需要以指数级增长,以保持相同的覆盖范围并避免过度拟合。

    在上面的例子中,我们展示了维度的诅咒引入了训练数据的稀疏性。我们使用的特征越多,数据就越稀疏,因此准确估计分类器的参数(即其决策边界)变得更加困难。维度诅咒的另一个影响是,这种稀疏性不是均匀分布在搜索空间上。实际上,原点周围的数据(在超立方体的中心)比搜索空间的角落中的数据要稀疏得多。这可以理解如下:

    想象一个代表2D特征空间的单位正方形。特征空间的平均值是该单位正方形的中心,距离该中心单位距离内的所有点都在一个单位圆内,该单位圆内接单位正方形。不属于该单位圆的训练样本更靠近搜索空间的角落而不是其中心。这些样本难以分类,因为它们的特征值差异很大(例如,单位正方形的相对角上的样本)。因此,如果大多数样本落在内接单位圆内,则分类更容易,如图9所示:

     

    图9.位于单位圆外的训练样本位于特征空间的角落,并且比特征空间中心附近的样本更难分类。

    现在一个有趣的问题是,当我们增加特征空间的维数时,圆(超球面)的体积如何相对于正方形(超立方体)的体积发生变化。尺寸d的单位超立方体的体积总是1 ^ d = 1. 尺寸d和半径0.5 的内切超球体体积可以计算为:

    (1) V(d) = \frac{\pi ^{\frac{d}{2}}}{\Gamma (\frac{d}{2}+1)}0.5^d

    图10显示了当维度增加时,这个超球体的体积如何变化:

     

    图10.随着维数的增加,超球面的体积趋向于零。

    这表明,当维数趋于无穷大时,超球体的体积倾向于零,而周围超立方体的体积保持不变。这种令人惊讶且相当反直觉的观察部分地解释了与分类中的维度诅咒相关的问题:在高维空间中,大多数训练数据驻留在定义特征空间的超立方体的角落中。如前所述,特征空间角落中的实例比超球面质心周围的实例更难分类。这由图11示出,其示出了2D单位正方形,3D单位立方体以及具有2 ^ 8 = 256个角的8D超立方体的创造性可视化:

     

    图11.随着维度的增加,更大比例的训练数据驻留在要素空间的角落中。

    对于8维超立方体,大约98%的数据集中在其256个角上。因此,当特征空间的维数变为无穷大时,从样本点到质心的最小和最大欧几里得距离的差值与最小距离本身的比率趋向于零:

    (2) \lim_{d\rightarrow \propto }\frac{dist_{max}-dist_{min}}{dist_{min}}\rightarrow 0

    因此,距离测量开始失去其在高维空间中测量不相似性的有效性。由于分类器依赖于这些距离测量(例如欧几里德距离,马哈拉诺比斯距离,曼哈顿距离),因此在较低维空间中分类通常更容易,其中较少特征用于描述感兴趣对象。类似地,高斯似然性在高维空间中变为平坦且重尾的分布,使得最小和最大似然之间的差异与最小似然本身的比率趋于零。

    展开全文
  • 理解维度诅咒

    千次阅读 多人点赞 2019-02-26 15:54:26
    分类中的维度诅咒 1简介 2维度和过度拟合的诅咒 3如何避免维数的诅咒? 4结论 介绍 在本文中,我们将讨论所谓的“维度诅咒”,并解释在设计分类器时它的重要性。在下面的章节中,我将提供这个概念的直观解释,...

    转载自http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/

    分类中的维度诅咒

    介绍

    在本文中,我们将讨论所谓的“维度诅咒”,并解释在设计分类器时它的重要性。在下面的章节中,我将提供这个概念的直观解释,由一个由于维数诅咒而过度拟合的明显例子说明。

    考虑一个例子,其中我们有一组图像,每个图像描绘一只猫或一只狗。我们想创建一个能够自动区分狗和猫的分类器。为此,我们首先需要考虑可以用数字表示的每个对象类的描述符,这样数学算法(即分类器)可以使用这些数字来识别对象。例如,我们可以说猫和狗的颜色通常不同。区分这两个类的可能描述符可以由三个数组成; 正在考虑的图像的平均红色,平均绿色和平均蓝色。例如,一个简单的线性分类器可以线性地组合这些特征来决定类标签:

    If 0.5*red + 0.3*green + 0.2*blue > 0.6 : return cat;
    else return dog;

    然而,这三种颜色描述数字,称为特征,显然不足以获得完美的分类。因此,我们可以决定添加一些描述图像纹理的特征,例如通过计算X和Y方向的平均边缘或梯度强度。我们现在有5个特征组合在一起,可以通过分类算法来区分猫和狗。

    为了获得更准确的分类,我们可以根据颜色或纹理直方图,统计矩等添加更多功能。也许我们可以通过仔细定义几百个这些功能来获得完美的分类?这个问题的答案可能听起来有点违反直觉:不,我们不能!。事实上,在某一点之后,通过添加新功能来增加问题的维度实际上会降低分类器的性能。这由图1说明,并且通常被称为“维度的诅咒”。

    特征维度与分类器性能

    图1.随着维度的增加,分类器的性能会提高,直到达到最佳要素数。进一步增加维度而不增加训练样本的数量导致分类器性能的降低。

    在接下来的部分中,我们将回顾上述原因是什么,以及如何避免维度的诅咒。

    维度和过度拟合的诅咒

    在早先介绍的猫和狗的例子中,我们假设有无数的猫和狗生活在我们的星球上。然而,由于我们有限的时间和处理能力,我们只能获得10张猫狗照片。然后,分类的最终目标是基于这10个训练实例训练分类器,该分类器能够正确地分类我们不了解的无限数量的狗和猫实例。

    现在让我们使用一个简单的线性分类器,并尝试获得一个完美的分类。我们可以从一个特征开始,例如图像中的平均“红色”颜色:

    一维分类问题

    图2.单个功能不会导致我们的训练数据完美区分。

    图2显示,如果仅使用单个特征,则无法获得完美的分类结果。因此,我们可能决定添加另一个特征,例如图像中的平均“绿色”颜色:

    二维分类问题

    图3.添加第二个特征仍然不会导致线性可分的分类问题:在此示例中,没有一条线可以将所有猫与所有狗分开。

    最后,我们决定添加第三个特征,例如图像中的平均“蓝色”颜色,从而产生三维特征空间:

    3D分类问题

    图4.在我们的示例中,添加第三个特征会导致线性可分的分类问题。存在一种将狗与猫完美分开的平面。

    在三维特征空间中,我们现在可以找到一个完美地将狗与猫分开的平面。这意味着可以使用这三个特征的线性组合来获得10幅图像的训练数据的完美分类结果:

    线性可分的分类问题

    图5.我们使用的特征越多,我们成功区分类的可能性就越高。

    上面的插图似乎表明,在获得完美的分类结果之前增加特征的数量是训练分类器的最佳方式,而在引言中,如图1所示,我们认为情况并非如此。但是,请注意当我们增加问题的维数时,训练样本的密度如何呈指数下降

    在1D情况下(图2),10个训练实例覆盖了完整的1D特征空间,其宽度为5个单位间隔。因此,在1D情况下,样品密度为10/5 = 2个样品/间隔。然而,在二维情况下(图3),我们仍然有10个训练实例,现在覆盖了一个面积为5×5 = 25个单位正方形的2D特征空间。因此,在2D情况下,样品密度为10/25 = 0.4个样品/间隔。最后,在3D情况下,10个样本必须覆盖5x5x5 = 125个单位立方体的特征空间体积。因此,在3D情况下,样品密度为10/125 = 0.08个样品/间隔。

    如果我们继续添加特征,则特征空间的维度会增长,并变得更稀疏和稀疏。由于这种稀疏性,找到可分离的超平面变得更加容易,因为当特征的数量变得无限大时,训练样本位于最佳超平面的错误侧的可能性变得无限小。但是,如果我们将高维分类结果投影回较低维空间,则与此方法相关的严重问题变得明显:

    过度拟合

    图6.使用太多特征会导致过度拟合。分类器开始学习特定于训练数据的异常,并且在遇到新数据时不能很好地概括。

    图6显示了投影到2D特征空间的3D分类结果。尽管数据在3D空间中是线性可分的,但在较低维度的特征空间中却不是这种情况。实际上,添加第三维以获得完美的分类结果,简单地对应于在较低维特征空间中使用复杂的非线性分类器。因此,分类器学习我们的训练数据集的特例和异常。因此,生成的分类器将在真实世界数据上失败,包括通常不遵守这些异常的无限量的看不见的猫和狗。

    这个概念被称为过度拟合,是维度诅咒的直接结果。图7显示了仅使用2个特征而不是3个特征训练的线性分类器的结果:

    线性分类器

    图7.尽管训练数据未被完美分类,但该分类器在看不见的数据上比图5中的数据获得更好的结果。

    虽然图7中显示的具有决策边界的简单线性分类器似乎比图5中的非线性分类器表现更差,但是这个简单的分类器更好地概括了看不见的数据,因为它没有学习仅在我们的训练数据中的特定异常。巧合。换句话说,通过使用较少的特征,避免了维数的诅咒,使得分类器不会过度拟合训练数据。

    下面的解释非常经典

    图8以不同的方式说明了上述内容。假设我们想要仅使用一个值为0到1的单个特征来训练分类器。让我们假设这个特征对于每只猫和狗都是唯一的。如果我们希望我们的训练数据覆盖此范围的20%,那么所需的训练数据量将占整个猫狗数量的20%。现在,如果我们添加另一个特征,生成2D特征空间,事情会发生变化; 为了覆盖20%的2D特征范围,我们现在需要在每个维度中获得猫和狗总数的45%(0.45 ^ 2 = 0.2)。在3D情况下,这变得更糟:要覆盖20%的3D特征范围,我们需要在每个维度中获得总数的58%(0.58 ^ 3 = 0.2)。

    训练数据量随着维度的数量呈指数增长

    图8.覆盖20%特征范围所需的训练数据量随着维度的数量呈指数增长。

    换句话说,如果可用的训练数据量是固定的,那么如果我们继续添加维度就会发生过度拟合。另一方面,如果我们不断增加维度,训练数据量需要以指数级增长,以保持相同的覆盖范围并避免过度拟合。

    在上面的例子中,我们展示了维度的诅咒引入了训练数据的稀疏性。我们使用的特征越多,数据就越稀疏,因此准确估计分类器的参数(即其决策边界)变得更加困难。维度诅咒的另一个影响是,这种稀疏性不是均匀分布在搜索空间上。实际上,原点周围的数据(在超立方体的中心)比搜索空间的角落中的数据要稀疏得多。这可以理解如下:

    想象一个代表2D特征空间的单位正方形。特征空间的平均值是该单位正方形的中心,距离该中心单位距离内的所有点都在一个单位圆内,该单位圆内接单位正方形。不属于该单位圆的训练样本更靠近搜索空间的角落而不是其中心。这些样本难以分类,因为它们的特征值差异很大(例如,单位正方形的相对角上的样本)。因此,如果大多数样本落在内接单位圆内,则分类更容易,如图9所示:

    单位距离平均单位圆内的特征

    图9.位于单位圆外的训练样本位于特征空间的角落,并且比特征空间中心附近的样本更难分类。

    现在一个有趣的问题是,当我们增加特征空间的维数时,圆(超球面)的体积如何相对于正方形(超立方体)的体积发生变化。尺寸d的单位超立方体的体积总是1 ^ d = 1. 尺寸d和半径0.5 的内切超球体体积可以计算为:

    (1) \ begin {equation *} V(d)= \ frac {\ pi ^ {d / 2}} {\ Gamma(\ frac {d} {2} + 1)} 0.5 ^ d。 \ {端方程*}

    图10显示了当维度增加时,这个超球体的体积如何变化:

    随着维度的增加,超球体的体积趋向于零

    图10.随着维数的增加,超球面的体积趋向于零。

    这表明,当维数趋于无穷大时,超球体的体积倾向于零,而周围超立方体的体积保持不变。这种令人惊讶且相当反直觉的观察部分地解释了与分类中的维度诅咒相关的问题:在高维空间中,大多数训练数据驻留在定义特征空间的超立方体的角落中。如前所述,特征空间角落中的实例比超球面质心周围的实例更难分类。这由图11示出,其示出了2D单位正方形,3D单位立方体以及具有2 ^ 8 = 256个角的8D超立方体的创造性可视化:

    高维特征空间在其原点周围稀疏

    图11.随着维度的增加,更大比例的训练数据驻留在要素空间的角落中。

    对于8维超立方体,大约98%的数据集中在其256个角上。因此,当特征空间的维数变为无穷大时,从样本点到质心的最小和最大欧几里得距离的差值与最小距离本身的比率趋向于零:

    (2) \ begin {equation *} \ lim_ {d \ to \ infty} \ frac {\ operatorname {dist} _ {\ _max}  -  \ operatorname {dist} _ {\ min}} {\ operatorname {dist} _ {\ min }到\ 0 \ end {equation *}

    因此,距离测量开始失去其在高维空间中测量不相似性的有效性。由于分类器依赖于这些距离测量(例如欧几里德距离,马哈拉诺比斯距离,曼哈顿距离),因此在较低维空间中分类通常更容易,其中较少特征用于描述感兴趣对象。类似地,高斯似然性在高维空间中变为平坦且重尾的分布,使得最小和最大似然之间的差异与最小似然本身的比率趋于零。

    如何避免维数的诅咒?

    图1显示,当问题的维数变得太大时,分类器的性能会降低。那么问题是“太大”意味着什么,以及如何避免过度拟合。遗憾的是,没有固定的规则来定义在分类问题中应该使用多少特征。实际上,这取决于可用的训练数据量(特征的数量和样本数量有关),决策边界的复杂性以及所使用的分类器的类型。

    如果理论无限数量的训练样本可用,则维度的诅咒不适用,我们可以简单地使用无数个特征来获得完美的分类。训练数据的大小越小,应使用的特征越少。如果N个训练样本足以覆盖单位区间大小的1D特征空间,则需要N ^ 2个样本来覆盖具有相同密度的2D特征空间,并且在3D特征空间中需要N ^ 3个样本。换句话说,所需的训练实例数量随着使用的维度数量呈指数增长。

    此外,倾向于非常准确地模拟非线性决策边界的分类器(例如,神经网络,KNN分类器,决策树)不能很好地推广并且易于过度拟合。因此,当使用这些分类器时,维度应该保持相对较低。如果使用易于推广的分类器(例如朴素贝叶斯线性分类器),那么所使用的特征的数量可以更高,因为分类器本身不那么具有表现力(less expressive)。图6显示在高维空间中使用简单分类器模型对应于在较低维空间中使用复杂分类器模型。

    因此,当在高维空间中估计相对较少的参数时,以及在较低维空间中估计大量参数时,都会发生过度拟合。例如,考虑高斯密度函数,由其均值和协方差矩阵参数化。假设我们在3D空间中操作,使得协方差矩阵是由3个独特元素组成的3×3对称矩阵(对角线上的3个方差和非对角线上的3个协方差)。与分布的三维均值一起,这意味着我们需要根据训练数据估计9个参数,以获得表示数据可能性的高斯密度。在1D情况下,仅需要估计2个参数(均值和方差),而在2D情况下需要5个参数(2D均值,两个方差和协方差)。我们再次可以看到,要估计的参数数量随着维度的数量而增长。

    在前面的文章中,我们表明,如果要估计的参数数量增加(并且如果估计的偏差和训练数据的数量保持不变),参数估计的方差会增加。这意味着,由于方差的增加,如果维数上升,我们的参数估计的质量会降低。分类器方差的增加对应于过度拟合。

    另一个有趣的问题是应该使用哪些特征。给定一组N个特征; 我们如何选择M个特征的最佳子集,使得M <N?一种方法是在图1所示的曲线中搜索最优值。由于为所有特征的所有可能组合训练和测试分类器通常是难以处理的,因此存在几种尝试以不同方式找到该最佳值的方法。这些方法称为特征选择算法,并且通常采用启发式(贪婪方法,最佳优先方法等)来定位最佳数量和特征组合。

    另一种方法是用一组M个特征替换N个特征的集合,每个特征是原始特征值的组合。试图找到原始特征的最佳线性或非线性组合以减少最终问题的维度的算法称为特征提取方法。一种众所周知的降维技术是主成分分析(PCA),它产生原始N特征的不相关的线性组合。PCA试图找到较低维度的线性子空间,以便保持原始数据的最大方差。但是,请注意,数据的最大差异不一定代表最具辨别力的信息

    最后,在分类器训练期间用于检测和避免过度拟合的宝贵技术是交叉验证。交叉验证方法将原始训练数据分成一个或多个训练子集。在分类器训练期间,一个子集用于测试所得分类器的准确度和精度,而其他子集用于参数估计。如果用于训练的子集的分类结果与用于测试的子集的结果大不相同,则过度拟合正在发挥作用。如果只有有限数量的训练数据可用,则可以使用几种类型的交叉验证,例如k折交叉验证和留一交叉验证。

    Diagram of k-fold cross-validation with k=4

    结论

    在本文中,我们讨论了特征选择,特征提取和交叉验证的重要性,以避免由于维度的诅咒而过度拟合。通过一个简单的例子,我们回顾了维度诅咒在分类器训练中的重要影响,即过度拟合。

    展开全文
  • 维度诅咒 重点 (Top highlight)The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than ...

    维度诅咒

    重点 (Top highlight)

    The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than they are), it’s a reference to the effect that adding more features has on your dataset. In a nutshell, the curse of dimensionality is all about loneliness.

    维度诅咒 ! 那到底是什么? 除了是机器学习术语中震撼人心的名字的主要示例(听起来通常比他们想象的要怪异得多)之外,它还引用了在数据集上添加更多特征的效果。 简而言之,维度的诅咒全都与孤独有关。

    In a nutshell, the curse of dimensionality is all about loneliness.

    简而言之,维度的诅咒全都与孤独有关。

    Before I explain myself, let’s get some basic jargon out of the way. What’s a feature? It’s the machine learning word for what other disciplines might call a predictor / (independent) variable / attribute / signal. Information about each datapoint, in other words. Here’s a jargon intro if none of those words felt familiar.

    在我自我解释之前,让我们先了解一些基本术语。 有什么功能? 这是机器学习的词汇,表示其他学科可能将其称为预测变量/(独立)变量/属性/信号。 换句话说,有关每个数据点的信息。 如果这些词都不熟悉,这是一个术语介绍

    Data social distancing is easy: just add a dimension. But for some algorithms, you may find that this is a curse…

    数据社交区分开很容易:只需添加一个维度。 但是对于某些算法,您可能会发现这是一个诅咒……

    When a machine learning algorithm is sensitive to the curse of dimensionality, it means the algorithm works best when your datapoints are surrounded in space by their friends. The fewer friends they have around them in space, the worse things get. Let’s take a look.

    机器学习算法对维数的诅咒敏感时,这意味着当您的数据点被朋友包围时,该算法最有效。 他们在太空周围拥有的朋友越少,情况就越糟。 让我们来看看。

    一维 (One dimension)

    Imagine you’re sitting in a large classroom, surrounded by your buddies.

    想象一下,您坐在一个大教室里,周围被好友们包围着。

    You’re a datapoint, naturally. Let’s put you in one dimension by making the room dark and shining a bright light from the back of the room at you. Your shadow is projected onto a line on the front wall. On that line, it’s not lonely at all. You and your crew are sardines in a can, all lumped together. It’s cozy in one dimension! Perhaps a little too cozy.

    您自然是一个数据点 。 通过使房间变暗并从房间背面向您发出明亮的光线,让您处于一个维度。 您的阴影投影到前墙上的一条线上。 在那条线上,这一点都不孤单。 您和您的船员都是沙丁鱼罐头,全都混在一起。 一维舒适! 也许有点舒服了。

    Image for post

    二维 (Two dimensions)

    To give you room to breathe, let’s add a dimension. We’re in 2D and the plane is the floor of the room. In this space, you and your friends are more spread out. Personal space is a thing again.

    为了给您呼吸的空间,让我们添加一个尺寸。 我们处于2D模式,飞机是房间的地板。 在这个空间中,您和您的朋友更加分散。 个人空间又是一回事。

    Image for post

    Note: If you prefer to follow along in an imaginary spreadsheet, think of adding/removing a dimension as inserting/deleting a column of numbers.

    注意: 如果您喜欢在虚构的电子表格中进行操作,请考虑将尺寸添加/删除视为插入/删除数字列。

    三维 (Three dimensions)

    Let’s add a third dimension by randomly sending each of you to one of the floors of the 5-floor building you were in.

    让我们通过将每个人随机发送到您所在的5层建筑物的一层来增加第三个维度。

    Image for post

    All of a sudden, you’re not so densely surrounded by friends anymore. It’s lonely around you. If you enjoyed the feeling of a student in nearly every seat, chances are you’re now mournfully staring at quite a few empty chairs. You’re beginning to get misty eyed, but at least one of your buddies is probably still near you…

    突然之间,您不再被朋友所包围。 你身边很寂寞。 如果您喜欢几乎每个座位上的学生感觉,那么您现在很悲哀地凝视着很多空椅子。 您开始眼花mist乱,但是至少您的一个伙伴可能仍在您附近……

    Image for post

    四个维度 (Four dimensions)

    Not for long! Let’s add another dimension. Time.

    不是很长! 让我们添加另一个维度。 时间。

    Image for post

    The students are spread among 60min sections of this class (on various floors) at various times — let’s limit ourselves to 9 sessions because lecturers need sleep and, um, lives. So, if you were lucky enough to still have a companion for emotional support before, I’m fairly confident you’re socially distanced now. If you can’t be effective when you’re lonely, boom! We have our problem. The curse of dimensionality has struck!

    在不同的时间,这些学生分布在该课程的60分钟部分(位于不同楼层)中-我们将自己限制在9节课中,因为讲师需要睡眠和一些生命。 因此,如果您有幸在此之前仍然有同伴提供情感支持,那么我很自信您现在在社交上与外界保持距离。 如果您在孤独时无法发挥作用,那就加油! 我们有问题。 维度的诅咒来了!

    Image for post

    MOAR尺寸 (MOAR dimensions)

    As we add dimensions, you get lonely very, very quickly. If we want to make sure that every student is just as surrounded by friends as they were in 2D, we’re going to need students. Lots of them.

    随着我们添加维度,您会非常非常快速地孤独。 如果我们要确保每个学生和2D一样都被朋友包围着,那么我们将需要学生。 其中很多。

    Image for post

    The most important idea here is that we have to recruit more friends exponentially, not linearly, to keep your blues at bay.

    这里最重要的想法是,我们必须成倍地而不是线性地招募更多的朋友,以使您的蓝调保持稳定。

    If we add two dimensions, we can’t simply compensate with two more students… or even two more classrooms’ worth of students. If we started with 50 students in the room originally and we added 5 floors and 9 classes, we need 5x9=45 times more students to keep one another as much company as 50 could have done. So, we need 45x50=2,250 students to avoid loneliness. That’s a whole lot more than one extra student per dimension! Data requirements go up quickly.

    如果我们增加两个维度,就不能简单地补偿另外两个学生…甚至两个教室的学生价值。 如果我们最初从教室里的50个学生开始,并且增加了5层楼和9个班级,那么我们需要的学生人数是5x9 = 45倍,以保持50个学生可以做的尽可能多的陪伴。 因此,我们需要45x50 = 2,250名学生来避免孤独感。 每个维度多了一个额外的学生! 数据需求Swift上升。

    When you add dimensions, minimum data requirements can grow rapidly.

    添加维度时,最低数据要求可能会Swift增长。

    We need to recruit many, many more students (datapoints) every time we go up a dimension. If data are expensive for you, this curse is really no joke!

    每次上维时,我们都需要招募更多很多学生(数据点)。 如果数据对您来说太昂贵了,那么这个诅咒真的不是笑话!

    维数 (Dimensional divas)

    Not all machine learning algorithms get so emotional when confronted with a bit of me-time. Methods like k-NN are complete divas, of course. It’s hardly a surprise for a method whose name abbreviation stands for k-Nearest Neighbors — it’s about computing things about neighboring datapoints, so it’s rather important that the datapoints are neighborly.

    并非所有的机器学习算法在面对我的时候都会变得如此激动。 当然,像k-NN这样的方法是完整的。 对于名称缩写代表k-Nearest Neighbors的方法来说,这并不令人惊讶-它是关于计算相邻数据点的信息,因此,数据点是相邻的非常重要。

    Other methods are a lot more robust when it comes to dimensions. If you’ve taken a class on linear regression, for example, you’ll know that once you have a respectable number of datapoints, gaining or dropping a dimension isn’t going to making anything implode catastrophically. There’s still a price — it’s just more affordable.*

    在尺寸方面,其他方法要健壮得多。 例如,如果您上过线性回归课程,您就会知道,一旦拥有足够数量的数据点,增加或减少维数就不会造成灾难性的内爆。 仍有价格-更实惠。*

    *Which doesn’t mean it is resilient to all abuse! If you’ve never known the chaos that including a single outlier or adding one near-duplicate feature can unleash on the least squares approach (the Napoleon of crime, Multicollinearity, strikes again!) then consider yourself warned. No method is perfect for every situation. And, yes, that includes neural networks.

    *这并不意味着它可以抵抗所有虐待! 如果您从未意识到只有一个异常值或添加一个近乎重复的特征会导致最小二乘方法的释放(犯罪的拿破仑,多重共线性,再次打击!),那么请考虑一下自己。 没有一种方法适合每种情况。 而且,是的,其中包括神经网络。

    你应该怎么做? (What should you do about it?)

    What are you going to do about the curse of dimensionality in practice? If you’re a machine learning researcher, you’d better know if your algorithm has this problem… but I’m sure you already do. You’re probably not reading this article, so we’ll just talk about you behind your back, shall we? But yeah, you might like to think about whether it’s possible to design the algorithm you’re inventing to be less sensitive to dimension. Many of your customers like their matrices on the full-figured side**, especially if things are getting textual.

    在实践中,您将如何处理维数的诅咒? 如果您是机器学习研究人员,则最好知道您的算法是否存在此问题……但我确定您已经做到了。 您可能没有读这篇文章,所以我们只是在背后谈论您,对吧? 但是,是的,您可能想考虑是否有可能设计自己发明的对尺寸不太敏感的算法。 您的许多客户都喜欢他们在功能齐全的一面的矩阵**,尤其是当事情变得文本化时。

    **Conventionally, we arrange data in a matrix so that the rows are examples and the columns are features. In that case, a tall and skinny matrix has lots of examples spread over few dimensions.

    **按惯例,我们将数据排列在矩阵中,以使行为示例,而列为要素。 在那种情况下,一个又高又瘦的矩阵有很多例子,分布在几个维度上。

    If you’re an applied data science enthusiast, you’ll do what you always do — get a benchmark of the algorithm’s performance using just one or a few promising features before attempting to throw the kitchen sink at it. (I’ll explain why you need that habit in another post, if you want a clue in the meantime, look up the term overfitting.)

    如果您是应用数据科学的狂热者,您将做自己经常做的事情-在尝试将厨房水槽扔给它之前,仅使用一个或几个有前途的功能就可以获得算法性能的基准。 (我将在另一篇文章中解释为什么您需要这种习惯,如果同时需要线索,请查找 过度拟合 ”一词 。)

    Some methods only work well on tall, skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

    某些方法仅适用于又高又瘦的数据集 ,因此,如果您感到被诅咒,可能需要节食饮食。

    If your method works decently on a limited number of features and then blows a raspberry at you when you increase the dimensions, that’s your cue to either stick to a few features you handpick (or even stepwise-select if you’re getting crafty) or first make a few superfeatures out of your original kitchen sink by running some cute feature engineering techniques (you could try anything from old school things like principal component analysis (PCA) — still relevant today, eigenvectors never go out of fashion — to more modern things like autoencoders and other neural network funtimes). You don’t really need to know the term curse of dimensionality to get your work done because your process — start small and build up the complexity — should take care of it for you, but if it was bothering you… now you can shrug off the worry.

    如果您的方法在有限数量的特征上工作得很好,然后在增加尺寸时向您吹了覆盆子,那么这可能是您坚持手工挑选了一些特征(或者如果您正在精打细算,则是逐步选择 )或首先通过运行一些可爱的功能工程技术在原始的厨房水槽中做一些超级功能 (您可以尝试一些从老派的事情,例如主成分分析(PCA),到今天仍然有用,特征向量永远不会过时,再到更现代的事情)例如自动编码器和其他神经网络的娱乐时间)。 您真的不需要知道维度诅咒一词就可以完成工作,因为您的过程(从小开始并增加复杂性)应该为您解决,但是,如果它困扰您……现在您可以不用担心了担心。

    Image for post

    To summarize: As you add more and more features (columns), you need an exponentially-growing amount of examples (rows) to overcome how spread out your datapoints are in space. Some methods only work well on long skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

    总结:随着添加越来越多的功能 (列),您需要数量呈指数增长的示例 (行)来克服数据点在空间中的分布。 有些方法仅适用于瘦长的数据集,因此,如果您感到被诅咒,可能需要节食饮食。

    Image for post
    spherical cow, er, I mean, meow-emitter is… and more a matter of how many packing peanuts it is covered in. Image: 头球形奶牛有多大,呃,我的意思是喵喵发射器……而更多的是它所覆盖的花生包装数量的问题。图片: SOURCE. SOURCE

    谢谢阅读! 喜欢作者吗? (Thanks for reading! Liked the author?)

    If you’re keen to read more of my writing, most of the links in this article take you to my other musings. Can’t choose? Try this one:

    如果您希望阅读更多我的作品,那么本文中的大多数链接都将带您进入我的其他想法。 无法选择? 试试这个:

    翻译自: https://towardsdatascience.com/the-curse-of-dimensionality-minus-the-curse-of-jargon-520da109fc87

    维度诅咒

    展开全文
  • 维度诅咒How do machines ‘see’? Or, in general, how can computers reduce an input of complex, high-dimensional data into a more manageable number of features? 机器如何“看到”? 或者,通常,计算机...

    维度诅咒

    How do machines ‘see’? Or, in general, how can computers reduce an input of complex, high-dimensional data into a more manageable number of features?

    机器如何“看到”? 或者,通常,计算机如何将复杂的高维数据的输入减少为更易于管理的功能?

    Extend your open hand in front of a nearby light-source, so that it casts a shadow against the nearest surface. Rotate your hand and study how its shadow changes. Note that from some angles it casts a narrow, thin shadow. Yet from other angles, the shadow looks much more recognizably like the shape of a hand.

    将您的张开的手伸到附近的光源前面,以使阴影投射到最近的表面上。 旋转您的手并研究其阴影如何变化。 请注意,从某些角度来看,它会投射出狭窄的细阴影。 但是从其他角度看,阴影看起来更像是手的形状。

    See if you can find the angle which best projects your hand. Preserve as much of the information about its shape as possible.

    看看是否可以找到最适合您伸出手的角度。 保留有关其形状的尽可能多的信息。

    Behind all the linear algebra and computational methods, this is what dimensionality reduction seeks to do with high-dimensional data. Through rotation you can find the optimal angle which represents your 3-D hand as a 2-D shadow.

    在所有线性代数和计算方法的背后, 就是降维试图对高维数据进行的处理。 通过旋转,您可以找到将您的3-D手表示为2-D阴影的最佳角度。

    There are statistical techniques which can find the best representation of data in a lower-dimensional space than that in which it was originally provided.

    有一些统计技术可以在比最初提供数据的维度更低的空间中找到最佳的数据表示形式。

    In this article, we will see why this is an often necessary procedure, via a tour of mind-bending geometry and combinatorics. Then, we will examine the code behind a range of useful dimensionality reduction algorithms, step-by-step.

    在本文中,我们将通过弯曲思维的几何学和组合学来了解为什么这是经常必要的过程。 然后,我们将逐步检查一系列有用的降维算法背后的代码。

    My aim is to make these often difficult concepts more accessible to the general reader — anyone with an interest in how data science and machine learning techniques are fast changing the world as we know it.

    我的目的是使普通读者(对数据科学和机器学习技术如何Swift改变我们所知道的世界感兴趣)的读者更容易理解这些通常很困难的概念。

    Semi-supervised machine learning is a hot topic in the field of data science, and for good reason. Combining the latest theoretical advances with today’s powerful hardware is a recipe for exciting breakthroughs and science fiction evoking headlines.

    半监督机器学习是数据科学领域的热门话题,这是有充分理由的。 将最新的理论进展与当今功能强大的硬件相结合,是令人激动的突破和科幻小说引起人们关注的头条新闻。

    We may attribute some of its appeal to how it approximates our own human experience of learning about the world around us.

    我们可以将它的某些吸引力归因于它如何近似我们人类对周围世界的学习经验。

    The high-level idea is straightforward: given information about a set of labelled “training” data, how can we generalize and make accurate inferences about a set of previously “unseen” data?

    高层次的想法很简单:给定有关一组标记的“训练”数据的信息,我们如何才能对一组先前“看不见的”数据进行归纳并做出准确的推断?

    Machine learning algorithms are designed to implement this idea. They use a range of different assumptions and input data types. These may be simplistic like K-means clustering. Or complex like Latent Dirichlet Allocation.

    机器学习算法旨在实现这一想法。 他们使用一系列不同的假设和输入数据类型。 这些可能像K-means聚类一样简单。 或诸如潜在Dirichlet分配之类的复杂对象。

    Behind all semi-supervised algorithms though are two key assumptions: continuity and embedding. These relate to the nature of the feature space in which the data are described. Below is a visual representation of data points in a 3-D feature space.

    在所有半监督算法的背后,有两个关键假设: 连续性嵌入 。 这些与描述数据的特征空间的性质有关。 下面是3-D特征空间中数据点的直观表示。

    Higher dimensional feature spaces can be thought of as scatter graphs with more axes than we can draw or visualize. The math remains more or less the same!

    高维特征空间可视为散点图,其散布图的轴数超出了我们可以绘制或可视化的范围。 数学基本保持不变!

    Continuity is the idea that similar data points such as those which are near to each other in ‘feature space’ are more likely to share the same label. Did you notice in the scatter graph above that nearby points are similarly colored? This assumption is the basis for a set of machine learning algorithms called clustering algorithms.

    连续性是这样的想法,即类似的数据点(例如在“特征空间”中彼此靠近的数据点)更有可能共享相同的标签。 您是否在上方的散点图中注意到附近的点也有类似的颜色? 该假设是一组称为聚类算法的机器学习算法的基础

    Embedding is the assumption that although the data may be described in a high-dimensional feature space such as a ‘scatter-graph-with-too-many-axes-to-draw’, the underlying structure of the data is likely much lower-dimensional.

    嵌入的假设是,尽管可以在高维特征空间(例如“绘制轴太多的散点图”)中描述数据,但数据的底层结构可能要低得多,尺寸。

    For example, in the scatter graph above we have shown the data in 3-D feature space. But the points fall more or less along a 2-D plane.

    例如,在上面的散点图中,我们显示了3-D特征空间中的数据。 但是这些点或多或少地沿着二维平面下降。

    Embedding allows us to effectively simplify our data by looking for its underlying structure.

    嵌入使我们能够通过查找其底层结构来有效简化数据。

    那么,关于这个诅咒……? (So, about this curse…?)

    Apart from having both the coolest and scariest sounding name in all data science, the phenomena collectively known as the Curse of Dimensionality also pose real challenges to practitioners in the field.

    除了在所有数据科学中都拥有最酷,最恐怖的名字外,被统称为“维诅咒”的现象也给该领域的从业人员带来了真正的挑战。

    Although somewhat on the melodramatic side, the title reflects an unavoidable reality of working with high-dimensional data sets. This includes those where each point of data is described by many measurements or ‘features’.

    尽管标题有些讲究戏剧性,但标题反映了使用高维数据集不可避免的现实。 这包括通过许多度量或“功能”描述数据的每个点的数据。

    The general theme is simple — the more dimensions you work with, the less effective standard computational and statistical techniques become. This has repercussions that need some serious workarounds when machines are dealing with Big Data. Before we dive into some of these solutions, let’s discuss the challenges raised by high-dimensional data in the first place.

    总体主题很简单-您使用的维度越多,标准的计算和统计技术的效力就越差。 当机器处理大数据时,这会产生一些需要认真解决的后果。 在深入探讨其中一些解决方案之前,让我们首先讨论高维数据带来的挑战。

    计算工作量 (Computational Workload)

    Working with data becomes more demanding as the number of dimensions increases. Like many challenges in data science, this boils down to combinatorics.

    随着维度数量的增加,处理数据的要求也越来越高。 像数据科学中的许多挑战一样,这可以归结为组合学

    With n = 1, there are only 5 boxes to search. With n = 2, there are now 25 boxes; and with n = 3, there are 125 boxes to search. As n gets bigger, it becomes difficult to sample all the boxes. This makes the treasure harder to find — especially as many of the boxes are likely to be empty!

    n = 1时,仅搜索5个框。 当n = 2时,现在有25个盒子; 在n = 3的情况下,有125个搜索框。 随着n变大,对所有盒子进行采样变得困难。 这使宝藏更难找到-尤其是许多盒子可能是空的!

    In general, with n dimensions each allowing for m states, we will have m^n possible combinations. Try plugging in a few different values and you will be convinced that this presents a workload-versus-sampling challenge to machines tasked with repeatedly sampling different combinations of variables.

    通常,在n个维中每个都允许m个状态的情况下,我们将有m ^ n个可能的组合。 尝试插入几个不同的值,您将确信,这对负责重复采样不同变量组合的机器提出了工作量与采样挑战。

    With high-dimensional data, we simply cannot comprehensively sample all the possible combinations, leaving vast regions of feature space in the dark.

    对于高维数据,我们根本无法对所有可能的组合进行全面采样,而将广阔的特征空间区域留在黑暗中。

    尺寸冗余 (Dimensional Redundancy)

    We may not even need to subject our machines to such demanding work. Having many dimensions is no guarantee that every dimension is especially useful . A lot of the time, we may be measuring the same underlying pattern in several different ways.

    我们甚至不需要使我们的机器经受如此艰巨的工作。 拥有多个维度并不能保证每个维度都特别有用。 很多时候,我们可能会以几种不同的方式来衡量相同的基础模式。

    For instance, we could look at data about professional football or soccer players. We may describe each player in six dimensions.

    例如,我们可以查看有关职业足球或足球运动员的数据。 我们可以用六个维度来描述每个玩家。

    This could be in terms of:

    可以是:

    • number of goals scored

      进球数
    • number of of shots attempted

      尝试拍摄的次数
    • number of chances created

      创造的机会数
    • number of tackles won

      铲球数量
    • number of blocks made

      块数
    • number of clearances made

      清关次数

    There are six dimensions. Yet you might see that we are actually only describing two underlying qualities — offensive and defensive ability — from a number of angles.

    有六个维度。 但是您可能会看到,我们实际上只是从多个角度描述了两种基本素质,即进攻能力和防守能力。

    This is an example of the embedding assumption we discussed earlier. High dimensional data often has a much lower-dimensional underlying structure.

    这是我们前面讨论的嵌入假设的示例。 高维数据通常具有低维的基础结构。

    In this case, we’d expect to see strong correlations between some of our dimensions. Goals scored and shots attempted will unlikely be independent of one another. Much of the information in each dimension is already contained in some of the others.

    在这种情况下,我们希望看到我们的某些维度之间有很强的相关性。 进球射门都不可能彼此独立。 每个维度中的许多信息已经包含在其他一些维度中。

    Often high-dimensional data will show such behavior. Many of the dimensions are, in some sense, redundant.

    高维数据通常会显示这种行为。 从某种意义上说,许多维度都是多余的。

    Highly correlated dimensions can harmfully impact other statistical techniques which rely upon assumptions of independence. This could lead to much-dreaded problems such as over-fitting.

    高度相关的维度可能会对依赖独立性假设的其他统计技术产生有害影响。 这可能会导致严重的问题,例如过度拟合

    Many high-dimensional data sets are actually the results of lower-dimensional generative processes. The classic example is the human voice. It can produce very high-dimensional data from the movement of only a small number of vocal chords.

    许多高维数据集实际上是低维生成过程的结果。 典型的例子是人的声音 。 它仅通过少量声带的移动就可以产生非常高维的数据。

    High-dimensionality can mask the generative processes. These are often what we’re interested in learning more about.

    高维可以掩盖生成过程。 这些通常是我们有兴趣了解的更多信息。

    Not only does high-dimensionality pose computational challenges, it often does so without bringing much new information to the show.

    高维不仅会带来计算上的挑战,而且经常会带来很多挑战,而不会带来很多新信息。

    And there’s more! Here’s where things start getting bizarre.

    还有更多! 这是事情开始变得怪异的地方。

    几何疯狂 (Geometric Insanity)

    Another problem arising from high-dimensional data concerns the effectiveness of different distance metrics, and the statistical techniques which depend upon them.

    高维数据引起的另一个问题涉及不同距离度量的有效性以及依赖于它们的统计技术。

    This is a tricky concept to grasp, because we’re so used to thinking in everyday terms of three spatial dimensions. This can be a bit of a hindrance for us humans.

    这是一个很难理解的概念,因为我们已经习惯于每天从三个空间维度来思考。 对我们人类来说这可能是一个障碍。

    Geometry starts getting weird in high-dimensional space. Not only hard-to-visualize weird, but more “WTF-is-that?!” weird.

    几何在高维空间开始变得怪异。 不仅难以想象的怪异,而且还有更多“那是WTF ?!” 奇怪的。

    Let’s begin with an example in a more familiar number of dimensions. Say you’re mailing a disc with a diameter of 10cm to a friend who likes discs. You could fit it snugly into a square envelope with sides of 10cm, leaving only the corners unused. What percentage of space in the envelope remains unused?

    让我们从一个更熟悉的维度示例开始。 假设您是将直径10厘米的光盘邮寄给喜欢该光盘的朋友。 您可以将其紧紧地塞入一个10厘米长的正方形信封中,而只剩下未使用的角落。 信封中有多少百分比的空间未使用?

    Well, the envelope has an area of 100cm² inside it, and the disc takes up 78.5398… cm² (recall the area of a circle equals πr²). In other words, the disc takes up ~78.5% of the space available. Less than a quarter remains empty in the four corners.

    好了,封套内部有一个100cm²的区域,光盘占用了78.5398…cm²(回想起来,一个圆面积等于πr² )。 换句话说,光盘占用了约78.5%的可用空间。 在四个角落中只有不到四分之一的地方空着。

    Now say you’re packaging up a ball which also has a diameter of 10cm, this time into a cube shaped box with sides of 10cm. The box has a total volume of 10³ = 1000cm³, while the ball has a volume of 523.5988… cm³ (the volume of a 3-D sphere can be calculated using 4/3 * πr³). This represents almost 52.4% of the total volume available. In other words, almost half of the box’s volume is empty space in the eight corners.

    现在,您要包装一个直径为10cm的球,这次是将其包装成一个边长为10cm的立方体形状的盒子。 盒子的总体积为10³=1000cm³,而球的体积为523.5988…cm³( 3-D球体积可以使用4/3 *πr³计算)。 这几乎占可用总量的52.4%。 换句话说,盒子的体积几乎有一半是八个角的空白区域。

    See these examples below:

    请参阅以下示例:

    The volume of a sphere in 3-D is smaller in example B than that of a circle in the 2-D example B. The center of a cube is smaller than the center of a square with the same length side. Does this pattern continue in more than three dimensions? Or when we’re dealing with hyper-spheres and hyper-cubes? Where do we even begin?

    在示例3中,3-D中的球体的体积小于示例2B中的圆的体积。立方体的中心小于在相同长度边上的正方形的中心。 这种模式是否会在三个以上的维度上持续下去? 或者,当我们处理超球体超立方体时 ? 我们什至从哪里开始?

    Let us think about what a sphere actually is, mathematically speaking. We can define an n-dimensional sphere as the surface formed by rotating a radius of fixed length r about a central point in (n+1)-dimensional space.

    从数学上来讲,让我们考虑一个球体实际上是什么。 我们可以将n维球体定义为围绕( n + 1)维空间中的中心点旋转固定长度r的半径而形成的表面。

    In 2-D, this traces out the edge of circle which is a 1-D line. In 3-D this traces out the 2-D surface of an everyday sphere. In 4-D+, which we cannot easily visualize, this process draws out a hyper-sphere.

    在2-D中,这将描绘出一维线的圆的边缘。 在3-D中,它可以描绘出日常球体的2-D表面。 在我们无法轻易可视化的4-D +中,此过程绘制出一个超球体。

    It’s harder to picture this concept in higher dimensions, but the pattern which we saw earlier continues . The relative volume of the sphere diminishes.

    很难从更高的角度来描述这个概念,但是我们之前看到的模式仍在继续。 球体的相对体积减小。

    The generalized formula for the volume of a hyper-sphere with radius r in n dimensions is shown below:

    广义公式用于超球与n维半径r的体积如下所示:

    Γ is the Gamma function, described here. Technically, we should be calling volume in > 3 dimensions hyper-content.

    Γ是伽马函数, 在此描述。 从技术上讲,我们应该调用> 3 dimensio NS超续 ENT量。

    The volume of a hyper-cube with sides of length 2r in n dimensions is simply (2r)^n. If we extend our sphere-packaging example into higher dimensions, we find the percentage of overall space filled can be found by the general formula:

    n个维度上边长为2r的超立方体的体积为(2 r )^ n。 如果将球包装示例扩展到更高的维度,我们发现可以通过以下通用公式找到填充的总空间的百分比:

    We’ve taken the first formula, multiplied by 1 / (2r)^n and then cancelled where r^n appears on both sides of the fraction.

    我们采用第一个公式,乘以1 /(2 r )^ n ,然后在分数的两边都出现r ^ n的地方取消。

    Look at how we have n/2 and n as exponents on the numerator (“top”) and denominator (“bottom”) of that fraction respectively. We can see that as n increases, the denominator will grow quicker than the numerator. This means the fraction gets smaller and smaller. That’s not to mention the fact the denominator also contains a Gamma function featuring n.

    看看我们如何分别在该分数的分子(“顶部”)和分母(“底部”)上分别具有n / 2和n作为指数。 我们可以看到,随着n的增加,分母将比分子增长更快。 这意味着分数越来越小。 更不用说分母也包含具有n的Gamma函数的事实

    The Gamma function is like the factorial function… you know, the one where (n! = 2 x 3 x … x n). The Gamma function also tends to grow really quickly. In fact, Γ(n) = (n-1)!.

    Gamma函数就像阶乘函数, …,您知道,其中( n != 2 x 3 x…x n )。 伽玛功能也往往会Swift增长。 实际上, Γ(n)=(n-1)!

    This means that as the number of dimensions increases, the denominator grows much faster than the numerator. So the volume of the hyper-sphere decreases towards zero.

    这意味着随着维数的增加,分母的增长比分子的增长快得多。 因此,超球体的体积朝零减小。

    In case you don’t much feel like calculating Gamma functions and hyper-volumes in high dimensional space, I’ve made a quick graph:

    如果您不太想在高维空间中计算Gamma函数和超体积,我制作了一张快速图表:

    The volume of the hyper-sphere (relative to the space in which it lives) rapidly plummets towards zero. This has serious repercussions in the world of Big Data.

    超球体的体积(相对于它所居住的空间)Swift下降为零。 这在大数据世界中具有严重的影响。

    …Why?

    …为什么?

    Recall our 2-D and 3-D examples. The empty space corresponded to the “corners” or “outlying regions” of the overall space.

    回顾我们的2D和3D示例。 空的空间对应于整个空间的“角落”或“外围区域”。

    For the 2-D case, our square had 4 corners which were 21.5% of the total space.

    对于二维情况,我们的正方形有4个角,占总空间的21.5%。

    In the 3-D case, our cube now had 8 corners which accounted for 47.6% of the total space.

    在3-D情况下,我们的立方体现在有8个角,占总空间的47.6%。

    As we move into higher dimensions, we will find even more corners. This will make an ever increasing percentage of the total space available.

    随着我们迈向更高的维度,我们将发现更多的角落。 这将使可用总空间的百分比不断增加。

    Now imagine we have data spread across some multidimensional space. The higher the dimensionality, the higher the total proportion of our data will be “flung out” in the corners, and the more similar the distances will be between the minimum and maximum distances between points.

    现在想象一下,我们的数据分布在某些多维空间中。 维数越高,在角落中“抛弃”我们的数据的总比例越高,并且点之间的最小距离和最大距离之间的距离越相似。

    In higher dimensions our data are more sparse and more similarly spaced apart. This makes most distance functions less effective.

    在更高维度上,我们的数据更稀疏,并且间隔更相似。 这会使大多数距离功能的效果降低。

    逃避诅咒! (Escaping the Curse!)

    There are a number of techniques which can project our high-dimensional data into a lower dimensional space. Recall the analogy of a 3-D object placed in front of a light source projects a 2-D shadow against a wall.

    有许多技术可以将我们的高维数据投影到低维空间中。 回想一下放置在光源前面的3D对象的类比,将2D阴影投射在墙上。

    By reducing the dimensionality of our data, we make three gains:

    通过减少数据的维数,我们获得了三点收获:

    • lighter computational workload

      减轻计算量
    • less dimensional redundancy

      较少的尺寸冗余
    • more effective distance metrics

      更有效的距离指标

    No wonder dimensionality reduction is so crucial in advanced machine learning applications such as computer vision, NLP and predictive modelling.

    难怪降维在高级机器学习应用(如计算机视觉NLP预测建模 )中如此重要。

    We’ll walk through five methods which are commonly applied to high-dimensional data sets. We’ll be restricting ourselves to feature extraction methods. They try to identify new features underlying the original data.

    我们将逐步介绍五种通常应用于高维数据集的方法。 我们将限制自己的特色 提取方法。 他们尝试识别原始数据的新功能。

    Feature selection methods choose which of the original features are worth keeping. We’ll leave those for a different article!

    特征 选择方法选择哪些原始功能值得保留。 我们将其留给其他文章!

    This is a long read with plenty of worked examples. So open your favorite code editor, put the kettle on, and let’s get started!

    这是一本长篇小说,上面有许多工作示例。 因此,打开您喜欢的代码编辑器,打开水壶,让我们开始吧!

    多维缩放(MDS) (Multidimensional Scaling (MDS))

    视觉总结 (Visual Summary)

    MDS refers to family of techniques used to reduce dimensionality. They project the original data in a lower-dimensional space, while preserving the distances between the points as much as possible. This is usually achieved by minimizing a loss-function (often called stress or strain) via an iterative algorithm.

    MDS是指用于减少尺寸的一系列技术。 他们将原始数据投影在较低维的空间中,同时尽可能保留点之间的距离。 这通常是通过迭代算法将损失函数(通常称为应力应变 )最小化来实现的。

    Stress is a function which measures how much of the original distance between points has been lost. If our projection does a good job at retaining the original distances, the returned value will be low.

    应力是一种测量点之间原始距离损失了多少的函数。 如果我们的投影在保留原始距离方面做得很好,则返回值将很低。

    工作实例 (Worked Example)

    If you have R installed, whack it open in your IDE of choice. Otherwise, if you want to follow along anyway, check this R-fiddle.

    如果您已安装R,请在您选择的IDE中打开它。 否则,如果您仍然想继续学习, 请检查此R小提琴

    We’ll be looking at CMDS (Classical MDS) in this example. It will give an identical output to PCA (Principal Components Analysis), which we’ll discuss later.

    在此示例中,我们将研究CMDS(经典MDS)。 它将为PCA(主成分分析)提供相同的输出,我们将在后面讨论。

    We’ll be making use of two of R’s strengths in this example:

    在此示例中,我们将利用R的两个优势:

    • working with matrix multiplication

      使用矩阵乘法
    • the existence of inbuilt data sets

      内置数据集的存在

    Start with defining our input data:

    首先定义我们的输入数据:

    M <- as.matrix(UScitiesD)

    We want to begin with a distance matrix where each element represents the Euclidean distance (think Pythagoras’ Theorem) between our observations. The UScitiesD and eurodist data sets in R are straight-line and road distance matrices between a selection of U.S. and European cities.

    我们想从一个距离矩阵开始,其中每个元素代表我们的观测值之间的欧几里得距离 (认为​​毕达哥拉斯定理)。 R中的UScitiesDeurodist数据集是选择的美国和欧洲城市之间的直线和道路距离矩阵。

    With non-distance input data, we would need a preliminary step to calculate the distance matrix first.

    对于非距离输入数据,我们需要一个初步步骤来首先计算距离矩阵。

    M <- as.matrix(dist(raw_data))

    With MDS, we seek to find a low-dimensional projection of the data that best preserves the distances between the points. In Classical MDS, we aim to minimize a loss-function called Strain.

    借助MDS,我们寻求找到最能保留点之间距离的数据的低维投影 。 在经典MDS中,我们旨在最小化称为Strain的损失函数

    Strain is a function that works out how much a given low-dimensional projection distorts the original distances between the points.

    应变是一种函数,可以计算出给定的低维投影有多大程度地扭曲了点之间的原始距离。

    With MDS, iterative approaches (for example, via gradient descent) are usually used to edge our way towards an optimal solution. But with CMDS, there’s an algebraic way of getting there.

    使用MDS时,通常使用迭代方法(例如,通过梯度下降法 )来逐步实现最佳解决方案。 但是有了CMDS,就有一种代数的方式到达那里。

    Time to bring in some linear algebra. If this stuff is new to you, don’t worry — you’ll pick things up with a little practice. A good starting point is to see matrices as blocks of numbers that we can manipulate all at once, and work from there.

    是时候引入一些线性代数了。 如果这些东西对您来说是新手,请不要担心-您将通过一些练习来掌握。 一个很好的起点是将矩阵视为数字块,我们可以一次操纵所有数字,然后从那里开始工作。

    Matrices follow certain rules for operations. Addition and multiplication can be broken down or decomposed into eigenvalues and corresponding eigenvectors.

    矩阵遵循某些操作规则。 加法乘法可以分解或分解特征值和相应的特征向量

    Eigen-what now?

    本征-现在如何?

    A simple way of thinking about all this eigen-stuff is in terms of transformations. Transformations can change both the direction and length of vectors upon which they act.

    对所有这些本征材料的简单思考方法是变换 。 变换可以改变其作用的向量的方向和长度。

    Shown below, matrix A describes a transformation, which is applied to two vectors by multiplying A x v. The blue vector’s direction of 1 unit across and 3 units up remains unchanged. Only it’s length changes, here it doubles. This makes the blue vector an eigenvector of A with an eigenvalue of 2.

    如下所示,矩阵A描述了一个变换,通过将A x v乘以将其应用于两个向量。 蓝色矢量的方向为1个单位向上3个单位不变。 只是长度改变了,这里变倍了。 这使蓝色向量成为特征值为2的A的特征向量。

    The orange vector does change direction when multiplied by A, so it cannot be an eigenvector of A.

    橙色向量乘以A确实会改变方向,因此它不能是A的特征向量

    Back to CMDS — our first move is to define a centering matrix that lets us double center our input data. In R, we can implement this as below:

    回到CMDS-我们的第一步是定义一个居中矩阵 ,该矩阵使我们可以对输入数据进行两次居中 。 在R中,我们可以如下实现:

    n <- nrow(M)
    C <- diag(n) - (1/n) * matrix(rep(1, n^2), nrow = n)

    We then use R’s support for matrix multiplication %*% to apply the centering matrix to our original data to form a new matrix, B.

    然后,我们使用R对矩阵乘法%*%的支持将定心矩阵应用于原始数据,以形成新矩阵B。

    B <- -(1/2) * C %*% M %*% C

    Nice! Now we can begin building our 2-D projection matrix. To do this, we define two more matrices using the eigenvectors associated with the two largest eigenvalues of matrix B.

    真好! 现在我们可以开始构建二维投影矩阵。 为此,我们使用特征向量定义另外两个矩阵 与两个最大特征值相关 矩阵B。

    Like so:

    像这样:

    E <- eigen(B)$vectors[,1:2]
    L <- diag(2) * eigen(B)$values[1:2]

    Let’s calculate our 2-D output matrix X, and plot the data according to the new co-ordinates.

    让我们计算二维输出矩阵X ,然后根据新坐标绘制数据。

    X <- E %*% L^(1/2)
    plot(-X, pch=4)
    text(-X, labels = rownames(M), cex = 0.5)

    How does that look? Pretty good, right? We have recovered the underlying 2-D layout of the cities from our original input distance matrix. Of course, this technique lets us use distance matrices calculated from even higher-dimensional data sets.

    看起来怎么样? 还不错吧? 我们已经从原始输入距离矩阵中恢复了城市的基本二维布局。 当然,这种技术使我们可以使用从更高维数据集计算出的距离矩阵。

    Learn more about the variety of techniques which come under the label of MDS.

    了解有关MDS标签下的各种技术的更多信息。

    主成分分析(PCA) (Principal Components Analysis (PCA))

    视觉总结 (Visual Summary)

    In a large data set with many dimensions, some of the dimensions may well be correlated and essentially describe the same underlying information. We can use linear algebra to project our data into a lower-dimensional space, while retaining as much of the underlying information as possible.

    在具有多个维度的大型数据集中,某些维度可能很相关,并且本质上描述了相同的基础信息。 我们可以使用线性代数将数据投影到较低维的空间,同时保留尽可能多的基础信息。

    The visual summary above provides a low-dimensional explanation. In the plot on the left, our data are described by two axes, x and y.

    上面的视觉摘要提供了一个低维度的解释。 在左侧的图中,我们的数据由xy两个轴描述。

    In the middle plot, we rotate the axes through the data in the direction that captures as much variation as possible. The new PC1 axis describes much more of the variation than axis PC2. In fact, we could ignore PC2 and still keep a large percentage of the variation in the data.

    在中间的图中,我们沿数据方向旋转轴,以捕获尽可能多的变化。 新的PC1轴比PC2轴描述了更多的变化。 实际上,我们可以忽略PC2,而仍然保留很大一部分数据变化。

    工作实例 (Worked Example)

    Let’s use a small scale example to illustrate the core idea. In an R session or in this snippet at R-fiddle), let’s load one of the in-built data sets.

    让我们用一个小规模的例子来说明核心思想。 在R会话中或R-fiddle的此代码段中 ,让我们加载其中一个内置数据集。

    data <- as.matrix(mtcars)
    head(data)
    dim(data)

    Here we have 32 observations of different cars across 11 dimensions. They include features and measurements such as mpg, cylinders, horsepower….

    在这里,我们对11个维度的不同汽车进行了32次观测。 它们包括功能和测量值,例如mpg,汽缸,马力……。

    But how many of those 11 dimensions do we actually need? Are some of them correlated?

    但是,我们实际上需要这11个维度中的多少个? 它们中的一些相关吗?

    Let’s calculate the correlation between the number of cylinders and horsepower. Without any prior knowledge, what might we expect to find?

    让我们计算缸数与马力之间的相关性。 如果没有任何先验知识,我们可能会发现什么?

    cor(mtcars$cyl, mtcars$hp)

    That’s an interesting result . At +0.83, we find the correlation coefficient is pretty high. This suggests that number of cylinders and horsepower are both describing the same underlying feature. Are more of our dimensions doing something similar?

    那是一个有趣的结果。 在+0.83处,我们发现相关系数非常高。 这表明缸数和马力都描述了相同的基本特征。 我们有更多的维度在做类似的事情吗?

    Let’s correlate all pairs of our dimensions and build a correlation matrix. Because life’s too short.

    让我们关联所有尺寸对,并建立一个关联矩阵 。 因为生命太短暂了。

    cor(data)

    Each cell contains the correlation coefficient between the dimensions at each row and column. The diagonal always equals 1.

    每个单元格包含每一行和每一列的尺寸之间的相关系数。 对角线始终等于1。

    Correlation coefficients near +1 show strong positive correlation. Coefficients near -1 show strong negative correlation. We can see some values close to -1 and +1 in our correlation matrix. This shows we have some correlated dimensions in our data set.

    接近+1的相关系数显示强正相关。 -1附近的系数显示出很强的负相关性。 我们可以在相关矩阵中看到一些接近-1和+1的值。 这表明我们的数据集中有一些相关的维度。

    This is cool, but we still have the same number of dimensions we started with. Let’s throw out a few!

    这很酷,但是我们仍然拥有与开始时相同的尺寸数。 让我们扔掉一些!

    To do this, we can get out the linear algebra again. One of the strong points of the R language is that it is good at linear algebra, and we’re gonna make use of that in our code. Our first step is to take our correlation matrix and find its eigenvalues.

    为此,我们可以再次求出线性代数。 R语言的强项之一是它擅长线性代数,我们将在我们的代码中使用它。 我们的第一步是获取相关矩阵并找到其特征值。

    e <- eigen(cor(data))

    Let’s inspect the eigenvalues:

    让我们检查特征值:

    e$valuesbarplot(e$values/sum(e$values),
        main="Proportion Variance explained")

    We see 11 values which decrease pretty dramatically on the bar plot! We see that the eigenvector associated with the largest eigenvalue explains about 60% of the variation in our data. The eigenvector associated with the second largest eigenvalue explains about 24% of the variation in our original data. That’s already 84% of the variation in the data, explained by two dimensions!

    我们看到11个值在条形图上显着下降! 我们看到与最大特征值相关的特征向量解释了我们数据中约60%的变化。 与第二大特征值相关的特征向量解释了原始数据中约24%的变化。 这已经是数据变化的84%,由两个维度来解释!

    OK, let’s say we want to keep 90% of the variation in our original data set. How many dimensions do we need to keep to achieve this?

    好的,假设我们要在原始数据集中保留90%的变化。 为了达到这个目的,我们需要保持多少个维度?

    cumulative <- cumsum(e$values/sum(e$values))
    print(cumulative)
    
    i <- which(cumulative >= 0.9)[1]
    print(i)

    We calculate the cumulative sum of our eigenvalues’ relative proportion of the total variance. We see that the eigenvectors associated with the 4 largest eigenvalues can describe 92.3% of the original variation in our data.

    我们计算特征值相对于总方差的相对比例的累积总和。 我们看到与4个最大特征值相关的特征向量可以描述数据中原始变化的92.3%。

    This is useful! We can retain >90% of the original structure using only 4 dimensions. Let’s project the original data set onto a 4-D space. To do this, we need to create a matrix of weights, which we’ll call W.

    这很有用! 我们仅使用4个维度就可以保留原始结构的90%以上。 让我们将原始数据集投影到4-D空间上。 要做到这一点,我们需要创建权重的矩阵,我们将CA L L W.

    W <- e$vectors[1:ncol(data),1:i]

    W is an 11 x 4 matrix. Remember, 11 is the number of dimensions in our original data, and 4 is the number we want to have for our transformed data. Each column in W is given by the eigenvectors corresponding to the four largest eigenvalues we saw earlier.

    W是11 x 4矩阵。 请记住,11是我们原始数据中的维数,4是我们想要转换后的数据中的维数。 W中的每一列由对应于我们之前看到的四个最大特征值的特征向量给出。

    To get our transformed data, we multiply the original data set by the weights matrix W. In R, we perform matrix multiplication with the %*% operator.

    为了获得转换后的数据,我们将原始数据集乘以权重矩阵W。在R中,我们使用%*%运算符执行矩阵乘法。

    tD <- data %*% W
    head(tD)

    We can view our transformed data set . Now each car is described in terms of 4 principal components instead of the original 11 dimensions. To get a better understanding of what these principal components are actually describing, we can correlate them against the original 11 dimensions.

    我们可以查看转换后的数据集。 现在,每辆汽车都是用4个主要部件而不是原始的11个尺寸来描述的。 为了更好地理解这些主要成分的实际含义,我们可以将它们与原始的11个维度相关联。

    cor(data, tD[,1:i])

    We see that component 1 is negatively correlated with cylinders, horsepower and displacement. It is also positively correlated with mpg and possessing a straight (as opposed to V-shaped) engine. This suggests that component 1 is a measure of engine type.

    我们看到组件1与汽缸,马力和排量负相关。 它也与mpg呈正相关,并具有直式(与V形相反)的引擎。 这表明组件1是发动机类型的量度。

    Cars with large, powerful engines will have a negative score for component 1. Smaller engines and more-fuel efficient cars will have a positive score. Recall that this component describes approximately 60% of the variation in the original data.

    具有大型,强劲发动机的汽车在组件1中的得分为负。较小的发动机和燃油效率更高的汽车的得分为正。 回想一下,此组件描述了原始数据中大约60%的变化。

    Likewise, we can interpret the remaining components in this manner. It can become trickier (if not impossible) to do so as we proceed. Each subsequent component describes a smaller and smaller proportion of the overall variation in the data. Nothing beats a little domain-specific expertise!

    同样,我们可以用这种方式解释其余的组件。 随着我们的进行,这样做可能会变得更加棘手(即使不是不可能)。 每个后续组件都描述了数据整体变化中越来越小的比例。 胜任一点领域专业知识!

    There are several aspects in which PCA can vary to the method described here. You can read an entire book on the subject.

    PCA可以在多个方面改变此处描述的方法。 您可以阅读有关该主题的整本书

    线性判别分析(LDA) (Linear Discriminant Analysis (LDA))

    视觉总结 (Visual Summary)

    On the original axis, the red and blue classes overlap. Through rotation, we can find a new axis which better separates the classes. We may choose to use this axis to project our data into a lower-dimensional space.

    在原始轴上,红色和蓝色类别重叠。 通过旋转,我们可以找到一个更好地分隔类的新轴。 我们可以选择使用该轴将数据投影到低维空间中。

    PCA seeks axes that best describe the variation within the data. Linear Discriminant Analysis (LDA) seeks axes that best discriminate between two or more classes within the data.

    PCA寻求最能描述数据变化的轴。 线性判别分析(LDA)会寻找可最佳区分数据中两个或多个类别的轴。

    This is achieved by calculating two measures

    这是通过计算两个度量来实现的

    • within-class variance

      组内方差

    • between-class variance.

      类间差异

    The objective is to optimize the ratio between them. There is minimal variance within each class and maximal variance between the classes. We can do this with algebraic methods.

    目的是优化它们之间的比率。 每个类别中的方差最小,而各个类别之间的方差最大。 我们可以用代数方法做到这一点。

    As shown above, A is the within-class scatter. B is the between-class scatter.

    如上所示, A是类内散布。 B是类间散布。

    它是如何工作的? (How Does It Work?)

    Let’s generate a simple data set for this example (for the R-fiddle, click here).

    让我们为该示例生成一个简单的数据集(对于R小提琴, 请单击此处 )。

    require(dplyr)
    languages <- data.frame(
      HTML = c(22,20,15, 5, 5, 5, 0, 2, 0),
      JavaScript = c(20,25,25,20,20,15, 5, 5, 0),
      Java = c(15, 5, 0,15,30,30,10,10,15),
      Python = c( 5, 0, 2, 5,10, 5,40,35,30),
      job = c("Web","Web","Web","App","App","App","Data","Data","Data")
      )
    
    View(languages)

    We have a fictional data set describing nine developers in terms of the number of hours they spend working in each of four languages:

    我们有一个虚构的数据集,以他们用四种语言中的每种语言工作的时间来描述九个开发人员:

    • HTML

      HTML
    • JavaScript

      JavaScript
    • Java

      Java
    • Python

      Python

    Each developer is classed in one of three job roles:

    每个开发人员都被划分为以下三个职位之一:

    • web developer

      Web开发人员
    • app developer

      应用程式开发人员
    • and data scientist

      和数据科学家
    cor(select(languages, -job))

    We use the select() function from the dplyr package to drop the class labels from the data set. This allows us to inspect the correlations between the different languages.

    我们使用dplyr包中的select()函数从数据集中删除类标签。 这使我们可以检查不同语言之间的相关性。

    Unsurprisingly, we see some patterns. There is a strong, positive correlation between HTML and JavaScript. This indicates developers who use one of these languages have a tendency to also use the other.

    毫不奇怪,我们看到了一些模式。 HTML和JavaScript之间有很强的正相关关系。 这表明使用其中一种语言的开发人员也倾向于使用另一种语言。

    We suspect that there is some lower-dimensional structure beneath this 4-D data set. Remember, four languages = four dimensions.

    我们怀疑在此4-D数据集下存在一些低维结构。 请记住,四种语言=四个维度。

    Let’s use LDA to project our data into a lower-dimensional space that best separates the three classes of job roles.

    让我们使用LDA将我们的数据投影到一个较低维度的空间中,该空间可以最好地将三类工作角色分开。

    First, we need to build within-class scatter matrices for each class. Let’s use dplyr’s filter() and select() methods to break down our data by job role.

    首先,我们需要为每个类建立类内散布矩阵。 让我们使用dplyrfilter()select()方法按工作角色细分数据。

    Web <- as.data.frame(
      scale(filter(languages, job == "Web") %>% 
        select(., -job),T))
    
    App <- as.data.frame(
      scale(filter(languages, job == "App") %>%
        select(., -job),T))
    
    Data <- as.data.frame(
      scale(filter(languages, job == "Data") %>%
        select(., -job),T))

    So now we have three new data sets, one for each job role. For each of these, we can find a covariance matrix. This is closely related to the correlation matrix. It also describes the trends between how languages are used together.

    因此,现在我们有了三个新数据集,每个工作角色一个。 对于这些中的每一个,我们都可以找到一个协方差矩阵 。 这与相关矩阵密切相关。 它还描述了如何一起使用语言之间的趋势。

    We find the within-class scatter matrix by summing the each of the three covariance matrices. This gives us a matrix describing the scatter within each class.

    我们发现内部 通过求和三个协方差矩阵中的每一个来分散矩阵。 这为我们提供了一个矩阵,用于描述每个类中的分散。

    within <- cov(Web) + cov(App) + cov(Data)

    Now we want to find the between-class scatter matrix which describes the scatter between the classes. To do this, we must first find the center of each class, by calculating the average features of each. This lets us form a data.frame where each column describes the average developer for each class.

    现在我们要找到类间散布矩阵,它描述了类之间的散布。 为此,我们必须首先通过计算每个类别的平均特征来找到每个类别的中心。 这使我们形成一个data.frame ,其中每一列描述每个类的平均开发人员。

    means <- t(data.frame(
      mean_Web <- sapply(Web, mean),
      mean_App <- sapply(App, mean),
      mean_Data <- sapply(Data, mean)))

    To get our between-class scatter matrix, we find the covariance of this matrix.:

    为了获得类间散布矩阵,我们找到该矩阵的协方差:

    between <- cov(means)

    Now we have two matrices:

    现在我们有两个矩阵:

    • our within-class scatter matrix

      我们的类内散布矩阵
    • the between-class scatter matrix

      类间散布矩阵

    We want to find new axes for our data which minimizes the ratio between within-class scatter and between-class scatter.

    我们希望为我们的数据找到新的轴,以最小化类内散布和类间散布之间的比率。

    We do this by finding the eigenvectors of the matrix formed by:

    我们通过找到以下矩阵形成的特征向量来做到这一点:

    e <- eigen(solve(within) %*% between)
    
    barplot(e$values/sum(e$values),
      main='Variance explained')
      
    W <- e$vectors[,1:2]

    By plotting the eigenvalues, we can see that the first two eigenvectors will explain more than 95% of the variation in the data.

    通过绘制特征值,我们可以看到前两个特征向量将解释数据中超过95%的变化。

    Let’s transform the original data set and plot the data in its new, lower-dimensional space.

    让我们转换原始数据集并将数据绘制在其新的较低维空间中。

    LDA <- scale(select(languages, -job), T) %*% W
      
    plot(LDA, pch="", 
      main='Linear Discriminant Analysis')
    
    text(LDA[,1],LDA[,2],cex=0.75,languages$job,
      col=unlist(lapply(c(2,3,4),rep, 3)))

    There you go! See how the new axes do an amazing job separating the different classes? This reduces the dimensionality of the data and could also prove useful for classification purposes.

    你去! 看到新的轴如何在区分不同的类方面做得很棒吗? 这降低了数据的维数,也可能证明对分类目的有用。

    To being interpreting the new axes, we can correlate them against the original data:

    为了解释新轴,我们可以将它们与原始数据相关联:

    cor(select(languages,-job),LDA)

    This reveals how Axis 1 is negatively correlated with JavaScript and HTML, and positively correlated with Python. This axis separates the Data Scientists from the Web and App developers.

    这揭示了Axis 1如何与JavaScript和HTML负相关,并与Python正相关。 此轴将数据科学家与Web和App开发人员分开。

    Axis 2 is correlated with HTML and Java in opposite directions. This separates the Web developers from the App developers. It would be an interesting insight, if the data weren’t fictional…

    轴2在相反的方向与HTML和Java相关。 这使Web开发人员与App开发人员分离。 如果数据不是虚构的,那将是一个有趣的见解。

    We have assumed the three classes are all equal in size, which simplifies things a bit. LDA can be applied to 2 or more classes, and can be used as a classification method as well.

    我们假设这三个类的大小都相等,这使事情简化了一点。 LDA可以应用于2个或更多类,也可以用作分类方法。

    Get the full picture and coverage of LDA’s use in classification.

    获得LDA在分类中的使用的全貌和范围。

    非线性降维 (Non-linear Dimensionality Reduction)

    The techniques covered so far are pretty good in many use-cases, but they make a key assumption: that we are working in the context of linear geometry.

    到目前为止,所涉及的技术在许多用例中都很好,但是它们做出了一个关键的假设:我们正在线性几何环境中工作。

    Sometimes, this is an assumption we need to drop.

    有时,这是我们需要放弃的假设。

    Non-linear dimensionality reduction (NLDR) opens up a fascinating world of advanced mathematics and mind-bending possibilities in applications such as computer vision and autonomy.

    非线性降维(NLDR)开启了一个引人入胜的高级数学世界,并在诸如计算机视觉和自治等应用程序中产生了令人折服的可能性。

    There are many NLDR methods available. We’ll take a look at a couple of techniques relating to manifold learning. These will approximate the underlying structure of high-dimensional data. Manifolds are one of the many mathematical concepts that might sound impenetrable but which are actually seen everyday.

    有许多可用的NLDR方法。 我们将研究与多种学习相关的几种技术 这些将近似于高维数据的基础结构。 流形是听起来似乎难以理解但实际上每天都会见到的许多数学概念之一。

    Take this map of the world:

    拿这张世界地图:

    We’re all fine with the idea of representing the surface of a sphere on a flat sheet of paper. Recall from before that a sphere is defined as a 2-D surface traced a fixed distance around a point in 3-D space. The earth’s surface is a 2-D manifold embedded, or wrapped around, in a 3-D space.

    在平面纸上表示球体表面的想法我们都很好。 回想一下定义球体之前的情况 因为二维表面在3-D空间中的某个点周围跟踪了固定的距离。 地球表面是嵌入或包裹在3-D空间中的2-D流形。

    With high-dimensional data, we can use the concept of manifolds to reduce the number of dimensions we need to describe the data.

    对于高维数据,我们可以使用流形的概念来减少描述数据所需的维数。

    Think back to the surface of the earth. Earth exists in a 3-D space, so we should describe the location, such as a city, in three dimensions. However, we have no trouble using only two dimensions of latitude and longitude instead.

    回想一下地球表面。 地球存在于3-D空间中,因此我们应该从三个维度描述位置,例如城市。 但是,我们可以轻松地只使用纬度和经度这两个维度。

    Manifolds can be more complex and higher-dimensional than the earth example here. Isomap and Laplacian Eigenmapping are two closely related methods used to apply this thinking to high-dimensional data.

    与此处的地球示例相比,流形可能更复杂且维数更高。 等值图 拉普拉斯特征映射 有两种紧密相关的方法,用于将这种想法应用于高维数据。

    等值图 (Isomap)

    视觉总结 (Visual Summary)

    We can see our original data as a U-shaped underlying structure. The straight-line distance, as shown by the black arrow, between A and B won’t reflect the fact they lie at opposite ends, as shown by the red line.

    我们可以将原始数据视为U形底层结构。 如黑色箭头所示, AB之间的直线距离不会反映出它们位于相对两端的事实,如红色线所示。

    We can build a nearest-neighbors graph to find the shortest path between the points. This lets us build a distance matrix that can be used as an input for MDS to find a lower-dimensional representation of the original data that preserves the non-linear structure.

    我们可以建立一个最近邻居图来找到两点之间的最短路径。 这使我们能够构建一个距离矩阵,该距离矩阵可用作MDS的输入,以查找保留非线性结构的原始数据的低维表示。

    We can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph or network by connecting each of our original data points to a set of neighboring points.

    我们可以使用图论中的技术来近似流形上的距离。 我们可以通过将每个原始数据点连接到一组相邻点来构建图形或网络来做到这一点。

    By using a shortest-paths algorithm, we can find the geodesic distance between each point. We can use this to form a distance matrix which can be an input for a linear dimensionality reduction method.

    通过使用最短路径算法,我们可以找到每个点之间的测地距离。 我们可以使用它来形成距离矩阵,该距离矩阵可以作为线性降维方法的输入。

    工作实例 (Worked Example)

    We’re going to implement a simple Isomap algorithm using an artificially generated data set. We’ll keep things in low dimensions, to help visualize what is going on. Here’s the code.

    我们将使用人工生成的数据集实现一个简单的Isomap算法。 我们将使尺寸保持较小,以帮助可视化正在发生的事情。 这是代码

    Let’s start by generating some data:

    让我们从生成一些数据开始:

    x <- y <- c(); a <- b <- 1
    
    for(i in 1:1000){
      theta <- 0.01 * i
      x <- append(x,(a+b*theta)*(cos(theta)+runif(1,-1,1))
      y <- append(y,(a+b*theta)*(sin(theta)+runif(1,-1,1))
    }
    
    color <- rainbow(1200)[1:1000]
    spiral <- data.frame(x,y,color)
    plot(y~x, pch=20, col=color)

    Nice! That’s an interesting shape, with a clear, non-linear structure. Our data could be seen as scattered along a 1-D line, running between red and violet, coiled up (or embedded) in a 2-D space. Under the assumption of linearity, distance metrics and other statistical techniques won’t take this into account.

    真好! 这是一个有趣的形状,具有清晰的非线性结构。 我们的数据可以看作是沿着一维线散布,在红色和紫色之间延伸,在二维空间中盘绕(或嵌入 )。 在线性的假设下,距离度量标准和其他统计技术不会考虑到这一点。

    How can we unravel the data to find its underlying 1-D structure?

    我们如何解开数据以找到其底层一维结构?

    pc <- prcomp(spiral[,1:2])
    plot(data.frame(
      pc$x[,1],1),col=as.character(spiral$color))

    PCA won’t help us, as it is a linear dimensionality reduction technique. See how it has collapsed all the points onto an axis running through the spiral? Instead of revealing the underlying red-violet spectrum of points, we only see the blue points scattered along the whole axis.

    PCA不会帮助我们,因为这是一个 线性降维技术。 看到它如何将所有点折叠到贯穿螺旋的轴上? 我们没有揭示点的潜在红紫色光谱,而是仅看到沿整个轴分散的蓝点。

    Let’s try implementing an Isomap algorithm. We begin by building a graph from our data points, by connecting each to its n-nearest neighboring points. n is a hyper-parameter that we need to set in advance of running the algorithm. For now, let’s use n = 5.

    让我们尝试实现Isomap算法。 我们首先从数据点构建图形,方法是将每个数据点连接到n个最近的相邻点。 n是我们在运行算法之前需要设置的超参数 。 现在,让我们使用n = 5。

    We can represent the n-nearest neighbors graph as an adjacency matrix A.

    我们可以将n最近邻居图表示为邻接矩阵 A。

    The element at the intersection of each row and column can be either 1 or 0 depending on whether the corresponding points are connected.

    每个行和列的交点处的元素可以为1或0,具体取决于是否连接了相应的点。

    Let’s build this with the code below:

    让我们用下面的代码构建它:

    n <- 5
    distance <- as.matrix(dist(spiral[,1:2]))
    A <- matrix(0,ncol=ncol(distance),nrow=nrow(distance))
    
    for(i in 1:nrow(A)){
      neighbours <- as.integer(
        names(sort(distance[i,])[2:n+1]))
      A[i,neighbours] <- 1
    }

    Now we have our n-nearest neighbors graph, we can start working with the data in a non-linear way. For example, we can begin to approximate the distances between points on the spiral by finding their geodesic distance — calculating the length of the shortest path between them.

    现在我们有了n最近邻居图,我们可以开始以非线性方式处理数据。 例如,我们可以通过找到测地线的距离来近似估计螺旋线上的点之间的距离,即计算它们之间的最短路径的长度。

    Dijkstra’s algorithm is a famous algorithm which can be used to find the shortest path between any two points in a connected graph. We could implement our own version here but to remain on-topic, I will use the distances() function from R’s igraph library.

    Dijkstra的算法是一种著名的算法,可用于查找连接图中任意两点之间的最短路径。 我们可以在这里实现我们自己的版本,但是为了保持话题性,我将使用R的igraph库中distances()函数。

    install.packages('igraph'); require(igraph)
    
    graph <- graph_from_adjacency_matrix(A)
    geo <- distances(graph, algorithm = 'dijkstra')

    This gives us a distance matrix. Each element represents the shortest number of edges or links required to get from one point to another.

    这给我们一个距离矩阵。 每个元素代表从一个点到另一点所需的最短边或链接数。

    Here’s an idea… why not use MDS to find some co-ordinates for the points represented in this distance matrix? It worked earlier for the cities data.

    这是一个主意……为什么不使用MDS为该距离矩阵中表示的点找到一些坐标? 它较早用于城市数据。

    We could wrap our earlier MDS example in a function and apply our own, homemade version. However, you’ll be pleased to know that R provides an in-built MDS function we can use as well. Let’s scale to one dimension.

    我们可以将早期的MDS示例包装在一个函数中,然后应用我们自己的自制版本。 但是,您会很高兴地知道R提供了我们也可以使用的内置MDS函数。 让我们缩放到一个维度。

    md <- data.frame(
      'scaled'=cmdscale(geo,1),
      'color'=spiral$color)
    
    plot(data.frame(
      md$scaled,1), col=as.character(md$color), pch=20)

    We’ve reduced from 2-D to 1-D, without ignoring the underlying manifold structure.

    我们已经从2-D减少到1-D,而没有忽略底层的流形结构。

    For advanced, non-linear machine learning purposes, this is a big deal. Often enough, high-dimensional data arises as a result of a lower-dimensional generative process. Our spiral example illustrates this.

    对于高级非线性机器学习而言,这很重要。 通常,由于低维生成过程而产生了高维数据。 我们的螺旋示例说明了这一点。

    The original spiral was plotted as a data.frame of x and y co-ordinates. But we generated those with a for-loop, in which our index variable i incremented by +1 each iteration.

    The original spiral was plotted as a data.frame of x and y co-ordinates. But we generated those with a for-loop, in which our index variable i incremented by +1 each iteration.

    By applying our Isomap algorithm, we have recapitulated the steady increase in i with each iteration of the loop. Pretty good going.

    By applying our Isomap algorithm, we have recapitulated the steady increase in i with each iteration of the loop. Pretty good going.

    The version of Isomap we implemented here has been a little simplified in parts. For example, we could have weighted our adjacency matrix to account for Euclidean distances between the points. This would give us a more nuanced measure of geodesic distance.

    The version of Isomap we implemented here has been a little simplified in parts. For example, we could have weighted our adjacency matrix to account for Euclidean distances between the points. This would give us a more nuanced measure of geodesic distance.

    One drawback of methods like this include the need to establish suitable hyper-parameter values. If the nearest-neighbors threshold n is too low, you will end up with a fragmented graph. If it is too high, the algorithm will be insensitive to detail. That spiral could become an ellipse if we start connecting points on different layers.

    One drawback of methods like this include the need to establish suitable hyper-parameter values. If the nearest-neighbors threshold n is too low, you will end up with a fragmented graph. If it is too high, the algorithm will be insensitive to detail. That spiral could become an ellipse if we start connecting points on different layers.

    This means these methods work best with dense data. That requires the manifold structure to be pretty well defined in the first place.

    This means these methods work best with dense data. That requires the manifold structure to be pretty well defined in the first place.

    Laplacian Eigenmapping (Laplacian Eigenmapping)

    Visual Summary (Visual Summary)

    Using ideas from Spectral Graph Theory, we can find a lower dimensional projection of the data while retaining the non-linear structure.

    Using ideas from Spectral Graph Theory, we can find a lower dimensional projection of the data while retaining the non-linear structure.

    Again, we can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph connecting each of our original data points to a set of neighboring points.

    Again, we can approximate distances on the manifold using techniques in graph theory. We can do this by building a graph connecting each of our original data points to a set of neighboring points.

    Laplacian Eigenmapping takes this graph and applies ideas from spectral graph theory to find a lower-dimensional embedding of the original data.

    Laplacian Eigenmapping takes this graph and applies ideas from spectral graph theory to find a lower-dimensional embedding of the original data.

    Worked Example (Worked Example)

    OK, you’ve made it this far. Your reward is the chance to nerd out with our fourth and final dimensionality reduction algorithm. We’ll be exploring another non-linear technique. Like Isomap, it uses graph theory to approximate the underlying structure of the manifold. Check out the code.

    OK, you've made it this far. Your reward is the chance to nerd out with our fourth and final dimensionality reduction algorithm. We'll be exploring another non-linear technique. Like Isomap, it uses graph theory to approximate the underlying structure of the manifold. Check out the code .

    Let’s start with similar spiral-shaped data to that we used before. But let’s make it even more tightly wound.

    Let's start with similar spiral-shaped data to that we used before. But let's make it even more tightly wound.

    set.seed(100)
    
    x <- y <- c();
    a <- b <- 1
    
    for(i in 1:1000){
      theta <- 0.02 * i
      x <- append(x,(a+b*theta)*(cos(theta)+runif(1,-1,1))
      y <- append(y,(a+b*theta)*(sin(theta)+runif(1,-1,1))
    }
    
    color <- rainbow(1200)[1:1000]
    spiral <- data.frame(x,y,color)
    plot(y~x, pch=20, col=color)

    The naive straight-line distance between A and B is much shorter than the distance from one end of the spiral to the other. Linear techniques won’t stand a chance!

    The naive straight-line distance between A and B is much shorter than the distance from one end of the spiral to the other. Linear techniques won't stand a chance!

    Again, we begin by constructing the adjacency matrix A of an n-nearest neighbors graph. n is a hyper-parameter we need to choose in advance.

    Again, we begin by constructing the adjacency matrix A of an n -nearest neighbors graph. n is a hyper-parameter we need to choose in advance.

    Let’s try n = 10:

    Let's try n = 10:

    n <- 10
    distance <- as.matrix(dist(spiral[,1:2]))
    A <- matrix(0,ncol=ncol(distance),
      nrow=nrow(distance))
    
    for(i in 1:nrow(A)){
      neighbours <- as.integer(
        names(sort(distance[i,])[2:n+1]))
      A[i,neighbours] <- 1
    }
    
    for(j in 1:nrow(A)){
      for(k in 1:ncol(A)){
        if(A[j,k] == 1){
          out[k,j] <- 1
        }
      }
    }

    So far, so much like Isomap. We’ve added an extra few lines of logic to force the matrix to be symmetric. This will allow us to use ideas from spectral graph theory in the next step. We will define the Laplacian matrix of our graph.

    So far, so much like Isomap. We've added an extra few lines of logic to force the matrix to be symmetric. This will allow us to use ideas from spectral graph theory in the next step. We will define the Laplacian matrix of our graph.

    We do this by building the degree matrix D.

    We do this by building the degree matrix D .

    D <- diag(nrow(A))
    
    for(i in 1:nrow(D)){   
      D[i,i] = sum(A[,i])
    }

    This is a matrix the same size as A, where every element is equal to zero — except those on the diagonal, which equal the sum of the corresponding column of matrix A.

    This is a matrix the same size as A , where every element is equal to zero — except those on the diagonal, which equal the sum of the corresponding column of matrix A .

    Next, we form the Laplacian matrix L with the simple subtraction:

    Next, we form the Laplacian matrix L with the simple subtraction:

    L = D - A

    The Laplacian matrix is another matrix representation of our graph particularly suited to linear algebra. It allows us to calculate a whole range of interesting properties.

    The Laplacian matrix is another matrix representation of our graph particularly suited to linear algebra. It allows us to calculate a whole range of interesting properties.

    To find our 1-D embedding of the original data, we need to find a vector x and eigenvalue λ.

    To find our 1-D embedding of the original data, we need to find a vector x and eigenvalue λ.

    This will solve the generalized eigenvalue problem:

    This will solve the generalized eigenvalue problem :

    Lx = λDx

    L x = λ D x

    Thankfully, you can put away the pencil and paper, because R provides a package to help us do this.

    Thankfully, you can put away the pencil and paper, because R provides a package to help us do this.

    install.packages('geigen'); require(geigen)
    eig <- geigen(L,D)
    eig$values[1:10]

    We see that the geigen() function has returned the eigenvalue solutions from smallest to largest. Note how the first value is practically zero.

    We see that the geigen() function has returned the eigenvalue solutions from smallest to largest. Note how the first value is practically zero.

    This is one of the properties of the Laplacian matrix — its number of zero eigenvalues tell us how many connected components we have in the graph. Had we used a lower value for n, we might have built a fragmented graph in say, three separate, disconnected parts — in which case, we’d have found three zero eigenvalues.

    This is one of the properties of the Laplacian matrix — its number of zero eigenvalues tell us how many connected components we have in the graph. Had we used a lower value for n , we might have built a fragmented graph in say, three separate, disconnected parts — in which case, we'd have found three zero eigenvalues.

    To find our low-dimensional embedding, we can take the eigenvectors associated with the lowest non-zero eigenvalues. Since we are projecting from 2-D into 1-D, we will only need one such eigenvector.

    To find our low-dimensional embedding, we can take the eigenvectors associated with the lowest non-zero eigenvalues. Since we are projecting from 2-D into 1-D, we will only need one such eigenvector.

    embedding <- eig$vectors[,2]
    plot(data.frame(embedding,1), col=spiral$colors, pch=20)

    And there we have it — another non-linear data set successfully embedded in lower dimensions. Perfect!

    And there we have it — another non-linear data set successfully embedded in lower dimensions. 完善!

    We have implemented a simplified version of Laplacian Eigenmapping. We ignored choosing another hyper-parameter t, which would have had the effect of weighting our nearest-neighbors graph.

    We have implemented a simplified version of Laplacian Eigenmapping. We ignored choosing another hyper-parameter t , which would have had the effect of weighting our nearest-neighbors graph.

    Take a look at the original paper for the full details and mathematical justification.

    Take a look at the original paper for the full details and mathematical justification.

    结论 (Conclusion)

    There we are — a run through of four dimensionality reduction techniques that we can apply to linear and non-linear data. Don’t worry if you didn’t quite follow all the math (although congrats if you did!). Remember, we always need to strike a balance between theory and practice when it comes to data science.

    There we are — a run through of four dimensionality reduction techniques that we can apply to linear and non-linear data. Don't worry if you didn't quite follow all the math (although congrats if you did!). Remember, we always need to strike a balance between theory and practice when it comes to data science.

    These algorithms and several others are available in various packages of R, and in scikit-learn for Python.

    These algorithms and several others are available in various packages of R , and in scikit-learn for Python.

    Why, then, did we run through each one step-by-step? In my experience, rebuilding something from scratch is a great way to understand how it works.

    Why, then, did we run through each one step-by-step? In my experience, rebuilding something from scratch is a great way to understand how it works.

    Dimensionality reduction touches upon several branches of mathematics which are useful within data science and other disciplines. Putting these into practice is a great exercise for turning theory into application.

    Dimensionality reduction touches upon several branches of mathematics which are useful within data science and other disciplines. Putting these into practice is a great exercise for turning theory into application.

    There are, of course, other techniques that we haven’t covered. But if you still have an appetite for more machine learning, then try out the links below:

    There are, of course, other techniques that we haven't covered. But if you still have an appetite for more machine learning, then try out the links below:

    Linear techniques:

    Linear techniques:

    Non-linear:

    Non-linear:

    Thanks for reading! If you have any feedback or questions, please leave a response below!

    谢谢阅读! If you have any feedback or questions, please leave a response below!

    翻译自: https://www.freecodecamp.org/news/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335/

    维度诅咒

    展开全文
  • The Curse of Dimensionality 在大数据时代,大量的数据,...但是当维度增大到一定的程度,就会产生所谓的维度诅咒,即在深度学习领域,随着数据特征的维度增多,训练所需要的数据呈指数型增加的现象。随着维度的增...
  • 解释训练机器学习算法时维度诅咒的含义及其含义 N维空间中的稀疏性 随着维数的增加,n维空间中的点变得越来越稀疏。 也就是说,点之间的距离将随着尺寸数量的增加而继续增加。 在许多机器学习算法中,尤其是在将点...
  • 对于机器学习的理解,我相信很多人还无法做到简单、易懂的将其思想描述出来,比如这里提到的一个基本概念:数据的维度,以及算法应用中为何升维和降维。对于初学者来说,这些都是难以理解的,在前期可...
  • 对于最小二乘和Galerkin投影方法,可以通过使用多项式公式来大幅降低尺寸的诅咒。 最小二乘被证明需要一个良好的初始值才能给出准确的解决方案。 另外,我们建议一种用于完整多项式的新的即席配置方法,该方法快速...
  • 维度诅咒 回到我仍受薪于代码的时候,我遇到了一个简单的问题,困扰于开发工作: “为什么我们明天不能释放?” 这个简短的简单问题竟然非常强大。 我记得我在加利福尼亚参与的一项工作,新任首席执行官接任并...
  • 解决空间和时间上的大规模优化问题会Swift产生计算僵局,称为“维度灾难”。 这严重限制了经济模型的实际应用,尤其是在确定气候变化和保护主义贸易政策的影响方面。 在本文中,我们采用了一种创新方法,通过使用...
  • 降维的定义、维度诅咒、降维的意义 目录 降维的定义、维度诅咒、降维的意义 降维的定义 维度诅咒 降维的意义 降维的定义 降维顾名思义就是把数据或特征的维数降低,一般分为线性降维和非线性降维。常用的...
  • 高维情况下的局域方法 在上面两篇中,我们提到了,假如说样本的数据总量NNN非常大,那么我们用最近邻方法可以近似出一个最优函数,其等价于理论上的条件数学...这种现象被称为维度诅咒(curse of dimensionality)。...
  • 特征工程是数据科学模型开发的重要组成部分之一。数据科学家把大部分时间花在数据处理...在本文中,我们讨论了几种编码具有多个级别的分类变量的技术,能够部分解决维度诅咒的问题。 作者:Satyam Kumar deephub翻译组
  • 【机器学习系列博客】1. 维度诅咒

    千次阅读 多人点赞 2018-09-10 08:58:20
    以上三个问题,在看完我们接下来的关于维度诅咒的讲解后,相信你会有更好的认识。 2. n维空间的膨胀 理想情况下,我们如果把一些样本映射到n维空间,这n维空间的n个坐标轴是相互正交的(线性无关),那么我们...
  • 维度诅咒 不久前,我写了一篇文章, 对(初级)开发人员的职业提出了建议 。 我解释的一件事是,“年轻”一词的意思是所有年轻的开发人员,因此您背负多年的经验并不重要。 您仍然“年轻”! 几天前,我从两个...
  • 维度诅咒(The Curse of Dimensionality)KNN在高维空间运行会出现”维度诅咒”的问题,那是因为在高维空间太广阔,高维空间的数据点不趋向接近另外的数据点。有一个办法可以证明这一点,随机产生很多对d维度的向量,...
  • 维度诅咒Let’s do a quick scientific experiment together. You are in an art gallery and a vase is sitting on a pedestal in the middle of the room. A visitor, fascinated by it, approaches to take a ...
  • 维度诅咒 首席执行官的骄傲和决心可以是公司最大的资产,也可以是最大的弱点。 那么,公司如何确保其充当前者而不是后者呢? 公开采购(成为一个更加开放的组织)可能是秘密的一部分。 在与我合作过的许多小型私人...
  • 维度诅咒 前言:这一篇是第一部分线性降维的第一章,将说明在计算中高维数据如何是一个大难题,这就是我们说的维度诅咒。 数据集维数的增加意味着在表示相应欧几里得空间中的每个观察的特征向量中有更多条目。我们...
  • 维度诅咒,是所有预计算OLAP引擎的严重问题; 在1.5之前, kylin使用一些简单的技术处理这个问题,也减轻了问题的严重程度; 在开源实践过程中,我们发现这些技术缺乏系统性设计思维,也无法处理很多常见问题; 在1.5,我们...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,443
精华内容 577
关键字:

维度诅咒