精华内容
下载资源
问答
  • K最近邻

    2018-07-27 11:19:13
    确定k之后,计算待预测点到每个样本的距离,找到最近k个,这k个点类别最多的那个类作为预测点的最终分类;如果是回归就取k个数的均值 KDTree TODO

    确定k之后,计算待预测点到每个样本的距离,找到最近的k个,这k个点类别最多的那个类作为预测点的最终分类;如果是回归就取k个数的均值

    KDTree TODO

    展开全文
  • KNN即K最近邻,相关的知识内容可以参考 http://blog.csdn.net/luanpeng825485697/article/details/78796773 这里只讲述sklearn中如何使用KNN算法。 无监督最近邻 NearestNeighbors (最近邻)实现了 unsuperv....
    
        ad1.jpg
    

    全栈工程师开发手册 (作者:栾鹏)

    python数据挖掘系列教程

    KNN即K最近邻,相关的知识内容可以参考
    http://blog.csdn.net/luanpeng825485697/article/details/78796773

    这里只讲述sklearn中如何使用KNN算法。

    无监督最近邻

    NearestNeighbors (最近邻)实现了 unsupervised nearest neighbors learning(无监督的最近邻学习)。 它为三种不同的最近邻算法提供统一的接口:BallTree, KDTree, 还有基于 sklearn.metrics.pairwise 的 brute-force 算法。算法的选择可通过关键字 ‘algorithm’ 来控制, 并必须是 [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’] 其中的一个。当默认值设置为 ‘auto’ 时,算法会尝试从训练数据中确定最佳方法。

    # ========无监督查找最近邻(常在聚类中使用,例如变色龙聚类算法)==========
    
    from sklearn.neighbors import NearestNeighbors
    import numpy as np # 快速操作结构数组的工具
    
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])  # 样本数据
    test_x = np.array([[-3.2, -2.1], [-2.6, -1.3], [1.4, 1.0], [3.1, 2.6], [2.5, 1.0], [-1.2, -1.3]])  # 设置测试数据
    # test_x=X  # 测试数据等于样本数据。这样就相当于在样本数据内部查找每个样本的邻节点了。
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)  # 为X生成knn模型
    distances, indices = nbrs.kneighbors(test_x)  # 为test_x中的数据寻找模型中的邻节点
    print('邻节点:',indices)
    print('邻节点距离:',distances)
    
    # ==============================使用kd树和Ball树实现无监督查找最近邻========================
    
    from sklearn.neighbors import KDTree,BallTree
    import numpy as np # 快速操作结构数组的工具
    
    X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    # test_x = np.array([[-3.2, -2.1], [-2.6, -1.3], [1.4, 1.0], [3.1, 2.6], [2.5, 1.0], [-1.2, -1.3]])  # 设置测试数据
    test_x=X  # 测试数据等于样本数据。这样就相当于在样本数据内部查找每个样本的邻节点了。
    kdt = KDTree(X, leaf_size=30, metric='euclidean')
    distances,indices = kdt.query(test_x, k=2, return_distance=True)
    print('邻节点:',indices)
    print('邻节点距离:',distances)
    
    

    最近邻分类

    scikit-learn 实现了两种不同的最近邻分类器:KNeighborsClassifier 基于每个查询点的 k 个最近邻实现,其中 k 是用户指定的整数值。RadiusNeighborsClassifier 基于每个查询点的固定半径 r 内的邻居数量实现, 其中 r 是用户指定的浮点数值。

    k -邻居分类是 KNeighborsClassifier 下的两种技术中比较常用的一种。k 值的最佳选择是高度依赖数据的:通常较大的 k 是会抑制噪声的影响,但是使得分类界限不明显。

    如果数据是不均匀采样的,那么 RadiusNeighborsClassifier 中的基于半径的近邻分类可能是更好的选择。

    RadiusNeighborsClassifier 中用户指定一个固定半径 r,使得稀疏邻居中的点使用较少的最近邻来分类。

    对于高维参数空间,这个方法会由于所谓的 “维度灾难” 而变得不那么有效。

    在两种k最近邻分类中,基本的最近邻分类使用统一的权重:分配给查询点的值是从最近邻的简单多数投票中计算出来的。 在某些环境下,最好对邻居进行加权,使得更近邻更有利于拟合。可以通过 weights 关键字来实现。

    默认值 weights = ‘uniform’ 为每个近邻分配统一的权重。而 weights = ‘distance’ 分配权重与查询点的距离成反比。 或者,用户可以自定义一个距离函数用来计算权重。

    # ==========k最近邻分类=========
    import numpy as np # 快速操作结构数组的工具
    from sklearn.neighbors import KNeighborsClassifier,KDTree   # 导入knn分类器
    
    
    # 数据集。4种属性,3种类别
    data=[
        [ 5.1,  3.5,  1.4,  0.2, 0],
        [ 4.9,  3.0,  1.4,  0.2, 0],
        [ 4.7,  3.2,  1.3,  0.2, 0],
        [ 4.6,  3.1,  1.5,  0.2, 0],
        [ 5.0,  3.6,  1.4,  0.2, 0],
        [ 7.0,  3.2,  4.7,  1.4, 1],
        [ 6.4,  3.2,  4.5,  1.5, 1],
        [ 6.9,  3.1,  4.9,  1.5, 1],
        [ 5.5,  2.3,  4.0,  1.3, 1],
        [ 6.5,  2.8,  4.6,  1.5, 1],
        [ 6.3,  3.3,  6.0,  2.5, 2],
        [ 5.8,  2.7,  5.1,  1.9, 2],
        [ 7.1,  3.0,  5.9,  2.1, 2],
        [ 6.3,  2.9,  5.6,  1.8, 2],
        [ 6.5,  3.0,  5.8,  2.2, 2],
    ]
    
    # 构造数据集
    dataMat = np.array(data)
    X = dataMat[:,0:4]
    y = dataMat[:,4]
    
    knn = KNeighborsClassifier(n_neighbors=2,weights='distance')    # 初始化一个knn模型,设置k=2。weights='distance'样本权重等于距离的倒数。'uniform'为统一权重
    knn.fit(X, y)                                          #根据样本集、结果集,对knn进行建模
    result = knn.predict([[3, 2, 2, 5]])                   #使用knn对新对象进行预测
    print(result)
    
    

    最近邻回归

    最近邻回归是用在数据标签为连续变量,而不是离散变量的情况下。分配给查询点的标签是由它的最近邻标签的均值计算而来的。

    scikit-learn 实现了两种不同的最近邻回归:KNeighborsRegressor 基于每个查询点的 k 个最近邻实现, 其中 k 是用户指定的整数值。RadiusNeighborsRegressor 基于每个查询点的固定半径 r 内的邻点数量实现, 其中 r 是用户指定的浮点数值。

    基本的最近邻回归使用统一的权重:即,本地邻域内的每个邻点对查询点的分类贡献一致。 在某些环境下,对邻点加权可能是有利的,使得附近点对于回归所作出的贡献多于远处点。 这可以通过 weights 关键字来实现。默认值 weights = ‘uniform’ 为所有点分配同等权重。 而 weights = ‘distance’ 分配的权重与查询点距离呈反比。 或者,用户可以自定义一个距离函数用来计算权重。

    # ==============================k最近邻回归========================
    
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import neighbors
    
    np.random.seed(0)
    X = np.sort(5 * np.random.rand(40, 1), axis=0)
    T = np.linspace(0, 5, 500)[:, np.newaxis]
    y = np.sin(X).ravel()
    
    # 为输出值添加噪声
    y[::5] += 1 * (0.5 - np.random.rand(8))
    
    # 训练回归模型
    n_neighbors = 5
    
    for i, weights in enumerate(['uniform', 'distance']):
        knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
        y_ = knn.fit(X, y).predict(T)
    
        plt.subplot(2, 1, i + 1)
        plt.scatter(X, y, c='k', label='data')
        plt.plot(T, y_, c='g', label='prediction')
        plt.axis('tight')
        plt.legend()
        plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,weights))
    
    plt.show()
    

    这里写图片描述

    展开全文
  • k最近邻

    2018-04-15 16:44:01
    K-Nearest Neighbors 该算法存储所有的训练样本(已知标签),然后通过分析新给的样本(标签未知)与已知标签的训练样本的相似度,选出其中的K个最相似的训练样本进行投票得到新样本的标签,并计算加权和等。...

    K-Nearest Neighbors 
    该算法存储所有的训练样本(已知标签),然后通过分析新给的样本(标签未知)与已知标签的训练样本的相似度,选出其中的K个最相似的训练样本进行投票得到新样本的标签,并计算加权和等。 该方法有时被称为是“learning by example”,因为他总是根据新样本的特征向量与已知标签的样本特征向量的相似度来判断新样本的类别。

    CvKNearest 
    class CvKNearest : public CvStatModel 
    该类实现了 K-Nearest Neighbors 模型

    CvKNearest::CvKNearest

    构造函数

    默认构造函数. 

    Default and training constructors.
    C++: CvKNearest::CvKNearest()
    C++: CvKNearest::CvKNearest(const Mat& trainData, const Mat& responses, const Mat& sampleIdx=Mat(), bool isRegression=false, int max_k=32 )
    C++: CvKNearest::CvKNearest(const CvMat* trainData, const CvMat* responses, const CvMat* sampleIdx=0, bool isRegression=false, int max_k=32 )

    训练函数
    CvKNearest::trainC++: bool CvKNearest::train(const Mat& trainData, const Mat& responses, const Mat& sampleIdx=Mat(), bool isRegression=false, int maxK=32, bool updateBase=false )
    C++: bool CvKNearest::train(const CvMat* trainData, const CvMat* responses, const CvMat* sampleIdx=0, bool is_regression=false, int maxK=32, bool updateBase=false)
    Python: cv2.KNearest.train(trainData, responses[, sampleIdx[, isRegression[, maxK[, updateBase]]]]) ---> retval

    参数:

        isRegression – Type of the problem: true for regression and false for classification.
        maxK – Number of maximum neighbors that may be passed to the method CvKNearest::find_nearest()
        updateBase – Specifies whether the model is trained from scratch (update_base=false), or it is updated using the new training data (update_base=true). In the latter case, the parameter maxK must not be larger than the original value.

    The method trains the K-Nearest model. It follows the conventions of the generic CvStatModel::train() approach with the following limitations:

        • Only CV_ROW_SAMPLE data layout is supported.
        • Input variables are all ordered.
        • Output variables can be either categorical (
    is_regression=false ) or ordered ( is_regression=true ).

        • Variable subsets (var_idx) and missing measurements are not supported.


    找到邻居并预测输入向量的响应

    CvKNearest::find_nearest

    Finds the neighbors and predicts responses for input vectors.
    C++: float CvKNearest::find_nearest(const Mat& samples, int k, Mat* results=0, const float** neighbors=0, Mat* neighborResponses=0, Mat* dist=0 ) const
    C++: float CvKNearest::find_nearest(const Mat& samples, int k, Mat& results, Mat& neighborResponses, Mat& dists) const
    C++: float CvKNearest::find_nearest(const CvMat* samples, int k, CvMat* results=0, const float** neighbors=0, CvMat* neighborResponses=0, CvMat* dist=0) const
    Python: cv2.KNearest.find_nearest(samples, k[, results[, neighborResponses[, dists]]]) ----> retval, results, neighborResponses, dists

    Parameters
    samples
    – Input samples stored by rows. It is a single-precision floating-point matrix of number_of_samples × number_of_features size
    k – Number of used nearest neighbors. 
    results – Vector with results of prediction (regression or classification) for each input sample. It is a single-precision floating-point vector with number_of_samples elements.
    neighbors – Optional output pointers to the neighbor vectors themselves. It is an array of k*samples->rows pointers.
    neighborResponses – Optional output values for corresponding neighbors. It is a singleprecision floating-point matrix of number_of_samples × k size.
    dist – Optional output distances from the input vectors to the corresponding neighbors. It is a single-precision floating-point matrix of number_of_samples × k size.
    For each input vector (a row of the matrix
    samples), the method finds the k nearest neighbors. In case of regression,
    the predicted result is a mean value of the particular vector’s neighbor responses. In case of classification, the class is
    determined by voting.
    For each input vector, the neighbors are sorted by their distances to the vector.
    In case of C++ interface you can use output pointers to empty matrices and the function will allocate memory itself.
    If only a single input vector is passed, all output matrices are optional and the predicted value is returned by the method.
    The function is parallelized with the TBB library.


    例程:

    Ptr<ml::KNearest>  knn(ml::KNearest::create());
    Mat_<float> trainFeatures(6,4);
    trainFeatures << 2,2,2,2,
                     3,3,3,3,
                     4,4,4,4,
                     5,5,5,5,
                     6,6,6,6,
                     7,7,7,7;
    
    Mat_<int> trainLabels(1,6);
    trainLabels << 2,3,4,5,6,7;
    
    knn->train(trainFeatures, ml::ROW_SAMPLE, trainLabels);
    
    Mat_<float> testFeature(1,4);
    testFeature<< 3,3,3,3;
    
    int K=1;
    Mat response,dist;
    knn->findNearest(testFeature, K, noArray(), response, dist);
    cerr << response << endl;
    cerr << dist<< endl;

    展开全文
  • K最近邻算法-源码

    2021-02-28 05:42:58
    K最近邻算法
  • k 最近邻 机器学习模型和维数的诅咒 (Machine Learning models and the curse of dimensionality) There is always a trade off between things in life. If you take up a certain path then there is always a ...

    k 最近邻

    机器学习模型和维数的诅咒 (Machine Learning models and the curse of dimensionality)

    There is always a trade off between things in life. If you take up a certain path then there is always a possibility that you might have to compromise with some other parameter. Machine Learning models are no different, considering the case of k-Nearest Neighbor there has always been a problem which has a huge impact over classifiers that rely on pairwise distance and that problem is nothing but the “Curse of Dimensionality”. By the end of this article you will be able to create your own k-Nearest Neighbor Model and observe the impact of increasing the dimension to fit a data set. Let’s dig in!

    生活中的事物之间总会有一个权衡。 如果您采用某条路径,那么总是有可能不得不折衷一些其他参数。 机器学习模型也没有什么不同,考虑到k最近邻的情况,一直存在着一个问题,该问题对依赖成对距离的分类器产生了巨大影响,而这个问题不过是“维数诅咒”而已。 到本文结束时,您将能够创建自己的k最近邻居模型,并观察增加维度以适合数据集的影响。 让我们开始吧!

    Creating a k-Nearest Neighbor model:

    创建k最近邻居模型:

    Right before we get our hands dirty with the technical part, we need to lay the buttress for our analysis, which is nothing but the libraries.

    就在我们开始接触技术部分之前,我们需要为我们的分析奠定基础,这不过是库。

    Thanks to inbuilt machine learning packages which makes our job quite easy.

    借助内置的机器学习包,这使我们的工作变得非常轻松。

    最近邻居分类器: (Nearest neighbors classifier:)

    Let’s begin with a simple nearest neighbor classifier in which we have been posed with a binary classification task: we have a set of labeled inputs, where the labels are all either 0 or 1. Our goal is to train a classifier to predict a 0 or 1 label for new, unseen test data. One conceptually simple approach is to simply find the sample in the training data that is “most similar” to our test sample (a “neighbor” in the feature space), and then give the test sample the same label as the “most similar” training sample. This is the nearest neighbors classifier.

    让我们从一个简单的最近邻分类器开始,在该分类器中,我们已经执行了一个二进制分类任务:我们有一组带标签的输入,其中标签全为0或1。我们的目标是训练一个分类器来预测0或1。 1个标签,用于显示看不见的新测试数据。 从概念上讲,一种简单的方法是简单地在训练数据中找到与我们的测试样本“最相似”(特征空间中的“邻居”)的样本,然后为测试样本赋予与“最相似”的相同标签训练样本。 这是最近的邻居分类器。

    After running few lines of code we can visualize our data set, with training data shown in blue (negative class) and red (positive class). A test sample is shown in green.For keeping things simple I have used a simple linear boundary for classification.

    运行几行代码后,我们可以可视化我们的数据集,其中训练数据以蓝色(负类)和红色(正类)显示。 测试样本以绿色显示。为了使事情简单,我使用了简单的线性边界进行分类。

    Image for post

    To find the nearest neighbor, we need a distance metric. For our case, I chose to use the L2 norm. There certainly are few perks of using the L2 norm as a distance metric, considering that we don’t have any outliers the L2 norm minimizes the mean cost and treats every feature equally.

    为了找到最近的邻居,我们需要一个距离度量 。 对于我们的情况,我选择使用L2范数。 考虑到我们没有任何异常值,使用L2范数作为距离度量当然很少有好处,因为L2范数可以最大程度地降低平均成本并平等地对待每个特征。

    The nearest neighbor to the test sample is circled, and its label is applied as the prediction for the test sample:

    圈出最接近测试样本的邻居,并使用其标签作为测试样本的预测:

    Image for post
    Nearest Neighbor classified
    最近邻居分类

    Using nearest neighbor we successfully classified our test value as label “0”, but again we made an assumption of no outliers and we also moderated the noise.

    使用最近的邻居,我们成功地将测试值分类为标签“ 0”,但是我们再次假设没有离群值,并且也降低了噪声。

    The nearest neighbor classifier works by “memorizing” the training data. One interesting consequence of this is that it will have zero prediction error (or equivalently, 100% accuracy) on the training data, since each training sample’s nearest neighbor is itself:

    最近的邻居分类器通过“存储”训练数据来工作。 一个有趣的结果是,由于每个训练样本的最近邻居本身都是零,因此在训练数据上它将具有零预测误差(或等效地,为100%的准确性):

    Now we look to overcome the shortcomings of the nearest neighbor model and the answer lies in the model named as the k-Nearest Neighbor classifier.

    现在,我们着眼于克服最邻近模型的缺点,答案就在于名为k-最邻近分类器的模型。

    K个最近邻居分类器: (K nearest neighbors classifier:)

    To make this approach less sensitive to noise, we might choose to look for multiple similar training samples to each new test sample, and classify the new test sample using the mode of the labels of the similar training samples. This is k nearest neighbors, where k is the number of “neighbors” that we search for.

    为了使这种方法对噪声的敏感性降低,我们可以选择为每个新的测试样本寻找多个相似的训练样本,并使用相似的训练样本的标签模式对新的测试样本进行分类。 这是k个最近的邻居,其中k是我们搜索的“邻居”数。

    In the following plot, we show the same data as in the previous example. Now, however, the 3 closest neighbors to the test sample are circled, and the mode of their labels is used as the prediction for the new test sample. Feel free to play with the parameter k and observe the changes.

    在下图中,我们显示了与上一个示例相同的数据。 但是,现在,将最接近测试样本的3个邻居圈起来,并将其标签的模式用作新测试样本的预测。 随意使用参数k并观察其变化。

    Image for post
    k-NN classifier with k=3
    k = 3的k-NN分类器

    The following image shows a set of test points plotted on top of the training data. The size of each test points indicate the confidence in the label, which we approximate by the proportion of k neighbors sharing that label.

    下图显示了在训练数据上方绘制的一组测试点。 每个测试点的大小表示对标签的置信度 ,我们可以通过共享该标签的k个邻居的比例来近似。

    Image for post
    Confidence score
    置信度分数

    The bigger the dots are means that the confidence score is higher for those points.

    点越大表示这些点的置信度得分越高。

    Also note that the training error for k nearest neighbors is not necessarily zero (though it can be!), since a training sample may have a different label than its k closest neighbors.

    还应注意,k个最邻近邻居的训练误差不一定为零(尽管可能是!),因为训练样本可能具有与其k个最邻近邻居不同的标签。

    功能缩放: (Feature scaling:)

    One important limitation of k nearest neighbors is that it does not “learn” anything about which features are most important for determining y. Every feature is weighted equally in finding the nearest neighbor.

    k个最近邻居的一个重要限制是它不“学习”关于哪些特征对于确定y最重要。 在寻找最接近的邻居时,每个要素的权重均相等。

    The first implication of this is:

    这的第一个含义是:

    • If all features are equally important, but they are not all on the same scale, they must be normalized — re scaled onto the interval [0,1]. Otherwise, the features with the largest magnitudes will dominate the total distance.

      如果所有功能都同等重要,但是它们的缩放比例不同,则必须将它们归一化-重新缩放为间隔[0,1]。 否则,幅度最大的要素将主导总距离。
    Image for post
    Image for post

    The second implication is:

    第二个含义是:

    • Even if some features are more important than others, they will all be considered equally important in the distance calculation. If uninformative features are included, they may dominate the distance calculation.

      即使某些功能比其他功能更重要,它们在距离计算中也将被视为同等重要。 如果包括非信息性特征,则它们可能会主导距离计算。

    Contrast this with our logistic regression classifier. In the logistic regression, the training process involves learning coefficients. The coefficients weight each feature’s effect on the overall output.

    将此与我们的逻辑回归分类器进行对比。 在逻辑回归中,训练过程涉及学习系数。 系数加权每个功能对整体输出的影响。

    Let’s see how our model performs for an image classification problem. Consider the following images from CIFAR10, a dataset of low-resolution images in ten classes:

    让我们看看我们的模型如何处理图像分类问题。 考虑以下来自CIFAR10的图像,它是十类低分辨率图像的数据集:

    Image for post
    images classified as car
    分类为汽车的图像

    The images above show a test sample and two training samples with their distances to the test sample.

    上图显示了一个测试样本和两个训练样本以及它们与测试样本的距离。

    The background pixels in the test sample “count” just as much as the foreground pixels, so that the image of the deer is considered a very close neighbor, while the image of the car is not. As stated before we used L2 norm and our model considers every pixel to be equal so it makes it difficult for nearest neighbor to classify real time images.

    测试样本中的背景像素“计数”与前景像素一样多,因此,鹿的图像被认为是非常近的邻居,而汽车的图像则不是。 如前所述,我们使用L2范数,并且我们的模型认为每个像素都相等,因此最近邻很难对实时图像进行分类。

    Image for post
    images classified as car
    分类为汽车的图像

    We also see here that Euclidean distance is not a good metric of visual similarity — the frog on the right is almost as similar to the car as the deer in the middle!

    我们在这里还看到,欧几里得距离不是视觉相似度的良好度量标准-右侧的青蛙与汽车之间的距离几乎与中间的鹿一样!

    K最近邻居回归: (K nearest neighbors regression:)

    K nearest neighbors can also be used for regression, with just a small change: instead of using the mode of the nearest neighbors to predict the label of a new sample, we use the mean. Consider the following training data:

    K个最接近的邻居也可以用于回归,只做很小的改变:我们使用均值,而不是使用最接近的邻居的模式来预测新样本的标签。 考虑以下训练数据:

    Image for post

    We can add a test sample, then use k nearest neighbors to predict its value:

    我们可以添加一个测试样本,然后使用k个最近的邻居来预测其值:

    Image for post

    “维数的诅咒”: (The “curse of dimensionality”:)

    Classifiers that rely on pairwise distance between points, like the k neighbors methods, are heavily impacted by a problem known as the “curse of dimensionality”. In this section, I will illustrate the problem. We will look at a problem with data uniformly distributed in each dimension of the feature space, and two classes separated by a linear boundary.

    像k邻居方法一样,依赖点之间成对距离的分类器受到称为“维数诅咒”的问题的严重影响。 在本节中,我将说明问题。 我们将研究一个数据均匀分布在特征空间各个维度上的问题,并且两个类之间由线性边界分隔。

    We will generate a test point, and show the k nearest neighbors to the test point. We will also show the length (or area, or volume) that we had to search to find those k test points. We will observe the radius required to find the nearest neighbor for increasing dimension space.

    我们将生成一个测试点,并显示距该测试点最近的k个邻居。 我们还将显示为找到这k个测试点而必须搜索的长度(或面积或体积)。 我们将观察为增加尺寸空间而寻找最接近的邻居所需的半径。

    Pay special attention to how that length (or area, or volume) changes as we increase the dimensionality of the feature space.

    当我们增加特征空间的维数时,请特别注意长度(或面积或体积)如何变化。

    First, let's observe the 1D problem:

    首先,让我们观察一维问题:

    Image for post
    1D space radius search
    一维空间半径搜索

    Now, the 2D equivalent:

    现在,等效于2D:

    Image for post
    2D space radius search
    二维空间半径搜索

    Finally, the 3D equivalent:

    最后,等效于3D:

    Image for post
    3D space radius search
    3D空间半径搜索

    We can see that as the dimensionality of the problem grows, the higher-dimensional space is less densely occupied by the training data, and we need to search a large volume of space to find neighbors of the test point. The pair-wise distance between points grows as we add additional dimensions.

    我们可以看到,随着问题维数的增长,高维空间被训练数据所占据的密度降低,并且我们需要搜索大量空间以找到测试点的邻居。 点之间的成对距离随着我们添加其他尺寸而增大

    And in that case, the neighbors may be so far away that they don’t actually have much in common with the test point.

    在这种情况下,邻居可能相距太远,以至于他们实际上与测试点没有太多共同之处。

    In general, the length of the smallest hyper-cube that contains all k-nearest neighbors of a test point is:

    通常,包含测试点的所有k个最近邻的最小超立方体的长度为:

    (k/N)¹/d

    (k / N)¹/ d

    for N samples with dimensionality d.

    对于N个维数为d的样本。

    From the expression above, we can see that as the number of dimensions increases linearly, the number of training samples must increase exponentially to counter the “curse”.

    从上面的表达式中,我们可以看到,随着维数线性增加,训练样本的数量必须成倍增加以抵消“诅咒”。

    Alternatively, we can reduce d — either by feature selection or by transforming the data into a lower-dimensional space.

    或者,我们可以通过特征选择或将数据转换为低维空间来减小d。

    翻译自: https://towardsdatascience.com/k-nearest-neighbors-and-the-curse-of-dimensionality-7d64634015d9

    k 最近邻

    展开全文
  • k最近邻算法 从零开始的算法 (Algorithms From Scratch) 介绍 (Introduction) A non-parametric algorithm capable of performing Classification and Regression; Thomas Cover, a professor at Stanford ...
  • python实现K最近邻算法

    2020-09-20 20:16:58
    主要为大家详细介绍了python实现K最近邻算法,具有一定的参考价值,感兴趣的小伙伴们可以参考一下
  • K最近邻算法

    2019-10-04 12:19:00
    K最近邻算法原理:在数据集里,新数据点离谁最近,就和谁属于同一类 K最近邻算法的用法:可以用于分类与回归 K最近邻算法在分类任务中的应用: #导入数据集生成工具 from sklearn.datasets import make_blobs #...
  • k最近邻算法 编程集体智能 (PCI)的第8章介绍了k最近邻居算法的用法和实现。 (k-NN)。 简单的说: k-NN是一种分类算法,它使用(k)作为邻居数来确定某项将属于哪个类别。 为了确定要使用的邻居,算法使用...
  • K最近邻项目-源码

    2021-02-14 10:21:34
    K最近邻项目 介绍 K最近邻居是一种简单的算法,可以存储所有可用案例并根据相似性度量对新案例进行分类。 在这个项目中,我使用了人工数据集,并用更好的K值改进了模型。 安装 import pandas as pd import numpy as ...
  • 本章内容 学习使用K最近邻算法创建分类系统。 学习特征抽取。 学习回归,即预测数值,如明天的股价或用户对某部电影的喜欢程度。 学习K最近邻算法的应用案例和局限性。怎么区分不同的类别?例如区分橙子还是...
  • matlab分类器,K最近邻分类器.matlab分类器,K最近邻分类器
  • 2.2 K最近邻算法

    2021-01-29 14:32:56
    什么是K最近邻算法 最近邻算法 先输入标注好的训练集,然后啥也不干。测试集的每个样本进来,找到最近的训练集样本,贴上和它一样的标签,就像下图绿色应该为三角形。 如何定义最近/距离? L1和L2空间的区别...
  • K-Nearest Neighbor K最近邻 KNN find a k nearest examples(neighbour) and vote for its prediction. In the picture, we want to classify green circle. If k=3k=3k=3, green circle belongs to the red ...
  • ML--K最近邻算法

    2019-02-19 19:54:29
    ML–K最近邻算法 主要涉及的要点如下: K最近邻算法的原理 K最近邻算法在分类任务中的应用 K最近邻算法在回归分析中的应用 使用K最近邻算法对酒的分类进行建模 一.K最近邻算法的原理 K最近邻算法的原理,正如–...
  • 本篇文章主要介绍了python机器学习案例教程——K最近邻算法的实现,详细的介绍了K最近邻算法的概念和示例,具有一定的参考价值,有兴趣的可以了解一下
  • 邻近算法,或者说K最近邻(kNN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表 该方法的思路是:如果一...
  • 移动K最近邻和空间关键字查询的高效同时处理系统
  • 本来这篇文章是5月份写的,今天修改了一下内容,就成今天发表的了,CSDN这是出BUG了...1.K最近邻分类器原理 首先给出一张图,根据这张图来理解最近邻分类器,如下: 根据上图所示,有两类不同的样本数据,分别用蓝色

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 3,402
精华内容 1,360
关键字:

k最近邻