复制算法和标记整理算法
什么是无监督学习?(What is Unsupervised Learning?)
You may have heard about Market Segmentation where there is a database of customers and we group them into different market segments so as to serve them better, or maybe about Social Network Analysis to find out who are the coherent groups of friends in a social network. However can you imagine any labels in the data that is used in the training examples for any of the mentioned sector? If no, then how do we develop Machine Leaning Algorithms to make predictions on unlabeled data? The answer is Unsupervised Learning. The Unsupervised Learning Algorithms learn from unlabeled data by finding some structure in the given training set. The following figure will illustrate the concept:
您可能听说过“细分市场”,那里有一个客户数据库,我们将他们分组为不同的细分市场以便更好地为他们提供服务,或者您可能听说过“社交网络分析”以找出谁是社交网络中连贯的朋友群体。 但是,您能否想象任何提到的部门的培训示例中使用的数据中的任何标签? 如果否,那么我们如何开发机器学习算法以对未标记的数据进行预测? 答案是无监督学习。 无监督学习算法通过在给定训练集中找到某种结构来从未标记的数据中学习。 下图将说明该概念:
K均值算法的聚类和工作: (Clustering and working of K-Means Algorithm:)
The above figure gives us an intuition about how an unsupervised learning algorithm can automatically group unlabeled data into coherent clusters. The K-Means algorithm is the most widely used clustering algorithm. So how does the K-Means Algorithm work?
上图为我们提供了一种关于无监督学习算法如何将未标记数据自动分组为相干群集的直觉。 K-Means算法是使用最广泛的聚类算法。 那么K-Means算法如何工作?
K-Means is an iterative algorithm. Let us say we have an unlabeled dataset as shown:
K-Means是一种迭代算法。 假设我们有一个未标记的数据集,如下所示:

Our objective is to group the data into 2 clusters. So, first we are going to randomly initialize two points (since I want to group the data into 2 clusters), known as cluster centroids.
我们的目标是将数据分为2个类。 因此,首先,我们将随机初始化两点(因为我想将数据分组为2个群集),称为群集质心。

Now my algorithm goes through each of the examples and depending on whether they are closer to the Red or the Blue centroid, it assigns each of the data points to one of the cluster centroids. This step is called the “Cluster Assignment Step”.
现在,我的算法将遍历每个示例,并根据它们是否更接近红色或蓝色质心,将每个数据点分配给一个聚类质心。 该步骤称为“群集分配步骤”。

The second part of the iteration is called the “Move Centroid Step”. The two cluster centroids are taken and moved to the average of their corresponding data points. Specifically, we are going to compute the mean of the points’ location for each of the two clusters already formed and move the centroids to their respective cluster means.
迭代的第二部分称为“移动质心步”。 取两个聚类质心并将其移动到其相应数据点的平均值。 具体来说,我们将为已经形成的两个聚类中的每个聚类计算点位置的平均值,然后将质心移动到它们各自的聚类平均值。

As I have stated earlier, K-Means is an iterative algorithm. So now we go back to the Cluster Assignment Step, go through the training examples and assign each data points to the nearest cluster centroid.
如前所述,K-Means是一种迭代算法。 现在,我们回到“聚类分配步骤”,遍历训练示例并将每个数据点分配给最近的聚类质心。

The algorithm continues with yet another Move Centroid Step and computes the average of the data points for each of the 2 clusters and move the centroids to their new position.
该算法继续执行另一个“移动质心步长”,并计算2个群集中每个群集的数据点的平均值,然后将质心移动到新位置。

As we can see 2 clusters have been formed quite distinctly. The algorithm will continue to run for as many iterations that has been specified but our intuition says that there will not be any change further while either of the 2 steps are performed since K-Means has converged.
如我们所见,已经非常明显地形成了两个集群。 该算法将继续运行已指定的尽可能多的迭代,但是我们的直觉表明,由于K-Means已收敛,因此执行这两个步骤中的任何一个步骤都不会进一步改变。
广义解释: (Generalized Explanation:)
To write more formally, the K-Means Algorithm takes two inputs. The first is a parameter (K) for the number of clusters we want to find, and the second is our unlabeled training set having ‘m’ examples.
为了更正式地编写,K均值算法采用两个输入。 第一个是我们要查找的簇数的参数(K),第二个是我们的带有'm'个示例的未标记训练集。

Each of my training examples has ‘n’ features, hence each of them is taken as an R(n) dimensional vector. Now that the inputs have been taken, we run the K-Means Algorithm.
我的每个训练示例都具有“ n”个特征,因此,每个特征都被视为R(n)维向量。 现在已经输入了数据,我们运行K-Means算法。
First we randomly initialize K cluster centroids u(1) through u(K). Then our iterative process of Cluster Assignment and Move Centroid begins.
首先,我们随机初始化K个簇质心u(1)到u(K)。 然后,我们开始进行群集分配和移动质心的迭代过程。

The first inner loop (Cluster Assignment Step) computes a variable ‘c(i)’ to store the index of the cluster centroid closest to training example ‘x(i)’. So, ‘c(i)’ is a number between 1 and K which denotes that the ‘i’th training example is closest to which cluster centroid ‘k’ (1 through K). Now the question arises that how the operation is done.
第一个内部循环(集群分配步骤)计算变量“ c(i)”,以存储最接近训练示例“ x(i)”的聚类质心的索引。 因此,“ c(i)”是1到K之间的数字,表示“ i”个训练示例最接近哪个聚类质心“ k”(1到K)。 现在出现的问题是该操作如何完成。
The loop runs for each of the training examples from 1 through m. In order to compute c(i), the ‘i’th example x(i) is taken and its distance is measured from each of the cluster centroids u(k), and the value of ‘k’ for which we get the minimum distance between x(i) and u(k), is what gets set in c(i). Hence, c(i) is set to be that value of ‘k’ that minimizes the square of the norm between (x(i)-u(k)) thereby selecting the cluster centroid having the shortest distance from my training example ‘x(i)’.
该循环针对从1到m的每个训练示例运行。 为了计算c(i),采用第i个示例x(i),并从每个聚类质心u(k)测量其距离,得到的k值最小。 x(i)和u(k)之间的距离是在c(i)中设置的。 因此,将c(i)设置为最小化(x(i)-u(k))之间范数平方的'k'值,从而选择距我的训练示例'x最短距离的聚类质心(一世)'。

The other inner loop for Move Centroid Step computes the ‘u(k)’ for each cluster as the mean of the training example points assigned them. Suppose the examples x(1), x(7), x(9) and x(15) has been assigned to cluster centroid 1. This implies that c(1) = c(7) = c(9) = c(15) = 1. The Move Centroid step would thus compute u(1) as:
移动质心步的另一个内部循环为每个群集计算“ u(k)”,作为分配给它们的训练示例点的平均值。 假设示例x(1),x(7),x(9)和x(15)已分配给群集质心1。这意味着c(1)= c(7)= c(9)= c( 15)=1。因此,“移动质心”步骤将计算u(1)为:

As a result the cluster centroids will move to the mean of the points assigned to them. The process continues until each of the clusters centroids have moved to their desired positions and the iterative K-Means algorithm continues to perform the steps and will eventually converge.
结果,聚类质心将移动到分配给它们的点的平均值。 该过程继续进行,直到每个聚类质心都已移至其所需位置,并且迭代K均值算法继续执行这些步骤并最终收敛。
成本函数: (Cost-Function:)
The K-Means algorithm has a cost function that it tries to optimize, just like supervised algorithms, which helps in debugging and avoiding local minima. We have used variables c(i) and u(k) for the Cluster Assignment and Move Centroid steps, respectively. We have to find a proper value of the cluster centroid of the cluster u(c(i)) to which a particular training example x(i) has been assigned.
就像监督算法一样,K-Means算法具有尝试优化的成本函数,有助于调试和避免局部最小值。 我们已经分别将变量c(i)和u(k)用于“群集分配”和“移动质心”步骤。 我们必须找到已经分配了特定训练示例x(i)的簇u(c(i))的簇质心的适当值。

Suppose x(i) has been assigned to cluster 7. So by the Cluster Assignment step, c(i) = 7. So the cluster centroid of the cluster to which x(i) has been assigned will be:
假设x(i)已分配给群集7,那么在“群集分配”步骤中,c(i)=7。因此,已为其分配x(i)的群集的群集质心为:

The Cost-function that K-Means optimizes is a function J of c(1) through c(m) and u(1) through u(m). These parameters are varied to an ambient value such that the Cost Function is minimized.
K-Means优化的成本函数是c(1)到c(m)和u(1)到u(m)的函数J。 将这些参数更改为环境值,以使成本函数最小化。

The Cost function is also known as the Distortion Cost function, and is implemented in the iterative process of Cluster Assignment step and Move Centroid step as:
成本函数也称为失真成本函数,在“集群分配”步骤和“移动质心”步骤的迭代过程中实现为:

So, the 2 sets of variables are taken and sort of partitioned. J is first minimized with respect to c, then with respect to u, and the process repeats itself the specified number of times. At the end of the iterative process K-Means converges and a structure in the data is eventually found.
因此,采用了两组变量并对变量进行了排序。 首先相对于c最小化J,然后相对于u最小化,该过程将自身重复指定的次数。 在迭代过程的最后,K-Means收敛,最终在数据中找到一个结构。
结论: (Conclusion:)
The K-Means Algorithm scales really well to large datasets, generalized to clusters of various shapes and guarantees convergence after a specific number of iterations. However, it fails to yield accurate results if outliers are present or density of the spread of data is not even. Nevertheless, equipped with the quality of adapting easily to new examples, K-Means is still the most widely used Unsupervised Learning Algorithm.
K-Means算法确实可以很好地扩展到大型数据集,并推广到各种形状的簇,并保证在特定数量的迭代后收敛。 但是,如果存在异常值或数据分布的密度不均匀,则无法获得准确的结果。 尽管如此,由于具有易于适应新示例的质量,K-Means仍然是使用最广泛的无监督学习算法。
翻译自: https://medium.com/srm-mic/k-means-algorithm-dealing-with-unlabeled-data-747f37697d9
复制算法和标记整理算法