• K均值算法   K均值算法是一种聚类算法，把样本分配到离它最近的类中心所属的类，类中心由属于这个类的所有样本确定。   k均值算法是一种无监督的聚类算法。算法将每个样本分配到离它最近的那个类中心所代表的...
K均值算法

K均值算法是一种聚类算法，把样本分配到离它最近的类中心所属的类，类中心由属于这个类的所有样本确定。

k均值算法是一种无监督的聚类算法。算法将每个样本分配到离它最近的那个类中心所代表的类，而类中心的确定又依赖于样本的分配方案。

在实现时，先随机初始化每个类的类中心，然后计算样本与每个类的中心的距离，将其分配到最近的那个类，然后根据这种分配方案重新计算每个类的中心。这也是一种分阶段优化的策略。

与k近邻算法一样，这里也依赖于样本之间的距离，因此需要定义距离的计算方式，最常用的是欧氏距离，也可以采用其他距离定义。算法在实现时要考虑下面几个问题：

1.类中心向量的初始化。一般采用随机初始化。最简单的是Forgy算法，它从样本集中随机选择k个样本作为初始类中心。第二种方案是随机划分，它将所有样本随机的分配给k个类中的一个，然后按照这种分配方案计算各个类的类中心向量。

2.参数k的设定。可以根据先验知识人工指定一个值，或者由算法自己确定。

3.迭代终止的判定规则。一般做法是计算本次迭代后的类中心和上一次迭代时的类中心之间的距离，如果小于指定阈值，则算法终止。


展开全文
• k均值：k均值算法
• K——均值法 一.算法学习： 1.前提： 模式特征矢量集为{x1,x2,…,xN}；类的数目K是事先取定的。 2.基本思想： 任意选取K个聚类中心，按最小距离原则将各模式分配到K类的某一类。不断计算聚类中心和调整各模式的类别...
K——均值法
一.算法学习：
1.前提：
模式特征矢量集为{x1,x2,…,xN}；类的数目K是事先取定的。
2.基本思想：
任意选取K个聚类中心，按最小距离原则将各模式分配到K类的某一类。不断计算聚类中心和调整各模式的类别，最终使各模式到其判属类别中心的距离平方之和最小。
Sj：第j个聚类集（域），Zj ；聚类中心，Nj： Sj中所含的样本个数

聚类中心的选择应使准则函数J极小，Sj类的聚类中心应选为该类样本的均值。
3.步骤：
(1)任选K个模式特征矢量作为初始聚类中心：	z1(1) ,z2(1) ,…zK(1)。括号内的序号表示迭代次数。
(2)将待分类的模式特征矢量集{x}中的模式逐个按最小距离原则分划给K类中的某一类。
(3)计算重新分类后的各聚类中心zj(k+1)，即求各聚类域中所包含样本的均值向量：
，以均值向量作新的聚类中心。可得新的准则函数：
(4)如果zj(k+1)=zj(k)(j=1,2,…K),则结束；否则，k=k+1,转(2)。
注意：多次运行K均值算法，例如50~1000次，每次随机选取不同的初始聚类中心。聚类结束后计算准则函数值，选取准则函数值最小的聚类结果为最后的结果。该方法一般适用于聚类数目小于10的情况。

二．例题：
X1(0,0) X2(3,8) X3(2,2) X4(1,1) X5(5,3)
X6(4,8) X7(6,3) X8(5,4) X9(6,4) X10(7,5)
2.1算法分析：
1）	第一个问题应该是选取一个合适的K值，然后随机选取K个点作为初始的聚类中心。
2）	然后就是计算距离，进行聚类（即把点集中每个点分到对应的聚类中）。
3）	计算新的聚类中心，均值法计算。
4）	判断上一轮的聚类中心和本轮的聚类中心是否相等。否则返回第二步。
2.2 代码：（python代码经planet B高亮处理）
1.	import random
2.	import numpy as np
3.	#数据导入工作
4.	data=[[0,0],[3,8],[2,2],[1,1],[5,3],[4,8],[6,3],[5,4],[6,4],[7,5]]
5.	l=len(data)
6.	print("数据长度：",l)
7.	k=int(input("请输入k值："))
8.	#随机选k个向量作为均值向量
9.	sj=random.sample(range(l),k)  #
10.	print("随机选取k个数:",sj)
11.	jzxl=[data[i] for i in sj]  #均值向量
12.	print("均值向量:",jzxl)
13.
14.	t=0
15.
16.	while True:
17.	    t=t+1
18.	    if t>100:
19.	        print("循环超过100次")
20.	        break
21.
22.	    a=[[] for i in range(k)]  # k个空列表来存各个样本与均值的距离
23.	    #计算距离
24.	    for j in range(l):
25.	        gds=1000 #固定数作为每轮比较值
26.	        julei=0
27.	        for q in range(k):
28.	            d=np.hypot(data[j][0]-jzxl[q][0],data[j][1]-jzxl[q][1])
29.	            if d<gds:
30.	                gds=d
31.	                julei=q
32.	        a[julei].append(data[j])
33.
34.
35.
36.	    #用均值法计算新的聚类中心
37.	    newjzxl=[[] for i in range(k)]
38.	    for i in range(k):
39.	        s1=np.sum(a[i][j][0] for j in range(len(a[i])))
40.	        z1=s1/len(a[i])
41.	        newjzxl[i].append(z1)
42.
43.	        s2=np.sum(a[i][j][1] for j in range(len(a[i])))
44.	        z2=s2/len(a[i])
45.	        newjzxl[i].append(z2)
46.	    if jzxl==newjzxl:  #如果前后均值向量相等，则退出循环
47.
48.	        break
49.	    else:
50.	        print("第",t,"次聚类结果为：")
51.	        for i in range(k):
52.	             print("聚类",i,": ",a[i])
53.	        jzxl=newjzxl
54.
55.	    #break
56.	print("共循环了  ",t,"  次，最终聚类结果为")
57.	for i in range(k):
58.	    print("聚类",i,": ",a[i])




展开全文
• k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself. 您见过加勒比礁吗？ 好吧，如果没有，请做好准备。 Today, we will be answering a question that, at ...
k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself. 您见过加勒比礁吗？ 好吧，如果没有，请做好准备。
Today, we will be answering a question that, at face value, appears quite simple: “What does a Caribbean reef look like?” However, this question can be decomposed into many complex layers. So to avoid ambiguity, let’s refine the question to: “What are the non-mobile components of a Caribbean reefs and how are they related?” 今天，我们将回答一个从表面上看很简单的问题：“加勒比海礁石看起来像什么？” 但是，这个问题可以分解为许多复杂的层。 因此，为避免歧义，让我们将问题细化为：“加勒比海珊瑚礁的非活动组成部分是什么，它们之间有何关系？”
That seems reasonable; we’ll have to look at fish another day. 这似乎是合理的； 我们要改天看看鱼。
Now we’re not going to roll out beautiful images of underwater cities teeming with diversity. Instead, we have bar charts. Without further ado, let’s dive in. 现在，我们不打算发布充满多样性的水下城市的美丽影像。 相反，我们有条形图。 事不宜迟，让我们开始吧。
什么是典型的珊瑚礁？ (What Makes up a Typical Reef?)
To start, we have developed a baseline graph (Figure 1) of the components of all Caribbean reefs. Here we have the median percent cover for nine substrate types. Now, if you haven’t conducted a scuba transect before, it may be helpful to break down the above sentence. First, percent cover is how coral reef composition is measured — in other words, from a birds-eye view what percent of sea floor is hard coral, sponge, rock, etc. Second, substrates types are broad categories of sea floor, such as silt or sand. If you’re curious about the sampling methods or specific substrate definitions, check this out. 首先，我们绘制了所有加勒比海珊瑚礁成分的基线图(图1)。 此处，我们提供了9种基材类型的中位覆盖率百分比。 现在，如果您以前没有进行过水肺横断面检查，则最好将上述句子分解。 首先，覆盖率是如何测量珊瑚礁成分的，换句话说，从鸟瞰角度看，硬质珊瑚，海绵，岩石等占海床的百分比。其次，底物类型是海床的大类，例如淤泥或沙子。 如果您想了解抽样方法或特异底物的定义，请检查该出来。
Ok, so in Figure 1 we’re looking at the median value for each of the nine substrate values. For example, in the Hard Coral column, we can see that hard coral’s median percent cover is roughly 17%. Good to know. 好的，因此在图1中，我们查看的是9个底物值中的每个的中值。 例如，在“ 硬珊瑚”列中，我们可以看到硬珊瑚的覆盖率中位数约为17％。 很高兴知道。
Diving deeper into the chart, it appears that most Caribbean reefs are primarily composed four substrate types: rock, hard coral, nutrient indicator algae (NI Algae), and sand. Together, these four categories account for 91% of the total median values. On the other hand, recently killed coral (RK Coral) and silt both have median values of 0. So, they’re relatively rare. 深入研究图表，似乎大多数加勒比海礁石主要由四种基质类型组成：岩石，硬珊瑚，营养指示藻类( NI Algae )和沙子。 这四个类别合起来占总中值的91％。 另一方面，最近被杀死的珊瑚( RK Coral )和淤泥的中位数均为0。因此，它们相对较少。
We have learned that Caribbean reefs are rocky and sandy. Lovely. 我们了解到加勒比礁是岩石和沙滩。 可爱。
But here’s an alarming analogy: the average number of children per US family is 1.93. If we take that number to be representative of the data, we might conclude that most families have 1.93 children, which I find hard to believe. Even worse, we have no understanding of the underlying distribution that led to an average of 1.93. There could be one family with 184 children and 9 families with one child. Instead, it would be useful to see if there are common counts for the number of kids per family. 但这是一个令人震惊的类比：每个美国家庭的平均孩子人数为1.93。 如果我们以该数字作为数据的代表，我们可以得出结论，大多数家庭有1.93个孩子，我很难相信。 更糟糕的是，我们不了解导致平均1.93的基本分布。 可能有一个家庭有184个孩子，有9个家庭有一个孩子。 取而代之的是，查看每个家庭的孩子数是否有共同计数是有用的。
K-均值演示 (K-Means Demo)
Applying this logic to reef composition, we will explore if there are groups coral reefs using the above substrate categories. This is where unsupervised classification comes into play. Unsupervised algorithms fit data where we don’t know the “correct” answer. And, one of the simplest methods of all is the k-means algorithm. 将这种逻辑应用于珊瑚礁组成，我们将使用上述基质类别探讨是否存在珊瑚礁群。 这是无监督分类起作用的地方。 无监督算法适合我们不知道“正确”答案的数据。 而且，最简单的方法之一是k-means算法。
Without getting too technical, k-means attempts to split data into k clusters. The algorithm does this by minimizing the distance from the center of the cluster (the cluster mean) to all points in that cluster. And because of this simple fitting criteria, it’s really easy to interpret. So let’s see an example… 不用太技术，k-means尝试将数据拆分为k个群集。 该算法通过最小化从群集中心(群集均值)到该群集中所有点的距离来实现此目的。 而且由于这种简单的拟合标准，它真的很容易解释。 因此，让我们看一个例子……
Reef Check.Reef Check 。 In Figure 2 we have created two clusters (k=2 in this case) using two substrate categories: hard coral and nutrient indicator algae. As you can see, there appears to be a clear divide between these two categories. But, let’s not get into interpretation quite yet. 在图2中，我们使用两个基质类别(硬珊瑚和营养指示藻)创建了两个群集(在这种情况下， k = 2 )。 如您所见，这两个类别之间似乎存在明显的鸿沟。 但是，让我们暂时不做解释。
Instead, let’s consider the case where we add another variable. Here, the k-means algorithm would categorize each point using three dimensions instead of two. But as you increase the number of dimensions, you lose the ability to visualize; it’s pretty hard to think in five or eight dimensions. However, we can still see where the cluster centers are numerically located in hyperspace. 相反，让我们考虑添加另一个变量的情况。 在这里，k-means算法将使用三个维度而不是两个维度对每个点进行分类。 但是随着尺寸的增加，您将失去可视化的能力。 很难从五个或八个维度来思考。 但是，我们仍然可以看到聚类中心在数字上位于超空间中的位置。
Now that we have a basic understanding of what k-means does, let’s move on to the interesting graphs. 现在，我们对k均值的功能有了基本的了解，让我们继续研究有趣的图。
前4种基板类型(k = 3) (Top 4 Substrate Types (k=3))
In Figure 3 (below) we have fit three clusters (k=3) using the four most most prevalent substrate types. Each bar represents a substrate category. The height of each bar represents the the difference between the cluster mean and the total mean for that given substrate. Blue bars correspond to a cluster mean greater than the entire category’s mean and conversely, red bars correspond to a cluster mean less than the entire category’s mean. 在下面的图3中，我们使用四种最普遍的底物类型拟合了三个簇( k = 3 )。 每个条形代表基材类别。 每个条形的高度代表该给定底物的簇均值与总均值之差。 蓝色条形对应的聚类平均值大于整个类别的平均值，红色条形对应的聚类平均值小于整个类别的平均值。
Reef Check.Reef Check 。 When classifying Caribbean reefs into three clusters there appear to be sensible groupings: sand-dominated, rock-dominated, and algae-dominated. Interestingly, hard coral showed relatively little change even though it was the second most abundant substrate category. Conversely, nutrient indicator algae, which is often found on degraded reefs, had extremely high signal relative to its abundance. 将加勒比海珊瑚礁分为三类时，似乎有一些合理的分类：以沙子为主，以岩石为主和以藻类为主。 有趣的是，即使硬质珊瑚是第二丰富的底物类别，其变化也相对较小。 相反，经常在退化的珊瑚礁上发现的营养指示剂藻类相对于其丰富度具有极高的信号。
We can also observe that sand-dominated reefs allowed for the highest quantity of hard coral at roughly 10 percentage points more than the total data average. Rock-dominated reefs were net positive but had little impact on hard corals. And finally, as most people would expect, the evil nutrient indicator algae appears to have a fairly strong negative impact on all other substrate types. 我们还可以观察到，以砂岩为主的礁石允许的硬珊瑚数量最多，比整个数据平均值高出大约10个百分点。 岩石为主的礁石为净阳性，但对硬珊瑚影响不大。 最后，正如大多数人所期望的那样，邪恶的营养指示剂藻类似乎对所有其他底物类型具有相当强烈的负面影响。
Ok, we’re starting to get somewhere. Now let’s increase the number of substrate types by including all categories that had a median value greater than zero: only silt and recently killed coral were not included. 好的，我们开始有所建树。 现在，通过包含中值大于零的所有类别来增加底物类型的数量：不包括淤泥和最近被杀死的珊瑚。
非零中值基板类型(k = 3) (Non-Zero-Median Substrate Types (k=3))
Reef Check.Reef Check 。 In Figure 4 it appears the categories we found above hold steady. Sand/rubble dominated reefs seem to support the most life with above-average values in hard coral, soft coral, and sponge. Rocky reefs also exhibit life-supporting ability, although less than its sandy counterpart. And finally, nutrient indicator algae reefs show below average percent cover in all other substrate values observed. 在图4中，我们上面找到的类别似乎保持稳定。 在硬珊瑚，软珊瑚和海绵中，以沙/卵石为主的礁石似乎能维持大多数生命，其价值均高于平均值。 礁石还具有生命维持能力，尽管比沙质礁石要弱一些。 最后，营养指示剂藻类礁石在所有其他底物值中均显示低于平均覆盖率。
Now you might be wondering what the deal is with NI Algae. Well, nutrient indicator algae are often found on degraded reefs because they thrive in waters with elevated nutrient levels, such as nitrogen and phosphorus; Reef Check added this category to monitor the infamous algal blooms. Conversely, these high levels of nutrients can be harmful to corals. Thus, we would expect to see an inverse relationship between nutrient indicator algae and the other living substrate types, namely sponges, soft corals, and hard corals. 现在您可能想知道与NI Algae达成的交易是什么。 好吧，营养指示剂藻类经常在退化的珊瑚礁上发现，因为它们在营养水平较高的水中繁殖，例如氮和磷。 Reef Check添加了此类别，以监视臭名昭著的藻华。 相反，这些高含量的养分可能对珊瑚有害。 因此，我们希望看到营养指示剂藻类与其他活的基质类型(即海绵，软珊瑚和硬珊瑚)之间存在反比关系。
This stuff is pretty cool. 这个东西很酷。
使用非零基材值进行拟合(k = 4) (Fitting Using the Non-Zero Substrate Values (k=4))
In our final chart, we will try increasing the number of clusters to four because who’s to say there are only three types of Caribbean reefs? Well, technically there are statistical methods to show reasonable values that k can take. In this case the elbow method was implemented and three to five clusters were deemed sensible. 在我们的最终图表中，我们将尝试将集群数增加到四个，因为谁能说只有三种类型的加勒比海珊瑚礁？ 嗯，从技术上讲，有统计方法可以显示k可以取的合理值。 在这种情况下，采用肘部方法，认为三到五个簇是明智的。
Reef Check.Reef Check 。 As shown shown in Figure 5 to the left, as expected, a fourth category has emerged. Boasting extremely high values of hard and soft corals, this coral-dominated reef appears to be the “healthiest” reefs of the four. 如预期的那样，如左图5所示，出现了第四类。 这种以珊瑚为主的珊瑚礁拥有极高的硬珊瑚和软珊瑚价值，似乎是这四种珊瑚中“最健康的”。
Now why did increasing the number of clusters suddenly create this magical healthy reef category? Well, with only three clusters, the high levels of hard and soft corals were lumped into the sand-dominated and rock-dominated classifications. By allowing for a fourth category, the data could be subset more cleanly. 现在，为什么增加簇的数量突然创建了这个神奇的健康珊瑚礁类别？ 好吧，只有三个集群，高水平的硬珊瑚和软珊瑚被归类为以沙子为主和以岩石为主的分类。 通过考虑第四类，可以更清晰地对数据进行子集化。
In a similar vein, why can’t we conclude that there are five types of reefs? To answer your outstanding question, k-means with k=5 was plotted, however the categories created were not intuitive. Moreover, because four central substrate categories compose 91% of the median total, limiting to four clusters is intuitive. 同样，为什么我们不能得出结论说有五种类型的珊瑚礁呢？ 为了回答您的悬而未决的问题，绘制了k = 5的 k均值，但是创建的类别不直观。 此外，由于四个中央底物类别构成中位数总数的91％，因此直观地限制为四个簇即可。
Ok final question, how can we tell if three or four clusters is better? Another outstanding question, but unfortunately there isn’t a clear answer. 好吧，最后一个问题，我们如何确定三个或四个集群更好？ 另一个悬而未决的问题，但不幸的是没有一个明确的答案。
From an ecological perspective, there is no reason why rock and sand-dominated reefs can’t support corals and sponges, which argues for k=3. It’s also simpler. However, by creating four clusters we can develop clear-cut classifications that appear to correspond to health, which argues for k=4. Those categories are: 从生态的角度来看，没有任何理由说明以岩石和沙子为主的珊瑚礁不能支撑珊瑚和海绵，这证明了k = 3 。 它也更简单。 但是，通过创建四个群集，我们可以开发出与健康相对应的清晰分类，这证明k = 4 。 这些类别是：
High health: coral-dominated 高健康：珊瑚为主 Medium health: sand/rubble-dominated, rock-dominated 中度健康：以沙子/碎石为主，以岩石为主 Low health: algae-dominated 低健康：藻类为主 As with many applied statistics problems, humans have to make judgement calls based on subject-matter knowledge. Here, there are good arguments for both k=3 and k=4. 与许多应用统计问题一样，人类必须根据主题知识做出判断。 在这里，对于k = 3和k = 4都有很好的论据。
结论 (Conclusion)
I’m glad you now understand why bar charts are superior to pretty pictures. Even though you have no idea what a Caribbean reef looks like, you have a better understanding of what makes up a Caribbean reef (which is pretty cool). 我很高兴您现在了解为什么条形图优于漂亮的图片。 即使您不知道加勒比礁是什么样子，您也可以更好地了解加勒比礁的构成(这很酷)。
What else can we conclude? 我们还能得出什么结论？
Caribbean reefs tend to be dominated by sand, rock, hard coral, and nutrient indicator algae. However, ratios differ greatly at the tails of the distributions. 加勒比礁往往以沙子，岩石，坚硬的珊瑚和营养指示剂藻类为主。 但是，比率在分布的尾部差别很大。 One of the most consistent reef classifications was algae-dominated reefs. Algal blooms tend to occur in areas with high levels of sunlight, nutrients, and CO2 (a term called eutrophication), so from an ecological standpoint, it makes sense that coral cover would have an inverse relationship with algae. That being said, further research is required, specifically species breakdown of the NI algae. 最一致的礁石分类之一是藻类为主的礁石。 藻华往往发生在阳光，营养和二氧化碳含量高的区域(富营养化)，因此从生态角度来看，珊瑚覆盖与藻类成反比是有意义的。 话虽如此，还需要进一步的研究，特别是NI藻类的种类分解。 All classifications that do not include nutrient indicator algae have the ability to support coral. That being said, sand-dominated reefs show a higher “life capacity” than rock-dominated reefs. 所有不包括营养指标藻类的分类都具有支持珊瑚的能力。 话虽如此，以砂为主的珊瑚礁比以岩石为主的珊瑚礁显示出更高的“生命能力”。 Got any other ideas? 还有其他想法吗？
资料来源 (Sources)
Algae can function as indicators of water pollution. (n.d.). Retrieved August 21, 2020, from http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/ 藻类可以作为水污染的指标。 (nd)。 检索于2020年8月21日， 网址为http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/ Barott, K. L., Rodriguez-Mueller, B., Youle, M., Marhaver, K. L., Vermeij, M. J., Smith, J. E., & Rohwer, F. L. (2011). Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proceedings of the Royal Society B: Biological Sciences, 279(1733), 1655–1664. doi:10.1098/rspb.2011.2155 KL的Barott，B。的Rodriguez-Mueller，M。的Youle，Marhaver的KL，Vermeij，MJ，Smith，JE和Rohwer的佛罗里达(2011)。 造礁珊瑚Montastraea ringis和底栖藻类之间的微生物到礁垢的相互作用。 皇家学会学报B：生物科学， 279 (1733)，1655–1664。 doi：10.1098 / rspb.2011.2155 Duffin, P., & 13, J. (2020, January 13). Average number of own children per family U.S. Retrieved August 20, 2020, from https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/ Duffin，P.，＆13，J.(2020年1月13日)。 美国每个家庭的平均独生子女数于2020年8月20日从https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/检索 The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. There were 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24. Date of the dive was not taken into account, however in future iterations it would be interesting to see how these cluster centers change over time. The only transformation to the traditional k-means algorithm was including weights that correspond to the median percent cover of each substrate category. 数据是由珊瑚礁非营利组织Reef Check收集的，该组织培训志愿潜水员收集海洋数据。 1997–05–24至2019–08–24期间，加勒比海地区共有1576个独特条目。 没有考虑潜水日期，但是在将来的迭代中，观察这些聚类中心如何随时间变化会很有趣。 对传统k均值算法的唯一转换是包括权重，该权重对应于每种基材类别的中位覆盖率百分比。
Here is the code. 这是代码 。
Note: These are my findings. If you would like to contact me, leave a message here. All criticisms are welcome. 注意：这些是我的发现。 如果您想与我联系，请在此处留言。 欢迎所有批评。 翻译自: https://medium.com/data-diving/classification-of-caribbean-coral-reefs-using-k-means-51a66997a989k均值算法 二分k均值算法
展开全文
• 这是先实现k均值算法，再在这个基础上实现约束种子k均值算法k均值算法有直接调用接口实现，有用代码一步一步实现，训练数据清晰，每一个函数都有解释，是一个学习k均值算法很好的资源。
• k-means 算法 数据科学访谈 (Data Science Interviews) KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced ...
k均值算法 二分k均值算法 数据科学访谈 (Data Science Interviews)
KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced data scientists do not have a good grasp of this algorithm. This makes KMeans an excellent topic for interviews, to get a good grasp of the understanding of one of the most foundational machine learning algorithm. 对于数据科学家而言，KMeans是最常见且最重要的聚类算法之一。 但是，通常情况下，经验丰富的数据科学家对这种算法不太了解。 这使KMeans成为面试的绝佳话题，可以很好地理解最基础的机器学习算法之一。
There are a lot of questions that can touched-on when discussing the topic: 讨论该主题时，有很多问题可以涉及：
Description of the Algorithm 算法说明 Big O Complexity & Optimization 大O复杂度和优化 Application of the algorithm 算法的应用 Comparison with other clustering algorithms 与其他聚类算法的比较 Advantages / Disadvantage of using K-Means 使用K均值的优点/缺点  算法说明 (Description of Algorithm)
Describing the inner working of the K-Means algorithm is typically the first step in an interview questions centered around clustering. It shows the interviewer whether you have grasped how the algorithm works. 描述K-Means算法的内部工作通常是围绕聚类的访谈问题的第一步。 它向面试官显示您是否已掌握算法的工作原理。
It might sound fine just to apply a KMeans().fit() and let the library handle all the algorithm work. Still, in case you need to debug some behavior or understand if using KMeans would be fit for purpose, it starts with having a sound understanding of how an algorithm works. 仅应用KMeans().fit()并让库处理所有算法工作，这听起来似乎不错。 尽管如此，如果您需要调试某些行为或了解使用KMeans是否适合其目的，则首先应充分了解算法的工作原理。
高级说明 (High-Level Description)
There are different aspects of K-means that are worth mentioning when describing the algorithm. The first one being that it is an unsupervised learning algorithm, aiming to group “records” based on their distances to a fixed number (i.e., k) of “centroids.” Centroids being defined as the means of the K-clusters. 描述算法时，K均值的不同方面值得一提。 第一个是它是一种无监督的学习算法，旨在根据记录与固定数量(即k)的“质心”之间的距离对“记录”进行分组。 质心定义为K-簇的均值。
内部运作 (Inner workings)
Besides the high-level description provided above, it is also essential to be able to walk an interviewer through the inner workings of the algorithm. That is from initialization, to the actual processing and the stop conditions. 除了上面提供的高级描述之外，还必须能够引导访问者了解算法的内部原理。 那就是从初始化到实际的处理以及停止条件。
Initialization: It is important to discuss that the initialization method determines the initial clusters’ means. It would be expected from this point of view, to at least mention the problem of initialization, how it can lead to different cluster being created, the impact on the time it takes to obtain the different clusters, etc.. One of the key initialization method to mention is the “Forgy” initialization method. 初始化：讨论初始化方法确定初始簇的均值很重要。 从这一角度来看，可以期望至少提及初始化问题，如何导致创建不同的集群，对获取不同集群所需时间的影响等。初始化的关键之一提及的方法是“ Forgy”初始化方法。
Processing: I would expect a discussion on how the algorithm traverses the points, and iteratively assigns them to the nearest cluster. Great candidates would be able to go beyond that description and into a discussion over KMeans, minimizing the within-cluster variance and discuss Lloyd’s algorithm. 处理：我希望能对算法如何遍历这些点并将其迭代地分配给最近的簇进行讨论。 优秀的候选人将能够超越该描述而进入有关KMeans的讨论，从而最大程度地降低集群内部差异并讨论Lloyd算法。
Stop condition: The stop conditions for the algorithm needs to be mentioned. The typical stop conditions for the algorithm are usually based on the following 停止条件：需要提及算法的停止条件。 该算法的典型停止条件通常基于以下条件
(stability) Centroids of new cluster do not change (稳定性)新集群的质心不变 (convergence) points stay in the same cluster (收敛)点保持在同一群集中 (cap) Maximum number of iterations has been reached (上限)已达到最大迭代次数 Stop conditions are quite important to the algorithm, and I would expect a candidate, to at least mention the stability or convergence and the cap conditions. Another key point to highlight going through these stop conditions is articulating the importance of having a cap implemented (see Big O complexity below). 停止条件对算法非常重要，我希望有一个候选人至少提及稳定性或收敛性和上限条件。 突出显示通过这些停止条件的另一个关键点是阐明实施上限的重要性(请参见下面的“大O”复杂性)。
大O复杂度 (Big O Complexity)
It is important for candidates to understand the complexity of the algorithm, both from a training and prediction standpoint, and how the different variables impact the performance of the algorithm. This is why questions around the complexity of the KMeans are often asked, when deep-diving into the algorithm: 对于候选人而言，从训练和预测的角度了解算法的复杂性以及不同的变量如何影响算法的性能非常重要。 这就是为什么在深入研究算法时经常会问有关KMeans复杂性的问题：
培训BigO (Training BigO)
From a training perspective, the complexity is (if using Lloyds’ algorithm): 从训练的角度来看，复杂度是(如果使用劳埃德算法) ：
BigO(KmeansTraining) = K *I * N * M BigO(KmeansTraining) = K *I * N * M  Where: 哪里：
K: Number of clusters K：簇数 I: The number of iterations I：迭代次数 N: The sample size N：样本量 M: The number of variables M：变量数 As it is possible to see, there can be a significant impact on capping the number of iterations. 可以看到，对限制迭代次数可能会产生重大影响。
预测BigO (Prediction BigO)
K-means predictions have a different complexity: K均值预测具有不同的复杂度：
BigO(KmeansPrediction) = K * N * M BigO(KmeansPrediction) = K * N * M  KMeans prediction, only needs to have computed for each record, the distance (which complexity is based on the number of variables) to each cluster, and assign it to the nearest one. KMeans预测只需为每条记录计算到每个聚类的距离(其复杂度基于变量的数量)，然后将其分配给最接近的一个。
扩展KMeans (Scaling KMeans)
During an interview, you might be asked if there are any ways to make KMeans perform faster on larger datasets. This should be a trigger to discuss mini-batch KMeans. 在采访中，可能会询问您是否有任何方法可以使KMeans在较大的数据集上更快地执行。 这应该是讨论迷你批处理KMeans的触发器。
Mini batch KMeans is an alternative to the traditional KMeans, that provides better performance for training on larger datasets. It leverages mini-batches of data, taken at random to update the clusters’ mean with a decreasing learning rate. For each data bach, the points are all first assigned to a cluster and then means are then re-calculated. The clusters’ centers are recalculated using gradient descent. The algorithm provides a faster convergence than the typical KMeans, but with a slightly different cluster output. 迷你批处理KMeans是传统KMeans的替代方法，可为较大数据集的训练提供更好的性能。 它利用随机获取的小批量数据，以降低的学习率来更新聚类的均值。 对于每个数据bach，首先将所有点都分配给一个聚类，然后重新计算均值。 使用梯度下降重新计算群集的中心。 该算法提供了比典型KMeans更快的收敛速度，但是群集输出略有不同。
应用K均值 (Applying K-means)
用例 (Use cases)
There are multiple use cases for leveraging the K-Means algorithm, from offering recommendations or offering some level of personalization on a website, to deep diving into potential cluster definitions from customer analysis and targeting. 有多种使用K-Means算法的用例，从在网站上提供建议或提供某种程度的个性化 ，到从客户分析和定位中深入研究潜在的集群定义。
Understanding what is expected from applying k-means also dictates how you should be applying it. Do you need to find the number of optimal number of K? or an arbitrary number given by the marketing department. Do you need to have interpretable variables, or is this something that would be better left for an algorithm to decide? 了解应用k均值的期望值还指示您应如何应用它。 您是否需要找到最佳数量的K？ 或市场部门提供的任意数字。 您是否需要具有可解释的变量，还是最好由算法决定？
It is important to understand how particular K-Means use cases can impact its’ implementations. Implementation specific questions, usually come up as follow-ups, such as: 了解特定的K-Means用例如何影响其实施非常重要。 实施方面的特定问题，通常是后续问题，例如：
Let say, the marketing department asked you to providse them with user segments for an upcoming marketing campaign. What features would you look to feed into your model and what transformations woud you apply to provide them with these segments? 假设营销部门要求您为他们提供即将进行的营销活动的用户群。 您希望将哪些功能引入模型中，并希望应用哪些转换为它们提供这些细分？ This type of followup question is very open-ended, can require further clarification, but does usually provide insights into whether or not the candidate understands how the results of the segmentation might be used. 这种类型的跟进问题是开放式的，可能需要进一步澄清，但是通常会提供有关候选人是否了解如何使用细分结果的见解。
求最佳K (Finding the optimal K)
Understanding how to determine the number of K to use for KMeans often comes up as a followup question in the application of the algorithm. 理解如何确定用于KMeans的K数通常是算法应用中的后续问题。
There are different techniques to identify the optimal number of clusters to use with KMeans. Three different methods are used the Elbow method, the Silhouette method, and Gap statistics. 有多种技术可以确定与KMeans一起使用的最佳群集数。 肘部，轮廓法和间隙统计使用了三种不同的方法。
The Elbow method: is all about finding the point of inflection on a graph of % of variance explained to the number of K. Elbow方法：都是关于在解释了K数的方差百分比图上找到拐点。
Silhouette method: The silhouette method, involves calculating for each point, a similarity/dissimilarity score between their assigned cluster, and the next best (i.e., nearest) cluster. 轮廓法：轮廓法涉及为每个点计算其分配的聚类和次佳(即最接近)的聚类之间的相似度/不相似度得分。
Gap statistics: The goal of the gap statistic is to compare the cluster assignments on the actual dataset against some randomly generated reference datasets. This comparison is done through the calculation of the intracluster variation, using the log of the sum of the pairwise distance between the clusters’ points. Large gap statistics indicates that the cluster obtained on observed data, are very different from those obtained from the randomly generated reference data. 差距统计：差距统计的目标是将实际数据集上的集群分配与一些随机生成的参考数据集进行比较。 通过使用群集点之间成对距离之和的对数，通过计算群集内变化来完成此比较。 大的间隙统计数据表明，根据观测数据获得的聚类与根据随机生成的参考数据获得的聚类有很大差异。
输入变量 (Input variables)
When applying KMeans, it is crucial to understand what kind of data can be fed to the algorithm. 应用KMeans时，至关重要的是要了解可以将哪种数据馈送到该算法。
For each user on our video streaming platform, you have been provided with their historical content views as well as their demographic data. How do you determine what to train the model on? 对于我们视频流平台上的每个用户，系统都向您提供了他们的历史内容视图以及人口统计数据。 您如何确定训练模型的依据？ It is generally an excellent way to breach into the two subtopics of variable normalization and on the number of variables. 通常，这是突破变量归一化和变量数量这两个子主题的绝佳方法。
Normalization of variables 变量归一化
In order to work correctly, KMeans typically needs to have some form of normalization done of the datasets. K-means is sensitive to both means and variance in the datasets. 为了正常工作，KMeans通常需要对数据集进行某种形式的标准化。 K均值对数据集中的均值和方差均敏感。
For numerical performing normalization using a StandardScaler is recommended, but depending on the specific cases, other techniques might be more suitable. 对于使用StandardScaler进行数值执行归一化的建议，但是根据具体情况，其他技术可能更合适。
For pure categorical data, one hot encoding would likely be preferred, but worth being careful with the number of variables it ends up producing, both from an efficiency (BigO) standpoint and for managing KMeans’ performance (see below: Number of variables). 对于纯类别数据，可能会首选一种热编码，但从效率(BigO)角度和管理KMeans的性能(请参阅下文：变量数 )的角度来看，值得谨慎对待最终产生的变量数 。
For mixed data types, it might be needed to pre-process the features beforehand. Techniques such as Principal Components Analysis (PCA) or Singular Value Decomposition (SVD) can, however, be used to transform the input data into a dataset that can be leveraged appropriately into KMeans. 对于混合数据类型，可能需要预先对功能进行预处理。 但是，可以使用诸如主成分分析(PCA)或奇异值分解(SVD)之类的技术将输入数据转换为可以适当利用到KMeans中的数据集。
Number of variables 变量数
The number of variables going into K-means has an impact on both the time/complexity it takes to train and apply the algorithm, but as well as an effect on how the algorithm behaves. 进入K均值的变量数量不仅影响训练和应用算法所需的时间/复杂度，还影响算法的行为方式。
This due to the curse of dimensionality: 这是由于维数的诅咒：
So as the dimensionality increases, more and more examples become nearest neighbors of xt, until the choice of nearest neighbor (and therefore of class) is effectively random.https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdfA large number of dimensions has a direct impact on distance-based computations, a key component of KMeans: 大量维度直接影响基于距离的计算，这是KMeans的关键组成部分：
The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238676/Dimensionality reductions methods such as PCA, or feature selection techniques are things to bring up when reaching this topic. 降维方法(例如PCA)或特征选择技术是达到此主题时需要提出的内容。
与其他算法的比较 (Comparison with other Algorithm)
Besides understanding the inner working of the KMeans algorithm, it is also important to know how it compares to other clustering algorithms. 除了了解KMeans算法的内部工作原理之外，了解与其他聚类算法的比较方式也很重要。
There is a wide range of other algorithms out there, hierarchical clustering, mean shift clustering, Gaussian mixture models (GMM), DBScan, Affinity propagation (AP), K-Medoids/ PAM, … 那里还有各种各样的其他算法，包括层次聚类，均值漂移聚类，高斯混合模型(GMM)，DBScan，亲和传播(AP)，K-Medoids / PAM，…
What other clustering methods do you know? 您还知道其他哪些聚类方法？
How does Algorithm X, compares to K-Means? 算法X与K均值相比如何？
Going through the list of algorithms, it is essential to at least know the different types of clustering methods: centroid/medoids (e.g., KMeans), hierarchical, density-based (e.g., MeanShift, DBSCAN). distribution-based (e.g., GMM) and Affinity propagation (Affinity Propagation)… 遍历算法列表，至少要了解不同类型的聚类方法至关重要：质心/类聚体(例如，KMeans)，分层的，基于密度的(例如，MeanShift，DBSCAN)。 基于分布的(例如GMM)和相似性传播(相似性传播)…
When doing these types of comparisons, it is important to list at least some K-Means alternatives, and showcasing some high-level knowledge of what the algorithm does and how it compares to K-Means. 在进行这些类型的比较时，重要的是至少列出一些K-Means备选方案，并展示有关该算法的功能以及与K-Means进行比较的一些高级知识。
You might be asked at this point to deep dive into one of the algorithms you previously mentioned, so be prepared to be able to explain how some of the other algorithm works, list their strengths and weakness compared to K-means and describe how the inner working of the algorithm differs from K-Means. 此时可能会要求您深入研究您先前提到的一种算法，因此准备好能够解释其他一些算法的工作原理，列出它们与K均值相比的优缺点，并描述内部该算法的工作方式不同于K-Means。
Going through any algorithms, it is important to know their advantage and disadvantage, it is not unsurprising that this is often asked during interviews. 遍历任何算法，重要的是要知道它们的优缺点，在面试中经常问到这一点并不奇怪。
Some of the key advantages of KMeans are: KMeans的一些主要优点是：
It is simple to implement 实施简单 Computational efficiency, both for training and prediction 训练和预测的计算效率 Guaranteed convergence 保证融合 While some of its disadvantages are: 虽然它的一些缺点是：
The number of clusters needs to be provided as an input variable. 群集的数量需要作为输入变量提供。 It is very dependent on the initialization process. 它非常依赖于初始化过程。 KMeans is good at clustering when dealing with spherical cluster shapes, but it performs poorly when dealing with more complicated shapes. KMeans在处理球形簇形状时擅长聚类，但是在处理更复杂的形状时性能较差。 Due to leveraging the Euclidian distance function, it is sensitive to outliers. 由于利用了欧几里得距离功能，因此对异常值很敏感。 Need pre-processing on mix data as it can’t take advantages of alternative distance function such as Gower’s distance 需要对混合数据进行预处理，因为它无法利用替代距离函数(例如高尔距离)的优势 More from me on Hacking Analytics: 我提供的有关Hacking Analytics的更多信息：
SQL interview Questions For Aspiring Data Scientist — The Histogram 面向有抱负的数据科学家SQL采访问题-直方图 Python Screening Interview questions for DataScientists DataScientists的Python筛选面试问题 ON Applying K-means Personalization to a website 关于将K-means个性化应用于网站 ON Coding K-Means in Vanilla Python 在香草Python中编码K均值 How to Learn Data science from scratch 如何从零开始学习数据科学 翻译自: https://medium.com/analytics-and-data/how-to-ace-the-k-means-algorithm-interview-questions-afe346f8fc09k均值算法 二分k均值算法
展开全文
• 通过比较目标函数、聚类原型模式P(0)的初始化方法、划分矩阵U和聚类原型p的更新方法等4个方面,得出k均值算法和硬C-均值算法的区别。
• K 均值算法算法就是一种解决聚类问题的算法，它包含两个步骤： 给聚类中心分配点：计算所有的训练样例，把他分配到距离某个聚类中心最短的的那聚类里。 移动聚类中心：新的聚类中心移动到这个聚类所有的点的...
• K均值算法是无监督学习，属于聚类算法 区别二：具体流程 K近邻算法是找与数据最近的K个点，如果多数属于某类，则归为该类，达到分类的效果 K均值算法是将所有数据归到距其最近的中心点，并不断迭代中心点以达到聚类...
• 竞争K均值算法中的权重
• K均值算法程序源代码，K均值算法是模式识别里比较基础的算法，需要掌握
• k均值算法c++语言实现代码
• K均值算法的简介
• matlab实现k均值算法，进行三维数据的模式分类
• 关于k均值算法c++代码 控制台程序对图像进行聚类
• k-means kmeans k均值 算法 用java实现的k均值算法 200行左右 可以通过文件读取空间二维点坐标，聚类结果将格式化输出到文本，利用excel就可以实现可视化
• 基于k均值算法的快速密度聚类策略
• K均值算法对一系列随机的点进行分类，并作图，画出其聚类中心，并用不同颜色标注不同类别的点。matla编写。
• k均值算法 简单的说，k均值算法是将一些坐标点，按距离划分，聚为k类（k为先设置的值）。例：对以下点的坐标进行聚类 坐标点的数量很多，可以通过k均值算法进行聚类。k均值算法的思路是，先初始化k个类心，将所有...

...