精华内容
下载资源
问答
  • 2021-04-03 22:53:05

    Python实现k均值聚类算法_K-Means聚类算法

    若想快速了解k均值聚类算法_K-Means聚类算法,可参考这位大佬的文章,通俗易懂:k均值聚类算法考试例题_K-Means聚类算法,作者:weixin_39789792
    感谢这位博主。

    本篇博客仅作为自用笔记,如有侵权联系删除。

    代码详细

    注意:
    质心k的坐标取值不能取的太离谱,比如三个质心中有一个质心的坐标离样本数据中所有坐标都很远,就会导致列表sse_k1(或k2、k3)中无数据,便使len(sse_kx(x为1、2、3))=0,报错:ZeroDivisionError: division by zero

    import pylab as pl
    
    
    def square_Euclid(x, y):
        """
        计算欧几里得距离:
        若是两个平面上的点,即(x1,y1),和(x2,y2),那这俩点距离即√( (x1-x2)^2+(y1-y2)^2);
        如果是三维空间中的点,则为√( (x1-x2)^2+(y1-y2)^2+(z1-z2)^2 。
        """
        return (x[0] - y[0]) ** 2 + (x[1] - y[1]) ** 2
    
    
    # 初始化各个点
    num_x = []
    num_y = []
    fl = open('data01.txt')  # 点数据存放在data01.txt中
    for line in fl.readlines():
        curLine = line.strip().split()
    
        num_x.append(float(curLine[0]))
        num_y.append(float(curLine[1]))
    
    # 初始化三个质心,经过聚类得到三个分组
    k1 = [-1.7, 1]
    k2 = [-0.5, 0.5]
    k3 = [1, 2]
    
    # 定义三个簇
    sse_k1 = []
    sse_k2 = []
    sse_k3 = []
    
    n = len(num_x)
    while True:
        sse_k1 = []
        sse_k2 = []
        sse_k3 = []
        for i in range(n):
            square_E1 = square_Euclid(k1, [num_x[i], num_y[i]])
            square_E2 = square_Euclid(k2, [num_x[i], num_y[i]])
            square_E3 = square_Euclid(k3, [num_x[i], num_y[i]])
    
            num_min = min(square_E1, square_E2, square_E3)
    		
    		# 聚类
            if num_min == square_E1:
                sse_k1.append(i)
            elif num_min == square_E2:
                sse_k2.append(i)
            elif num_min == square_E3:
                sse_k3.append(i)
    
        # 求坐标平均值,以确定新的质心(更新质心坐标)
        k1_x = sum([num_x[i] for i in sse_k1]) / len(sse_k1)
        k1_y = sum([num_y[i] for i in sse_k1]) / len(sse_k1)
    
        k2_x = sum([num_x[i] for i in sse_k2]) / len(sse_k2)
        k2_y = sum([num_y[i] for i in sse_k2]) / len(sse_k2)
    
        k3_x = sum([num_x[i] for i in sse_k3]) / len(sse_k3)
        k3_y = sum([num_y[i] for i in sse_k3]) / len(sse_k3)
    
        # 只要有质心的坐标发生改变,则更新质心坐标;若三个质心均无变化,则判定以收敛,聚类结束,退出循环
        if k1 != [k1_x, k1_y] or k2 != [k2_x, k2_y] or k3 != [k3_x, k3_y]:
            k1 = [k1_x, k1_y]
            k2 = [k2_x, k2_y]
            k3 = [k3_x, k3_y]
        else:
            break
    
    # 取聚类后的点坐标
    kv1_x = [num_x[i] for i in sse_k1]
    kv1_y = [num_y[i] for i in sse_k1]
    
    kv2_x = [num_x[i] for i in sse_k2]
    kv2_y = [num_y[i] for i in sse_k2]
    
    kv3_x = [num_x[i] for i in sse_k3]
    kv3_y = [num_y[i] for i in sse_k3]
    
    pl.plot(kv1_x, kv1_y, '+')
    pl.plot(kv2_x, kv2_y, '.')
    pl.plot(kv3_x, kv3_y, '^')
    
    # 坐标系大小依样本数据值范围而定
    pl.xlim(-2, 2.5)
    pl.ylim(-1, 2.5)
    pl.show()
    
    

    代码参考自:k均值聚类算法实例,作者:weixin_38553466

    结果展示

    Kmeans聚类算法结果展示

    所用样本数据

    在.py文件所在文件下创建data01.txt文本文档,并将以下数据存入

    -0.113523119722435 0.305566317246824
    -0.0363280337139859 0.110677855003451
    0.113494507689856 0.285179031884109
    -0.00383816850385252 0.670778674827114
    -0.180363593046200 0.394837771823933
    0.295367728543231 -0.355182535548782
    -0.0296566442720039 0.228722511660635
    -0.0930361342677474 0.154377592930645
    -0.159633545380951 1.03286272700827
    -0.609370592744484 0.0100246598182464
    0.164875043625935 0.107920610145671
    -0.649661855983650 -0.0264148180075531
    -0.0853301136043781 0.194929464533097
    -0.0869727732803104 -0.166019322253363
    0.267258237150858 0.318664851557507
    -0.876324515282669 0.578412914115882
    0.290320777421500 0.0269704554131184
    -0.164202641138215 -0.0216061750617156
    -0.408886348765266 -0.178183406834480
    -0.00275690297052195 -0.149757266490323
    -0.230897603220972 0.202729565016547
    -0.289768125501838 0.299373894453753
    0.565273947293806 -0.112025265465832
    -0.259434375270518 -0.183038062076565
    -0.0622055869197436 0.0178584309105331
    -0.281488166956539 -0.282493439656289
    0.288003999490542 0.354832178282382
    -0.00387861254715821 0.245338598261617
    0.0230259610960932 0.304367839506965
    0.297069520513791 0.398694925851779
    0.213528795047459 -0.0341268311839215
    0.248545070529365 -0.182513241920946
    -0.674431824833610 0.166219624024427
    0.0695478578554150 0.364281641067673
    1.52144323033782 1.56356334395462
    1.54901744911605 1.44082824131763
    1.72628026225810 0.999267392962595
    1.34339405843162 1.54435051334828
    1.63076888391605 0.822969713727122
    1.24625402720513 1.50291563943267
    1.49966193305128 1.43962200220279
    0.806148334745612 1.59798616598320
    1.73765675194197 0.801038214866100
    0.688725193167526 1.18560461303177
    1.31503430771996 1.25566460922217
    1.14051881393761 1.28173391148891
    0.883497444350820 1.52712829138676
    1.35619761199096 1.47157896393621
    -1.41400896645106 1.03490557492282
    -1.46921827418174 0.691733912712829
    -1.06733046906236 0.945293131396786
    -0.789899047908273 1.04583303354796
    -0.922550939191143 1.39310184834662
    -0.918965347657051 1.44432139992464
    -1.03616345036068 1.00166612828372
    -1.07715160762591 1.51189230738663
    -1.01283275702248 1.46105578965393
    -1.48079534886488 1.21031313607727
    -0.986518252032434 0.949195019118798
    -1.62901492888985 1.53208781532487
    -1.05432664597088 1.20897843449092
    -1.51323198856773 0.929507861004623
    -1.55689740607725 1.32978015955565
    -1.39341270591838 1.41557221811715
    -1.66195228414799 0.787792125905413
    -1.24494832523794 1.84020927229746
    -0.898778729417616 0.570410077321060
    -1.32885894876685 0.732892764435160
    -1.09324537986321 1.63706883409855
    -1.19875924554585 1.35282905121539
    -0.866788557380931 1.11436578620945
    -1.30006378262166 1.25366700524127
    -1.15735442373393 1.48126320709162
    -0.469640188642725 0.975507100878317
    -0.887529056287694 1.54350983044641
    -1.54530712190787 1.47051092229069
    -0.895890659992745 1.23572220775434
    -1.54226615688700 1.31046627190501
    -1.24686714416393 1.05116769432966
    -1.18900045601094 1.35740905869805
    -1.65786095402519 1.03723338930851
    -1.37644334323617 1.08018136201292
    -1.00479602718570 0.921073237322932
    -0.958390570797860 1.56536899409517
    -0.761574879786032 1.24176101803965
    -1.56925161923031 1.04223861195863
    -0.979085655811513 1.46432198217887
    -1.14713536403328 1.08846006315455
    -0.853944089636229 1.39103904734476

    更多相关内容
  • 简单实现平面的点K均值分析,使用欧几里得距离,并用pylab展示。 复制代码 代码如下:import pylab as pl #calc Euclid squiredef calc_e_squire(a, b): return (a[0]- b[0]) ** 2 + (a[1] – b[1]) **2 #init the ...
  • 利用K均值聚类算法对灰度图像进行聚类,四维数组存储和运算数据,运行时间慢
  • k均值聚类python实现

    2018-10-18 09:30:23
    k-means(k均值)算法python代码实现,可以显示聚类效果与聚类的迭代次数,初学者使用更方便。
  • k均值聚类算法pythonThis post was originally published here 这篇文章最初发表在这里 Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more ...

    k均值聚类算法python

    This post was originally published here

    这篇文章最初发表在这里

    Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we’ll check out a few algorithms in Python so you can get a basic understanding of the fundamentals of clustering on a real dataset.

    聚类是将对象分组在一起,以便属于同一组(群集)的对象比其他组(群集)的对象彼此更相似。 在此介绍性群集分析教程中,我们将检查一些Python中的算法,以便您可以基本了解真实数据集上的群集基础。

    数据集 (The Dataset)

    For the clustering problem, we will use the famous Zachary’s Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator “John A” and an instructor “Mr. Hi” (both pseudonyms). Then a conflict arose between them, causing the students (Nodes) to split into two groups. One that followed John and one that followed Mr. Hi.

    对于聚类问题,我们将使用著名的Zachary的空手道俱乐部数据集。 数据集背后的故事很简单:一个空手道俱乐部有一个管理员“约翰·A”和一个教练“先生”。 嗨”(两个化名)。 然后他们之间发生了冲突,导致学生(节点)分为两组。 一个跟随约翰,另一个跟随Hi先生。

    Visualization of Karate Club Clustering
    Wikipedia 维基百科

    Python群集入门 (Getting Started with Clustering in Python)

    But enough with the introductory talk, let’s get to main reason you are here, the code itself. First of all, you need to install both scikit-learn and networkx libraries to complete this tutorial. If you don’t know how, the links above should help you. Also, feel free to follow along by grabbing the source code for this tutorial over on Github.

    但是,通过介绍性的讨论就足够了,让我们了解代码出现的主要原因。 首先,您需要同时安装scikit-learnnetworkx库才能完成本教程。 如果您不知道如何,上面的链接应该可以为您提供帮助。 另外,欢迎随时通过Github获取本教程的源代码。

    Usually, the datasets that we want to examine are available in text form (JSON, Excel, simple txt file, etc.) but in our case, networkx provide it for us. Also, to compare our algorithms, we want the truth about the members (who followed whom) which unfortunately is not provided. But with these two lines of code, you will be able to load the data and store the truth (from now on we will refer it as ground truth):

    通常,我们要检查的数据集以文本形式(JSON,Excel,简单的txt文件等)可用,但在我们的情况下, networkx为我们提供了它。 另外,为了比较我们的算法,我们想要关于不幸的是没有提供成员(跟随谁)的真相。 但是,通过这两行代码,您将能够加载数据并存储真相(从现在开始,我们将其称为基本真相):

    # Load and Store both data and groundtruth of Zachary's Karate Club
    G = nx.karate_club_graph()
    groundTruth = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]
    # Load and Store both data and groundtruth of Zachary's Karate Club
    G = nx.karate_club_graph()
    groundTruth = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]
     

    The final step of the data preprocessing, is to transform the graph into a matrix (desirable input for our algorithms). This is also quite simple:

    数据预处理的最后一步是将图形转换成矩阵(对于我们的算法来说是理想的输入)。 这也很简单:

    Before we get going with the Clustering Techniques, I would like you to get a visualization on our data. So, let’s compile a simple function to do that:

    在我们开始使用聚类技术之前,我希望您对我们的数据有一个可视化。 因此,让我们编译一个简单的函数来做到这一点:

    def drawCommunities(G, partition, pos):
        # G is graph in networkx form
        # Partition is a dict containing info on clusters
        # Pos is base on networkx spring layout (nx.spring_layout(G))
    
        # For separating communities colors
        dictList = defaultdict(list)
        nodelist = []
        for node, com in partition.items():
            dictList[com].append(node)
    
        # Get size of Communities
        size = len(set(partition.values()))
    
        # For loop to assign communities colors
        for i in range(size):
    
            amplifier = i % 3
            multi = (i / 3) * 0.3
    
            red = green = blue = 0
    
            if amplifier == 0:
                red = 0.1 + multi
            elif amplifier == 1:
                green = 0.1 + multi
            else:
                blue = 0.1 + multi
    
            # Draw Nodes
            nx.draw_networkx_nodes(G, pos,
                                   nodelist=dictList[i],
                                   node_color=[0.0 + red, 0.0 + green, 0.0 + blue],
                                   node_size=500,
                                   alpha=0.8)
    
        # Draw edges and final plot
        plt.title("Zachary's Karate Club")
        nx.draw_networkx_edges(G, pos, alpha=0.5)
    def drawCommunities(G, partition, pos):
        # G is graph in networkx form
        # Partition is a dict containing info on clusters
        # Pos is base on networkx spring layout (nx.spring_layout(G))
    
        # For separating communities colors
        dictList = defaultdict(list)
        nodelist = []
        for node, com in partition.items():
            dictList[com].append(node)
    
        # Get size of Communities
        size = len(set(partition.values()))
    
        # For loop to assign communities colors
        for i in range(size):
    
            amplifier = i % 3
            multi = (i / 3) * 0.3
    
            red = green = blue = 0
    
            if amplifier == 0:
                red = 0.1 + multi
            elif amplifier == 1:
                green = 0.1 + multi
            else:
                blue = 0.1 + multi
    
            # Draw Nodes
            nx.draw_networkx_nodes(G, pos,
                                   nodelist=dictList[i],
                                   node_color=[0.0 + red, 0.0 + green, 0.0 + blue],
                                   node_size=500,
                                   alpha=0.8)
    
        # Draw edges and final plot
        plt.title("Zachary's Karate Club")
        nx.draw_networkx_edges(G, pos, alpha=0.5)
     

    What that function does is to simply extract the number of clusters that are in our result and then assign a different color to each of them (up to 10 for the given time is fine) before plotting them.

    该函数的作用是简单地提取结果中的聚类数量,然后为每个聚类分配不同的颜色(在给定时间内最多可以分配10个颜色),然后再绘制它们。

    zacharys空手道俱乐部群集节点

    聚类算法 (Clustering Algorithms)

    Some clustering algorithms will cluster your data quite nicely and others will end up failing to do so. That is one of the main reasons why clustering is such a difficult problem. But don’t worry, we won’t let you drown in an ocean of choices. We’ll go through a few algorithms that are known to perform very well.

    一些聚类算法会很好地聚类您的数据,而另一些聚类算法最终会失败。 这是聚类如此困难的问题的主要原因之一。 但请放心,我们不会让您淹没在众多选择中。 我们将介绍一些表现出色的算法。

    K均值聚类 (K-Means Clustering)

    Source: github.com/nitoyon/tech.nitoyon.com

    资料来源: github.com/nitoyon/tech.nitoyon.com

    K-means is considered by many the gold standard when it comes to clustering due to its simplicity and performance, and it’s the first one we’ll try out. When you have no idea at all what algorithm to use, K-means is usually the first choice. Bear in mind that K-means might under-perform sometimes due to its concept: spherical clusters that are separable in a way so that the mean value converges towards the cluster center. To simply construct and train a K-means model, use the follow lines:

    由于涉及简单性和性能,在聚类方面,K-means被许多黄金标准视为标准,这是我们将尝试的第一个方法。 当您完全不知道要使用哪种算法时,K-means通常是首选。 请记住,由于其概念,K均值有时可能会表现不佳:球形聚类可以以某种方式分离,以使平均值朝聚类中心收敛。 要简单地构建和训练K均值模型,请使用以下代码行:

    聚集聚类 (Agglomerative Clustering)

    The main idea behind agglomerative clustering is that each node starts in its own cluster, and recursively merges with the pair of clusters that minimally increases a given linkage distance. The main advantage of agglomerative clustering (and hierarchical clustering in general) is that you don’t need to specify the number of clusters. That of course, comes with a price: performance. But, in scikit’s implementation, you can specify the number of clusters to assist the algorithm’s performance. To create and train an agglomerative model use the following code:

    聚集群集背后的主要思想是,每个节点都从其自己的群集开始,然后与这对群集进行递归合并,从而最小地增加了给定的链接距离。 聚集集群(通常是分层集群)的主要优点是,您无需指定集群的数量。 当然,这要付出代价:性能。 但是,在scikit的实现中,您可以指定簇数以辅助算法的性能。 要创建和训练凝聚模型,请使用以下代码:

    # Agglomerative Clustering Model
    agglomerative = cluster.AgglomerativeClustering(n_clusters=kClusters, linkage="ward")
    agglomerative.fit(edgeMat)
    
    # Transform our data to list form and store them in results list
    results.append(list(agglomerative.labels_))# Agglomerative Clustering Model
    agglomerative = cluster.AgglomerativeClustering(n_clusters=kClusters, linkage="ward")
    agglomerative.fit(edgeMat)
    
    # Transform our data to list form and store them in results list
    results.append(list(agglomerative.labels_)) 

    光谱 (Spectral)

    The Spectral clustering technique applies clustering to a projection of the normalized Laplacian. When it comes to image clustering, spectral clustering works quite well. See the next few lines of Python for all the magic:

    频谱聚类技术将聚类应用于归一化拉普拉斯算子的投影。 当涉及图像聚类时,光谱聚类可以很好地工作。 有关所有魔术,请参见Python的以下几行:

    亲和力传播 (Affinity Propagation)

    Well this one is a bit different. Unlike the previous algorithms, you can see AF does not require the number of clusters to be determined before running the algorithm. AF, performs really well on several computer vision and biology problems, such as clustering pictures of human faces and identifying regulated transcripts:

    好吧,这个有点不同。 与以前的算法不同,您可以看到AF不需要在运行算法之前确定簇数。 AF在一些计算机视觉和生物学问题上表现出色,例如将人脸图片聚类并识别受管制的笔录:

    # Affinity Propagation Clustering Model
    affinity = cluster.affinity_propagation(S=edgeMat, max_iter=200, damping=0.6)
    
    # Transform our data to list form and store them in results list
    results.append(list(affinity[1]))# Affinity Propagation Clustering Model
    affinity = cluster.affinity_propagation(S=edgeMat, max_iter=200, damping=0.6)
    
    # Transform our data to list form and store them in results list
    results.append(list(affinity[1])) 

    指标和绘图 (Metrics & Plotting)

    Well, it is time to choose which algorithm is more suitable for our data. A simple visualization of the result might work on small datasets, but imagine a graph with one thousand, or even ten thousand, nodes. That would be slightly chaotic for the human eye. So, let me show how to calculate the Adjusted Rand Score (ARS) and the Normalized Mutual Information (NMI):

    好了,现在该选择哪种算法更适合我们的数据了。 对结果进行简单的可视化可能适用于小型数据集,但可以想象一个具有一千甚至一万个节点的图。 对于人眼来说,这将有些混乱。 因此,让我展示如何计算调整后的兰德分数(ARS)和归一化的共同信息(NMI):

    If you’re unfamiliar with these metrics, here’s a quick explanation:

    如果您不熟悉这些指标,请快速进行说明:

    标准化互信息(NMI) (Normalized Mutual Information (NMI))

    Mutual Information of two random variables is a measure of the mutual dependence between the two variables. Normalized Mutual Information is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In other words, 0 means dissimilar and 1 means perfect match.

    两个随机变量的互信息是两个变量之间相互依赖的度量。 归一化互信息是互信息(MI)分数的归一化,用于在0(无互信息)和1(完全相关)之间缩放结果。 换句话说,0表示不相似,1表示完全匹配。

    调整后的兰德比分(ARS) (Adjusted Rand Score (ARS))

    Adjusted Rand Score on the other hand, computes a similarity measure between two clusters by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusters. If that’s a little weird to think about, have in mind that, for now, 0 is the lowest similarity and 1 is the highest.

    另一方面,调整后的兰德得分可通过考虑所有样本对并计算在预测和真实群集中相同或不同群集中分配的对对来计算两个群集之间的相似性度量。 如果觉得这有点怪异,请记住,目前,0是最低的相似性,而1是最高的相似性。

    So, to get a combination of these metrics (the NMI and ARS), we simply calculate the average value of their sum. And remember, the higher the number, the better the result.

    因此,要获得这些指标(NMI和ARS)的组合,我们只需计算其总和的平均值即可。 请记住,数字越大,结果越好。

    Below, I have plotted the score evaluation so we can get a better understanding of our results. We could plot them in many ways, as points, as a straight line, but I think a bar chart is the better choice for our case. To do so, just use the following code:

    下面,我绘制了分数评估,以便我们可以更好地了解我们的结果。 我们可以通过多种方式将它们绘制为点或直线,但我认为条形图是我们案例的更好选择。 为此,只需使用以下代码:

    # Code for plotting results
    
    # Average of NMI and ARS
    y = [sum(x) / 2 for x in zip(nmiResults, arsResults)]
    
    xlabels = ['Spectral', 'Agglomerative', 'Kmeans', 'Affinity Propagation']
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    # Set parameters for plotting
    ind = np.arange(len(y))
    width = 0.35
    
    # Create barchart and set the axis limits and titles
    ax.bar(ind, y, width,color='blue', error_kw=dict(elinewidth=2, ecolor='red'))
    ax.set_xlim(-width, len(ind)+width)
    ax.set_ylim(0,2)
    ax.set_ylabel('Average Score (NMI, ARS)')
    ax.set_title('Score Evaluation')
    
    # Add the xlabels to the chart
    ax.set_xticks(ind + width / 2)
    xtickNames = ax.set_xticklabels(xlabels)
    plt.setp(xtickNames, fontsize=12)
    
    # Add the actual value on top of each chart
    for i, v in enumerate(y):
        ax.text( i, v, str(round(v, 2)), color='blue', fontweight='bold')
    
    # Show the final plot
    plt.show()# Code for plotting results
    
    # Average of NMI and ARS
    y = [sum(x) / 2 for x in zip(nmiResults, arsResults)]
    
    xlabels = ['Spectral', 'Agglomerative', 'Kmeans', 'Affinity Propagation']
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    # Set parameters for plotting
    ind = np.arange(len(y))
    width = 0.35
    
    # Create barchart and set the axis limits and titles
    ax.bar(ind, y, width,color='blue', error_kw=dict(elinewidth=2, ecolor='red'))
    ax.set_xlim(-width, len(ind)+width)
    ax.set_ylim(0,2)
    ax.set_ylabel('Average Score (NMI, ARS)')
    ax.set_title('Score Evaluation')
    
    # Add the xlabels to the chart
    ax.set_xticks(ind + width / 2)
    xtickNames = ax.set_xticklabels(xlabels)
    plt.setp(xtickNames, fontsize=12)
    
    # Add the actual value on top of each chart
    for i, v in enumerate(y):
        ax.text( i, v, str(round(v, 2)), color='blue', fontweight='bold')
    
    # Show the final plot
    plt.show() 

    As you can see in the chart below, K-means and Agglomerative clustering have the best results for our dataset (best possible outcome). That of course, does not mean that Spectral and AF are low-performing algorithms, just that the did not fit in our data.

    如下图所示,K均值和聚集聚类对我们的数据集具有最佳结果(可能的最佳结果)。 当然,这并不意味着Spectral和AF是性能不佳的算法,只是不适合我们的数据。

    聚类得分评估

    Well, that’s it for this one!

    好了,仅此而已!

    Thanks for joining me in this clustering intro. I hope you found some value in seeing how we can easily manipulate a public dataset and apply several different clustering algorithms in Python. Let me know if you have any questions in the comments below, and feel free to attach a clustering project you’ve experimented with!

    感谢您加入我的这个集群介绍。 我希望您在了解如何轻松地操纵公共数据集并在Python中应用几种不同的聚类算法中找到一些价值。 如果您在下面的评论中有任何疑问,请告诉我,并随时附加您尝试过的集群项目!

    翻译自: https://www.pybloggers.com/2017/02/k-means-clustering-algorithms-quick-intro-python/

    k均值聚类算法python

    展开全文
  • K-Means算法是典型的基于距离的聚类算法,其中k代表类簇个数,means代表类簇内数据对象的均值(这种均值是一种对类簇中心的描述),因此,K-Means算法又称为k-均值算法。K-Means算法是一种基于划分的聚类算法,以...
  • K-Means算法的思想很简单,对于给定的样本集,按照样本之间的距离大小,将样本集划分为K个簇。让簇内的点尽量紧密的连在一起,而让簇间的距离尽量的大。 使用场景:通用聚类方法,用于均匀的簇大小,簇的数量不多的...
  • 主要介绍了Python聚类算法之基本K均值运算技巧,结合实例形式较为详细的分析了基本K均值的原理与相关实现技巧,具有一定参考借鉴价值,需要的朋友可以参考下
  • 2.编写k-均值聚类算法程序,对人工生成数据集AriGen 与 Iris数据集进行聚类﹐并计算DB指数。 一、数据集(150) 具体数据如下(如果不能运行,尝试在末尾加回车) 5.1 3.5 1.4 0.2 1 4.9 3 1.4 0.2 1 4.7 3...

    内容:

            1.数据集:

            (1)人工生成数据集AriGen:自己生成含有3个类别的二维数据集,且类别间线性分开,分别具有100·200与300个样本;

            (2) Iris 数据集·


            2.编写k-均值聚类算法程序,对人工生成数据集AriGen 与 Iris数据集进行聚类﹐并计算DB指数。

    一、数据集(150)

    具体数据如下(如果不能运行,尝试在末尾加回车)

    5.1	3.5	1.4	0.2	1
    4.9	3	1.4	0.2	1
    4.7	3.2	1.3	0.2	1
    4.6	3.1	1.5	0.2	1
    5	3.6	1.4	0.2	1
    5.4	3.9	1.7	0.4	1
    4.6	3.4	1.4	0.3	1
    5	3.4	1.5	0.2	1
    4.4	2.9	1.4	0.2	1
    4.9	3.1	1.5	0.1	1
    5.4	3.7	1.5	0.2	1
    4.8	3.4	1.6	0.2	1
    4.8	3	1.4	0.1	1
    4.3	3	1.1	0.1	1
    5.8	4	1.2	0.2	1
    5.7	4.4	1.5	0.4	1
    5.4	3.9	1.3	0.4	1
    5.1	3.5	1.4	0.3	1
    5.7	3.8	1.7	0.3	1
    5.1	3.8	1.5	0.3	1
    5.4	3.4	1.7	0.2	1
    5.1	3.7	1.5	0.4	1
    4.6	3.6	1	0.2	1
    5.1	3.3	1.7	0.5	1
    4.8	3.4	1.9	0.2	1
    5	3	1.6	0.2	1
    5	3.4	1.6	0.4	1
    5.2	3.5	1.5	0.2	1
    5.2	3.4	1.4	0.2	1
    4.7	3.2	1.6	0.2	1
    4.8	3.1	1.6	0.2	1
    5.4	3.4	1.5	0.4	1
    5.2	4.1	1.5	0.1	1
    5.5	4.2	1.4	0.2	1
    4.9	3.1	1.5	0.1	1
    5	3.2	1.2	0.2	1
    5.5	3.5	1.3	0.2	1
    4.9	3.1	1.5	0.1	1
    4.4	3	1.3	0.2	1
    5.1	3.4	1.5	0.2	1
    5	3.5	1.3	0.3	1
    4.5	2.3	1.3	0.3	1
    4.4	3.2	1.3	0.2	1
    5	3.5	1.6	0.6	1
    5.1	3.8	1.9	0.4	1
    4.8	3	1.4	0.3	1
    5.1	3.8	1.6	0.2	1
    4.6	3.2	1.4	0.2	1
    5.3	3.7	1.5	0.2	1
    5	3.3	1.4	0.2	1
    7	3.2	4.7	1.4	2
    6.4	3.2	4.5	1.5	2
    6.9	3.1	4.9	1.5	2
    5.5	2.3	4	1.3	2
    6.5	2.8	4.6	1.5	2
    5.7	2.8	4.5	1.3	2
    6.3	3.3	4.7	1.6	2
    4.9	2.4	3.3	1	2
    6.6	2.9	4.6	1.3	2
    5.2	2.7	3.9	1.4	2
    5	2	3.5	1	2
    5.9	3	4.2	1.5	2
    6	2.2	4	1	2
    6.1	2.9	4.7	1.4	2
    5.6	2.9	3.6	1.3	2
    6.7	3.1	4.4	1.4	2
    5.6	3	4.5	1.5	2
    5.8	2.7	4.1	1	2
    6.2	2.2	4.5	1.5	2
    5.6	2.5	3.9	1.1	2
    5.9	3.2	4.8	1.8	2
    6.1	2.8	4	1.3	2
    6.3	2.5	4.9	1.5	2
    6.1	2.8	4.7	1.2	2
    6.4	2.9	4.3	1.3	2
    6.6	3	4.4	1.4	2
    6.8	2.8	4.8	1.4	2
    6.7	3	5	1.7	2
    6	2.9	4.5	1.5	2
    5.7	2.6	3.5	1	2
    5.5	2.4	3.8	1.1	2
    5.5	2.4	3.7	1	2
    5.8	2.7	3.9	1.2	2
    6	2.7	5.1	1.6	2
    5.4	3	4.5	1.5	2
    6	3.4	4.5	1.6	2
    6.7	3.1	4.7	1.5	2
    6.3	2.3	4.4	1.3	2
    5.6	3	4.1	1.3	2
    5.5	2.5	4	1.3	2
    5.5	2.6	4.4	1.2	2
    6.1	3	4.6	1.4	2
    5.8	2.6	4	1.2	2
    5	2.3	3.3	1	2
    5.6	2.7	4.2	1.3	2
    5.7	3	4.2	1.2	2
    5.7	2.9	4.2	1.3	2
    6.2	2.9	4.3	1.3	2
    5.1	2.5	3	1.1	2
    5.7	2.8	4.1	1.3	2
    6.3	3.3	6	2.5	3
    5.8	2.7	5.1	1.9	3
    7.1	3	5.9	2.1	3
    6.3	2.9	5.6	1.8	3
    6.5	3	5.8	2.2	3
    7.6	3	6.6	2.1	3
    4.9	2.5	4.5	1.7	3
    7.3	2.9	6.3	1.8	3
    6.7	2.5	5.8	1.8	3
    7.2	3.6	6.1	2.5	3
    6.5	3.2	5.1	2	3
    6.4	2.7	5.3	1.9	3
    6.8	3	5.5	2.1	3
    5.7	2.5	5	2	3
    5.8	2.8	5.1	2.4	3
    6.4	3.2	5.3	2.3	3
    6.5	3	5.5	1.8	3
    7.7	3.8	6.7	2.2	3
    7.7	2.6	6.9	2.3	3
    6	2.2	5	1.5	3
    6.9	3.2	5.7	2.3	3
    5.6	2.8	4.9	2	3
    7.7	2.8	6.7	2	3
    6.3	2.7	4.9	1.8	3
    6.7	3.3	5.7	2.1	3
    7.2	3.2	6	1.8	3
    6.2	2.8	4.8	1.8	3
    6.1	3	4.9	1.8	3
    6.4	2.8	5.6	2.1	3
    7.2	3	5.8	1.6	3
    7.4	2.8	6.1	1.9	3
    7.9	3.8	6.4	2	3
    6.4	2.8	5.6	2.2	3
    6.3	2.8	5.1	1.5	3
    6.1	2.6	5.6	1.4	3
    7.7	3	6.1	2.3	3
    6.3	3.4	5.6	2.4	3
    6.4	3.1	5.5	1.8	3
    6	3	4.8	1.8	3
    6.9	3.1	5.4	2.1	3
    6.7	3.1	5.6	2.4	3
    6.9	3.1	5.1	2.3	3
    5.8	2.7	5.1	1.9	3
    6.8	3.2	5.9	2.3	3
    6.7	3.3	5.7	2.5	3
    6.7	3	5.2	2.3	3
    6.3	2.5	5	1.9	3
    6.5	3	5.2	2	3
    6.2	3.4	5.4	2.3	3
    5.9	3	5.1	1.8	3

    二、代码:

    (1)人工生成数据集AriGen:

    import math  # 数学
    import random  # 随机
    import numpy as np
    import matplotlib.pyplot as plt
    
    
    # 计算两个样本距离
    def dist(xy, a, b):
        d_l = 0  # 储存当前距离值
        for i in range(len(xy) - 1):  # 对每个属性
            d_l += (xy[i][a] - xy[i][b]) ** 2  # 计算平方和
        d_l = d_l ** 0.5  # 开根号
        return d_l
    
    
    # 计算avg(C)
    def avg(xy, Ci):
        sum = 0
        for i in Ci:
            for j in Ci:
                sum += dist(xy, i, j)
        return sum * 2 / (len(Ci) * (len(Ci) - 1))
    
    
    # 计算DB指数
    def DB(xy, C, u):
        sum = 0
        for i in range(len(C)):  # 每一簇
            max_C = -float('inf')
            for j in range(len(C)):
                if i == j:
                    continue
                dcen = 0
                for k in range(len(xy) - 1):  # 对每个属性
                    dcen += (u[i][k] - u[j][k]) ** 2  # 计算平方和
                dcen = dcen ** 0.5  # 开根号
                max_ = (avg(xy, C[i]) + avg(xy, C[j])) / dcen
                if max_ > max_C:
                    max_C = max_
            sum += max_C
        return sum / len(C)
    
    
    # k均值聚类算法,分k簇。
    def jvlei_k(xy, k):
        n = len(xy[0])  # 储存数据数。
        u = []  # 储存簇的均值向量
        C = []  # 储存簇划分
        # 挑选k个不重复的样本。
        i = 0
        j = 0
        u = random.sample(range(n), k)  # (如果数据随机,这一步其实可以直接选前k个)
        for i in range(k):  # 对每个初始点。
            l = u[i]  # 暂存
            u[i] = []
            print('初始点', i, end='\t: ')
            for j in range(len(xy) - 1):  # 对每个属性
                u[i].append(xy[j][l])
            print(u[i], xy[j + 1][l])
        while 1:
            C = []  # 储存簇划分
            for i in range(k):  # k 簇
                C.append([])  # 初始化C。
            for i in range(n):  # 所有的数据
                d = float('inf')  # 初始无穷大
                l = 0  # 属于哪个簇
                for j in range(k):  # 看和每个簇的距离,找最小。
                    d_ = 0  # 储存当前距离值
                    for ii in range(len(xy) - 1):  # 对每个属性
                        d_ += (xy[ii][i] - u[j][ii]) ** 2  # 计算平方和
                    d_ = d_ ** 0.5  # 开根号
                    if d_ < d:  # 距离更近
                        d = d_  # 最近值更改
                        l = j  # 属于第j簇
                C[l].append(i)  # 并入第l簇
            u_ = []  # 新的均值向量
            for i in range(k):  # 重新计算每个簇的均值
                u_.append([0] * (len(xy) - 1))
                for j in C[i]:  # 对簇中每个样本
                    for ii in range(len(xy) - 1):  # 对每个属性
                        u_[i][ii] += xy[ii][j]  # 求和
                for ii in range(len(xy) - 1):  # 对每个属性
                    u_[i][ii] /= len(C[i])  # 均值
            l = 0  # 标记
            for i in range(k):  # 每个均值向量
                for j in range(len(u[0])):  # 每个属性
                    l += (u[i][j] - u_[i][j]) ** 2  # 求误差
            if l < 0.01:
                print('DB=', DB(xy, C, u))
                return C
            else:
                u = u_[:]
    
    
    # 簇分类和标签对比,返回【最多标签,最多标签所占比例】
    def panduan(yi, Ci):
        biaoqian = {}  # 标签,出现次数
        for i in Ci:
            if yi[i] in biaoqian:  # 存在
                biaoqian[yi[i]] += 1  # 加一
            else:
                biaoqian[yi[i]] = 1  # 创建
        max_ = -1  # 标签
        max_t = -1  # 标签数
        for i in biaoqian:  # 所有标签
            if biaoqian[i] > max_t:  # 找最多的
                max_ = i
                max_t = biaoqian[i]
        return [max_, (max_t * 100 / len(Ci))]
    
    
    # 生成数据集=============================================
    x = []  # 数据集x属性
    y = []  # 数据集y属性
    x.extend(np.random.normal(loc=040.0, scale=10, size=100))  # 向x中放100个均值为40,方差为10,正态分布的随机数
    x.extend(np.random.normal(loc=060.0, scale=10, size=200))  # 向x中放200个均值为60,方差为10,正态分布的随机数
    x.extend(np.random.normal(loc=100.0, scale=10, size=300))  # 向x中放300个均值为100,方差为10,正态分布的随机数
    y.extend(np.random.normal(loc=050.0, scale=10, size=100))  # 向y中放100个均值为50,方差为10,正态分布的随机数
    y.extend(np.random.normal(loc=100.0, scale=10, size=200))  # 向y中放200个均值为100,方差为10,正态分布的随机数
    y.extend(np.random.normal(loc=040.0, scale=10, size=300))  # 向y中放300个均值为40,方差为10,正态分布的随机数
    c = [1] * 100  # c标签第一类为 1  x均值:40 y均值:50
    c.extend([2] * 200)  # # c标签第二类为 2  x均值:60 y均值:100
    c.extend([3] * 300)  # # c标签第三类为 3  x均值:100 y均值:40
    # 生成训练集与测试集=======================================
    lt = list(range(600))  # 得到一个顺序序列
    random.shuffle(lt)  # 打乱序列
    x1 = [[], [], []]  # 初始化x1
    for i in lt:  # 做训练集
        x1[0].append(x[i])  # 加上数据集x属性
        x1[1].append(y[i])  # 加上数据集y属性
        x1[2].append(c[i])  # 加上数据集c标签
    # 生成簇划分 ==========================================
    k = 3  # 分成三簇
    C = jvlei_k(x1, k)
    # 计算正确率===========================================
    for i in range(k):
        c = panduan(x1[2], C[i])
        print('第', i, '簇,标签最多的类为', c[0], '占比', c[1], '%')
    # 绘图 ===============================================
    
    plt.figure(1)  # 第一张画布
    for i in range(len(x1[2])):  # 对数据集‘上色’
        if x1[2][i] == 1:
            x1[2][i] = 'r'  # 数据集  1 红色
        else:
            if x1[2][i] == 2:
                x1[2][i] = 'y'  # 数据集 2 黄色
            else:
                x1[2][i] = 'b'  # 数据集 3 蓝色
    plt.scatter(x1[0], x1[1], c=x1[2], alpha=0.5)  # 绘点数据集
    plt.xlabel('x')  # x轴标签
    plt.ylabel('y')  # y轴标签
    plt.title('Initial')  # 标题
    #  簇划分:==========================================
    plt.figure(2)  # 第二张画布
    for i in range(k):  # 对数据集‘上色’
        for j in C[i]:
            if i == 1:
                x1[2][j] = 'r'  # 簇  1 红色
            else:
                if i == 2:
                    x1[2][j] = 'y'  # 簇 2 黄色
                else:
                    x1[2][j] = 'b'  # 簇 3 蓝色
    plt.scatter(x1[0], x1[1], c=x1[2], alpha=0.5)  # 绘点数据集
    plt.xlabel('x')  # x轴标签
    plt.ylabel('y')  # y轴标签
    plt.title('End')  # 标题
    plt.show()  # 显示
    

          结果示例:

     

    (2) Iris 数据集:

    import math  # 数学
    import random  # 随机
    import numpy as np
    import matplotlib.pyplot as plt
    
    
    # 计算两个样本距离
    def dist(xy, a, b):
        d_l = 0  # 储存当前距离值
        for i in range(len(xy) - 1):  # 对每个属性
            d_l += (xy[i][a] - xy[i][b]) ** 2  # 计算平方和
        d_l = d_l ** 0.5  # 开根号
        return d_l
    
    
    # 计算avg(C)
    def avg(xy, Ci):
        sum = 0
        for i in Ci:
            for j in Ci:
                sum += dist(xy, i, j)
        return sum * 2 / (len(Ci) * (len(Ci) - 1))
    
    
    # 计算DB指数
    def DB(xy, C, u):
        sum = 0
        for i in range(len(C)):  # 每一簇
            max_C = -float('inf')
            for j in range(len(C)):
                if i == j:
                    continue
                dcen = 0
                for k in range(len(xy) - 1):  # 对每个属性
                    dcen += (u[i][k] - u[j][k]) ** 2  # 计算平方和
                dcen = dcen ** 0.5  # 开根号
                max_ = (avg(xy, C[i]) + avg(xy, C[j])) / dcen
                if max_ > max_C:
                    max_C = max_
            sum += max_C
        return sum / len(C)
    
    
    # k均值聚类算法,分k簇。
    def jvlei_k(xy, k):
        n = len(xy[0])  # 储存数据数。
        u = []  # 储存簇的均值向量
        C = []  # 储存簇划分
        # 挑选k个不重复的样本。
        i = 0
        j = 0
        u = random.sample(range(n), k)  # (如果数据随机,这一步其实可以直接选前k个)
        for i in range(k):  # 对每个初始点。
            l = u[i]  # 暂存
            u[i] = []
            print('初始点', i, end='\t: ')
            for j in range(len(xy) - 1):  # 对每个属性
                u[i].append(xy[j][l])
            print(u[i], xy[j + 1][l])
        while 1:
            C = []  # 储存簇划分
            for i in range(k):  # k 簇
                C.append([])  # 初始化C。
            for i in range(n):  # 所有的数据
                d = float('inf')  # 初始无穷大
                l = 0  # 属于哪个簇
                for j in range(k):  # 看和每个簇的距离,找最小。
                    d_ = 0  # 储存当前距离值
                    for ii in range(len(xy) - 1):  # 对每个属性
                        d_ += (xy[ii][i] - u[j][ii]) ** 2  # 计算平方和
                    d_ = d_ ** 0.5  # 开根号
                    if d_ < d:  # 距离更近
                        d = d_  # 最近值更改
                        l = j  # 属于第j簇
                C[l].append(i)  # 并入第l簇
            u_ = []  # 新的均值向量
            for i in range(k):  # 重新计算每个簇的均值
                u_.append([0] * (len(xy) - 1))
                for j in C[i]:  # 对簇中每个样本
                    for ii in range(len(xy) - 1):  # 对每个属性
                        u_[i][ii] += xy[ii][j]  # 求和
                for ii in range(len(xy) - 1):  # 对每个属性
                    u_[i][ii] /= len(C[i])  # 均值
            l = 0  # 标记
            for i in range(k):  # 每个均值向量
                for j in range(len(u[0])):  # 每个属性
                    l += (u[i][j] - u_[i][j]) ** 2  # 求误差
            if l < 0.01:
                print('DB=', DB(xy, C, u))
                return C
            else:
                u = u_[:]
    
    
    # 簇分类和标签对比,返回【最多标签,最多标签所占比例】
    def panduan(yi, Ci):
        biaoqian = {}  # 标签,出现次数
        for i in Ci:
            if yi[i] in biaoqian:  # 存在
                biaoqian[yi[i]] += 1  # 加一
            else:
                biaoqian[yi[i]] = 1  # 创建
        max_ = -1  # 标签
        max_t = -1  # 标签数
        for i in biaoqian:  # 所有标签
            if biaoqian[i] > max_t:  # 找最多的
                max_ = i
                max_t = biaoqian[i]
        return [max_, (max_t * 100 / len(Ci))]
    
    
    f = open('Iris.txt', 'r')  # 读文件
    x = [[], [], [], [], []]  # 花朵属性,(0,1,2,3),花朵种类
    x1 = [[], [], [], [], []]
    x2 = [[], [], [], [], []]
    y = []  # 分支节点[分支类,[分支节点1],[分支节点2]···]  递归决策树
    y1 = []
    while 1:
        yihang = f.readline()  # 读一行
        if len(yihang) <= 1:  # 读到末尾结束
            break
        fenkai = yihang.split('\t')  # 按\t分开
        for i in range(5):  # 分开的五个值
            x[i].append(eval(fenkai[i]))  # 化为数字加到x中
    print('数据集===============================================')
    print(len(x[0]))
    for i in range(len(x)):
        print(x[i])
    x1 = x[:]
    # 生成簇划分 ==========================================
    k = 3  # 分成三簇
    C = jvlei_k(x1, k)
    # 计算正确率===========================================
    for i in range(k):
        c = panduan(x1[4], C[i])
        print('第', i, '簇,标签最多的类为', c[0], '占比', c[1], '%')
    # 绘图 ===============================================
    # 这里改输出那三维,(有四维,选三维)
    a = 0
    b = 2
    c = 3
    for i in range(len(x1[4])):  # 对数据集‘上色’
        if x1[4][i] == 1:
            x1[4][i] = 'r'  # 数据集  1 红色
        else:
            if x1[4][i] == 2:
                x1[4][i] = 'y'  # 数据集 2 黄色
            else:
                x1[4][i] = 'b'  # 数据集 3 蓝色
    fig = plt.figure()
    m = plt.axes(projection='3d')
    m.scatter3D(x1[a], x1[b], x1[c], c=x1[4], alpha=0.8)  # 绘制散点图
    plt.xlabel(a)  # x轴标签
    plt.ylabel(b)  # y轴标签
    plt.title('Initial')  # 标题
    #  簇划分:==========================================
    for i in range(k):  # 对数据集‘上色’
        for j in C[i]:
            if i == 1:
                x1[4][j] = 'r'  # 簇  1 红色
            else:
                if i == 2:
                    x1[4][j] = 'y'  # 簇 2 黄色
                else:
                    x1[4][j] = 'b'  # 簇 3 蓝色
    fig = plt.figure()
    m = plt.axes(projection='3d')
    m.scatter3D(x1[a], x1[b], x1[c], c=x1[4], alpha=0.8)  # 绘制散点图
    plt.xlabel(a)  # x轴标签
    plt.ylabel(b)  # y轴标签
    plt.title('End')  # 标题
    plt.show()  # 显示
    

      结果示例:

     

    展开全文
  • K均值聚类(K-Means) k-means 算法:根据给定的数据样本构建 k 个划分聚类,每个划分聚类即为一个簇。 该算法是一个典型的基于距离的聚类算法,采用距离作为相似性的评价指标(两个样本的距离越近,相似度就越大...

    K均值聚类(K-Means)

            k-means 算法:根据给定的数据样本构建 k 个划分聚类,每个划分聚类即为一个簇。

            该算法是一个典型的基于距离的聚类算法,采用距离作为相似性的评价指标(两个样本的距离越近,相似度就越大)。

    每个数据样本必须属于而且只能属于一个簇。

    同一簇中的数据样本相似度高,不同簇中的数据样本相似度较小。

    聚类相似度是利用各簇中样本的均值来进行计算的。

    注:因为在该算法第一步中是随机的选取任意k个对象作为初始聚类的中心,初始地代表一个簇,因此k个初始类聚类中心点的选取对聚类结果具有较大的影响。

    聚类步骤:
    1)随机指定k个质心点(中心点)
    2)根据距离分类(靠近质心点的划归为同一类)
    3)采用平均值更新质心点
    4)重复迭代步骤2-3,直到质心点不在变化或者达到指点具体迭代次数。

    python代码实现

    import numpy as np
    import matplotlib.pyplot as plt
    # 引入scipy中的距离函数,默认欧式距离
    from scipy.spatial.distance import cdist
    # 从sklearn中直接生成聚类数据
    from sklearn.datasets._samples_generator import make_blobs
    
    
    # -------------1. 数据加载---------
    x, y = make_blobs(n_samples=100, centers=6, random_state=1234, cluster_std=0.6)
    
    #plt.figure(figsize=(6, 6))
    #plt.scatter(x[:, 0], x[:, 1], c=y)
    #plt.show()
    
    # --------------2. 算法实现--------------
    class K_Means(object):
        # 初始化,参数 n_clusters(K)、迭代次数max_iter、初始质心 centroids
        def __init__(self, n_clusters=5, max_iter=300, centroids=[]):
            self.n_clusters = n_clusters
            self.max_iter = max_iter
            self.centroids = np.array(centroids, dtype=np.float)
    
        # 训练模型方法,k-means聚类过程,传入原始数据
        def fit(self, data):
            # 假如没有指定初始质心,就随机选取data中的点作为初始质心
            if (self.centroids.shape == (0,)):
                # 从data中随机生成0到data行数的6个整数,作为索引值
                self.centroids = data[np.random.randint(0, data.shape[0], self.n_clusters), :]
    
            # 开始迭代
            for i in range(self.max_iter):
                # 1. 计算距离矩阵,得到的是一个100*6的矩阵
                distances = cdist(data, self.centroids)
    
                # 2. 对距离按有近到远排序,选取最近的质心点的类别,作为当前点的分类
                c_ind = np.argmin(distances, axis=1)
    
                # 3. 对每一类数据进行均值计算,更新质心点坐标
                for i in range(self.n_clusters):
                    # 排除掉没有出现在c_ind里的类别
                    if i in c_ind:
                        # 选出所有类别是i的点,取data里面坐标的均值,更新第i个质心
                        self.centroids[i] = np.mean(data[c_ind == i], axis=0)
    
        # 实现预测方法
        def predict(self, samples):
            # 跟上面一样,先计算距离矩阵,然后选取距离最近的那个质心的类别
            distances = cdist(samples, self.centroids)
            c_ind = np.argmin(distances, axis=1)
    
            return c_ind
    
    
    dist = np.array([[121, 221, 32, 43],
                     [121, 1, 12, 23],
                     [65, 21, 2, 43],
                     [1, 221, 32, 43],
                     [21, 11, 22, 3], ])
    c_ind = np.argmin(dist, axis=1)
    print(c_ind)
    x_new = x[0:5]
    print(x_new)
    print(c_ind == 2)
    print(x_new[c_ind == 2])
    np.mean(x_new[c_ind == 2], axis=0)
    
    # --------------3. 测试------------
    # 定义一个绘制子图函数
    def plotKMeans(x, y, centroids, subplot, title):
        # 分配子图,121表示1行2列的子图中的第一个
        plt.subplot(subplot)
        plt.scatter(x[:, 0], x[:, 1], c='cyan')
        # 画出质心点
        plt.scatter(centroids[:, 0], centroids[:, 1], c=np.array(range(5)), s=100)
        plt.title(title)
    
    kmeans = K_Means(max_iter=300, centroids=[[2, 1], [2, 2], [2, 3], [2, 4], [2, 5]])
    
    plt.figure(figsize=(16, 6))
    plotKMeans(x, y, kmeans.centroids, 121, 'start')
    
    # 开始聚类
    kmeans.fit(x)
    
    plotKMeans(x, y, kmeans.centroids, 122, 'k-means')
    
    # 预测新数据点的类别
    x_new = np.array([[0, 0], [10, 7]])
    y_pred = kmeans.predict(x_new)
    
    print(kmeans.centroids)
    print(y_pred)
    
    plt.scatter(x_new[:, 0], x_new[:, 1], s=100, c='black')
    plt.show()

    展开全文
  • python实现K均值聚类算法

    千次阅读 2021-05-14 21:13:50
    K-Means算法的思想很简单,对于给定的样本集,按照样本之间的距离大小,将样本集划分为K个簇。让簇内的点尽量紧密的连在一起,而让簇间的距离尽量的大。 如何计算? 如果用数据表达式表示,假设簇划分为(C1,C2,...Ck
  • Python biKmeans 二分聚类算法
  • K均值聚类算法python实现,以及聚类算法与EM最大算法的关系 参考引用 先上一张gif的k均值聚类算法动态图片,让大家对算法有个感性认识: 其中:N=200代表有200个样本,不同的颜色代表不同的簇(其中 3种颜色为3...
  • 无监督学习的数据集中没有输出标签y,常用的无监督学习算法聚类和降维。 概要 聚类人有归纳和总结的能力,机器也有。聚类就是让机器把数据集中的样本按照特征的性质分组,这个过程中没有标签的存在。 聚类和...
  • k均值聚类算法原理详解 示例为链接中的例题 直接调用python机器学习的库scikit-learn中k均值算法的相关方法 from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt x = np.array...
  • k-means 聚类算法思想先随机选择k个聚类中心,把集合里的元素与最近的聚类中心聚为一类,得到一次聚类,再把每一个类的均值作为新的聚类中心重新聚类,迭代n次得到最终结果分步解析 一、初始化聚类中心 首先随机...
  • 《机器学习实战》之K-均值聚类算法python实现最近的项目是关于“基于数据挖掘的电路故障分析”,项目基本上都是师兄们在做,我只是在研究关于项目中用到的如下几种算法:二分均值聚类、最近邻分类、基于规则的分类...
  • 关于聚类问题的算法python代码实现-K-均值聚类方法
  • 1个K均值算法实际上,K-means算法是一种非常简单的算法,与算法思想或特定实现无关. 通过以一定方式测量样本之间的相似度,并迭代更新聚类中心,它属于无监督分类. 当聚类中心不再移动或移动差异小于阈值时,将样本...
  • K均值聚类聚类算法中十分经典的算法,本人采用人工生成数据集进行试验,数据集真实分类结果为4,代码首先对真实情况进行可视化。然后进行均值聚类,聚类结果与真实情况接近。结果图片放置在文件中,大家一起学习!...
  • opencv K均值聚类(python)

    千次阅读 2022-03-15 22:47:22
    K均值聚类K均值聚类K均值聚类的基本步骤K均值聚类模块简单例子 K均值聚类 预测的是一个离散值时,做的工作就是“分类”。 预测的是一个连续值时,做的工作就是“回归”。 机器学习模型还可以将训练集中的数据...
  • =listU: for i in range(30): for j in range(3): dis.append(math.sqrt(pow(kdataList[i][0]-listU[j][0],2)+pow(kdataList[i][1]-listU[j][1],2))) #计算样本点与均值向量的距离 minDis = dis[0] disIndex=0 for m...
  • 请问大神,使用Python实现K均值聚类算法,使用面向对象和面向过程, 这两种设计思想的区别,和优劣是什么,是结合这个具体的实现算法谈一下,
  • K-Means聚类算法原理 """ K-means impl, take square for example @Author: JiananYuan @Date: 2021/12/14 """ import random import matplotlib.pyplot as plt import numpy as np def check_consistent(last_...
  • 由于这是一个无监督学习的算法,因此我们首先在一个二维的坐标轴下随机给定一堆点,并随即给定两个质心,我们这个算法的目的就是将这一堆点根据它们自身的坐标特征分为两类,因此选取了两个质心,什么时候这一堆点...
  • 今天说K-means聚类算法,但是必须要先理解聚类和分类的区别,很多业务人员在日常分析时候不是很严谨,混为一谈,其实二者有本质的区别。 分类其实是从特定的数据中挖掘模式,作出判断的过程。比如Gmail邮箱里有垃圾...
  • k均值聚类算法pythonWhat is K mean clustering? K是什么意思聚类? K means clustering is the most popular and widely used unsupervised learning model. It is also called clustering because it works by ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 11,312
精华内容 4,524
关键字:

k均值聚类算法python