精华内容
下载资源
问答
  • 利用matlab构建层次聚类树状图 (1)计算数据集每对元素之间的距离,对应函数为pdistw. 调用格式:Y=pdist(X),Y=pdist(X,’metric’), Y=pdist(X,’distfun’),Y=pdist(X,’minkowski’,p) 说明:X是m*n的矩阵,metric...

    利用matlab构建层次聚类树状图

    (1)计算数据集每对元素之间的距离,对应函数为pdistw.
    调用格式:Y=pdist(X),Y=pdist(X,’metric’), Y=pdist(X,’distfun’),Y=pdist(X,’minkowski’,p)

    说明:X是m*n的矩阵,metric是计算距离的方法选项:

    metric=euclidean表示欧式距离(缺省值);

    metric=seuclidean表示标准的欧式距离;

    metric=mahalanobis表示马氏距离。

    distfun是自定义的距离函数,p是minkowski距离计算过程中的幂次,缺省值为2.Y返回大小为m(m-1)/2的距离矩阵,距离排序顺序为(1,2),(1,3),…(m-1,m),Y也称为相似矩阵,可用squareform将其转化为方阵。

    (2)对元素进行分类,构成一个系统聚类树,对应函数为linkage.
    调用格式:Z=linkage(Y), Z=linkage(Y,’method’)

    说明:Y是距离函数,Z是返回系统聚类树,method是采用的算法选项,

    如下:method=single表示最短距离(缺省值);

    complete表示最长距离;median表示中间距离法;

    centroid表示重心法;average表示类平均法;

    ward 表示离差平方和法(Ward法)。

    (3)确定怎样划分系统聚类树,得到不同的类,对应的函数为cluster.
    调用格式:T=cluster(Z,’cutoff’,c),T=cluster(Z,’maxclust’,n)

    说明:Z是系统聚类树,为(m-1)*3的矩阵,c是阈值,n是类的最大数目,

    maxclust是聚类的选项,cutoff是临界值,决定cluster函数怎样聚类。

    **构建20组五维向量的matlab源码如下:

    %Matlab运行程序:
    X=[
       20,7,12,14,22;
       18,10,23,15,16;
       10,5,13,9,7;
       4,5,4,6,9;
       4,3,2,5,7;
       16,17,18,19,20;
       23,13,14,15,16;
       17,21,29,10,30;
       44,55,66,77,33;
       23,26,24,35,56;
       44,34,23,15,16;
       67,66,56,78,90;
       70,71,16,51,61;
       31,21,79,63,91;
       65,45,54,93,67;
       81,90,70,90,43;
       42,45,69,81,98;
       99,20,13,41,50;
       13,14,51,26,71;
       51,60,93,75,38;
      ];
    Y=pdist(X);
    SF=squareform(Y);
    Z=linkage(Y,'average');
    dendrogram(Z);%显示系统聚类树
    T=cluster(Z,'maxclust',3)
    

    运行结果如下:
    在这里插入图片描述

    展开全文
  • I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if...

    I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if I am not understanding it anymore or there is a bug in Python 3's version of this module.

    This answer, how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy, implies that the dendrogram output dictionary gives dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord']) w/ all of the same size so you can zip them and plt.plot them to reconstruct the dendrogram.

    Seems simple enough and I did get it work back when I used Python 2.7.11 but once I upgraded to Python 3.5.1 my old scripts weren't giving me the same results.

    I started reworking my clusters for a very simple repeatable example and think I may have found a bug in Python 3.5.1's version of SciPy version 0.17.1-np110py35_1. Going to use the Scikit-learn datasets b/c most people have that module from the conda distribution.

    Why aren't these lining up and how come I am unable to reconstruct the dendrogram in this way?

    # Init

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    import seaborn as sns; sns.set()

    # Load data

    from sklearn.datasets import load_diabetes

    # Clustering

    from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list

    from scipy.spatial import distance

    from fastcluster import linkage # You can use SciPy one too

    %matplotlib inline

    # Dataset

    A_data = load_diabetes().data

    DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])

    # Absolute value of correlation matrix, then subtract from 1 for disimilarity

    DF_dism = 1 - np.abs(DF_diabetes.corr())

    # Compute average linkage

    A_dist = distance.squareform(DF_dism.as_matrix())

    Z = linkage(A_dist,method="average")

    # I modded the SO code from the above answer for the plot function

    def plot_tree( D_dendro, ax ):

    # Set up plotting data

    leaves = D_dendro["ivl"]

    icoord = np.array( D_dendro['icoord'] )

    dcoord = np.array( D_dendro['dcoord'] )

    color_list = D_dendro["color_list"]

    # Plot colors

    for leaf, xs, ys, color in zip(leaves, icoord, dcoord, color_list):

    print(leaf, xs, ys, color, sep="\t")

    plt.plot(xs, ys, color)

    # Set min/max of plots

    xmin, xmax = icoord.min(), icoord.max()

    ymin, ymax = dcoord.min(), dcoord.max()

    plt.xlim( xmin-10, xmax + 0.1*abs(xmax) )

    plt.ylim( ymin, ymax + 0.1*abs(ymax) )

    # Set up ticks

    ax.set_xticks( np.arange(5, len(leaves) * 10 + 5, 10))

    ax.set_xticklabels(leaves, fontsize=10, rotation=45)

    plt.show()

    fig, ax = plt.subplots()

    D1 = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, no_plot=True)

    plot_tree(D_dendro=D1, ax=ax)

    c7c744e1f12bc742fa977c9dfbfda04d.png

    attr_1 [ 15. 15. 25. 25.] [ 0. 0.10333704 0.10333704 0. ] g

    attr_4 [ 55. 55. 65. 65.] [ 0. 0.26150727 0.26150727 0. ] r

    attr_5 [ 45. 45. 60. 60.] [ 0. 0.4917828 0.4917828 0.26150727] r

    attr_2 [ 35. 35. 52.5 52.5] [ 0. 0.59107459 0.59107459 0.4917828 ] b

    attr_8 [ 20. 20. 43.75 43.75] [ 0.10333704 0.65064998 0.65064998 0.59107459] b

    attr_6 [ 85. 85. 95. 95.] [ 0. 0.60957062 0.60957062 0. ] b

    attr_7 [ 75. 75. 90. 90.] [ 0. 0.68142114 0.68142114 0.60957062] b

    attr_0 [ 31.875 31.875 82.5 82.5 ] [ 0.65064998 0.72066112 0.72066112 0.68142114] b

    attr_3 [ 5. 5. 57.1875 57.1875] [ 0. 0.80554653 0.80554653 0.72066112] b

    Here's one w/o the labels and just the icoord values for the x-axis

    61837e850740484501f73d69934fa431.png

    So check out the colors aren't mapping correctly. It says [ 15. 15. 25. 25.] for the icoord goes with attr_1 but based on the values it looks like it goes with attr_4. Also, it doesn't go to all the way to the last leaf (attr_9) and that's b/c the length of icoord and dcoord is 1 less than the amount of ivl labels.

    print([len(x) for x in [leaves, icoord, dcoord, color_list]])

    #[10, 9, 9, 9]

    解决方案

    icoord, dcoord and color_list describe the links, not the leaves. icoord and dcoord give the coordinates of the "arches" (i.e. upside-down U or J shapes) for each link in a plot, and color_list is the color of those arches. In a full plot, the length of icoord, etc., will be one less than the length of ivl, as you have observed.

    Don't try to line up the ivl list with the icoord, dcoord and color_list lists. They are associated with different things.

    展开全文
  • 聚类树状图_聚集聚类和树状图-解释

    千次阅读 2020-08-08 19:33:52
    聚类树状图Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data ...

    聚类树状图

    Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

    聚集聚类是一种层次聚类算法。 这是一种无监督的机器学习技术,可将总体分为多个集群,以使同一集群中的数据点更加相似,而不同集群中的数据点则彼此不同。

    • Points in the same cluster are closer to each other.

      同一群集中的点彼此靠近。
    • Points in the different clusters are far apart.

      不同聚类中的点相距很远。
    Image for post
    (Image by Author), Sample 2-dimension Dataset
    (作者提供的图像),样本2维数据集

    In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.

    在上面的示例二维数据集中,可以看到该数据集形成了3个彼此相距很远的群集,并且同一群集中的点彼此靠近。

    聚集集群背后的直觉: (The intuition behind Agglomerative Clustering:)

    Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.

    聚集式聚类是一种自下而上的方法,最初,每个数据点都是其自身的一个聚类,随着一个聚类上移,将进一步合并成对的聚类。

    聚集聚类的步骤: (Steps of Agglomerative Clustering:)

    1. Initially, all the data-points are a cluster of its own.

      最初,所有数据点都是其自身的集群。
    2. Take two nearest clusters and join them to form one single cluster.

      选取两个最近的群集,并将它们合并为一个群集。
    3. Proceed recursively step 2 until you obtain the desired number of clusters.

      递归地执行步骤2,直到获得所需的群集数量。
    Image for post
    1st Image: All the data point is a cluster of its own, 第一个图像:所有数据点都是其自己的一个群集, 2nd Image: Two nearest clusters (surrounded by a black oval) joins together to form a single cluster. 第二个图像:两个最近的群集(由黑色椭圆形包围)连接在一起形成一个群集。

    In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.

    在上面的样本数据集中,观察到2个聚类彼此分离。 因此,我们在获得2个簇之后就停止了。

    Image for post
    (Image by Author), Sample dataset separated into 2 clusters
    (作者提供的图像),样本数据集分为2个类
    Image for post

    如何加入两个集群以形成一个集群? (How to join two clusters to form one cluster?)

    To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.

    为了获得所需的群集数量,需要将群集数量从最初的n个群集减少(n等于数据点的总数)。 通过计算两个群集之间的相似度将它们组合在一起。

    There are some methods which are used to calculate the similarity between two clusters:

    有一些方法可用于计算两个聚类之间的相似度:

    • Distance between two closest points in two clusters.

      两个群集中两个最近点之间的距离。
    • Distance between two farthest points in two clusters.

      两个群集中两个最远点之间的距离。
    • The average distance between all points in the two clusters.

      两个群集中所有点之间的平均距离。
    • Distance between centroids of two clusters.

      两个簇的质心之间的距离。

    There are several pros and cons of choosing any of the above similarity metrics.

    选择上述相似性指标中的任何一个都有其优缺点。

    凝聚集群的实现: (Implementation of Agglomerative Clustering:)

    (Code by Author)
    (作者代码)

    如何获得最佳的簇数? (How to obtain the optimal number of clusters?)

    The implementation of the Agglomerative Clustering algorithm accepts the number of desired clusters. There are several ways to find the optimal number of clusters such that the population is divided into k clusters in a way that:

    聚集聚类算法的实现接受所需聚类的数量。 有几种方法可以找到最佳数目的聚类,以便按以下方式将总体分为k个聚类:

    Points in the same cluster are closer to each other.

    同一群集中的点彼此靠近。

    Points in the different clusters are far apart.

    不同聚类中的点相距很远。

    By observing the dendrograms, one can find the desired number of clusters.

    通过观察树状图,可以找到所需数目的簇。

    Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.

    树状图是数据点之间层次关系的图形表示。 它说明了由相应分析产生的聚类的排列,并用于观察分层(聚集)聚类的输出。

    树状图的实现: (Implementation of Dendrograms:)

    (Code by Author)
    (作者代码)

    Download the sample 2-dimension dataset from here.

    此处下载示例二维数据集。

    Image for post
    Left Image: Visualize the sample dataset, 左图像:可视化示例数据集, Right Image: Visualize 3 cluster for the sample dataset 右图像:可视化示例数据集的3个簇

    For the above sample dataset, it is observed that the optimal number of clusters would be 3. But for high dimension dataset where visualization is of the dataset is not possible dendrograms plays an important role to find the optimal number of clusters.

    对于上面的样本数据集,可以观察到最佳数目的聚类将是3。但是对于高维数据集,无法可视化该数据集,树状图对于找到最佳数目的聚类起着重要的作用。

    如何通过观察树状图找到最佳聚类数: (How to find the optimal number of clusters by observing the dendrograms:)

    Image for post
    (Image by Author), Dendrogram for the above sample dataset
    (作者提供的图像),上述示例数据集的树状图

    From the above dendrogram plot, find a horizontal rectangle with max-height that does not cross any horizontal vertical dendrogram line.

    从上面的树状图中,找到最大高度不与任何水平垂直树状图线交叉的水平矩形。

    Image for post
    Left: Separating into 2 clusters, 左:分为2个类, Right: Separating into 3 clusters 右:分为3个类

    The portion in the dendrogram in which rectangle having the max-height can be cut, and the optimal number of clusters will be 3 as observed in the right part of the above image. Max height rectangle is chosen because it represents the maximum Euclidean distance between the optimal number of clusters.

    在树状图中可以切割出具有最大高度的矩形的部分,并且如上图右侧所示,最佳簇数将为3。 选择最大高度矩形是因为它代表最佳簇数之间的最大欧几里得距离。

    结论: (Conclusion:)

    In this article, we have discussed the in-depth intuition of the agglomerative hierarchical clustering algorithm. There are some disadvantages to the algorithm that it is not suitable for large datasets because of the large space and time complexities. Even observing the dendrogram to find the optimal number of clusters for a large dataset is very difficult.

    在本文中,我们讨论了聚集层次聚类算法的深入直觉。 由于存在较大的空间和时间复杂性,该算法存在一些缺点,不适用于大型数据集。 即使观察树状图以找到大型数据集的最佳聚类数也非常困难。

    Thank You for Reading

    谢谢您的阅读

    翻译自: https://towardsdatascience.com/agglomerative-clustering-and-dendrograms-explained-29fc12b85f23

    聚类树状图

    展开全文
  • 聚类树状图可视化 当对面板数据进行聚类时,如何直观的看出聚类效果,基于层次聚类的树状图可以满足此要求。 数据说明 2019年省际数字经济指标数据(已标准化) 代码部分 在pycharm实现 # 调包 import matplotlib....

    聚类树状图可视化

    当对面板数据进行聚类时,如何直观的看出聚类效果,基于层次聚类的树状图可以满足此要求。

    数据说明

    2019年省际数字经济指标数据(已标准化)
    在这里插入图片描述

    代码部分

    在pycharm实现

    # 调包
    import matplotlib.pyplot as plt
    import pandas as pd
    import scipy.cluster.hierarchy as shc # 层次聚类
    # 防止中文乱码
    plt.rcParams['font.sans-serif']=['SimHei'] 
    plt.rcParams['axes.unicode_minus']=False
    # 加载数据
    df = pd.read_csv('E:/文件目录/kmeans3.csv',encoding='gbk')
    print(df.head())
    # 绘图
    plt.figure(figsize=(16,10),dpi=100)
    plt.title('省际数字经济聚类树状图',fontsize=22)
    dend = shc.dendrogram(shc.linkage(df[['列名1','列名2']],
    method='ward'),labels=df.按什么聚类的列名.values,color_threshold=100)
    plt.xticks(fontsize=12)
    plt.savefig('树状图.png') # 保存图片
    plt.show()
    

    结果展示

    聚类的结果如下图
    在这里插入图片描述

    纯新手,如有不对,请指正。

    展开全文
  • python绘制聚类树状图

    2021-02-07 13:26:08
    python绘制聚类树状图 import pandas as pd import plotly.figure_factory as ff import chart_studio.plotly as py import chart_studio chart_studio.tools.set_credentials_file(username="用户名", api_key='...
  • 画了一个样例如下: 其中将A、B、C、D、E五个对象进行层级聚类,最终的聚类步骤上面已经标出(1,2,3,4)。 原理:循环遍历所有对象,利用算法计算对象点之间的距离,每次将最近的两个对象聚为一类,直到...
  • This (very very briefly) 比较 (correlates) the 样本中成对成对的实际距离 to 系统聚类暗示的距离. The closer the value is to 1, 聚类就越好地保留了原本的距离, which in our case is pretty close: 0....
  • 1)数据标准化 import scipy import scipy.cluster.hierarchy as sch from scipy.cluster.vq import vq,kmeans,whiten import numpy as np import matplotlib.pylab as plt ... 层次聚类 #生成点与点之...
  • 聚类算法之层次聚类

    万次阅读 多人点赞 2018-04-30 01:13:32
    一、原型聚类和层次聚类原型聚类也称基于原型的聚类(prototype-based clustering),这类算法假设聚类结构能够通过一组原型刻画,先对原型进行初始化,然后对原型进行迭代更新求解。采用不同的原型表示、不同的求解...
  • 凝聚层次聚类层次聚类方法凝聚层次聚类算法原理簇间距离计算方法单链法single全链法complete组平均法 averageward法python代码实现绘制层次聚类树状图一些参考 相关文章: 数据挖掘 | [关联规则] 利用apyori库的关联...
  • 层次聚类

    千次阅读 2017-07-19 21:21:49
    层次聚类的数学结构在给定一个需要聚类的对象的矩阵之后,我们可以计算相应的邻近度矩阵,邻近度矩阵是层次聚类方法的基础,邻近度矩阵中的元素可以是对象之间的相似度(similarity)或不相似度(dissimilarity)。...
  • 层次聚类AHP

    2018-06-05 11:12:11
    一般通过给定网络的拓扑结构定义网络节点间的相似性或距离,然后采用单连接层次聚类或全连接层次聚类将网络节点组成一个树状图层次结构
  • 聚类算法(4)--Hierarchical clustering层次聚类

    万次阅读 多人点赞 2018-11-07 17:45:47
    树状图分类判断 一、层次聚类 1、层次聚类的原理及分类 1)层次法(Hierarchicalmethods)先计算样本之间的距离。每次将距离最近的点合并到同一个类。然后,再计算类与类之间的距离,将距离最近的类合并为一个...
  • python 层次聚类

    2021-05-07 20:44:31
    import plotly from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree z = linkage(data, method='single', metric='correlation') # method:变量...dendrogram(z, labels=data.index) # 聚类树状图 # #
  • 图像聚类-层次聚类

    万次阅读 热门讨论 2016-07-15 11:16:56
    最近做的一个东西跟这个相关,本来希望是...因此,比较容易想到,对于无标签又需要分类的图像数据,可以尝试先采用聚类来解决.   下面的内容是译自Jan Erik Solem的《Programming Computer Vision with Python》的第6
  • 层次聚类和Kmeans

    2020-03-26 15:28:39
    文章目录层次聚类层次聚类流程层次聚类优缺点Kmeans聚类Kmeans聚类流程K-Means的优缺点 层次聚类 层次聚类流程 (1) 计算两两样本之间的距离; (2) 将距离最小的两个类合并成一个新类; (3) 重新计算新类与所有类之间...
  • Matlab绘制聚类分析树状图

    千次阅读 2019-08-27 13:31:48
    工作环境(蓝色粗体字为特别注意内容)1、软件环境:Windows 7 Ultimate sp1、MatlabR2012b 32bit. 在使用Matlab做聚类...这样调用dendrogram就能够自动生成树状图,默认显示30个节点的树状图,像下面这样 如果...
  • 谱聚类和层次聚类 在讯飞实习了一个月了,做了点说话人聚类的工作,现在总结一下主要用到的谱聚类和层次聚类层次聚类层次聚类这块,我主要学习了 凝聚型层次聚类和 BIRCH方法,主要参考的博客有 [ BIRCH...
  • 【机器学习】层次聚类

    千次阅读 2018-11-18 18:20:56
    这种簇的层次结构表示为树(或树状图),树的根汇聚所有样本,树的叶子是各个样本。本篇博客会简述层次聚类的原理,重点是使用sklearn、scipy、seaborn等实现层次聚类并可视化结果。 原理简述   看到一篇详细讲...
  • Python计算机视觉编程图像聚类(一)K-means 聚类1.1 SciPy 聚类包1.2 图像聚类1.1 在主成分上可视化图像1.1 像素聚类(二)层次聚类(三)谱聚类 图像聚类 (一)K-means 聚类 1.1 SciPy 聚类包 1.2 图像聚类 ...
  • 层次聚类 层次聚类的概念: 层次聚类是一种很直观的算法。顾名思义就是要一层一层地进行聚类。 层次法(Hierarchicalmethods)先计算样本之间的距离。每次将距离最近的点合并到同一个类。然后,再 计算类与类之间的...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,412
精华内容 564
关键字:

层次聚类树状图