精华内容
下载资源
问答
  • from scipy.cluster.hierarchy import dendrogram, linkage,fcluster from matplotlib import pyplot as plt X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]] Z = linkage(X, method='centroid') f = fc

    这里,我们来解读一下scipy中给出的层次聚类scipy.cluster.hierarchy的示例:

    import numpy as np
    from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
    from matplotlib import pyplot as plt
    X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]
    Z = linkage(X, method='centroid')
    f = fcluster(Z,t=3,criterion='distance')
    fig = plt.figure(figsize=(5, 3))
    dn = dendrogram(Z)
    print('Z:\n', Z)
    print('f:\n', f)
    plt.show()
    

    可以得到输出结果:

    Z:
     [[ 2.          7.          0.          2.        ]
     [ 5.          6.          0.          2.        ]
     [ 1.          9.          1.          3.        ]
     [ 4.          8.          1.          3.        ]
     [ 0.         11.          1.66666667  4.        ]
     [ 3.         12.          3.25        5.        ]
     [10.         13.          7.26666667  8.        ]]
    f:
     [2 1 2 3 2 1 1 2]
    

    f代表了[2, 8, 0, 4, 1, 9, 9, 0]的每一个元素属于哪一个类别,这里设置了3类。如果想要5类的话,就可以在fcluster函数中的t参数设置为t=5即可。
    最令人头大的是Z矩阵的意义,翻看了很多的博客都没有写清楚的。这里,我来讲解一下:

    由于层次聚类每一次都会聚合两个类,那么如果有n个样本,那么最终会进行(n-1)次聚合,显然,Z矩阵有n-1行,这就意味着每一行表示了一次操作。那么接下来,我们从上到下解读。

    首先,Z矩阵的构成一定是一个(n-1)*4的矩阵。前两个元素是每一步合并的两个簇,第三个元素是这些集群之间的距离,第四个元素是合并后的新簇中元素个数。

    第一步:
    根据Z的第一行,那么索引2和7将会合并为一个新的类,新的类给一个新的索引,譬如为8,第三个数0表示索引2和7的两个簇之间的距离为0,这是显然的。最后一个数2表示当前合并完的这个类有2个元素。
    在这里插入图片描述
    在这里插入图片描述
    同理,我们可以把这一系列过程都表达如下:
    在这里插入图片描述
    最后,我们来分析一下各个函数以及常用参数的设置:

    linkage函数
    在这里插入图片描述
    1.第一个参数y为一个尺寸为(m,n)的二维矩阵。一共有n个样本,每个样本有m个维度。
    2.参数method =
    ’single’:一范数距离
    ’complete’:无穷范数距离
    ’average’:平均距离
    ’centroid’:二范数距离
    ’ward’:离差平方和距离
    3.返回值:(n-1)*4的矩阵Z

    fcluster函数->从给定链接矩阵定义的层次聚类中形成平面聚类
    在这里插入图片描述

    这个函数压平树状图
    这种分配主要取决于距离阈值t——允许的最大簇间距离
    1.参数Z是linkage函数的输出Z。
    2.参数scalar:形成扁平簇的阈值。
    3.参数criterion:
    ’inconsistent’:预设的,如果一个集群节点及其所有后代的不一致值小于或等于 t,那么它的所有叶子后代都属于同一个平面集群。当没有非单例集群满足此条件时,每个节点都被分配到自己的集群中。
    ’distance’:每个簇的距离不超过t
    4.输出是每一个特征的类别。

    dendrogram函数
    绘制层次聚类图
    在这里插入图片描述
    (未完待续…后续完善dendrogram函数作图的细节与完善以及如何基于相关性来做聚类)

    作于:
    2021-6-8
    17:15

    展开全文
  • 来源:数据STUDIO大量数据中具有"相似"特征的数据点或样本划分为一个类别。聚类分析提供了样本集在非监督模式下的类别划分。基本思想物以类聚、人以群分常用于数据探索或挖掘前期没有先验经验做...

    来源:数据STUDIO

    4bd18c9d35605cd0193a65fb735f341c.png

    大量数据中具有"相似"特征的数据点或样本划分为一个类别。聚类分析提供了样本集在非监督模式下的类别划分。

    基本思想

    • 物以类聚、人以群分

    常用于数据探索或挖掘前期

    • 没有先验经验做探索性分析

    • 样本量较大时做预处理

    解决问题

    • 数据集可以分几类

    • 每个类别有多少样本量

    • 不同类别中各个变量的强弱关系如何

    • 不同类型的典型特征是什么

    应用

    • 群类别间的差异性特征分析

    • 群类别内的关键特征提取

    • 图像压缩、分割、图像理解

    • 异常检测

    • 数据离散化

    缺点: 无法提供明确的行动指向; 数据异常对结果有影响。

    02aaa1add6cf62796cf3194d6d1b3cdb.png

    K-Means 聚类

    K-Means算法的思想简单,对于给定的样本集,按照样本之间的距离大小,将样本集划分为K个簇。让簇内的点尽量紧密的连在一起,而让簇间的距离尽量的大。

    均值聚类是一种矢量量化方法,起源于信号处理,是数据挖掘中流行的聚类分析方法。

    算法原理

    • 随机K个质心;

    • 开始循环,计算每个样本点到那个质心到距离,样本离哪个近就将该样本分给哪个质心,得到K个簇;

    • 对于每个簇,计算所有被分到该簇的样本点的平均距离作为新的质心;

    • 直到所有簇不再发生变化。

    衡量指标

    • 组内平方和:Total_Inertia

    • 轮廓系数:  组内差异,组间差异 取值范围越大越好

    优化目标

    • 内差异小,簇间差异大;其中差异由样本点到其所在簇的质心的距离衡量

    应用

    • 客户分群、用户画像、精确营销、基于聚类的推荐系统

    K-Means算法的优点

    • k-means算法是解决聚类问题的一种经典算法,算法简单、快速 。

    • 算法尝试找出使平方误差函数值最小的k个划分。当簇是密集的、球状或团状的,且簇与簇之间区别明显时,聚类效果较好 。

    缺点

    • k-means方法只有在簇的平均值被定义的情况下才能使用,且对有些分类属性的数据不适合。

    • 要求用户必须事先给出要生成的簇的数目k。

    • 对初值敏感,对于不同的初始值,可能会导致不同的聚类结果。

    • 不适合于发现非凸面形状的簇,或者大小差别很大的簇。

    • 对于"噪声"和孤立点数据敏感,少量的该类数据能够对平均值产生极大影响。

    单支股票单个字段聚类

    仍然以股市数据为例,根据每支股票整个时间段内的股价特征,将相似的那些交易日打上标签,并通过可视化方式将整个时间段内的交易日开盘价与收盘价展现出来。

    数据准备

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import warnings
    warnings.filterwarnings("ignore")
    import yfinance as yf
    yf.pdr_override()
    
    symbol = 'TCEHY'
    start = '2020-01-01'
    end = '2021-01-01'
    
    dataset = yf.download(symbol,start,end)
    dataset.head()

    73bd3bb3c7937c35fce8b4a25befcc75.png

    数据标准化

    X = dataset[['Open','High','Low','Close','Adj Close','Volume']]
    from sklearn.preprocessing import StandardScaler
    X = dataset.values[:,1:]
    X = np.nan_to_num(X)
    Clus_dataSet = StandardScaler().fit_transform(X)
    Clus_dataSet
    array([[-1.33493398, -1.31490333, -1.33543485, 
            -1.33612775, -0.95734284],
           [-1.19325204, -1.16643501, -1.15260357, 
            -1.15474442,  0.23740018],
           ...,
           [ 0.99796748,  1.03600566,  0.98270623,  
             0.98235044, -0.41634718],
           [ 1.0222281 ,  0.97701185,  0.99932706,  
             0.99888395, -0.63830968]])

    模型建立

    from sklearn.cluster import KMeans 
    # 设置簇中心个数
    clusterNum = 3
    k_means = KMeans(init = "k-means++", 
                     n_clusters = clusterNum,
                     n_init = 12)
    k_means.fit(X)
    labels = k_means.labels_
    print(labels)
    [1 0 1 0 0 1 1 0...1 1]

    设置价格标签

    dataset["Prices"] = labels
    dataset.head(5)

    85674f235f2753465e46071206552d4a.png

    将三个聚类中心聚合求均值

    dataset.groupby('Prices').mean()

    8a00f1e209742dd91e6de26ba893fe91.png

    可视化

    以类别为颜色,开盘价为散点的面积绘制开盘价和收盘价的气泡图。

    area = np.pi * ( X[:, 1])**2  
    plt.figure(figsize=(10,6))
    plt.scatter(X[:, 0], X[:, 3], s=area, 
                c=labels.astype(np.float), 
                alpha=0.5)
    plt.xlabel('Open', fontsize=18)
    plt.ylabel('Close', fontsize=16)
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlim([35,95])
    plt.ylim([30,100])
    plt.show()

    ef5bef14b068f3679c811b6092374a5a.png

    3D可视化聚类结果

    from mpl_toolkits.mplot3d import Axes3D 
    fig = plt.figure(1, figsize=(8, 6))
    plt.clf() # Clear figure
    # 设置3d画布
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    plt.cla() # Clear axis
    ax.set_xlabel('High', fontsize=18)
    ax.set_ylabel('Open', fontsize=16)
    ax.set_zlabel('Close', fontsize=16)
    # 绘制散点图
    ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))

    5d9d73f063a61e5acf38d7f3d8c2851a.png

    多支股票单个字段聚类

    数据获取

    从维基百科中获取股票符号、行业和子行业。

    # 美股
    wiki_table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies',header=0)[0]
    symbols = list(wiki_table['Ticker symbol'])
    # A股
    import urllib
    word = '深圳证券交易所主板上市公司列表'
    word = urllib.parse.quote(word)
    wiki_table = pd.read_html(f'https://zh.wikipedia.org/wiki/{word}',header=0)[0]
    symbols = list(wiki_table['公司代码'])

    或直接在深圳证券交易所下载A股列表。

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd 
    import baostock as bs
    # 从下载下来的A股列上获取上市公司名称及代码
    zero = '000000'
    A_table = pd.read_excel('./A股列表.xlsx')
    A_codes = A_table['A股代码'].map(lambda x: zero[0: 6 - len(str(x))] + str(x))[0: 200].values
    A_names = A_table['A股简称'][0: 200].values
    print(A_codes)

    0fd31b5364f3d3c38c55ce8b6cc1bafd.png

    ['000001' '000002' '000004' '000005' '000006' '000007' '000008' '000009'
     '000010' '000011' '000012' '000014' '000016' '000017' '000019' '000020'
      ...
     '000611' '000612' '000613' '000615' '000616' '000617' '000619' '000620'
     '000622' '000623' '000625' '000626' '000627' '000628' '000629' '000630']

    根据上面获得的股票代码下载相应日k线图。

    bs.login()
    dataset = pd.DataFrame()
    for num, A_code in enumerate(A_codes):
        print(A_code)
        result = bs.query_history_k_data(A_code, fields = 'date,close',
                                        start_date = '2020-01-01',
                                        end_date = '2021-01-01',
                                        frequency='d')
        df_result = result.get_data().rename(columns={'close':A_names[num]})
        
        if num == 0:
            dataset = df_result
        else:
            dataset = dataset.merge(df_result, on=['date'])
    bs.logout()
    dataset = dataset.set_index('date').applymap(lambda x: float(x))

    bfa0aed3ffb45143d90af5c1b8172bcc.png

    数据预处理

    import math
    # 计算一个理论一年的平均年收益率Returns和波动率Volatility
    returns = dataset.pct_change().mean() * 252
    returns = pd.DataFrame(returns)
    # print(returns)
    returns.columns = ['Returns']
    returns['Volatility'] = dataset.pct_change(
                        ).std() * math.sqrt(252)
    # print(returns['Volatility'])
    # 将数据格式化为numpy数组以提供给K-Means算法
    data = np.asarray(
              [np.asarray(returns['Returns']),
               np.asarray(returns['Volatility'])]
               ).T
    # 删除NaN值,将其替换为0
    cleaned_data = np.where(np.isnan(data), 0, data)
    X = cleaned_data

    建立聚类模型

    from sklearn.cluster import KMeans
    from sklearn.preprocessing import StandardScaler
    
    # 在变量“n_clusters”中定义集群数量
    n_clusters = 12
    
    # 数据聚类
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
        n_clusters=12, n_init=10, n_jobs=None, precompute_distances='auto',
        random_state=None, tol=0.0001, verbose=0)

    绘制学习曲线

    from sklearn.cluster import KMeans
    
    min_clusters = 1
    max_clusters = 20
    distortions = []
    for i in range(min_clusters, max_clusters+1):
        km = KMeans(n_clusters=i,
                    init='k-means++',
                    n_init=10,
                    max_iter=300,
                    random_state=0)
        km.fit(X)
        distortions.append(km.inertia_)
        
    # 绘图
    plt.figure(figsize=(14,6))
    plt.plot(range(min_clusters, max_clusters+1), distortions, marker='o')
    plt.xlabel("Number of clusters", fontsize=18)
    plt.ylabel("Distortion", fontsize=16)
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.show()

    67a64354d8b673a49c82b324b55180fa.png

    绘制轮廓系数

    wcss = []
    from sklearn.metrics import silhouette_score
    for k in range(2, 20):
        k_means = KMeans(n_clusters=k)
        k_means.fit(X)
        wcss.append(silhouette_score(X, k_means.labels_))
    fig = plt.figure(figsize=(15, 5))
    plt.plot(range(2, 20), wcss)
    plt.grid(True)
    plt.xticks(fontsize=15)
    plt.xlabel("Number of clusters", fontsize=18)
    plt.ylabel('Silhouette_score', fontsize=15)
    plt.title('Silhouette_score curve', fontsize=18)
    plt.show()

    5ccbb3e53decace5679611e9c3041a5f.png

    简单判断下,图中拐点位置大致在聚类中心个数为9时,此时轮廓系数最小。则n_clusters可以选择等于9.

    scipy中的k-means

    from scipy.cluster.vq import kmeans, vq
    # 计算 K = 5 的K均值(5个簇)
    centroids,_ = kmeans(cleaned_data,5)
    # 将每个样本分配给一个簇
    idx,_ = vq(cleaned_data,centroids)
    data = cleaned_data

    绘制聚类散点图

    将每种簇按照不同的颜色区分绘制,同时绘制出簇中心。

    7c1914e451ae45584a451ee41b1516e8.png

    统计每个股票属于哪个簇

    details = [(name,cluster) for name, 
              cluster in zip(returns.index,idx)]
    labels =['A股简称', 'Cluster']
    df = pd.DataFrame.from_records(details, 
                                   columns=labels)
    df.head(n=10)

    A股简称Cluster
    0平安银行3
    1万 科A1
    2国华网安3
    3世纪星源1
    4深振业A3
    5全新好2
    6神州高铁2
    7中国宝安3
    8美丽生态3
    9深物业A0

    ‍多支股票多个字段举例

    stocks_dict = dict(zip(A_names,A_codes))
    bs.login()
    dataset = []
    for names, A_code in stocks_dict.items():
        print(A_code)
        result = bs.query_history_k_data(A_code, fields = 'date,open,high,low,close,volume',
                                        start_date = '2020-01-01',
                                        end_date = '2021-01-01',
                                        frequency='d')
        df_result = result.get_data()
        dataset.append(df_result)
    bs.logout()
    
    # 获取开盘价
    open_price = np.array([p["open"] for p in dataset]).astype(np.float)
    
    # 获取收盘价
    close_price = np.array([p["close"] for p in dataset]).astype(np.float)
    # 计算变化率
    X = (close_price - open_price) / open_price

    建模

    from sklearn.cluster import KMeans
    # 定义聚类中心个数
    n_clusters = 12
    kmeans = KMeans(n_clusters)
    kmeans.fit(X)
    # 输出结果
    labels = kmeans.labels_
    for i in range(n_clusters):
        print('Cluster %i: %s' % ((i + 1), 
              ', '.join(A_names[labels == i])))

    72f6eb9230dbe4bcc41dc0be82a9d070.png

    使用管道链接归一化和聚类模型

    from sklearn.pipeline import make_pipeline
    from sklearn.cluster import KMeans
    from sklearn.preprocessing import Normalizer
    
    normalizer = Normalizer()
    kmeans = KMeans(n_clusters=10, max_iter = 1000)
    # 制作一个管道链接归一化和kmeans
    pipeline = make_pipeline(normalizer, kmeans)
    pipeline.fit(X)
    labels = pipeline.predict(X)
    df = pd.DataFrame({'labels':labels,
                       'companies':A_names})
    print(df.sort_values('labels'))
    labels companies
    434       0      华铁股份
    472       0      协鑫能科
    419       0      首钢股份
    417       0      中通客车
    194       0      长安汽车
    ..      ...       ...
    266       9      鲁  泰A
    268       9      国元证券
    467       9      传化智联
    234       9      中山公用
    0         9      平安银行
    
    [500 rows x 2 columns]

    使用PCA降维

    如果用于聚类的数据维度很高,在使用聚类分析时通常会占用过程的计算时间。此时运用PCAj降维方法。

    from sklearn.preprocessing import Normalizer
    from sklearn.decomposition import PCA
    normalizer = Normalizer()
    new_X = normalizer.fit_transform(X)
    # 使用PCA降维
    reduced_data = PCA(n_components = 2).fit_transform(new_X)
    #对降维后的数据训练kmeans
    kmeans = KMeans(n_clusters =10)
    kmeans.fit(reduced_data)
    labels = kmeans.predict(reduced_data)
    # print(kmeans.inertia_)
    # 创建DataFrame
    df = pd.DataFrame({'labels':labels,
                       'companies':A_names})
    # 根据标签排序
    print(df.sort_values('labels'))
    3.2745576650179067
         labels companies
    339       0      *ST长动
    445       0      诚志股份
    37        0      德赛电池
    244       0      模塑科技
    41        0      深 赛 格
    ..      ...       ...
    275       9      南风化工
    108       9      国际医学
    22        9      深深房A
    444       9      九 芝 堂
    164       9      *ST金洲
    
    [500 rows x 2 columns]

    可视化簇及簇中心

    40d6b584fc1e23d9afb09d4ed1a96911.png

    Mini-Batch K-Means聚类

    Mini Batch K-Means算法是K-Means算法的变种,采用小批量的数据子集减小计算时间,同时仍试图优化目标函数,这里所谓的小批量是指每次训练算法时所随机抽取的数据子集,采用这些随机产生的子集进行训练算法,大大减小了计算时间,与其他算法相比,减少了K-Means的收敛时间,小批量K-Means产生的结果,一般只略差于标准算法。

    该算法的迭代步骤有两步:

    • 从数据集中随机抽取一些数据形成小批量,把他们分配给最近的质心

    • 更新质心

    与K-Means算法相比,数据的更新是在每一个小的样本集上。对于每一个小批量,通过计算平均值得到更新质心,并把小批量里的数据分配给该质心,随着迭代次数的增加,这些质心的变化是逐渐减小的,直到质心稳定或者达到指定的迭代次数,停止计算。

    单支股票多个字段

    import baostock as bs
    bs.login()
    result = bs.query_history_k_data('sh.601318', fields = 'date,open,high, low,close,volume',
                                        start_date = '2018-01-01',
                                        end_date = '2021-01-01',
                                        frequency='d')
    dataset = result.get_data().set_index('date').applymap(lambda x: float(x))
    bs.logout()
    
    dataset['Increase_Decrease'] = np.where(dataset['volume'].shift(-1) > dataset['volume'],1,0)
    dataset['Buy_Sell_on_Open'] = np.where(dataset['open'].shift(-1) > dataset['open'],1,0)
    dataset['Buy_Sell'] = np.where(dataset['close'].shift(-1) > dataset['close'],1,0)
    dataset['Returns'] = dataset['close'].pct_change()
    dataset = dataset.dropna()
    dataset.tail()

    ee288271a4f4d229eb2df4ce8ead63bc.png

    模型建立

    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import MiniBatchKMeans
    
    X = dataset.drop(['close', 'open'], axis=1).values
    Y = dataset['close'].values
    # 数据标准化
    scaler = StandardScaler()
    X_std = scaler.fit_transform(X)
    # 创建聚类对象
    clustering = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
    # 训练模型
    model = clustering.fit(X_std)

    预测结果

    model.cluster_centers_
    model.labels_
    model.predict(X,Y)

    基于图的 AP 聚类

    Affinity Propagation Clustering(简称AP算法)特别适合高维、多类数据快速聚类,相比传统的聚类算法,该算法算是比较新的,从聚类性能和效率方面都有大幅度的提升。

    AP算法的基本思想:将全部样本看作网络的节点,然后通过网络中各条边的消息传递计算出各样本的聚类中心。聚类过程中,共有两种消息在各节点间传递,分别是吸引度( responsibility)和归属度(availability) 。

    AP算法通过迭代过程不断更新每一个点的吸引度和归属度值,直到产生m个高质量的Exemplar(类似于质心),同时将其余的数据点分配到相应的聚类中。

    AP算法流程:

    • 步骤1:算法初始,将吸引度矩阵R和归属度矩阵初始化为0矩阵;

    • 步骤2:更新吸引度矩阵

    • 步骤3:更新归属度矩阵

    • 步骤4:根据衰减系数  对两个公式进行衰减

    • 重复步骤2,3,4直至矩阵稳定或者达到最大迭代次数,算法结束。

    • 最终取最大的k作为聚类中心。

    AP聚类算法的特点:

    • 无需指定聚类“数量”参数。

    • 明确的质心(聚类中心点)。

    • 对距离矩阵的对称性没要求。

    • 初始值不敏感。

    • 算法复杂度较高,为 ,为样本数, 为迭代次数,而K-Means只是 的复杂度。

    • 若以误差平方和来衡量算法间的优劣,AP聚类比其他方法的误差平方和都要低。

    AP算法相对K-Means鲁棒性强且准确度较高,但没有任何一个算法是完美的,AP聚类算法也不例外:

    • AP聚类应用中需要手动指定Preference和Damping factor,这其实是原有的聚类“数量”控制的变体。

    • 算法较慢。由于AP算法复杂度较高,运行时间相对K-Means长,这会使得尤其在海量数据下运行时耗费的时间很多。

    数据准备

    market_dates = np.vstack([dataset.index])
    open_price = np.array([p["open"] for p in dataset]).astype(np.float)
    high_price = np.array([p["high"] for p in dataset]).astype(np.float)
    low_price = np.array([p["low"] for p in dataset]).astype(np.float)
    close_price = np.array([p["close"] for p in dataset]).astype(np.float)
    volume_price = np.array([p["volume"] for p in dataset]).astype(np.float)

    数据预处理

    # 计算变化率
    X = (close_price - open_price) / open_price
    # 每日变化的报价是什么携带最多的信息
    variation = close_price - open_price
    from sklearn import cluster, covariance, manifold, preprocessing
    # 从相关性中学习图形结构
    edge_model = covariance.GraphicalLassoCV()
    # 标准化时间序列:使用相关性而不是协方差
    # 是更有效的结构恢复
    X = variation.copy().T
    
    # 在对输入数据进行归一化之后,经验协方差矩阵的特征值仍然跨越大约[0-8]的较大范围。
    # 使用sklearn.covariance.shrunk_covariance()函数缩小此范围可以使其在计算上更容易接受
    myScaler = preprocessing.StandardScaler()
    X = myScaler.fit_transform(X)
    emp_cov = covariance.empirical_covariance(X)
    shrunk_cov = covariance.shrunk_covariance(emp_cov, shrinkage=0.8)

    模型训练

    edge_model.fit(shrunk_cov)
    # 使用ffinity propagation聚类
    _, labels = cluster.affinity_propagation(edge_model.covariance_)
    n_labels = labels.max()
    for i in range(n_labels + 1):
        print('Cluster %i: %s' % ((i + 1), ', '.join(A_names[labels == i])))

    460cf1a80022fd1d5be431a526f540fa.png

    市场结构可视化

    算法的基本思想是将样本数据看做网络的节点,根据节点之间的相互关系计算出每个节点作为聚类中心的合适程度,选择合适程度最高的几个数据节点作为聚类中心,并将其他节点分配给最合适的聚类中心。

    67943fe8e6d40056dbc791f8c19a9606.png

    DBSCAN 聚类

    一种基于密度的带有噪声的空间聚类 。它将簇定义为密度相连的点的最大集合,能够把具有足够高密度的区域划分为簇,并可在噪声的空间数据集中发现任意形状的聚类。

    基于密度的空间聚类与噪声应用。寻找高密度的核心样本,并从中扩展星团。适用于包含相似密度的簇的数据。

    DBSCAN算法将聚类视为由低密度区域分隔的高密度区域。由于这种相当通用的观点,DBSCAN发现的集群可以是任何形状,而k-means假设集群是凸形的。DBSCAN的核心组件是核心样本的概念,即位于高密度区域的样本。因此,一个集群是一组彼此接近的核心样本(通过一定的距离度量)和一组与核心样本相近的非核心样本(但它们本身不是核心样本)。

    >>> from sklearn.cluster import DBSCAN
    >>> import numpy as np
    '''
    X = np.array([[1, 2], [2, 2], [2, 3],
                  [8, 7], [8, 8], [25, 80]])
    '''
    >>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
    >>> clustering.labels_
    'array([ 0,  0,  0,  1,  1, -1])'
    >>> clustering
    'DBSCAN(eps=3, min_samples=2)'

    eps float, default=0.5 两个样本之间的最大距离,其中一个样本被认为是相邻的。这不是集群内点的距离的最大值,这是为您的数据集和距离函数选择的最重要的DBSCAN参数。

    min_samples int, default=5 被视为核心点的某一邻域内的样本数(或总权重)。这包括点本身。

    层次聚类

    层次聚类(Hierarchical Clustering)在数据挖掘和统计中,层次聚类是一种聚类分析方法,旨在建立一个层次的聚类。

    层次聚类(Hierarchical Clustering)通过计算不同类别数据点间的相似度来创建一棵有层次的嵌套聚类树。在聚类树中,不同类别的原始数据点是树的最低层,树的顶层是一个聚类的根节点。创建聚类树有自下而上合并和自上而下分裂两种方法,

    合并算法

    层次聚类的合并算法通过计算两类数据点间的相似性,对所有数据点中最为相似的两个数据点进行组合,并反复迭代这一过程。

    简单来说

    通过计算每一个类别的数据点与所有数据点之间的欧式距离来确定它们之间的相似性,距离越小,相似度越高 。并将距离最近的两个数据点或类别进行组合,生成聚类树。

    数据准备

    import baostock as bs
    bs.login()
    result = bs.query_history_k_data('sh.601318', fields = 'date,open,high, low,close,volume',
                                        start_date = '2017-01-01',
                                        end_date = '2021-01-01',
                                        frequency='d')
    dataset = result.get_data().set_index('date').applymap(lambda x: float(x))
    bs.logout()
    dataset = dataset.dropna()
    dataset = dataset.reset_index(drop=True)
    print("Shape of dataset after cleaning: ", dataset.size)
    dataset.head(5)
    Shape of dataset after cleaning:  4870

    c6d1063d1606a6aa24a6c2415840e94b.png

    数据预处理

    features = dataset[['open','high','low','close','volume']]
    # 标准化
    # 将每个数据缩放到0和1之间
    from sklearn.preprocessing import MinMaxScaler
    x = features.values #returns a numpy array
    min_max_scaler = MinMaxScaler()
    feature_mtx = min_max_scaler.fit_transform(x)
    feature_mtx [0:5]
    array([[0.00189166, 0.00996622, 0.00734394,  
            0.00807977, 0.04191282],
           [0.00791058, 0.00625   , 0.0106662 , 
            0.00756404, 0.02496157],
           [0.00859845, 0.00878378, 0.01363875, 
            0.00893932, 0.03806948],
           [0.00859845, 0.00793919, 0.00786851, 
            0.00395393, 0.06706422],
           [0.00412726, 0.00219595, 0.00646966, 
            0.00361011, 0.03184948]])

    scipy中的层次聚类

    聚类模型建立

    • criterion='distance'

    import scipy
    leng = feature_mtx.shape[0]
    D = np.zeros([leng,leng])
    for i in range(leng):
        for j in range(leng):
            # 计算两个一维数组之间的欧氏距离。
            # scipy.spatial.distance中包含各种距离的计算
            D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
            
    from scipy.cluster import hierarchy 
    from scipy.cluster.hierarchy import fcluster
    
    Z = hierarchy.linkage(D, 'complete')
    max_d = 3
    clusters = fcluster(Z, max_d, criterion='distance')
    clusters
    array([43, 43, 43, 43, 43, 43, 43, 43, 43, 
           44, 43, 43, 43, 43, 43, 43, 43,
           43, 43, 45, 43, 43, 43, 43, 45, 43, 
           45, 43, 45, 45, 45, 45, 45, 43,
            ...
           30, 30, 30, 30, 32, 32, 30, 28, 28, 
           28, 28, 28, 38, 38, 36, 34, 33,
           33, 33, 33, 36, 36, 37, 37, 38, 37, 
           28, 37, 37, 37, 37, 28, 27, 27,
           30, 27, 28, 28, 28], dtype=int32)
    • criterion='maxclust'

    from scipy.cluster.hierarchy import fcluster
    k = 5
    clusters = fcluster(Z, k, criterion='maxclust')
    clusters
    array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
           4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
           ...
           2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2], dtype=int32)

    fcluster参数

    scipy.cluster.hierarchy.fcluster(Z,    
                        t, 
                        criterion='inconsistent', 
                        depth=2, 
                        R=None, 
                        monocrit=None)

    Z: ndarray

    根据给定的链接矩阵定义的层次聚类,形成平面聚类。

    t: scalar

    • 对于 "inconsistent", "distance "or "monocrit" 的标准,这是形成平面集群时要应用的阈值。

    • 对于"maxclust"或"maxclust_monocrit"标准,这将是请求的最大集群数量。

    criterion: str 可选参数
    用于形成扁平集群的标准。可以是以下任何值:

    • inconsistent:

    如果一个集群节点及其所有后代节点的值小于或等于t的值不一致,那么它的所有叶子后代都属于同一个平面集群。当没有非单例集群满足此条件时,每个节点都被分配到自己的集群中。(默认)

    • distance:

    形成扁平的群集,使每个平面集群的原始观测结果不大于同聚距离。

    • maxclust:

    求一个最小阈值,使同一平面集群中任意两个原始观测之间的同聚距离不大于且不超过个聚类形成的平面集群

    • monocrit:

    monocrit[j] <= t时,从索引为i的簇节点c形成一个扁平集群

    例如,对不一致矩阵R中计算的最大平均距离阈值设为0.8的阈值:

    MR = maxRstat(Z, R, 3)
    fcluster(Z, t=0.8, criterion='monocrit', monocrit=MR)
    • maxclust_monocrit:

    当下面和包括的所有集群指数时,从非单子集群节点形成一个扁平的集群。最小化,以至于形成不超过扁平的集群。单节必须是单调的。例如,要最大限度地降低最大不一致值上的阈值 t,以便形成不超过 3 个平面集群,则需要:

    从一个非单例集群节点形成一个平面集群时,所有的集群索引都被最小化,这样只会形成一个平面集群。monocrit 必须是单调的。

    例如,要最小化最大不一致性值的阈值t,以便形成不超过3个平面集群,请执行:cmonocrit[i] <= ricrt

    MI = maxinconsts(Z, R)
    fcluster(Z, t=3, criterion='maxclust_monocrit', monocrit=MI)

    可视化层次聚类

    import pylab
    fig = pylab.figure(figsize=(18,50))
    def llf(id):
        return '[%s %s %s]' % (dataset['high'][id], dataset['low'][id], int(float(dataset['close'][id])) )
        
    dendro = hierarchy.dendrogram(Z,  leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')

    pylab 提供了比较强大的画图功能,平常使用最多的应该是画线了。

    hierarchy.dendrogram将分层聚类绘制为树状图。

    树状图通过在非单例群集及其子级之间绘制一条U-shaped链接来说明每个群集的组成方式。U-link的顶部指示群集合并。U-link的两条腿指示要合并的集群。U-link的两条腿的长度表示子群集之间的距离。它也是两个子类中原始观测值之间的距离。

    Z: ndarray

    链接矩阵编码分层聚类以呈现为树状图。看到linkage函数以获取有关格式的更多信息Z

    orientation:str, 可选参数
    树状图的绘制方向,可以是以下任意字符串:

    • 'top'

    在顶部绘制根,并绘制向下的后代链接。(默认)。

    • 'bottom'

    在底部绘制根,并绘制向上的后代链接。

    • 'left'

    在左边绘制根,在右边绘制后代链接。

    • 'right'

    在右边绘制根,在左边绘制后代链接。

    leaf_rotation:double, 可选参数
    指定旋转叶子标签的角度(以度为单位)。如果未指定,则旋转基于树状图中的节点数(默认为0)。

    leaf_font_size:int, 可选参数
    指定叶子标签的字体大小(以磅为单位)。未指定时,大小基于树状图中的节点数。

    leaf_label_func:lambda 或 function, 可选参数
    当leaf_label_func是可调用函数时,对于具有簇索引的每个叶子。该函数应返回带有叶子标签的字符串。

    指标  对应于原始观察值,而索引对应于非单簇。

    da319ab94477fec477780227d2e6025c.png

    层次聚类热图

    热图的绘制非常简单,因为seaborn的工具包非常强大,我们使用clustermap函数即可。

    import seaborn as sns
    sns.clustermap(D,method ='ward',metric='euclidean')

    1

    915b6b2d602169da4e9fb652b744f304.png

    计算距离矩阵

    from scipy.spatial import distance_matrix 
    # 返回所有成对距离的矩阵。
    dist_matrix = distance_matrix(feature_mtx,feature_mtx) 
    print(dist_matrix)
    [[0.         0.01867314 0.01007541 ... 1.71146908 
      1.71172797 1.75931251]
     [0.01867314 0.         0.01376365 ... 1.70982933 
      1.71026084 1.75873221]
     ...
     [1.71172797 1.71026084 1.70562427 ... 0.02158105 
      0.         0.09823995]
     [1.75931251 1.75873221 1.75346407 ... 0.11352427 
      0.09823995 0.        ]]

    sklearn中的层次聚类

    from sklearn.cluster import AgglomerativeClustering 
    agglom = AgglomerativeClustering(n_clusters = 6, linkage = 'complete')
    agglom.fit(feature_mtx)
    agglom.labels_
    array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
           3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
           3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
           3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         ...
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2], dtype=int64)
    dataset['cluster_'] = agglom.labels_
    dataset.head()

    887a0d5b95af02ed6f1201f0a36749a8.png

    可视化层次聚类

    分别以股票的最高价和最低价为轴,以收盘价为圆圈的面积,以不同颜色区分不同簇,绘制聚类散点图。

    baae1b40ba041c776173a0d6ec7f9022.png

    每个集群中按交易量聚合并可视化

    agg_price = dataset.groupby(['cluster_','volume'])['open','high','low','close'].mean()
    agg_price

    0c59f12342449f5c91f783a3678ab32c.png

    同样以股票的最高价和最低价为轴,以收盘价为圆圈的面积,以不同颜色区分不同簇,绘制聚类散点图。

    1745d7ed60129d007845c5ed939344c3.png

    ---------End---------

    顺便给大家推荐下我的微信视频号「Python数据之道」,欢迎扫码关注。
    展开全文
  • python 聚类分析

    千次阅读 2018-04-13 17:19:36
    2.fcluster(Z, t, criterion=’inconsistent’, depth=2, R=None, monocrit=None) 第一个参数Z是linkage得到的矩阵,记录了层次聚类的层次信息; t是一个聚类的阈值-“The threshold to apply when forming flat ...

    转自博客 https://blog.csdn.net/elaine_bao/article/details/50242867
    keams聚类:https://www.cnblogs.com/yjd_hycf_space/p/7094005.html(可以试试)
    scipy cluster库简介

    scipy.cluster是scipy下的一个做聚类的package, 共包含了两类聚类方法:
    1. 矢量量化(scipy.cluster.vq):支持vector quantization 和 k-means 聚类方法
    2. 层次聚类(scipy.cluster.hierarchy):支持hierarchical clustering 和 agglomerative clustering(凝聚聚类)

    聚类方法实现:k-means和hierarchical clustering.

    ###cluster.py
    #导入相应的包
    import scipy
    import scipy.cluster.hierarchy as sch
    from scipy.cluster.vq import vq,kmeans,whiten
    import numpy as np
    import matplotlib.pylab as plt
    
    
    #生成待聚类的数据点,这里生成了20个点,每个点4维:
    points=scipy.randn(20,4)  
    
    #1. 层次聚类
    #生成点与点之间的距离矩阵,这里用的欧氏距离:
    disMat = sch.distance.pdist(points,'euclidean') 
    #进行层次聚类:
    Z=sch.linkage(disMat,method='average') 
    #将层级聚类结果以树状图表示出来并保存为plot_dendrogram.png
    P=sch.dendrogram(Z)
    plt.savefig('plot_dendrogram.png')
    #根据linkage matrix Z得到聚类结果:
    cluster= sch.fcluster(Z, t=2, criterion='inconsistent')
    
    print "Original cluster by hierarchy clustering:\n",cluster
    
    #2. k-means聚类
    #将原始数据做归一化处理
    data=whiten(points)
    
    #使用kmeans函数进行聚类,输入第一维为数据,第二维为聚类个数k.
    #有些时候我们可能不知道最终究竟聚成多少类,一个办法是用层次聚类的结果进行初始化.当然也可以直接输入某个数值. 
    #k-means最后输出的结果其实是两维的,第一维是聚类中心,第二维是损失distortion,我们在这里只取第一维,所以最后有个[0]
    centroid=kmeans(data,max(cluster))[0]  
    
    #使用vq函数根据聚类中心对所有数据进行分类,vq的输出也是两维的,[0]表示的是所有数据的label
    label=vq(data,centroid)[0] 
    
    print "Final clustering by k-means:\n",label

    在Terminal中输入:python cluster.py
    输出:
    Original cluster by hierarchy clustering:
    [4 3 3 1 3 3 2 3 2 3 2 3 3 2 3 1 3 3 2 2]
    Final clustering by k-means:
    [1 2 1 3 1 2 0 2 0 0 0 2 1 0 1 3 2 2 0 0]
    数值是随机标的,不用看,只需要关注同类的是哪些.可以看出层次聚类的结果和k-means还是有区别的

    运行过程中出现了该问题SyntaxError: non-keyword arg after keyword arg

    Python中调用函数时,有时会报SyntaxError: non-keyword arg after keyword arg错误。
    这通常是因为函数中定义了部分参数的默认值,Python中*arg表示任意多个无名参数,类型为tuple(元组),**kwargs表示关键字参数,为dict(字典),因此没有默认值的参数,即*arg 要放在前面,**kwargs 要放在后面,出现这个错误后,可以在有默认值的参数前加上参数名即可。

    cluster= sch.fcluster(Z, t=2, 'inconsistent') 出现上述错误
    cluster= sch.fcluster(Z, t=2, criterion='inconsistent')则正常

    补充:一些函数的用法

    1.linkage(y, method=’single’, metric=’euclidean’)
    共包含3个参数:
    y是距离矩阵,由pdist得到;method是指计算类间距离的方法,比较常用的有3种:
    (1)single:最近邻,把类与类间距离最近的作为类间距
    (2)complete:最远邻,把类与类间距离最远的作为类间距
    (3)average:平均距离,类与类间所有pairs距离的平均

    其他的method还有如weighted,centroid等等,具体可以参考: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage

    2.fcluster(Z, t, criterion=’inconsistent’, depth=2, R=None, monocrit=None)
    第一个参数Z是linkage得到的矩阵,记录了层次聚类的层次信息; t是一个聚类的阈值-“The threshold to apply when forming flat clusters”,在实际中,感觉这个阈值的选取还是蛮重要的.另外,scipy提供了多种实施阈值的方法(criterion):

    其他的参数我用的是默认的,具体可以参考:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

    3.kmeans(obs, k_or_guess, iter=20, thresh=1e-05, check_finite=True)
    输入obs是数据矩阵,行代表数据数目,列代表特征维度; k_or_guess表示聚类数目;iter表示循环次数,最终返回损失最小的那一次的聚类中心;
    输出有两个,第一个是聚类中心(codebook),第二个是损失distortion,即聚类后各数据点到其聚类中心的距离的加和.

    参考页面:http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans
    

    4.vq(obs, code_book, check_finite=True)
    根据聚类中心将所有数据进行分类.obs为数据,code_book则是kmeans产生的聚类中心.
    输出同样有两个:第一个是各个数据属于哪一类的label,第二个和kmeans的第二个输出是一样的,都是distortion

    参考页面:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.vq.html#scipy.cluster.vq.vq
    
    展开全文
  • This is a tutorial on how to use scipy's hierarchical clustering.One of the benefits of hierarchical clustering is that you 不用提前知道数据需要分成多少类(类别数量用k表示). Sadly, there doesn't ...

    【原文链接】https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/


    This is a tutorial on how to use scipy's hierarchical clustering.

    One of the benefits of hierarchical clustering is that you 不用提前知道数据需要分成多少类(类别数量用k表示). Sadly, there doesn't seem to be much documentation on how to actually use scipy's hierarchical clustering to 做出知情的决定然后获得类别.

    In the following I'll explain:

    Naming conventions:

    Before we start, as i know that it's easy to get lost, some naming conventions:

    • X 样本 (n x m array), aka data points or "singleton clusters"
    • n number of samples
    • m number of features
    • Z cluster linkage array (contains the hierarchical clustering information)
    • k number of cluster

    Imports and Setup, Generating Sample Data, Perform the Hierarchical Clustering

    As the scipy linkage docs tell us, 'ward'(离差平方和) is one of the 方法 that can be used to 计算clusters之间距离. 'ward' causes linkage() to use the Ward variance minimization algorithm.

    I think it's a good default choice, but it never hurts to play around with some other common linkage methods like 'single''complete''average', ... and the different distance metrics (距离度量) like 'euclidean' (default)'cityblock' aka Manhattan'hamming'(汉明距离)'cosine'... if you have the feeling that your data should not just be clustered to minimize the overall intra cluster variance in euclidean space. For example, you should have such a weird feeling with long (binary) feature vectors (e.g., 词向量 in 文本聚类).

    # needed imports
    from matplotlib import pyplot as plt
    from scipy.cluster.hierarchy import dendrogram, linkage
    import numpy as np
    # some setting for this notebook to actually show the graphs inline in the notebook, rather than in a new window.
    from IPython import get_ipython
    get_ipython().run_line_magic('matplotlib', 'inline')
    np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation
    
    # 生成两个cluster: a有100点, b有50:
    np.random.seed(4711)  # for repeatability of this tutorial
    a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
    b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
    X = np.concatenate((a, b),)
    print(X.shape)  # 150 样本 with 2维
    plt.scatter(X[:,0], X[:,1])
    plt.show()
    
    # generate the linkage matrix
    Z = linkage(X, 'ward')

    As you can see there's a lot of choice here and while python and scipy make it very easy to do the clustering, it's you who has to understand and make these choices. If i find the time, i might give some more practical advice about this, but for now i'd urge you to at least read up on the linked methods and metrics to make a somewhat informed choice. Another thing you can and should definitely do is check the Cophenetic Correlation Coefficient of your clustering with help of the cophenet() function. This (very very briefly) 比较 (correlates) the 样本中成对成对的实际距离 to 系统聚类暗示的距离. The closer the value is to 1, 聚类就越好地保留了原本的距离, which in our case is pretty close: 0.98001483875742679

    from scipy.cluster.hierarchy import cophenet
    from scipy.spatial.distance import pdist
    c, coph_dists = cophenet(Z, pdist(X))
    print(c)

    No matter what method and metric you pick, the linkage() function will use that method and metric to 计算clusters的距离 (从n个独立的样本 (aka data 点) as singleton clusters 开始)) and 在每次迭代式 will merge the two clusters which have the 最小距离 according the selected method and metric. It will 返回一个距离为n - 1的数组 giving you information about 为了成对成对地合并 n 个 clusters 所需的 n - 1 次 cluster merges. 在第i次迭代的时候,Z[i] will tell us which clusters were merged, let's take a look at the first two points that were merged:

    print(Z[0])
    array([ 52.     ,  53.     ,   0.04151,   2.     ])

    We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count].

    In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, 因为他们两个之间的距离仅为 0.04151. 因此产生了一个cluster with a total of 2 样本.

    In the second iteration the algorithm decided to merge the clusters (original samples here as well) with indices 14 and 79, which had a distance of 0.04914. This again formed another cluster with a total of 2 samples.

    The indices of the clusters until now correspond to our samples. Remember that we had a total of 150 samples, so indices 0 to 149. Let's have a look at the first 20 iterations:

    print(Z[:20])
    array([[  52.     ,   53.     ,    0.04151,    2.     ],
           [  14.     ,   79.     ,    0.05914,    2.     ],
           [  33.     ,   68.     ,    0.07107,    2.     ],
           [  17.     ,   73.     ,    0.07137,    2.     ],
           [   1.     ,    8.     ,    0.07543,    2.     ],
           [  85.     ,   95.     ,    0.10928,    2.     ],
           [ 108.     ,  131.     ,    0.11007,    2.     ],
           [   9.     ,   66.     ,    0.11302,    2.     ],
           [  15.     ,   69.     ,    0.11429,    2.     ],
           [  63.     ,   98.     ,    0.1212 ,    2.     ],
           [ 107.     ,  115.     ,    0.12167,    2.     ],
           [  65.     ,   74.     ,    0.1249 ,    2.     ],
           [  58.     ,   61.     ,    0.14028,    2.     ],
           [  62.     ,  152.     ,    0.1726 ,    3.     ],
           [  41.     ,  158.     ,    0.1779 ,    3.     ],
           [  10.     ,   83.     ,    0.18635,    2.     ],
           [ 114.     ,  139.     ,    0.20419,    2.     ],
           [  39.     ,   88.     ,    0.20628,    2.     ],
           [  70.     ,   96.     ,    0.21931,    2.     ],
           [  46.     ,   50.     ,    0.22049,    2.     ]])

    We can observe that until iteration 13 the algorithm only directly merged original samples. We can also observe the 单调 increase of the distance.

    In iteration 13 the algorithm decided to merge cluster indices 62 with 152. If you paid attention the 152 should astonish you as we only have original sample indices 0 to 149 for our 150 samples. All indices idx >= len(X) actually refer to the cluster formed in Z[idx - len(X)].

    This means that while idx 149 corresponds to X[149], idx 150 corresponds to the cluster formed in Z[0]idx 151 to Z[1], 152 to Z[2], ...

    Hence, the merge iteration 13 merged sample 62 to 之前在第二次迭代 (152 - 2) 中 merged 的样本:33 and 68.

    Let's check out the points coordinates to see if this makes sense:

    print(X[[33, 68, 62]])
    array([[ 9.83913, -0.4873 ],
           [ 9.89349, -0.44152],
           [ 9.97793, -0.56383]])

    Seems pretty close, but let's plot the points again and highlight them:

    idxs = [33, 68, 62]
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1])  # plot all points
    plt.scatter(X[idxs,0], X[idxs,1], c='r')  # plot interesting points in red again
    plt.show()

    We can see that the 3 red dots are pretty close to each other, which is a good thing.

    The same happened in iteration 14 where the alrogithm merged indices 41 to 15 and 69:

    idxs = [33, 68, 62]
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1])
    plt.scatter(X[idxs,0], X[idxs,1], c='r')
    idxs = [15, 69, 41]
    plt.scatter(X[idxs,0], X[idxs,1], c='y')
    plt.show()

    Showing that the 3 yellow dots are also quite close.

    And so on...

    We'll later come back to visualizing this, but now let's have a look at what's called a dendrogram of this hierarchical clustering first:

    Plotting a Dendrogram

    dendrogram is a visualization in form of a tree showing merges 的顺序 and 距离 during the hierarchical clustering.

    # calculate full dendrogram
    plt.figure(figsize=(25, 10))
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(
        Z,
        leaf_rotation=90.,  # rotates the x axis labels
        leaf_font_size=8.,  # font size for the x axis labels
    )
    plt.show()


    If this is the first time you see a dendrogram, it's probably quite confusing, so let's take this apart...

    • On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
    • On the y axis you see the distances (of the 'ward' method in our case).

    Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The 高度 of that 水平线 tells you about the 距离 at which this label was merged into another label or cluster (这个标签和另一个标签或cluster merge时的距离). You can find that other cluster by following the other vertical line down again. If you don't encounter another horizontal line, it was just merged with the other label you reach, otherwise it was merged into another cluster that was formed earlier.

    Summarizing:

    • 水平线是 cluster merges
    • vertical lines tell you which clusters/labels were part of merge forming that new cluster
    • 水平线的高度 tell you about the distance that needed to be "bridged" to form the new cluster (形成一个新的cluster所需要桥接的距离)

    You can also see that from distances > 25 up there's a huge jump of the distance to the final merge at a distance of approx. 180. Let's have a look at the distances of the last 4 merges:

    print(Z[-4:,2])
    array([  15.11533,   17.11527,   23.12199,  180.27043])

    Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that 可能不应该被 merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 2 clusters here.

    Looking at indices in the above dendrogram also shows us that the green cluster only has indices >= 100, while the red one only has such < 100. This is a good thing as it shows that the algorithm re-discovered the two classes in our toy example.

    In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified 自动选择 a distance cut-off value of 70% of the 最后一次 merge and then 将最后一次 merge 下面的第一层 clusters 上色.

    Dendrogram Truncation

    As you might have noticed, the above is pretty big for 150 samples already and you probably have way more in real scenarios, so let me spend a few seconds on highlighting some other features of the dendrogram() function:

    plt.title('Hierarchical Clustering Dendrogram (truncated)')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(
        Z,
        truncate_mode='lastp',  # show only the last p merged clusters
        p=12,  # show only the last p merged clusters
        show_leaf_counts=False,  # otherwise numbers in brackets are counts
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,  # to get a distribution impression in truncated branches
    )
    plt.show()

    The above shows a truncated dendrogram, which only shows the last p=12 out of our 149 merges.

    First thing you should notice are that most labels are missing. This is because except for X[40] all other samples were already merged into clusters before the last 12 merges.

    The parameter show_contracted allows us to draw black dots at the heights of those previous cluster merges, so we can still spot gaps even if we don't want to 弄乱 the whole visualization. In our example we can see that the dots are all at pretty small distances when compared to the huge last merge at a distance of 180, telling us that we probably didn't miss much there.

    As it's kind of hard to keep track of the cluster sizes just by the dots, dendrogram()will by default also print the cluster sizes in brackets () if a cluster was truncated:

    plt.title('Hierarchical Clustering Dendrogram (truncated)')
    plt.xlabel('sample index or (cluster size)')
    plt.ylabel('distance')
    dendrogram(
        Z,
        truncate_mode='lastp',  # show only the last p merged clusters
        p=12,  # show only the last p merged clusters
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,  # to get a distribution impression in truncated branches
    )
    plt.show()

    Eye Candy

    Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

    def fancy_dendrogram(*args, **kwargs):
        max_d = kwargs.pop('max_d', None)
        if max_d and 'color_threshold' not in kwargs:
            kwargs['color_threshold'] = max_d
        annotate_above = kwargs.pop('annotate_above', 0)
    
        ddata = dendrogram(*args, **kwargs)
    
        if not kwargs.get('no_plot', False):
            plt.title('Hierarchical Clustering Dendrogram (truncated)')
            plt.xlabel('sample index or (cluster size)')
            plt.ylabel('distance')
            for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
                x = 0.5 * sum(i[1:3])
                y = d[1]
                if y > annotate_above:
                    plt.plot(x, y, 'o', c=c)
                    plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
                                 textcoords='offset points',
                                 va='top', ha='center')
            if max_d:
                plt.axhline(y=max_d, c='k')
        return ddata
    
    fancy_dendrogram(
        Z,
        truncate_mode='lastp',
        p=12,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=10,  # useful in small plots so annotations don't overlap
    )
    plt.show()

    Selecting a Distance Cut-Off aka Determining the Number of Clusters

    As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If you have the chance to do this manually, i'd always opt for that, as it allows you to gain some insights into your data and to perform some 合理性检查 on the edge cases (a problem or situation that occurs only at an extreme (maximum or minimum) operating parameter). In our case i'd probably just say that our cut-off is 50, as the jump is pretty obvious. Let's visualize this in the dendrogram as a cut-off line:

    # set cut-off to 50
    max_d = 50  # max_d as in max_distance
    fancy_dendrogram(
        Z,
        truncate_mode='lastp',
        p=12,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=10,
        max_d=max_d,  # plot a horizontal cut-off line
    )
    plt.show()

    As we can see, we ("surprisingly") have two clusters at this cut-off.

    In general for a chosen cut-off value max_d you can always simply count 和 dendrogram 的垂直线的交点的数量 to get the number of formed clusters. Say we choose a cut-off of max_d = 16, we'd get 4 final clusters:


    Automated Cut-Off Selection (or why you shouldn't rely on this)

    Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which you want to automate this.

    The problem again is that there is no golden method to pick the number of clusters for all cases (which is why i think the investigative & backtesting (testing a predictive model using existing historic data) manual method is preferable). Wikipedia lists a couple of common methods. Reading this, you should realize how different the approaches and how vague their descriptions are.

    I honestly think it's a really bad idea to just use any of those methods, unless you know the data you're working on really really well.

    Inconsistency Method

    For example, let's have a look at the "inconsistency" method, which seems to be one of the defaults for the fcluster() function in scipy.

    The question driving the inconsistency method is "what makes a distance jump a jump (什么样的距离才能算作一个跳跃)?". It answers this by comparing each cluster merge's 高度 h to the 平均值 avg and normalizing it by the 标准差 std formed over the depth (应该是树的高度) previous levels (应该是树的层):

    $$inconsistency = \frac{h - avg}{std}$$

    The following shows a 矩阵 of the avg, std, count, inconsistency for each of the last 10 merges of our hierarchical clustering with depth = 5

    from scipy.cluster.hierarchy import inconsistent
    depth = 5
    incons = inconsistent(Z, depth)
    print(incons[-10:])
    
    array([[  1.80875,   2.17062,  10.     ,   2.44277],
           [  2.31732,   2.19649,  16.     ,   2.52742],
           [  2.24512,   2.44225,   9.     ,   2.37659],
           [  2.30462,   2.44191,  21.     ,   2.63875],
           [  2.20673,   2.68378,  17.     ,   2.84582],
           [  1.95309,   2.581  ,  29.     ,   4.05821],
           [  3.46173,   3.53736,  28.     ,   3.29444],
           [  3.15857,   3.54836,  28.     ,   3.93328],
           [  4.9021 ,   5.10302,  28.     ,   3.57042],
           [ 12.122  ,  32.15468,  30.     ,   5.22936]])

    Now you might be tempted to say "yay, let's just pick 5" as a limit in the inconsistencies, but look at what happens if we set depth to 3 instead:

    depth = 3
    incons = inconsistent(Z, depth)
    print(incons[-10:])
    
    array([[  3.63778,   2.55561,   4.     ,   1.35908],
           [  3.89767,   2.57216,   7.     ,   1.54388],
           [  3.05886,   2.66707,   6.     ,   1.87115],
           [  4.92746,   2.7326 ,   7.     ,   1.39822],
           [  4.76943,   3.16277,   6.     ,   1.60456],
           [  5.27288,   3.56605,   7.     ,   2.00627],
           [  8.22057,   4.07583,   7.     ,   1.69162],
           [  7.83287,   4.46681,   7.     ,   2.07808],
           [ 11.38091,   6.2943 ,   7.     ,   1.86535],
           [ 37.25845,  63.31539,   7.     ,   2.25872]])
    Oups! This should make you realize that the inconsistency values heavily depend on the depth of the tree you calculate the averages over.

    Another problem in its calculation is that the previous d levels' heights aren't 正态分布的, but expected to increase, so you can't really just treat the current level as an "outlier"(离群值,an observation point that is distant from other observations) of a normal distribution, as it's expected to be bigger.

    Elbow Method

    Another thing you might see out there is a 变体 of the "elbow method". It tries to find the clustering step where the 距离增长的加速度 is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

    acceleration = np.diff(last, 2)  # 2nd 导数 of the distances
    acceleration_rev = acceleration[::-1]
    plt.plot(idxs[:-2] + 1, acceleration_rev)
    plt.show()
    k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
    print("clusters:", k)

    clusters: 2

    While this works nicely in our simplistic example (the green line takes its maximum for k=2), it's pretty 有缺陷的 as well.

    One issue of this method has to do with the way an "elbow" is defined: you need at least a right and a left point, which implies that this method will never be able to tell you that 所有的数据都在一个cluster中.

    Another problem with this variant lies in the np.diff(Z[:, 2], 2) though. Z[:, 2] 中的距离的顺序 isn't properly reflecting one branch of the tree 中的 merges 的顺序. In other words: 不能保证 Z[i] 中的距离包含在 Z[i+1] 的分支中. By simply computing the np.diff(Z[:, 2], 2) we assume that this doesn't matter and just compare our merge tree 从不同分支出发的跳跃距离.

    If you still don't want to believe this, let's just construct another simplistic example but this time with very different 方差 in the different clusters:

    c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
    d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
    e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
    X2 = np.concatenate((X, c, d, e),)
    plt.scatter(X2[:,0], X2[:,1])
    plt.show()

    As you can see we have 5 clusters now, but they have increasing variances... let's have a look at the dendrogram again and how you can use it to spot the problem:

    Z2 = linkage(X2, 'ward')
    plt.figure(figsize=(10,10))
    fancy_dendrogram(
        Z2,
        truncate_mode='lastp',
        p=30,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=40,
        max_d=170,
    )
    plt.show()

    When looking at a dendrogram like this and trying to put a cut-off line somewhere, you should notice the very 不同的分布情况 of merge distances below that cut-off line. Compare the distribution in the cyan cluster to the red, green or even two blue clusters that have even been truncated away. In the cyan cluster below the cut-off we don't really have any discontinuity of merge distances up to very close to the cut-off line. The two blue clusters on the other hand are each merged below a distance of 25, and have a gap of > 155 to our cut-off line.

    The variant of the "elbow" method will incorrectly see the jump from 167 to 180 as minimal and tell us we have 4 clusters:

    last = Z2[-10:, 2]
    last_rev = last[::-1]
    idxs = np.arange(1, len(last) + 1)
    plt.plot(idxs, last_rev)
    acceleration = np.diff(last, 2)  # 2nd derivative of the distances
    acceleration_rev = acceleration[::-1]
    plt.plot(idxs[:-2] + 1, acceleration_rev)
    plt.show()
    k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
    print("clusters:", k)

    clusters: 4
    

    The same happens with the inconsistency metric:

    print(inconsistent(Z2, 5)[-10:])
    [[  13.99222   15.56656   30.         3.86585]
     [  16.73941   18.5639    30.         3.45983]
     [  19.05945   20.53211   31.         3.49953]
     [  19.25574   20.82658   29.         3.51907]
     [  21.36116   26.7766    30.         4.50256]
     [  36.58101   37.08602   31.         3.50761]
     [  12.122     32.15468   30.         5.22936]
     [  42.6137   111.38577   31.         5.13038]
     [  81.75199  208.31582   31.         5.30448]
     [ 147.25602  307.95701   31.         3.6215 ]]

    I hope you can now understand why i'm warning against blindly using any of those methods on a dataset you know nothing about. They can give you some indication, but you should always go back in and check if the results make sense, for example with a dendrogram which is a great tool for that (especially if you have higher dimensional data that you can't simply visualize anymore).

    Retrieve the Clusters

    Now, let's finally have a look at how to 获得 clusters, for different ways of determining k. We can use the fcluster function.

    Knowing max_d:

    Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:

    from scipy.cluster.hierarchy import fcluster
    max_d = 50
    clusters = fcluster(Z, max_d, criterion='distance')
    print(clusters)
    
    array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

    Knowing k:

    Another way starting from the dendrogram is to say "i can see i have k=2" clusters. You can then use:

    k=2
    print(fcluster(Z, k, criterion='maxclust'))
    
    array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

    Using the Inconsistency Method (default):

    If you're really sure you want to use the inconsistency method to determine the number of clusters in your dataset, you can use the default criterion of fcluster() and hope you picked the correct values:

    from scipy.cluster.hierarchy import fcluster
    print(fcluster(Z, 8, depth=10))
    
    array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

    Visualizing Your Clusters

    If you're lucky enough and your data is very low dimensional, you can actually visualize the resulting clusters very easily:

    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism')  # plot points with cluster dependent colors
    plt.show()

    Further Reading:

    【完整代码】

    # needed imports
    from matplotlib import pyplot as plt
    from scipy.cluster.hierarchy import dendrogram, linkage
    import numpy as np
    # some setting for this notebook to actually show the graphs inline in the notebook, rather than in a new window.
    from IPython import get_ipython
    get_ipython().run_line_magic('matplotlib', 'inline')
    np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation
    
    # 生成两个cluster: a有100点, b有50:
    np.random.seed(4711)  # for repeatability of this tutorial
    a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
    b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
    X = np.concatenate((a, b),)
    print(X.shape)  # 150 样本 with 2维
    plt.scatter(X[:,0], X[:,1])
    plt.show()
    
    # generate the linkage matrix
    Z = linkage(X, 'ward')
    
    from scipy.cluster.hierarchy import cophenet
    from scipy.spatial.distance import pdist
    c, coph_dists = cophenet(Z, pdist(X))
    print(c)
    
    print(Z[0])
    print(Z[:20])
    print(X[[33, 68, 62]])
    idxs = [33, 68, 62]
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1])  # plot all points
    plt.scatter(X[idxs,0], X[idxs,1], c='r')  # plot interesting points in red again
    plt.show()
    
    idxs = [33, 68, 62]
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1])
    plt.scatter(X[idxs,0], X[idxs,1], c='r')
    idxs = [15, 69, 41]
    plt.scatter(X[idxs,0], X[idxs,1], c='y')
    plt.show()
    
    # calculate full dendrogram
    plt.figure(figsize=(25, 10))
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(
        Z,
        leaf_rotation=90.,  # rotates the x axis labels
        leaf_font_size=8.,  # font size for the x axis labels
    )
    plt.show()
    
    print(Z[-4:,2])
    
    plt.title('Hierarchical Clustering Dendrogram (truncated)')
    plt.xlabel('sample index')
    plt.ylabel('distance')
    dendrogram(
        Z,
        truncate_mode='lastp',  # show only the last p merged clusters
        p=12,  # show only the last p merged clusters
        show_leaf_counts=False,  # otherwise numbers in brackets are counts
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,  # to get a distribution impression in truncated branches
    )
    plt.show()
    
    plt.title('Hierarchical Clustering Dendrogram (truncated)')
    plt.xlabel('sample index or (cluster size)')
    plt.ylabel('distance')
    dendrogram(
        Z,
        truncate_mode='lastp',  # show only the last p merged clusters
        p=12,  # show only the last p merged clusters
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,  # to get a distribution impression in truncated branches
    )
    plt.show()
    
    # set cut-off to 50
    max_d = 50  # max_d as in max_distance
    fancy_dendrogram(
        Z,
        truncate_mode='lastp',
        p=12,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=10,
        max_d=max_d,  # plot a horizontal cut-off line
    )
    plt.show()
    
    fancy_dendrogram(
        Z,
        truncate_mode='lastp',
        p=12,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=10,
        max_d=16,
    )
    plt.show()
    
    from scipy.cluster.hierarchy import inconsistent
    depth = 5
    incons = inconsistent(Z, depth)
    print(incons[-10:])
    
    depth = 3
    incons = inconsistent(Z, depth)
    print(incons[-10:])
    
    last = Z[-10:, 2]
    last_rev = last[::-1]
    idxs = np.arange(1, len(last) + 1)
    plt.plot(idxs, last_rev)
    
    acceleration = np.diff(last, 2)  # 2nd 导数 of the distances
    acceleration_rev = acceleration[::-1]
    plt.plot(idxs[:-2] + 1, acceleration_rev)
    plt.show()
    k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
    print("clusters:", k)
    
    c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
    d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
    e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
    X2 = np.concatenate((X, c, d, e),)
    plt.scatter(X2[:,0], X2[:,1])
    plt.show()
    
    Z2 = linkage(X2, 'ward')
    plt.figure(figsize=(10,10))
    fancy_dendrogram(
        Z2,
        truncate_mode='lastp',
        p=30,
        leaf_rotation=90.,
        leaf_font_size=12.,
        show_contracted=True,
        annotate_above=40,
        max_d=170,
    )
    plt.show()
    
    last = Z2[-10:, 2]
    last_rev = last[::-1]
    idxs = np.arange(1, len(last) + 1)
    plt.plot(idxs, last_rev)
    acceleration = np.diff(last, 2)  # 2nd derivative of the distances
    acceleration_rev = acceleration[::-1]
    plt.plot(idxs[:-2] + 1, acceleration_rev)
    plt.show()
    k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
    print("clusters:", k)
    
    print(inconsistent(Z2, 5)[-10:])
    
    from scipy.cluster.hierarchy import fcluster
    max_d = 50
    clusters = fcluster(Z, max_d, criterion='distance')
    print(clusters)
    
    k=2
    print(fcluster(Z, k, criterion='maxclust'))
    
    from scipy.cluster.hierarchy import fcluster
    print(fcluster(Z, 8, depth=10))
    
    plt.figure(figsize=(10, 8))
    plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism')  # plot points with cluster dependent colors
    plt.show()
    展开全文
  • python的scipy层次聚类参数详解

    万次阅读 多人点赞 2018-01-17 09:30:53
    详解python中层次聚类的fcluster函数 调用实例: import scipy import scipy.cluster.hierarchy as sch from scipy.cluster.vq import vq,kmeans,whiten import numpy as np import matplotlib.pylab as plt ...
  • from scipy.cluster.hierarchy import dendrogram, linkage,fcluster from matplotlib import pyplot as plt X = [[1,2],[3,2],[4,4],[1,2],[1,3]] Z = linkage(X, 'ward') f = fcluster(Z,4,'distance') fig = plt....
  • # -*- coding: utf-8 -*- ...from scipy.cluster.hierarchy import dendrogram, linkage, fcluster from matplotlib import pyplot as plt     def hierarchy_cluster(data, method='average', thre...
  • Python聚类工具scipy cluster

    千次阅读 2016-10-18 21:16:17
    cluster= sch.fcluster(Z, t=1, 'inconsistent') print "Original cluster by hierarchy clustering:\n",cluster #2. k-means聚类 #将原始数据做归一化处理 data=whiten(points) #使用kmeans函数进行聚类,输入第...
  • from scipy.cluster.hierarchy import linkage, fcluster import numpy as np from matplotlib import pyplot as plt data = np.random.rand(100, 2) # 进行层次聚类(linkage返回聚类结果矩阵z) z = linkage(data...
  • python机器学习包里面的cluster提供了很多聚类算法,其中ward_tree实现了凝聚层次聚类算法。 但是没有看明白ward_tree的返回值代表了什么含义,遂决定寻找别的实现方式。 经过查找,发现scipy.cluster....
  • 安装方式 原生命令安装(繁琐、易错) 官方工具安装(简单、快捷) 原生命令安装(理解原理) 修改配置文件(redis.conf) 简易版,其他使用默认配置,从7000到7005端口,共生成6个配置文件。...
  • cluster= sch.fcluster(Z, criterion='inconsistent',t=1) print "层次聚类结果为:\n",cluster #白化处理 data=whiten(matrix) #kmeans聚类 ''' 聚类数确定方法: 1.借助层次聚类方法初步确定 2.手动设定...
  • 使用scipy进行聚类

    千次阅读 2014-07-06 19:14:11
    T = sch.fcluster(Z, 0.5*d.max(), 'distance') #array([4, 5, 3, 2, 2, 3, 5, 2, 2, 5, 2, 2, 2, 3, 2, 3, 2, 5, 4, 5, 2, 5, 2, # 3, 3, 3, 1, 3, 4, 2, 2, 4, 2, 4, 3, 3, 2, 5, 5, 5, 3, 2, 2, 2, 5, 4, # 2, 4,...
  • 层次聚类

    2018-08-29 17:35:08
    import numpy as np import matplotlib.pyplot as plt from sklearn import ...from scipy.cluster.hierarchy import dendrogram, linkage, fcluster import random def hierarchy_cluster(data, method='a...
  • 视频跟踪——CMT算法

    千次阅读 2017-01-12 14:16:03
    *部分内容转载于... CMT,全称是Clustering of Static-Adaptive Correspondences for Deformable Object Tracking主要应用功能为对于物体的视觉跟踪。...视频跟踪常见的方法思路可以分为三大类: ...第一
  • 作者:Dipanjan (DJ) Sarkar 编译:ronghuaiyang 导读 介绍了一些传统但是被验证是非常有用的,现在都还在用的策略,用来对非结构化的文本数据提取特征。 ...在本文中,我们将研究如何处理文本数据,这无疑是最...
  • squareform from scipy.cluster.hierarchy import dendrogram, linkage,fcluster a = np.array([[1,0],[1,1],[3,2],[4,3],[2,5]]) (m,n)=a.shape d = np.zeros((m,m)) for i in range(m): for j in range(i+1,m): d...
  • 文章目录一、前言2、自底向上的层次算法python实现层次聚类4、使用Sklearn中的层次聚类5、使用Scipy库中的层次聚类(1)linkage(y, method=’single’, metric=’euclidean’)(2).fcluster(Z, t, criterion=’...
  • 我们之所以去构建一个DNS-FCluster集群,是为了避免内存、硬盘或版本问题导致的DNS服务无效,有了DNS-F Cluster集群,当本地Agent出现问题的时候,就可以通过集群去代理和解析DNS请求。 \n 二是在Nacos-Sync上,我们...
  • ref: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ This is a tutorial on how to use scipy's hierarchical clustering. ...One of the benefits of hi
  • 聚类分析法-层次聚类

    千次阅读 2021-02-12 10:38:05
    聚类分析法 文章目录聚类分析法1.简介2.基本内容介绍1.数据变换2. 样品间亲疏程度的测度... linkage3.fcluster4.H=dendrogram(Z,p)4.基于类间距离的层次聚类1. 最短距离法2. 例子:3. 最长距离法4.例子: 1.简介 ​ 聚
  • STP客户分类(Python)

    2020-10-17 21:25:39
    一 分析背景 数据为保险公司对车险客户的调查问卷,需要对车险客户进行分类,以确定在车险市场的目标客户,并开展精准营销。 二 分析思路 使用STP分析法,即客户细分(Segmentation)、目标客户选择(Targeting)、...
  • 2.fcluster(Z, t, criterion=’inconsistent’, depth=2, R=None, monocrit=None) 第一个参数Z是linkage得到的矩阵,记录了层次聚类的层次信息; t是一个聚类的阈值-“The threshold to apply when forming flat ...
  • 时钟中断导致的内核模块死锁

    千次阅读 2013-04-02 17:41:48
    这篇文章是对这几天找的一个BUG的总结,如果理解上有不对的地方,或者有更好的建议,...模块中的处理主要有这几个分支:NF_INET_LOCAL_IN和NF_INET_POST_ROUTING两个钩子点的处理函数,一个是定时器处理函数fcluster_t
  • 使用scipy进行层次聚类和k-means聚类

    千次阅读 2016-08-25 11:40:15
    cluster= sch.fcluster(Z, t=(0, 102, 102); box-sizing: border-box;">1, (0, 136, 0); box-sizing: border-box;">'inconsistent') (0, 0, 136); box-sizing: border-box;">print</span> (0, 136, 0); box-sizing...
  • scipy.cluster.hierarchy.fcluster (stats::)hclust 包聚类(Bagged Cluster) 未知 e1071::bclust DBSCAN sklearn.cluster.DBSCAN dbscan::dbsan Birch ...

空空如也

空空如也

1 2 3 4 5 ... 14
收藏数 267
精华内容 106
关键字:

fcluster