精华内容
下载资源
问答
  • 余弦相似度 pythonWhat is cosine similarity?什么是余弦相似度? Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors. 余弦相似度...

    余弦相似度 python

    What is cosine similarity?

    什么是余弦相似度?

    Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.

    余弦相似度通过计算两个向量之间的角度的余弦值来衡量两个向量之间的相似度。

    Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.

    余弦相似度是数据科学中使用最广泛,功能最强大的相似度之一。 它可用于多种应用程序,例如在NLP中查找相似的文档,信息检索,在生物信息学中查找与DNA相似的序列,检测抄袭等等。

    Cosine similarity is calculated as follows,

    余弦相似度计算如下:

    Image for post
    Angle between two 2-D vectors A and B (Image by author)
    两个二维向量A和B之间的角度(作者提供的图片)
    Image for post
    calculation of cosine of the angle between A and B
    A和B之间的夹角余弦的计算

    Why cosine of the angle between A and B gives us the similarity?

    为什么A和B之间的夹角余弦会给我们相似性?

    If you look at the cosine function, it is 1 at theta = 0 and -1 at theta = 180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. You can consider 1-cosine as distance.

    如果查看余弦函数,则在theta = 0处为1,在theta = 180处为-1,这意味着对于两个重叠的向量,余弦将是两个完全相反的向量的最高和最低值。 您可以将1-cosine作为距离。

    Image for post
    cosine(Image by author)
    余弦(作者提供)
    Image for post
    values of cosine at different angles (Image by author)
    不同角度的余弦值(作者提供)

    How to calculate it in Python?

    如何在Python中计算?

    The numerator of the formula is the dot product of the two vectors and denominator is the product of L2 norm of both the vectors. Dot product of two vectors is the sum of element wise multiplication of the vectors and L2 norm is the square root of sum of squares of elements of a vector.

    公式的分子是两个向量的点积,分母是两个向量的L2范数的乘积。 两个向量的点积是向量在元素上的乘积之和,而L2范数是向量的元素平方和的平方根。

    We can either use inbuilt functions in Numpy library to calculate dot product and L2 norm of the vectors and put it in the formula or directly use the cosine_similarity from sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, following code calculates the cosine similarity,

    我们可以使用Numpy库中的内置函数来计算向量的点积和L2范数并将其放入公式中,也可以直接使用sklearn.metrics.pairwise中的cosine_similarity。 考虑二维中的两个向量A和B,下面的代码计算余弦相似度,

    import numpy as np
    import matplotlib.pyplot as plt# consider two vectors A and B in 2-D
    A=np.array([7,3])
    B=np.array([3,7])ax = plt.axes()ax.arrow(0.0, 0.0, A[0], A[1], head_width=0.4, head_length=0.5)
    plt.annotate(f"A({A[0]},{A[1]})", xy=(A[0], A[1]),xytext=(A[0]+0.5, A[1]))ax.arrow(0.0, 0.0, B[0], B[1], head_width=0.4, head_length=0.5)
    plt.annotate(f"B({B[0]},{B[1]})", xy=(B[0], B[1]),xytext=(B[0]+0.5, B[1]))plt.xlim(0,10)
    plt.ylim(0,10)plt.show()
    plt.close()# cosine similarity between A and B
    cos_sim=np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))
    print (f"Cosine Similarity between A and B:{cos_sim}")
    print (f"Cosine Distance between A and B:{1-cos_sim}")
    Image for post
    Code output (Image by author)
    代码输出(作者提供的图像)
    # using sklearn to calculate cosine similarity
    from sklearn.metrics.pairwise import cosine_similarity,cosine_distancescos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))
    print (f"Cosine Similarity between A and B:{cos_sim}")
    print (f"Cosine Distance between A and B:{1-cos_sim}")
    Image for post
    Code output (Image by author)
    代码输出(作者提供的图像)
    # using scipy, it calculates 1-cosine
    from scipy.spatial import distancedistance.cosine(A.reshape(1,-1),B.reshape(1,-1))
    Image for post
    Code output (Image by author)
    代码输出(作者提供的图像)

    Proof of the formula

    公式证明

    Cosine similarity formula can be proved by using Law of cosines,

    余弦相似度公式可以用余弦定律证明,

    Image for post
    Law of cosines (Image by author)
    余弦定律(作者提供图片)

    Consider two vectors A and B in 2-dimensions, such as,

    考虑二维的两个向量A和B,例如,

    Image for post
    Two 2-D vectors (Image by author)
    两个二维矢量(作者提供的图片)

    Using Law of cosines,

    利用余弦定律,

    Image for post
    Cosine similarity using Law of cosines (Image by author)
    使用余弦定律的余弦相似度(作者提供的图片)

    You can prove the same for 3-dimensions or any dimensions in general. It follows exactly same steps as above.

    通常,您可以证明3维或任何尺寸的相同。 它遵循与上述完全相同的步骤。

    Summary

    概要

    We saw how cosine similarity works, how to use it and why does it work. I hope this article helped in understanding the whole concept behind this powerful metric.

    我们了解了余弦相似度如何工作,如何使用它以及为什么起作用。 我希望本文有助于理解这一强大指标背后的整个概念。

    翻译自: https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db

    余弦相似度 python

    展开全文
  • :return: 返回两个向量的余弦相似度 """ dist1 = float(np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))) return dist1 v1,v2 = get_word_vector('字符串1','字符串2') a=cos_dist(v1,v2)...
  • # -*- coding:utf8 -*- from math import sqrt users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Va
    # -*- coding:utf8 -*-
    
    from math import sqrt


    users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
             "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
             "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
             "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
             "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
             "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
             "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
             "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
            }






    def manhattan(rating1, rating2):
        """Computes the Manhattan distance. Both rating1 and rating2 are dictionaries
           of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
        distance = 0
        commonRatings = False 
        for key in rating1:
            if key in rating2:
                distance += abs(rating1[key] - rating2[key])
                commonRatings = True
        if commonRatings:
            return distance
        else:
            return -1 #Indicates no ratings in common
    #欧几里距离
    def euclidean(rating1,rating2):
        """Computes the Euclidean distance. Both rating1 and rating2 are dictionaries
            of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
        distance=0
        commonRatings = False
        for key in rating1:
            if key in rating2:
                #distance += sqrt((rating1[key]-rating2[key])**2)
                distance += (rating1[key] - rating2[key])**2
                commonRatings=True
        if commonRatings:
            return distance
        else:
             return -1




    #明氏距离
    def minkowski(rating1,rating2,r):
        distance=0
        commonRatings=False
        for key in rating1:
            if key in rating2:
                distance += pow(abs(rating1[key]-rating2[key]),r)
                commonRatings=True
                return pow(distance,1/r)
            else:
                return -1
    def computeNearestNeighbor(username, users):
        """creates a sorted list of users based on their distance to username"""
        distances = []
        for user in users:
            if user != username:
                distance = minkowski(users[user], users[username],3)
                distances.append((distance, user))
        # sort based on distance -- closest first
        distances.sort()
        return distances


    def recommend(username, users):
        """Give list of recommendations"""
        # first find nearest neighbor
        nearest = computeNearestNeighbor(username, users)[0][1]
        print nearest


        recommendations = []
        # now find bands neighbor rated that user didn't
        neighborRatings = users[nearest]
        userRatings = users[username]
        for artist in neighborRatings:
            if not artist in userRatings:
                recommendations.append((artist, neighborRatings[artist]))
        # using the fn sorted for variety - sort is more efficient
        return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)
     
    # examples - urncomment to run


    #print( recommend('Hailey', users))
    def pearson(rating1,rating2):
        sum_xy=0
        sum_x=0
        sum_y=0
        sum_x2=0
        sum_y2=0
        n=0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x*y
                sum_x += x
                sum_y += y
                sum_x2 += x**2
                sum_y2 += y**2
        denominnator = sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)
        if denominnator == 0:
            return 0
        else:
            return (sum_xy-(sum_x*sum_y)/n)/denominnator
    def cos_like(rating1,rating2):
        innerProd=0
        vector_x=0
        vectoy_y=0
        for key  in rating1:
            if key in rating2:
                x=rating1[key]
                y=rating2[key]
                innerProd += x*y
                vector_x += x**2
                vectoy_y += y**2
        if sqrt(vector_x)*sqrt(vectoy_y)==0:
            return 0
        else:
            return innerProd/(sqrt(vector_x)*sqrt(vectoy_y))
    print cos_like(users['Angelica'],users['Bill'])
    print pearson(users['Angelica'],users['Bill'])
    for list in ( recommend('Veronica', users)):
        print list
    展开全文
  • 余弦相似度简介 余弦相似度,又称为余弦相似性,是通过计算两个向量的夹角余弦值来评估他们的相似度。对于两个向量,可以想象成空间中的两条线段,都是从原点([0, 0, ...])出发,指向不同的方向。两条线段之间...

    一.余弦相似度简介

    余弦相似度,又称为余弦相似性,是通过计算两个向量的夹角余弦值来评估他们的相似度。对于两个向量,可以想象成空间中的两条线段,都是从原点([0, 0, ...])出发,指向不同的方向。两条线段之间形成一个夹角:如果夹角为0度,则意味着方向相同、线段重合;如果夹角为90度,意味着形成直角,方向完全不相似;如果夹角为180度,意味着方向正好相反。因此,可以通过夹角的大小,来判断向量的相似程度。夹角越小,就代表越相似。

    对n维向量A,B,假设A= [A1, A2, ..., An] ,B= [B1, B2, ..., Bn] ,则A与B的夹角θ的余弦等于:\cos\Theta =\frac{A\cdot B}{|A|\cdot |B|}=\tfrac{\sum_{n}^{i=1}(A_{i}\times B_{i})}{\sqrt{\sum_{n}^{i=1}(A_{i})^{2}}\times \sqrt{\sum_{n}^{i=1}(B_{i})^{2}}}

    余弦值的范围在[-1,1]之间,值越趋近于1,代表两个向量的方向越接近;越趋近于-1,他们的方向越相反;接近于0,表示两个向量近乎于正交。

    一般情况下,相似度都是归一化到[0,1]区间内,因此余弦相似度表示为 cosine_similarity = 0.5cosθ + 0.5

    二.余弦相似度与欧式距离的区别

    欧氏距离衡量的是空间各点的绝对距离,跟各个点所在的位置坐标直接相关;而余弦距离衡量的是空间向量的夹角,更加体现在方向上的差异,而不是位置。

    余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。相比欧氏距离,余弦距离更加注重两个向量在方向上的差异。

    欧氏距离和余弦距离各自有不同的计算方式和衡量特征,因此它们适用于不同的数据分析模型:

    1.欧氏距离能够体现个体数值特征的绝对差异,所以更多的用于需要从维度的数值大小中体现差异的分析,如使用用户行为指标分析用户价值的相似度或差异。

    2.余弦距离更多的是从方向上区分差异,而对绝对的数值不敏感,更多的用于使用用户对内容评分来区分兴趣的相似度和差异,同时修正了用户间可能存在的度量标准不统一的问题(因为余弦距离对绝对数值不敏感)。

    2.1调整余弦相似度:

    正因为余弦相似度在数值上的不敏感,会导致这样一种情况存在:

    用户对内容评分,按5分制,X和Y两个用户对两个内容的评分分别为(1,2)和(4,5),使用余弦相似度得到的结果是0.98,两者极为相似。但从评分上看X似乎不喜欢2这个 内容,而Y则比较喜欢,余弦相似度对数值的不敏感导致了结果的误差,需要修正这种不合理性就出现了调整余弦相似度,即所有维度上的数值都减去一个均值,比如X和Y的评分均值都是3,那么调整后为(-2,-1)和(1,2),再用余弦相似度计算,得到-0.8,相似度为负值并且差异不小,但显然更加符合现实。

    三.余弦相似度的python实现

    方法一:

    def cosine_similarity(x, y, dim=256):
        xx = 0.0
        yy = 0.0
        xy = 0.0
        for i in range(dim):
            xx += x[i] * x[i]
            yy += y[i] * y[i]
            xy += x[i] * y[i] 
        xx_sqrt = xx ** 0.5
        yy_sqrt = yy ** 0.5
        cos = xy/(xx_sqrt*yy_sqrt)*0.5+0.5
        return cos

    方法二:

    import numpy as np
    
    def cosine_similarity(x,y):
        num = x.dot(y.T)
        denom = np.linalg.norm(x) * np.linalg.norm(y)
        return num / denom

    方法三:

    def cosine_similarity(x, y, norm=False):
        assert len(x) == len(y), "len(x) != len(y)"
        zero_list = [0] * len(x)
        if x == zero_list or y == zero_list:
            return float(1) if x == y else float(0)
    
        res = np.array([[x[i] * y[i], x[i] * x[i], y[i] * y[i]] for i in range(len(x))])
        cos = sum(res[:, 0]) / (np.sqrt(sum(res[:, 1])) * np.sqrt(sum(res[:, 2])))
    
        return 0.5 * cos + 0.5 if norm else cos 
    展开全文
  • I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possi...

    I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

    The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

    Is there any program that can do this? Or should I start writing this from scratch?

    解决方案

    Check out NLTK package: http://www.nltk.org it has everything what you need

    For the cosine_similarity:

    def cosine_distance(u, v):

    """

    Returns the cosine of the angle between vectors v and u. This is equal to

    u.v / |u||v|.

    """

    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

    For ngrams:

    def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):

    """

    A utility that produces a sequence of ngrams from a sequence of items.

    For example:

    >>> ngrams([1,2,3,4,5], 3)

    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function. Set pad_left

    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)

    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams

    @type sequence: C{sequence} or C{iterator}

    @param n: the degree of the ngrams

    @type n: C{int}

    @param pad_left: whether the ngrams should be left-padded

    @type pad_left: C{boolean}

    @param pad_right: whether the ngrams should be right-padded

    @type pad_right: C{boolean}

    @param pad_symbol: the symbol to use for padding (default is None)

    @type pad_symbol: C{any}

    @return: The ngrams

    @rtype: C{list} of C{tuple}s

    """

    if pad_left:

    sequence = chain((pad_symbol,) * (n-1), sequence)

    if pad_right:

    sequence = chain(sequence, (pad_symbol,) * (n-1))

    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)

    return [tuple(sequence[i:i+n]) for i in range(count)]

    for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

    if you like pylucene, this will tell you how to comute tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))

    docs = reader.numDocs()

    for i in xrange(docs):

    tfv = reader.getTermFreqVector(i, fieldname)

    if tfv:

    rec = {}

    terms = tfv.getTerms()

    frequencies = tfv.getTermFrequencies()

    for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):

    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term

    tmap.setdefault(t, len(tmap))

    rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF

    # and normalize the values using cosine normalization

    if cosine_normalization:

    denom = sum([x**2 for x in rec.values()])**0.5

    for k,v in rec.items():

    rec[k] = v / denom

    展开全文
  • 余弦相似度Python代码: C保存的是分子部分的数据。N保存的是分母部分的数据。如果下面代码看起来比较费劲,可以看java代码,可能更容易理解一些。 ## ItemCF-余弦算法 import math def ItemSimilarity_cos(train...
  • 文章目录自然语言处理系列三十一文本相似度算法余弦相似度Python代码实现总结 自然语言处理系列三十一 文本相似度算法 在自然语言处理中,我们经常需要判定两个东西是否相似。比如,在微博的热点话题推荐那里,我们...
  • 火花余弦相似度 这是一个脚本,输入一个矩阵并计算矩阵中每个向量与其他向量的余弦相似度 例子: *add test dataset (dataset.txt) into hadoop hdfs 这是数据集的摘录: "16",45,12,7,2,2,2,2,4,7,7 "28",1,1,1...
  • 两向量之间的余弦相似度Prerequisite: 先决条件: Defining a Vector using list 使用列表定义向量 Defining Vector using Numpy 使用Numpy定义向量 Cosine similarity is a metric used to measure how similar ...
  • Python计算余弦相似度

    千次阅读 多人点赞 2020-08-09 10:51:39
    余弦相似度常用在文本分类、图片分类等应用中,来计算两个文本或两个图像之间的相似度。 本文主要介绍通过Python计算两个向量的余弦相似度
  • I need to compute the cosine similarity function across a very big set. This set represents users and each user as an array of object id. An example below:user_1 = [1,4,6,100,3,1]user_2 = [4,7,8,3,3,2...
  • python余弦相似度

    2021-04-06 17:04:42
    余弦相似度 归一化之后的embedding向量进行点积实际上就是计算夹角余弦值 >>> a = np.array([255,255,33,33,33,40]) >>> b = np.array([255,255,255,33,33,40]) >>> c = np.array([255,...
  • from sklearn.metrics.pairwise import cosine_similarity as cosine results = [] for i in range(Array1.shape[0]): results.append(numpy.max(cosine(Array1[None,i,:], Array2))) 解决方案 Iterating in Python ...
  • Python3 实现的文章余弦相似度计算
  • python计算余弦相似度

    2020-12-21 10:27:51
    余弦相似度常用在文本分类、图片分类等应用中,来计算两个文本或两个图像之间的相似度。 余弦相似度的取值范围在-1到1之间。余弦值越接近1,也就是两个向量越相似,完全相同时数值为1;相反反向时为-1;正交或不...
  • 今天小编就为大家分享一篇Python 余弦相似度与皮尔逊相关系数 计算实例,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
  • 余弦相似度的计算: 有了上面的基础知识,我们可以将每个分好词和去停用词的文档进行文档向量化,并计算出每一个词项的权重,而且每个文档的向量的维度都是一样的,我们比较两篇文档的相似性就可以通过计算这两个...
  • 背景在计算相似度时,常常用到余弦夹角来判断相似度,Cosine(余弦相似度)取值范围[-1,1],当两个向量的方向重合时夹角余弦取最大值1,当两个向量的方向完全相反夹角余弦取最小值-1,两个方向正交时夹角余弦取值为0...
  • 余弦相似度和欧氏距离Photo by Markus Winkler on Unsplash Markus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a ...
  • 这篇文章主要是先叙述VSM和余弦相似度相关理论知识,然后引用阮一峰大神的例子进行解释,最后通过Python简单实现百度百科和互动百科Infobox的余弦相似度计算。基本步骤:1.分别统计两个文档的关键词 2.两篇文章的...
  • Say the input matrix is: A= [0 1 0 0 1 0 0 1 1 1 1 1 0 1 0] The sparse representation is: A = 0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3 In Python, it's straightforward to work with the matrix-input ...
  • import numpy as np x1=[1,1] x2=[2,2] x1_np = np.array(x1) x2_np = np.array(x2) # 直接用公式计算 dist1 = np.sqrt(np.sum((x1_np-x2_np)**2)) # 使用内置范数函数计算 dist2 = np.linalg.norm(x1_np,x2_np) ...
  • 利用余弦相似度做文本分类: 在数学中余弦相似度的公式:cos(a,b)=a*b/(|a|+|b|),而在文本上,我们的余弦相似度通常是这样计算而成: (文本a,b共同出现的词条数目)/(文本a出现的词条数目+文本b出现的词条数目) ...
  • 基于TF-IDF算法、余弦相似度算法实现相似文本推荐——文本相似度算法,主要应用于文本聚类、相似文本推荐等场景。设计说明使用jieba切词,设置自定义字典使用TF-IDF算法,找出文章的关键词;每篇文章各取出若干个...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 5,596
精华内容 2,238
关键字:

余弦相似度python

python 订阅