• 余弦相似度 pythonWhat is cosine similarity?什么是余弦相似度？ Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors. 余弦相似度...

余弦相似度 python

What is cosine similarity?
什么是余弦相似度？
Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.
余弦相似度通过计算两个向量之间的角度的余弦值来衡量两个向量之间的相似度。
Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.
余弦相似度是数据科学中使用最广泛，功能最强大的相似度之一。 它可用于多种应用程序，例如在NLP中查找相似的文档，信息检索，在生物信息学中查找与DNA相似的序列，检测抄袭等等。
Cosine similarity is calculated as follows,
余弦相似度计算如下：
Angle between two 2-D vectors A and B (Image by author)

calculation of cosine of the angle between A and B

Why cosine of the angle between A and B gives us the similarity?
为什么A和B之间的夹角余弦会给我们相似性？
If you look at the cosine function, it is 1 at theta = 0 and -1 at theta = 180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. You can consider 1-cosine as distance.
如果查看余弦函数，则在theta = 0处为1，在theta = 180处为-1，这意味着对于两个重叠的向量，余弦将是两个完全相反的向量的最高和最低值。 您可以将1-cosine作为距离。
cosine(Image by author)

values of cosine at different angles (Image by author)

How to calculate it in Python?
如何在Python中计算？
The numerator of the formula is the dot product of the two vectors and denominator is the product of L2 norm of both the vectors. Dot product of two vectors is the sum of element wise multiplication of the vectors and L2 norm is the square root of sum of squares of elements of a vector.
公式的分子是两个向量的点积，分母是两个向量的L2范数的乘积。 两个向量的点积是向量在元素上的乘积之和，而L2范数是向量的元素平方和的平方根。
We can either use inbuilt functions in Numpy library to calculate dot product and L2 norm of the vectors and put it in the formula or directly use the cosine_similarity from sklearn.metrics.pairwise. Consider two vectors A and B in 2-D, following code calculates the cosine similarity,
我们可以使用Numpy库中的内置函数来计算向量的点积和L2范数并将其放入公式中，也可以直接使用sklearn.metrics.pairwise中的cosine_similarity。 考虑二维中的两个向量A和B，下面的代码计算余弦相似度，
import numpy as npimport matplotlib.pyplot as plt# consider two vectors A and B in 2-DA=np.array([7,3])B=np.array([3,7])ax = plt.axes()ax.arrow(0.0, 0.0, A, A, head_width=0.4, head_length=0.5)plt.annotate(f"A({A},{A})", xy=(A, A),xytext=(A+0.5, A))ax.arrow(0.0, 0.0, B, B, head_width=0.4, head_length=0.5)plt.annotate(f"B({B},{B})", xy=(B, B),xytext=(B+0.5, B))plt.xlim(0,10)plt.ylim(0,10)plt.show()plt.close()# cosine similarity between A and Bcos_sim=np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))print (f"Cosine Similarity between A and B:{cos_sim}")print (f"Cosine Distance between A and B:{1-cos_sim}")
Code output (Image by author)

# using sklearn to calculate cosine similarityfrom sklearn.metrics.pairwise import cosine_similarity,cosine_distancescos_sim=cosine_similarity(A.reshape(1,-1),B.reshape(1,-1))print (f"Cosine Similarity between A and B:{cos_sim}")print (f"Cosine Distance between A and B:{1-cos_sim}")
Code output (Image by author)

# using scipy, it calculates 1-cosinefrom scipy.spatial import distancedistance.cosine(A.reshape(1,-1),B.reshape(1,-1))
Code output (Image by author)

Proof of the formula
公式证明
Cosine similarity formula can be proved by using Law of cosines,
余弦相似度公式可以用余弦定律证明，
Law of cosines (Image by author)

Consider two vectors A and B in 2-dimensions, such as,
考虑二维的两个向量A和B，例如，
Two 2-D vectors (Image by author)

Using Law of cosines,
利用余弦定律，
Cosine similarity using Law of cosines (Image by author)

You can prove the same for 3-dimensions or any dimensions in general. It follows exactly same steps as above.
通常，您可以证明3维或任何尺寸的相同。 它遵循与上述完全相同的步骤。
Summary
概要
We saw how cosine similarity works, how to use it and why does it work. I hope this article helped in understanding the whole concept behind this powerful metric.
我们了解了余弦相似度如何工作，如何使用它以及为什么起作用。 我希望本文有助于理解这一强大指标背后的整个概念。

余弦相似度 python

展开全文 • :return: 返回两个向量的余弦相似度 """ dist1 = float(np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))) return dist1 v1,v2 = get_word_vector('字符串1','字符串2') a=cos_dist(v1,v2)... python
• # -*- coding:utf8 -*- from math import sqrt users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Va
# -*- coding:utf8 -*-
from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
"Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
}

def manhattan(rating1, rating2):
"""Computes the Manhattan distance. Both rating1 and rating2 are dictionaries
of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
distance = 0
commonRatings = False
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key])
commonRatings = True
if commonRatings:
return distance
else:
return -1 #Indicates no ratings in common
#欧几里距离
def euclidean(rating1,rating2):
"""Computes the Euclidean distance. Both rating1 and rating2 are dictionaries
of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
distance=0
commonRatings = False
for key in rating1:
if key in rating2:
#distance += sqrt((rating1[key]-rating2[key])**2)
distance += (rating1[key] - rating2[key])**2
commonRatings=True
if commonRatings:
return distance
else:
return -1

#明氏距离
def minkowski(rating1,rating2,r):
distance=0
commonRatings=False
for key in rating1:
if key in rating2:
distance += pow(abs(rating1[key]-rating2[key]),r)
commonRatings=True
return pow(distance,1/r)
else:
return -1
"""creates a sorted list of users based on their distance to username"""
distances = []
for user in users:
distances.append((distance, user))
# sort based on distance -- closest first
distances.sort()
return distances

"""Give list of recommendations"""
# first find nearest neighbor
print nearest

recommendations = []
# now find bands neighbor rated that user didn't
neighborRatings = users[nearest]
for artist in neighborRatings:
if not artist in userRatings:
recommendations.append((artist, neighborRatings[artist]))
# using the fn sorted for variety - sort is more efficient
return sorted(recommendations, key=lambda artistTuple: artistTuple, reverse = True)

# examples - urncomment to run

#print( recommend('Hailey', users))
def pearson(rating1,rating2):
sum_xy=0
sum_x=0
sum_y=0
sum_x2=0
sum_y2=0
n=0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x*y
sum_x += x
sum_y += y
sum_x2 += x**2
sum_y2 += y**2
denominnator = sqrt(sum_x2-(sum_x**2)/n)*sqrt(sum_y2-(sum_y**2)/n)
if denominnator == 0:
return 0
else:
return (sum_xy-(sum_x*sum_y)/n)/denominnator
def cos_like(rating1,rating2):
innerProd=0
vector_x=0
vectoy_y=0
for key  in rating1:
if key in rating2:
x=rating1[key]
y=rating2[key]
innerProd += x*y
vector_x += x**2
vectoy_y += y**2
if sqrt(vector_x)*sqrt(vectoy_y)==0:
return 0
else:
return innerProd/(sqrt(vector_x)*sqrt(vectoy_y))
print cos_like(users['Angelica'],users['Bill'])
print pearson(users['Angelica'],users['Bill'])
for list in ( recommend('Veronica', users)):
print list
展开全文  皮尔逊相关系数 数据挖掘算法 数据挖掘
• 余弦相似度简介 余弦相似度，又称为余弦相似性，是通过计算两个向量的夹角余弦值来评估他们的相似度。对于两个向量，可以想象成空间中的两条线段，都是从原点（[0, 0, ...]）出发，指向不同的方向。两条线段之间...
一.余弦相似度简介
余弦相似度，又称为余弦相似性，是通过计算两个向量的夹角余弦值来评估他们的相似度。对于两个向量，可以想象成空间中的两条线段，都是从原点（[0, 0, ...]）出发，指向不同的方向。两条线段之间形成一个夹角：如果夹角为0度，则意味着方向相同、线段重合；如果夹角为90度，意味着形成直角，方向完全不相似；如果夹角为180度，意味着方向正好相反。因此，可以通过夹角的大小，来判断向量的相似程度。夹角越小，就代表越相似。
对n维向量A，B，假设A= [A1, A2, ..., An] ，B= [B1, B2, ..., Bn] ，则A与B的夹角θ的余弦等于：
余弦值的范围在[-1,1]之间，值越趋近于1，代表两个向量的方向越接近；越趋近于-1，他们的方向越相反；接近于0，表示两个向量近乎于正交。
一般情况下，相似度都是归一化到[0,1]区间内，因此余弦相似度表示为 cosine_similarity = 0.5cosθ + 0.5
二.余弦相似度与欧式距离的区别
欧氏距离衡量的是空间各点的绝对距离，跟各个点所在的位置坐标直接相关；而余弦距离衡量的是空间向量的夹角，更加体现在方向上的差异，而不是位置。
余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。相比欧氏距离，余弦距离更加注重两个向量在方向上的差异。
欧氏距离和余弦距离各自有不同的计算方式和衡量特征，因此它们适用于不同的数据分析模型：
1.欧氏距离能够体现个体数值特征的绝对差异，所以更多的用于需要从维度的数值大小中体现差异的分析，如使用用户行为指标分析用户价值的相似度或差异。
2.余弦距离更多的是从方向上区分差异，而对绝对的数值不敏感，更多的用于使用用户对内容评分来区分兴趣的相似度和差异，同时修正了用户间可能存在的度量标准不统一的问题（因为余弦距离对绝对数值不敏感）。
2.1调整余弦相似度:
正因为余弦相似度在数值上的不敏感，会导致这样一种情况存在：
用户对内容评分，按5分制，X和Y两个用户对两个内容的评分分别为（1,2）和（4,5），使用余弦相似度得到的结果是0.98，两者极为相似。但从评分上看X似乎不喜欢2这个 内容，而Y则比较喜欢，余弦相似度对数值的不敏感导致了结果的误差，需要修正这种不合理性就出现了调整余弦相似度，即所有维度上的数值都减去一个均值，比如X和Y的评分均值都是3，那么调整后为（-2，-1）和（1,2），再用余弦相似度计算，得到-0.8，相似度为负值并且差异不小，但显然更加符合现实。
三.余弦相似度的python实现
方法一：
def cosine_similarity(x, y, dim=256):
xx = 0.0
yy = 0.0
xy = 0.0
for i in range(dim):
xx += x[i] * x[i]
yy += y[i] * y[i]
xy += x[i] * y[i]
xx_sqrt = xx ** 0.5
yy_sqrt = yy ** 0.5
cos = xy/(xx_sqrt*yy_sqrt)*0.5+0.5
return cos
方法二：
import numpy as np

def cosine_similarity(x,y):
num = x.dot(y.T)
denom = np.linalg.norm(x) * np.linalg.norm(y)
return num / denom
方法三：
def cosine_similarity(x, y, norm=False):
assert len(x) == len(y), "len(x) != len(y)"
zero_list =  * len(x)
if x == zero_list or y == zero_list:
return float(1) if x == y else float(0)

res = np.array([[x[i] * y[i], x[i] * x[i], y[i] * y[i]] for i in range(len(x))])
cos = sum(res[:, 0]) / (np.sqrt(sum(res[:, 1])) * np.sqrt(sum(res[:, 2])))

return 0.5 * cos + 0.5 if norm else cos 
展开全文 • I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possi... I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.
The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.
Is there any program that can do this? Or should I start writing this from scratch?
解决方案
Check out NLTK package: http://www.nltk.org it has everything what you need
For the cosine_similarity:
def cosine_distance(u, v):
"""
Returns the cosine of the angle between vectors v and u. This is equal to
u.v / |u||v|.
"""
return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
For ngrams:
"""
A utility that produces a sequence of ngrams from a sequence of items.
For example:
>>> ngrams([1,2,3,4,5], 3)
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use ingram for an iterator version of this function. Set pad_left
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
@param sequence: the source data to be converted into ngrams
@type sequence: C{sequence} or C{iterator}
@param n: the degree of the ngrams
@type n: C{int}
@return: The ngrams
@rtype: C{list} of C{tuple}s
"""
sequence = chain((pad_symbol,) * (n-1), sequence)
sequence = chain(sequence, (pad_symbol,) * (n-1))
sequence = list(sequence)
count = max(0, len(sequence) - n + 1)
return [tuple(sequence[i:i+n]) for i in range(count)]
for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:
if you like pylucene, this will tell you how to comute tf.idf
for i in xrange(docs):
if tfv:
rec = {}
terms = tfv.getTerms()
frequencies = tfv.getTermFrequencies()
for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
tmap.setdefault(t, len(tmap))
rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF
# and normalize the values using cosine normalization
if cosine_normalization:
denom = sum([x**2 for x in rec.values()])**0.5
for k,v in rec.items():
rec[k] = v / denom

展开全文 • 余弦相似度Python代码： C保存的是分子部分的数据。N保存的是分母部分的数据。如果下面代码看起来比较费劲，可以看java代码，可能更容易理解一些。 ## ItemCF-余弦算法 import math def ItemSimilarity_cos(train...
• 文章目录自然语言处理系列三十一文本相似度算法余弦相似度Python代码实现总结 自然语言处理系列三十一 文本相似度算法 在自然语言处理中，我们经常需要判定两个东西是否相似。比如，在微博的热点话题推荐那里，我们... 算法 人工智能 字符串 java
• 火花余弦相似度 这是一个脚本，输入一个矩阵并计算矩阵中每个向量与其他向量的余弦相似度 例子： *add test dataset (dataset.txt) into hadoop hdfs 这是数据集的摘录： "16",45,12,7,2,2,2,2,4,7,7 "28",1,1,1...
• 两向量之间的余弦相似度Prerequisite: 先决条件： Defining a Vector using list 使用列表定义向量 Defining Vector using Numpy 使用Numpy定义向量 Cosine similarity is a metric used to measure how similar ...
• ## Python计算余弦相似度

千次阅读 多人点赞 2020-08-09 10:51:39
余弦相似度常用在文本分类、图片分类等应用中，来计算两个文本或两个图像之间的相似度。 本文主要介绍通过Python计算两个向量的余弦相似度
• I need to compute the cosine similarity function across a very big set. This set represents users and each user as an array of object id. An example below:user_1 = [1,4,6,100,3,1]user_2 = [4,7,8,3,3,2...
• 余弦相似度 归一化之后的embedding向量进行点积实际上就是计算夹角余弦值 >>> a = np.array([255,255,33,33,33,40]) >>> b = np.array([255,255,255,33,33,40]) >>> c = np.array([255,...
• from sklearn.metrics.pairwise import cosine_similarity as cosine results = [] for i in range(Array1.shape): results.append(numpy.max(cosine(Array1[None,i,:], Array2))) 解决方案 Iterating in Python ...
• 余弦相似度常用在文本分类、图片分类等应用中，来计算两个文本或两个图像之间的相似度。 余弦相似度的取值范围在-1到1之间。余弦值越接近1，也就是两个向量越相似，完全相同时数值为1；相反反向时为-1；正交或不...
• 今天小编就为大家分享一篇Python 余弦相似度与皮尔逊相关系数 计算实例，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
• 余弦相似度的计算： 有了上面的基础知识，我们可以将每个分好词和去停用词的文档进行文档向量化，并计算出每一个词项的权重，而且每个文档的向量的维度都是一样的，我们比较两篇文档的相似性就可以通过计算这两个...
• 背景在计算相似度时，常常用到余弦夹角来判断相似度，Cosine（余弦相似度）取值范围[-1,1]，当两个向量的方向重合时夹角余弦取最大值1，当两个向量的方向完全相反夹角余弦取最小值-1，两个方向正交时夹角余弦取值为0...
• 余弦相似度和欧氏距离Photo by Markus Winkler on Unsplash Markus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a ...
• 这篇文章主要是先叙述VSM和余弦相似度相关理论知识，然后引用阮一峰大神的例子进行解释，最后通过Python简单实现百度百科和互动百科Infobox的余弦相似度计算。基本步骤：1.分别统计两个文档的关键词 2.两篇文章的...
• Say the input matrix is: A= [0 1 0 0 1 0 0 1 1 1 1 1 0 1 0] The sparse representation is: A = 0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3 In Python, it's straightforward to work with the matrix-input ...
• import numpy as np x1=[1,1] x2=[2,2] x1_np = np.array(x1) x2_np = np.array(x2) # 直接用公式计算 dist1 = np.sqrt(np.sum((x1_np-x2_np)**2)) # 使用内置范数函数计算 dist2 = np.linalg.norm(x1_np,x2_np) ... python
• 利用余弦相似度做文本分类： 在数学中余弦相似度的公式：cos(a,b)=a*b/(|a|+|b|),而在文本上，我们的余弦相似度通常是这样计算而成： （文本a,b共同出现的词条数目）/（文本a出现的词条数目+文本b出现的词条数目） ...
• 基于TF-IDF算法、余弦相似度算法实现相似文本推荐——文本相似度算法，主要应用于文本聚类、相似文本推荐等场景。设计说明使用jieba切词，设置自定义字典使用TF-IDF算法，找出文章的关键词；每篇文章各取出若干个...  ...

# 余弦相似度python python 订阅