• 274KB weixin_42100971 2021-03-16 21:00:06
• 5星
4.7MB wsp_1138886114 2018-09-21 21:48:07
• 4.64MB u010177412 2018-05-30 10:46:18
• ## ml-100k.zip数据集网盘链接 数据集

120B u011436316 2019-03-04 22:41:49
• 5星
27.19MB caozhanweicaiyuli 2018-01-12 19:23:25
• 488KB qq_34862636 2020-04-11 22:07:58
• ML-了解统计数据 (ML - Understanding Data with Statistics) Advertisements 广告 Previous Page 上一页 Next Page 下一页 介绍 (Introduction) While working with machine learning projects, ...


ml-100k推荐数据
ML-了解统计数据 (ML - Understanding Data with Statistics)
介绍 (Introduction)
While working with machine learning projects, usually we ignore two most important parts called mathematics and data. It is because, we know that ML is a data driven approach and our ML model will produce only as good or as bad results as the data we provided to it.
在进行机器学习项目时，通常我们会忽略两个最重要的部分，称为数学和数据 。 这是因为，我们知道ML是一种数据驱动的方法，而我们的ML模型只会产生与提供给它的数据一样好的或坏的结果。
In the previous chapter, we discussed how we can upload CSV data into our ML project, but it would be good to understand the data before uploading it. We can understand the data by two ways, with statistics and with visualization.
在上一章中，我们讨论了如何将CSV数据上传到ML项目中，但是最好在上传之前了解数据。 我们可以通过统计和可视化两种方式来理解数据。
In this chapter, with the help of following Python recipes, we are going to understand ML data with statistics.
在本章中，在遵循以下Python食谱的帮助下，我们将了解具有统计信息的ML数据。
查看原始数据 (Looking at Raw Data)
The very first recipe is for looking at your raw data. It is important to look at raw data because the insight we will get after looking at raw data will boost our chances to better pre-processing as well as handling of data for ML projects.
第一个配方是查看原始数据。 查看原始数据很重要，因为在查看原始数据后将获得的洞察力将增加我们为ML项目更好地进行预处理以及处理数据的机会。
Following is a Python script implemented by using head() function of Pandas DataFrame on Pima Indians diabetes dataset to look at the first 50 rows to get better understanding of it −
例 (Example)

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']


输出量 (Output)

preg   plas  pres    skin  test  mass   pedi    age      class
0      6      148     72     35   0     33.6    0.627    50    1
1      1       85     66     29   0     26.6    0.351    31    0
2      8      183     64      0   0     23.3    0.672    32    1
3      1       89     66     23  94     28.1    0.167    21    0
4      0      137     40     35  168    43.1    2.288    33    1
5      5      116     74      0   0     25.6    0.201    30    0
6      3       78     50     32   88    31.0    0.248    26    1
7     10      115      0      0   0     35.3    0.134    29    0
8      2      197     70     45  543    30.5    0.158    53    1
9      8      125     96      0   0     0.0     0.232    54    1
10     4      110     92      0   0     37.6    0.191    30    0
11    10      168     74      0   0     38.0    0.537    34    1
12    10      139     80      0   0     27.1    1.441    57    0
13     1      189     60     23  846    30.1    0.398    59    1
14     5      166     72     19  175    25.8    0.587    51    1
15     7      100      0      0   0     30.0    0.484    32    1
16     0      118     84     47  230    45.8    0.551    31    1
17     7      107     74      0   0     29.6    0.254    31    1
18     1      103     30     38  83     43.3    0.183    33    0
19     1      115     70     30  96     34.6    0.529    32    1
20     3      126     88     41  235    39.3    0.704    27    0
21     8       99     84      0   0     35.4    0.388    50    0
22     7      196     90      0   0     39.8    0.451    41    1
23     9      119     80     35   0     29.0    0.263    29    1
24    11      143     94     33  146    36.6    0.254    51    1
25    10      125     70     26  115    31.1    0.205    41    1
26     7      147     76      0   0     39.4    0.257    43    1
27     1       97     66     15  140    23.2    0.487    22    0
28    13      145     82     19  110    22.2    0.245    57    0
29     5      117     92      0   0     34.1    0.337    38    0
30     5      109     75     26   0     36.0    0.546    60    0
31     3      158     76     36  245    31.6    0.851    28    1
32     3       88     58     11   54    24.8    0.267    22    0
33     6       92     92      0   0     19.9    0.188    28    0
34    10      122     78     31   0     27.6    0.512    45    0
35     4      103     60     33  192    24.0    0.966    33    0
36    11      138     76      0   0     33.2    0.420    35    0
37     9      102     76     37   0     32.9    0.665    46    1
38     2       90     68     42   0     38.2    0.503    27    1
39     4      111     72     47  207    37.1    1.390    56    1
40     3      180     64     25   70    34.0    0.271    26    0
41     7      133     84      0   0     40.2    0.696    37    0
42     7      106     92     18   0     22.7    0.235    48    0
43     9      171    110     24  240    45.4    0.721    54    1
44     7      159     64      0   0     27.4    0.294    40    0
45     0      180     66     39   0     42.0    1.893    25    1
46     1      146     56      0   0     29.7    0.564    29    0
47     2       71     70     27   0     28.0    0.586    22    0
48     7      103     66     32   0     39.1    0.344    31    1
49     7      105      0      0   0     0.0     0.305    24    0


We can observe from the above output that first column gives the row number which can be very useful for referencing a specific observation.
我们可以从上面的输出中观察到，第一列给出了行号，这对于引用特定观察值非常有用。
检查数据尺寸 (Checking Dimensions of Data)
It is always a good practice to know how much data, in terms of rows and columns, we are having for our ML project. The reasons behind are −
知道我们的ML项目拥有多少数据(以行和列为单位)始终是一个好习惯。 背后的原因是-
Suppose if we have too many rows and columns then it would take long time to run the algorithm and train the model. 假设如果行和列太多，那么运行算法和训练模型将花费很长时间。 Suppose if we have too less rows and columns then it we would not have enough data to well train the model. 假设如果行和列太少，那么我们将没有足够的数据来很好地训练模型。
Following is a Python script implemented by printing the shape property on Pandas Data Frame. We are going to implement it on iris data set for getting the total number of rows and columns in it.
以下是通过在Pandas Data Frame上打印shape属性实现的Python脚本。 我们将在虹膜数据集上实现它，以获取其中的行和列的总数。
例 (Example)

path = r"C:\iris.csv"
print(data.shape)


输出量 (Output)

(150, 4)


We can easily observe from the output that iris data set, we are going to use, is having 150 rows and 4 columns.
我们可以从输出中轻松观察到，我们将要使用的虹膜数据集具有150行4列。
获取每个属性的数据类型 (Getting Each Attribute’s Data Type)
It is another good practice to know data type of each attribute. The reason behind is that, as per to the requirement, sometimes we may need to convert one data type to another. For example, we may need to convert string into floating point or int for representing categorial or ordinal values. We can have an idea about the attribute’s data type by looking at the raw data, but another way is to use dtypes property of Pandas DataFrame. With the help of dtypes property we can categorize each attributes data type. It can be understood with the help of following Python script −
了解每个属性的数据类型是另一种好习惯。 背后的原因是，根据要求，有时我们可能需要将一种数据类型转换为另一种数据类型。 例如，我们可能需要将字符串转换为浮点数或int以表示分类或有序值。 通过查看原始数据，我们可以对属性的数据类型有所了解，但是另一种方法是使用Pandas DataFrame的dtypes属性。 借助dtypes属性，我们可以对每个属性数据类型进行分类。 可以通过以下Python脚本来理解-
例 (Example)

path = r"C:\iris.csv"
print(data.dtypes)


输出量 (Output)

sepal_length  float64
sepal_width   float64
petal_length  float64
petal_width   float64
dtype: object


From the above output, we can easily get the datatypes of each attribute.
从上面的输出中，我们可以轻松地获取每个属性的数据类型。
数据统计汇总 (Statistical Summary of Data)
We have discussed Python recipe to get the shape i.e. number of rows and columns, of data but many times we need to review the summaries out of that shape of data. It can be done with the help of describe() function of Pandas DataFrame that further provide the following 8 statistical properties of each & every data attribute −
我们已经讨论了Python配方以获得数据的形状，即行和列的数量，但是很多时候我们需要从该数据形状中查看摘要。 可以借助Pandas DataFrame的describe()函数来完成，该函数进一步提供每个数据属性的以下8个统计属性-
Count 计数 Mean 意思 Standard Deviation 标准偏差 Minimum Value 最低值 Maximum value 最大值 25% 25％ Median i.e. 50% 中位数，即50％ 75% 75％
例 (Example)

from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
set_option('display.width', 100)
set_option('precision', 2)
print(data.shape)
print(data.describe())


输出量 (Output)

(768, 9)
preg      plas       pres      skin      test        mass       pedi      age      class
count 768.00      768.00    768.00     768.00    768.00     768.00     768.00    768.00    768.00
mean    3.85      120.89     69.11      20.54     79.80      31.99       0.47     33.24      0.35
std     3.37       31.97     19.36      15.95    115.24       7.88       0.33     11.76      0.48
min     0.00        0.00      0.00       0.00      0.00       0.00       0.08     21.00      0.00
25%     1.00       99.00     62.00       0.00      0.00      27.30       0.24     24.00      0.00
50%     3.00      117.00     72.00      23.00     30.50      32.00       0.37     29.00      0.00
75%     6.00      140.25     80.00      32.00    127.25      36.60       0.63     41.00      1.00
max    17.00      199.00    122.00      99.00    846.00      67.10       2.42     81.00      1.00


From the above output, we can observe the statistical summary of the data of Pima Indian Diabetes dataset along with shape of data.
从以上输出中，我们可以观察到Pima Indian Diabetes数据集数据的统计摘要以及数据形状。
复习班级分布 (Reviewing Class Distribution)
Class distribution statistics is useful in classification problems where we need to know the balance of class values. It is important to know class value distribution because if we have highly imbalanced class distribution i.e. one class is having lots more observations than other class, then it may need special handling at data preparation stage of our ML project. We can easily get class distribution in Python with the help of Pandas DataFrame.
在需要知道类值平衡的分类问题中，类分布统计量非常有用。 知道类值的分布很重要，因为如果我们的类分布高度不平衡，即一个类比其他类具有更多的观察值，那么在我们的ML项目的数据准备阶段可能需要特殊处理。 借助Pandas DataFrame，我们可以轻松地在Python中获得类分发。
例 (Example)

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
count_class = data.groupby('class').size()
print(count_class)


输出量 (Output)

Class
0  500
1  268
dtype: int64


From the above output, it can be clearly seen that the number of observations with class 0 are almost double than number of observations with class 1.
从上面的输出中，可以清楚地看到，等级0的观察次数几乎是等级1的观察次数的两倍。
复查属性之间的相关性 (Reviewing Correlation between Attributes)
The relationship between two variables is called correlation. In statistics, the most common method for calculating correlation is Pearson’s Correlation Coefficient. It can have three values as follows −
两个变量之间的关系称为相关。 在统计中，用于计算相关性的最常见方法是Pearson的相关系数。 它可以具有三个值，如下所示：
Coefficient value = 1 − It represents full positive correlation between variables. 系数值= 1-表示变量之间完全正相关。 Coefficient value = -1 − It represents full negative correlation between variables. 系数值= -1 −表示变量之间完全负相关。 Coefficient value = 0 − It represents no correlation at all between variables. 系数值= 0-变量之间完全没有关联。
It is always good for us to review the pairwise correlations of the attributes in our dataset before using it into ML project because some machine learning algorithms such as linear regression and logistic regression will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a correlation matrix of dataset attributes with the help of corr() function on Pandas DataFrame.
在将数据集用于ML项目之前，回顾一下数据集中属性的成对相关性始终是一件好事，因为如果我们具有高度相关的属性，则某些机器学习算法(例如线性回归和逻辑回归)将表现不佳。 在Python中，借助Pandas DataFrame上的corr()函数，我们可以轻松地计算数据集属性的相关矩阵。
例 (Example)

from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
set_option('display.width', 100)
set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)


输出量 (Output)

preg     plas     pres     skin     test      mass     pedi       age      class
preg     1.00     0.13     0.14     -0.08     -0.07   0.02     -0.03       0.54   0.22
plas     0.13     1.00     0.15     0.06       0.33   0.22      0.14       0.26   0.47
pres     0.14     0.15     1.00     0.21       0.09   0.28      0.04       0.24   0.07
skin    -0.08     0.06     0.21     1.00       0.44   0.39      0.18      -0.11   0.07
test    -0.07     0.33     0.09     0.44       1.00   0.20      0.19      -0.04   0.13
mass     0.02     0.22     0.28     0.39       0.20   1.00      0.14       0.04   0.29
pedi    -0.03     0.14     0.04     0.18       0.19   0.14      1.00       0.03   0.17
age      0.54     0.26     0.24     -0.11     -0.04   0.04      0.03       1.00   0.24
class    0.22     0.47     0.07     0.07       0.13   0.29      0.17       0.24   1.00


The matrix in above output gives the correlation between all the pairs of the attribute in dataset.
上面输出中的矩阵给出了数据集中所有属性对之间的相关性。
回顾属性分布的偏差 (Reviewing Skew of Attribute Distribution)
Skewness may be defined as the distribution that is assumed to be Gaussian but appears distorted or shifted in one direction or another, or either to the left or right. Reviewing the skewness of attributes is one of the important tasks due to following reasons −
偏度可以定义为假定为高斯分布，但在一个方向或另一个方向或向左或向右扭曲或偏移的分布。 由于以下原因，检查属性的偏斜度是重要的任务之一-
Presence of skewness in data requires the correction at data preparation stage so that we can get more accuracy from our model. 数据中存在偏度需要在数据准备阶段进行校正，以便我们可以从模型中获得更高的准确性。 Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of bell curved data. 大多数ML算法都假定数据具有高斯分布，即钟形曲线数据的法线。
In Python, we can easily calculate the skew of each attribute by using skew() function on Pandas DataFrame.
在Python中，我们可以通过在Pandas DataFrame上使用skew()函数轻松地计算每个属性的偏斜。
例 (Example)

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
print(data.skew())


输出量 (Output)

preg   0.90
plas   0.17
pres  -1.84
skin   0.11
test   2.27
mass  -0.43
pedi   1.92
age    1.13
class  0.64
dtype: float64


From the above output, positive or negative skew can be observed. If the value is closer to zero, then it shows less skew.
从以上输出中，可以观察到正偏或负偏。 如果该值更接近于零，则显示较少的偏斜。

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_understanding_data_with_statistics.htm

ml-100k推荐数据

展开全文
cunzai1985 2020-09-23 11:59:12
• 链接：https://grouplens.org/datasets/movielens/
链接：https://grouplens.org/datasets/movielens/
展开全文
weixin_42390253 2019-12-16 08:49:32
• 本菜鸟最近做了用ml-100k数据集的电影推荐系数，主要基于Pearson相关系数。 import numpy as np import math def loadData(): f = open('u.data') data = [] for i in range(100000): h = f.readline().split...
最近在python从入门放弃的路上，做了用MovieLens（ml-100k）数据集的电影推荐系统，主要基于Pearson相关系数判断数据集中其他用户与目标用户的相似性，取其中最相似的50个用户加权计算其推荐系数，排序后推荐得分最高的10部电影。 以下是具体实现过程：
0.准备
我们首先得了解数据集的标签，毕竟年代久远直接读来有点困难我查找了一下资料得到了以下信息： u.data: 完整的数据集文件，包含943位用户对1682部电影的100000个评分 评分1—5 每个用户至少20部 u.info: 用户数、项目数、评价总数 u.item： 电影的信息，由tab字符分隔。 u.genre: 电影流派信息0-18编号 u.user: 用户基本信息。id，年龄，性别，职业，邮编，id与u.data一致
1.首先载入数据集文件，u.data与电影名称文件u.item。
def loadData():
f = open('u.data')
data = []
for i in range(100000):
h = list(map(int, h))
data.append(h[0:3])
f.close()
return data

f=open('u.item',encoding='ISO-8859-1')
name = []
for i in range(1682):
k=''
m=0
for j in range(100):
k+=str(h[j])
if str(h[j])=='|':
m+=1
if m==2:
break
name.append(k)
f.close()
return name

这里我在载入电影名称时出现了编码问题，gbk编码没办法读这个文件，需增加“encoding=‘ISO-8859-1’”。
2.整合与处理数据
通过了解了标签信息我们就可以编写函数将其处理成以每行为一个用户对所有电影一一对应评分的一个列表或者说943*1682的矩阵。
def manageDate(data):
outdata = []
for i in range(943):
outdata.append([])
for j in range(1682):
outdata[i].append(0)
for h in data:
outdata[h[0] - 1][h[1] - 1] = h[2]
return outdata

3.计算相关系数
这一步主要分为两步三个函数，先求向量也就是列表均值，后求相关系数。
def calcMean(x, y):
sum_x = sum(x)
sum_y = sum(y)
n = len(x)
x_mean = float(sum_x + 0.0) / n
y_mean = float(sum_y + 0.0) / n
return x_mean, y_mean

def calcPearson(x, y):
x_mean, y_mean = calcMean(x, y)  # 计算x,y向量平均值
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
for i in range(n):
sumTop += (x[i] - x_mean) * (y[i] - y_mean)
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean, 2)
sumBottom = math.sqrt(x_pow * y_pow)
p = sumTop / sumBottom
return p

def calcAttribute(dataSet, num):
prr = []
n, m = np.shape(dataSet)  # 获取数据集行数和列数
x = [0] * m  # 初始化特征x和类别y向量
y = [0] * m
y = dataSet[num - 1]
for j in range(n):  # 获取每个特征的向量，并计算Pearson系数，存入到列表中
x = dataSet[j]
prr.append(calcPearson(x, y))
return prr

4.选择电影
我们采用开头提到的策略取其中最相似的50个用户加权计算其推荐系数，排序后推荐得分最高的10部电影。
def choseMovie(outdata, num):
prr = calcAttribute(outdata, num)
list=[]
mid=[]
out_list=[]
movie_rank=[]
for i in range(1682):
movie_rank.append([i,0])
k=0
for i in range(943):
list.append([i,prr[i]])
for i in range(943):
for j in range(942-i):
if list[j][1]<list[j+1][1]:
mid=list[j]
list[j]=list[j+1]
list[j+1]=mid
for i in range(1,51):
for j in range(0,1682):
movie_rank[j][1]=movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50
for i in range(1682):
for j in range(1681-i):
if movie_rank[j][1]<movie_rank[j+1][1]:
mid=movie_rank[j]
movie_rank[j]=movie_rank[j+1]
movie_rank[j+1]=mid
for i in range(1,1682):
if(outdata[num-1][movie_rank[i][0]]==0):
mark=0
for d in out_list:
if d[0]==j:
mark=1
if mark!=1:
k+=1
out_list.append(movie_rank[i])
if k==10:
break
return movie_rank

这里返回的是推荐电影的索引与评分。
5.输出
这里简单的输出了电影名称与推荐评分。
def printMovie(out_list,name):
print("base on the data we think you may like those movies:")
for i in range(10):
print(name[out_list[i][0]]," rank score:",out_list[i][1])

运行结果如下图所示：
下面给出完整代码：
import numpy as np
import math

f = open('u.data')
data = []
for i in range(100000):
h = list(map(int, h))
data.append(h[0:3])
f.close()
return data

f=open('u.item.txt',encoding='ISO-8859-1')
name = []
for i in range(1682):
k=''
m=0
for j in range(100):
k+=str(h[j])
if str(h[j])=='|':
m+=1
if m==2:
break
name.append(k)
f.close()
return name

def manageDate(data):
outdata = []
for i in range(943):
outdata.append([])
for j in range(1682):
outdata[i].append(0)
for h in data:
outdata[h[0] - 1][h[1] - 1] = h[2]
return outdata

def calcMean(x, y):
sum_x = sum(x)
sum_y = sum(y)
n = len(x)
x_mean = float(sum_x + 0.0) / n
y_mean = float(sum_y + 0.0) / n
return x_mean, y_mean

def calcPearson(x, y):
x_mean, y_mean = calcMean(x, y)  # 计算x,y向量平均值
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
for i in range(n):
sumTop += (x[i] - x_mean) * (y[i] - y_mean)
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean, 2)
sumBottom = math.sqrt(x_pow * y_pow)
p = sumTop / sumBottom
return p

def calcAttribute(dataSet, num):
prr = []
n, m = np.shape(dataSet)  # 获取数据集行数和列数
x = [0] * m  # 初始化特征x和类别y向量
y = [0] * m
y = dataSet[num - 1]
for j in range(n):  # 获取每个特征的向量，并计算Pearson系数，存入到列表中
x = dataSet[j]
prr.append(calcPearson(x, y))
return prr

def choseMovie(outdata, num):
prr = calcAttribute(outdata, num)
list=[]
mid=[]
out_list=[]
movie_rank=[]
for i in range(1682):
movie_rank.append([i,0])
k=0
for i in range(943):
list.append([i,prr[i]])
for i in range(943):
for j in range(942-i):
if list[j][1]<list[j+1][1]:
mid=list[j]
list[j]=list[j+1]
list[j+1]=mid
for i in range(1,51):
for j in range(0,1682):
movie_rank[j][1]=movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50
for i in range(1682):
for j in range(1681-i):
if movie_rank[j][1]<movie_rank[j+1][1]:
mid=movie_rank[j]
movie_rank[j]=movie_rank[j+1]
movie_rank[j+1]=mid
for i in range(1,1682):
if(outdata[num-1][movie_rank[i][0]]==0):
mark=0
for d in out_list:
if d[0]==j:
mark=1
if mark!=1:
k+=1
out_list.append(movie_rank[i])
if k==10:
break
return movie_rank

def printMovie(out_list,name):
print("base on the data we think you may like those movies:")
for i in range(10):
print(name[out_list[i][0]]," rank score:",out_list[i][1])

out_data = manageDate(i_data)
a = eval(input("please input the id of user:"))
out_list = choseMovie(out_data, a)
printMovie(out_list,name)

展开全文
qq_19869749 2019-12-18 15:48:45
• 从imdb爬取ml-100k的电影封面 ml-100k:数据集，只用到了./ml-100k/u.item result: 电影封面 电影id.jpg,可以用u.item找到id->电影名称对应关系 所用到的库 import pandas as pd # 读取ml-100k中的文件 from ...
从imdb爬取ml-100k的电影封面
ml-100k:数据集，只用到了./ml-100k/u.item result: 电影封面 电影id.jpg,可以用u.item找到id->电影名称对应关系
所用到的库
import pandas as pd    # 读取ml-100k中的文件
from pyquery import PyQuery as pq    # 爬虫库
import requests    # 爬虫库
import logging    # 记录日志
import os    # 判断文件是否存在
import multiprocessing    # 多进程，加速爬取
import shutil  # 最后爬取不到的封面，复制no_found封面

第一步 读取所有电影名称
ml-100k下载，下载不了也没关系，后面GitHub有
def get_movie_names(item_path='./ml-100k/u.item'):
"""获取电影名称./ml-100k/u.item
Args:
item_path: ml-100k电影名称数据集
Return:
movies_data: ml-100k中[(电影id,名称), ()]
"""
movies_data = []
for idx, row in data.iterrows():
movies_data.append((row[0], row[1]))
print(f'get {len(movies_data)} movie name success')
return movies_data

第二步 获取电影对应的封面链接
先定义一个爬虫接口
def scrape_api(url):
"""爬取网页接口"""
logging.info(f'scraping {url}')
try:
response = requests.get(url)
if response.status_code == 200:
return response
logging.error(f'scraping {url} status code error')
except requests.RequestException:
logging.error(f'scraping {url} error')
return None

先用https://www.imdb.com/find?q=搜索电影名称，然后进入详细电影介绍，找到封面url
def get_movie_png(movie_name):
"""获取每部电影的封面图片的url"""
# imdb搜索
search_url = f'https://www.imdb.com/find?q={movie_name}'
response = scrape_api(search_url)
if response is None:
return None
doc = pq(response.text)
href = doc('.findList tr td a').attr('href')    # class='.findList' <tr> <td> <a>标签下的href属性
if href is None:
return None
# imdb封面url链接获取
detail_url = f'https://www.imdb.com/{href}'
response = scrape_api(detail_url)
if response is None:
return None
detail_doc = pq(response.text)
jpg_url = detail_doc('.poster a img').attr('src')  # class='.poster' <a> <img>标签下的src属性
return jpg_url

第三步 保存封面url到本地文件中
def save_pictures(url, movie_index, save_base_path='./result/'):
"""根据url保存图片"""
# 判断路径是否存在
if not os.path.exists(save_base_path):
os.mkdir(save_base_path)
r = scrape_api(url)
try:
jpg = r.content
open(f'{save_base_path}{movie_index}.jpg', 'wb').write(jpg)
logging.info(f'成功保存{movie_index}.jpg')
except IOError:
logging.error(f'{movie_index}.jpg保存失败')

第四步 使用进程池加速爬取，添加异常处理
爬取失败的封面，多爬几次，或手动添加即可。(少许封面因为网速问题会出现失败)
def main(movie_data, save_base_path='./result/'):
# 设置日志格式
logging.basicConfig(level=logging.INFO)
movie_index = movie_data[0]
movie_name = movie_data[1]
# 已经爬取过的图片就无需重复爬取
if os.path.exists(f'{save_base_path}{movie_index}.jpg'):
return
jpg_url = get_movie_png(movie_name)  # 获取电影图片链接
if jpg_url is None:     # 获取电影图片链接失败
jpg_url = get_movie_png(movie_name.split('(')[0])  # 把电影名的年份去掉再搜索
if jpg_url is None:  # 获取电影图片链接失败
logging.error(f'error to get {movie_name} pic_url')
else:
save_pictures(jpg_url, movie_index, save_base_path)  # 保存图片

if __name__ == '__main__':
movies_data = get_movie_names()  # 获取所有电影名称
print(movies_data)
pool = multiprocessing.Pool()   # 创建进程池
pool.map(main, movies_data)     # 进程映射
pool.close()    # 关闭进程加入进程池
pool.join()     # 等待子进程结束
print('end')

第五步 填充没有爬取到封面的电影
def fill(movie_data, fill_jpg='./no_found.jpg', save_base_path='./result/'):
"""讲没找到封面的电影用no_found.jpg代替"""
for i in range(1, len(movies_data)+1):
poster_jpg = f'{save_base_path}{i}.jpg'
if not os.path.exists(poster_jpg):
shutil.copyfile(fill_jpg, poster_jpg)

全部代码如下
import pandas as pd
from pyquery import PyQuery as pq
import requests
import logging
import os
import multiprocessing
import shutil

def scrape_api(url):
"""爬取网页接口"""
logging.info(f'scraping {url}')
try:
response = requests.get(url)
if response.status_code == 200:
return response
logging.error(f'scraping {url} status code error')
except requests.RequestException:
logging.error(f'scraping {url} error')
return None

def get_movie_names(item_path='./ml-100k/u.item'):
"""获取电影名称./ml-100k/u.item
Args:
item_path: ml-100k电影名称数据集
Return:
movies_data: ml-100k中[(电影id,名称), ()]
"""
movies_data = []
for idx, row in data.iterrows():
movies_data.append((row[0], row[1]))
print(f'get {len(movies_data)} movie name success')
return movies_data

def get_movie_png(movie_name):
"""获取每部电影的封面图片的url"""
# imdb搜索
search_url = f'https://www.imdb.com/find?q={movie_name}'
response = scrape_api(search_url)
if response is None:
return None
doc = pq(response.text)
href = doc('.findList tr td a').attr('href')    # class='.findList' <tr> <td> <a>标签下的href属性
if href is None:
return None
# imdb封面url链接获取
detail_url = f'https://www.imdb.com/{href}'
response = scrape_api(detail_url)
if response is None:
return None
detail_doc = pq(response.text)
jpg_url = detail_doc('.poster a img').attr('src')  # class='.poster' <a> <img>标签下的src属性
return jpg_url

def save_pictures(url, movie_index, save_base_path='./result/'):
"""根据url保存图片"""
# 判断路径是否存在
if not os.path.exists(save_base_path):
os.mkdir(save_base_path)
r = scrape_api(url)
try:
jpg = r.content
open(f'{save_base_path}{movie_index}.jpg', 'wb').write(jpg)
logging.info(f'成功保存{movie_index}.jpg')
except IOError:
logging.error(f'{movie_index}.jpg保存失败')

def main(movie_data, save_base_path='./result/'):
# 设置日志格式
logging.basicConfig(level=logging.INFO)
movie_index = movie_data[0]
movie_name = movie_data[1]
# 已经爬取过的图片就无需重复爬取
if os.path.exists(f'{save_base_path}{movie_index}.jpg'):
return
jpg_url = get_movie_png(movie_name)  # 获取电影图片链接
if jpg_url is None:     # 获取电影图片链接失败
jpg_url = get_movie_png(movie_name.split('(')[0])  # 把电影名的年份去掉再搜索
if jpg_url is None:  # 获取电影图片链接失败
logging.error(f'error to get {movie_name} pic_url')
else:
save_pictures(jpg_url, movie_index, save_base_path)  # 保存图片

def fill(movie_data, fill_jpg='./no_found.jpg', save_base_path='./result/'):
"""讲没找到封面的电影用no_found.jpg代替"""
for i in range(1, len(movies_data)+1):
poster_jpg = f'{save_base_path}{i}.jpg'
if not os.path.exists(poster_jpg):
shutil.copyfile(fill_jpg, poster_jpg)

if __name__ == '__main__':
movies_data = get_movie_names()  # 获取所有电影名称
print(movies_data)
pool = multiprocessing.Pool()   # 创建进程池
pool.map(main, movies_data)     # 进程映射
pool.close()    # 关闭进程加入进程池
pool.join()     # 等待子进程结束
print('end')
# 我能爬取1551/1682张封面
fill(movies_data)  # 将爬取不到的封面用一张图(no_found.jpg)代替


GItHub地址https://github.com/visionsss/imdb_poster result存放爬取的封面图片
展开全文
weixin_38603360 2020-03-31 16:38:20
• ## 推荐系统必备数据集——ml-100k 百度网盘链接 机器学习

weixin_44984664 2020-05-13 22:10:23
• weixin_39578457 2020-12-04 11:49:54
• qq_39315740 2019-08-03 21:30:09
• ## 利用pandas和数据集ml-100k计算男女生对电影的看法 Python pandas DataFrame

qq_35190319 2019-04-14 09:41:31
• 4.54MB weixin_45532870 2021-08-04 09:56:57
• ## 以MovieLens 的 ml-100k 为实验数据，基于 ItemCF 算法的推荐结果。 算法

19.64MB muyua_ 2022-01-20 23:23:59
• 15KB muyua_ 2022-01-20 23:32:04

qq_40575565 2021-04-14 09:31:57
• ## movielens100K数据集 MovieLens

4.7MB darksnipers 2016-12-16 09:56:41

...