精华内容
下载资源
问答
  • 推荐系统在我们日常生活中发挥着非常重要的作用,相信实际从事过推荐相关的工程项目的人或多或少都会看多《推荐系统实战》这本书,我也是读者之一,个人感觉对于...技术层面主要从机器学习和深度学习两个方面来分别...

           推荐系统在我们日常生活中发挥着非常重要的作用,相信实际从事过推荐相关的工程项目的人或多或少都会看多《推荐系统实战》这本书,我也是读者之一,个人感觉对于推荐系统的入门来说这本书籍还是不错的资料。很多商场、大厂的推荐系统都是很复杂也是很强大的,大多是基于深度学习来设计强有力的计算系统,本文是笔者在公司实践项目中实际做过的推荐系统实践经验分享。技术层面主要从机器学习和深度学习两个方面来分别进行讲解。

           其中,机器学习部分主要是基于surprise模块来实现图书推荐系统和电影的推荐系统设计与实现;深度学习部分主要是基于神经网络推荐模型来实现音乐数据推荐。

           本文所用到的数据集可以从下面的链接处下载:

    https://download.csdn.net/download/together_cz/10916350

           关于surprise模块的相关介绍和实例可以参考下面的链接:

    https://surprise.readthedocs.io/en/stable/getting_started.html

          首页截图如下所示:

             使用surprise来加载自己的数据集首先要定义一个数据读取器来格式化数据,简单的数据读取器代码实现如下:

    #构建读取器
    reader=Reader(line_format=data_format,sep=sep)
    mydata=Dataset.load_from_file(data_path,reader=reader)
    

             图书推荐系统设计示意图如下所示:

         从上方的数据集连接处下载得到数据集后,我们对book.csv进行简单的查看,下面是部分数据的结果:

    id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
    1,2767052,2767052,2792775,272,439023483,9.78043902348e+12,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m/2767052.jpg,https://images.gr-assets.com/books/1447303603s/2767052.jpg
    2,3,3,4640799,491,439554934,9.78043955493e+12,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m/3.jpg,https://images.gr-assets.com/books/1474154022s/3.jpg
    3,41865,41865,3212258,226,316015849,9.78031601584e+12,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m/41865.jpg,https://images.gr-assets.com/books/1361039443s/41865.jpg
    4,2657,2657,3275794,487,61120081,9.78006112008e+12,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m/2657.jpg,https://images.gr-assets.com/books/1361975680s/2657.jpg
    5,4671,4671,245494,1356,743273567,9.78074327356e+12,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m/4671.jpg,https://images.gr-assets.com/books/1490528560s/4671.jpg
    6,11870085,11870085,16827462,226,525478817,9.78052547881e+12,John Green,2012.0,The Fault in Our Stars,The Fault in Our Stars,eng,4.26,2346404,2478609,140739,47994,92723,327550,698471,1311871,https://images.gr-assets.com/books/1360206420m/11870085.jpg,https://images.gr-assets.com/books/1360206420s/11870085.jpg
    7,5907,5907,1540236,969,618260307,9.7806182603e+12,J.R.R. Tolkien,1937.0,The Hobbit or There and Back Again,The Hobbit,en-US,4.25,2071616,2196809,37653,46023,76784,288649,665635,1119718,https://images.gr-assets.com/books/1372847500m/5907.jpg,https://images.gr-assets.com/books/1372847500s/5907.jpg
    8,5107,5107,3036731,360,316769177,9.78031676917e+12,J.D. Salinger,1951.0,The Catcher in the Rye,The Catcher in the Rye,eng,3.79,2044241,2120637,44920,109383,185520,455042,661516,709176,https://images.gr-assets.com/books/1398034300m/5107.jpg,https://images.gr-assets.com/books/1398034300s/5107.jpg
    9,960,960,3338963,311,1416524797,9.78141652479e+12,Dan Brown,2000.0,Angels & Demons ,"Angels & Demons  (Robert Langdon, #1)",en-CA,3.85,2001311,2078754,25112,77841,145740,458429,716569,680175,https://images.gr-assets.com/books/1303390735m/960.jpg,https://images.gr-assets.com/books/1303390735s/960.jpg
    10,1885,1885,3060926,3455,679783261,9.78067978327e+12,Jane Austen,1813.0,Pride and Prejudice,Pride and Prejudice,eng,4.24,2035490,2191465,49152,54700,86485,284852,609755,1155673,https://images.gr-assets.com/books/1320399351m/1885.jpg,https://images.gr-assets.com/books/1320399351s/1885.jpg
    11,77203,77203,3295919,283,1594480001,9.78159448e+12,Khaled Hosseini,2003.0,The Kite Runner ,The Kite Runner,eng,4.26,1813044,1878095,59730,34288,59980,226062,628174,929591,https://images.gr-assets.com/books/1484565687m/77203.jpg,https://images.gr-assets.com/books/1484565687s/77203.jpg
    12,13335037,13335037,13155899,210,62024035,9.78006202404e+12,Veronica Roth,2011.0,Divergent,"Divergent (Divergent, #1)",eng,4.24,1903563,2216814,101023,36315,82870,310297,673028,1114304,https://images.gr-assets.com/books/1328559506m/13335037.jpg,https://images.gr-assets.com/books/1328559506s/13335037.jpg
    13,5470,5470,153313,995,451524934,9.78045152494e+12,"George Orwell, Erich Fromm, Celâl Üster",1949.0,Nineteen Eighty-Four,1984,eng,4.14,1956832,2053394,45518,41845,86425,324874,692021,908229,https://images.gr-assets.com/books/1348990566m/5470.jpg,https://images.gr-assets.com/books/1348990566s/5470.jpg
    14,7613,7613,2207778,896,452284244,9.78045228424e+12,George Orwell,1945.0,Animal Farm: A Fairy Story,Animal Farm,eng,3.87,1881700,1982987,35472,66854,135147,433432,698642,648912,https://images.gr-assets.com/books/1424037542m/7613.jpg,https://images.gr-assets.com/books/1424037542s/7613.jpg
    15,48855,48855,3532896,710,553296981,9.78055329698e+12,"Anne Frank, Eleanor Roosevelt, B.M. Mooyaart-Doubleday",1947.0,Het Achterhuis: Dagboekbrieven 14 juni 1942 - 1 augustus 1944,The Diary of a Young Girl,eng,4.1,1972666,2024493,20825,45225,91270,355756,656870,875372,https://images.gr-assets.com/books/1358276407m/48855.jpg,https://images.gr-assets.com/books/1358276407s/48855.jpg
    16,2429135,2429135,1708725,274,307269752,9.78030726975e+12,"Stieg Larsson, Reg Keeland",2005.0,Män som hatar kvinnor,"The Girl with the Dragon Tattoo (Millennium, #1)",eng,4.11,1808403,1929834,62543,54835,86051,285413,667485,836050,https://images.gr-assets.com/books/1327868566m/2429135.jpg,https://images.gr-assets.com/books/1327868566s/2429135.jpg
    17,6148028,6148028,6171458,201,439023491,9.7804390235e+12,Suzanne Collins,2009.0,Catching Fire,"Catching Fire (The Hunger Games, #2)",eng,4.3,1831039,1988079,88538,10492,48030,262010,687238,980309,https://images.gr-assets.com/books/1358273780m/6148028.jpg,https://images.gr-assets.com/books/1358273780s/6148028.jpg
    18,5,5,2402163,376,043965548X,9.78043965548e+12,"J.K. Rowling, Mary GrandPré, Rufus Beck",1999.0,Harry Potter and the Prisoner of Azkaban,"Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)",eng,4.53,1832823,1969375,36099,6716,20413,166129,509447,1266670,https://images.gr-assets.com/books/1499277281m/5.jpg,https://images.gr-assets.com/books/1499277281s/5.jpg
    19,34,34,3204327,566,618346252,9.78061834626e+12,J.R.R. Tolkien,1954.0, The Fellowship of the Ring,"The Fellowship of the Ring (The Lord of the Rings, #1)",eng,4.34,1766803,1832541,15333,38031,55862,202332,493922,1042394,https://images.gr-assets.com/books/1298411339m/34.jpg,https://images.gr-assets.com/books/1298411339s/34.jpg
    20,7260188,7260188,8812783,239,439023513,9.78043902351e+12,Suzanne Collins,2010.0,Mockingjay,"Mockingjay (The Hunger Games, #3)",eng,4.03,1719760,1870748,96274,30144,110498,373060,618271,738775,https://images.gr-assets.com/books/1358275419m/7260188.jpg,https://images.gr-assets.com/books/1358275419s/7260188.jpg
    21,2,2,2809203,307,439358078,9.78043935807e+12,"J.K. Rowling, Mary GrandPré",2003.0,Harry Potter and the Order of the Phoenix,"Harry Potter and the Order of the Phoenix (Harry Potter, #5)",eng,4.46,1735368,1840548,28685,9528,31577,180210,494427,1124806,https://images.gr-assets.com/books/1387141547m/2.jpg,https://images.gr-assets.com/books/1387141547s/2.jpg
    22,12232938,12232938,1145090,183,316166685,9.78031616668e+12,Alice Sebold,2002.0,The Lovely Bones,The Lovely Bones,eng,3.77,1605173,1661562,36642,62777,131188,404699,583575,479323,https://images.gr-assets.com/books/1457810586m/12232938.jpg,https://images.gr-assets.com/books/1457810586s/12232938.jpg
    23,15881,15881,6231171,398,439064864,9.78043906487e+12,"J.K. Rowling, Mary GrandPré",1998.0,Harry Potter and the Chamber of Secrets,"Harry Potter and the Chamber of Secrets (Harry Potter, #2)",eng,4.37,1779331,1906199,34172,8253,42251,242345,548266,1065084,https://images.gr-assets.com/books/1474169725m/15881.jpg,https://images.gr-assets.com/books/1474169725s/15881.jpg
    24,6,6,3046572,332,439139600,9.7804391396e+12,"J.K. Rowling, Mary GrandPré",2000.0,Harry Potter and the Goblet of Fire,"Harry Potter and the Goblet of Fire (Harry Potter, #4)",eng,4.53,1753043,1868642,31084,6676,20210,151785,494926,1195045,https://images.gr-assets.com/books/1361482611m/6.jpg,https://images.gr-assets.com/books/1361482611s/6.jpg
    25,136251,136251,2963218,263,545010225,9.78054501022e+12,"J.K. Rowling, Mary GrandPré",2007.0,Harry Potter and the Deathly Hallows,"Harry Potter and the Deathly Hallows (Harry Potter, #7)",eng,4.61,1746574,1847395,51942,9363,22245,113646,383914,1318227,https://images.gr-assets.com/books/1474171184m/136251.jpg,https://images.gr-assets.com/books/1474171184s/136251.jpg
    26,968,968,2982101,350,307277674,9.78030727767e+12,Dan Brown,2003.0,The Da Vinci Code,"The Da Vinci Code (Robert Langdon, #2)",eng,3.79,1447148,1557292,41560,71345,126493,340790,539277,479387,https://images.gr-assets.com/books/1303252999m/968.jpg,https://images.gr-assets.com/books/1303252999s/968.jpg
    27,1,1,41335427,275,439785960,9.78043978597e+12,"J.K. Rowling, Mary GrandPré",2005.0,Harry Potter and the Half-Blood Prince,"Harry Potter and the Half-Blood Prince (Harry Potter, #6)",eng,4.54,1678823,1785676,27520,7308,21516,136333,459028,1161491,https://images.gr-assets.com/books/1361039191m/1.jpg,https://images.gr-assets.com/books/1361039191s/1.jpg
    28,7624,7624,2766512,458,140283331,9.78014028333e+12,William Golding,1954.0,Lord of the Flies ,Lord of the Flies,eng,3.64,1605019,1671484,26886,92779,160295,425648,564916,427846,https://images.gr-assets.com/books/1327869409m/7624.jpg,https://images.gr-assets.com/books/1327869409s/7624.jpg
    29,18135,18135,3349450,1937,743477111,9.78074347712e+12,"William Shakespeare, Robert           Jackson",1595.0,An Excellent conceited Tragedie of Romeo and Juliet,Romeo and Juliet,eng,3.73,1628519,1672889,14778,57980,153179,452673,519822,489235,https://images.gr-assets.com/books/1327872146m/18135.jpg,https://images.gr-assets.com/books/1327872146s/18135.jpg
    30,8442457,19288043,13306276,196,297859382,9.78029785938e+12,Gillian Flynn,2012.0,Gone Girl,Gone Girl,eng,4.03,512475,1626519,121614,38874,80807,280331,616031,610476,https://images.gr-assets.com/books/1339602131m/8442457.jpg,https://images.gr-assets.com/books/1339602131s/8442457.jpg
    31,4667024,4667024,4717423,183,399155341,9.78039915534e+12,Kathryn Stockett,2009.0,The Help,The Help,eng,4.45,1531753,1603545,78204,10235,25117,134887,490754,942552,https://images.gr-assets.com/books/1346100365m/4667024.jpg,https://images.gr-assets.com/books/1346100365s/4667024.jpg
    32,890,890,40283,373,142000671,9.78014200067e+12,John Steinbeck,1937.0,Of Mice and Men ,Of Mice and Men,eng,3.84,1467496,1518741,24642,46630,110856,355169,532291,473795,https://images.gr-assets.com/books/1437235233m/890.jpg,https://images.gr-assets.com/books/1437235233s/890.jpg
    33,930,929,1558965,220,739326228,9.78073932622e+12,Arthur Golden,1997.0,Memoirs of a Geisha,Memoirs of a Geisha,eng,4.08,1300209,1418172,25605,23500,59033,258700,517157,559782,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,https://s.gr-assets.com/assets/nophoto/book/50x75-a91bf249278a81aabab721ef782c4a74.png
    34,10818853,10818853,15732562,169,1612130291,9.78161213029e+12,E.L. James,2011.0,Fifty Shades of Grey,"Fifty Shades of Grey (Fifty Shades, #1)",eng,3.67,1338493,1436818,75437,165455,152293,252185,294976,571909,https://images.gr-assets.com/books/1385207843m/10818853.jpg,https://images.gr-assets.com/books/1385207843s/10818853.jpg
    35,865,865,4835472,458,61122416,9.78006112242e+12,"Paulo Coelho, Alan R. Clarke",1988.0,O Alquimista,The Alchemist,eng,3.82,1299566,1403995,55781,74846,123614,289143,412180,504212,https://images.gr-assets.com/books/1483412266m/865.jpg,https://images.gr-assets.com/books/1483412266s/865.jpg
    36,3636,3636,2543234,192,385732554,9.78038573255e+12,Lois Lowry,1993.0,The Giver,"The Giver (The Giver, #1)",eng,4.12,1296825,1345445,54084,26497,59652,225326,448691,585279,https://images.gr-assets.com/books/1342493368m/3636.jpg,https://images.gr-assets.com/books/1342493368s/3636.jpg
    37,100915,100915,4790821,474,60764899,9.78006076489e+12,C.S. Lewis,1950.0,"The Lion, the Witch and the Wardrobe","The Lion, the Witch, and the Wardrobe (Chronicles of Narnia, #1)",eng,4.19,1531800,1584884,15186,19309,55542,262038,513366,734629,https://images.gr-assets.com/books/1353029077m/100915.jpg,https://images.gr-assets.com/books/1353029077s/100915.jpg
    38,14050,18619684,2153746,167,965818675,9.78096581867e+12,Audrey Niffenegger,2003.0,The Time Traveler's Wife,The Time Traveler's Wife,eng,3.95,746287,1308667,43382,44339,85429,257805,427210,493884,https://images.gr-assets.com/books/1437728815m/14050.jpg,https://images.gr-assets.com/books/1437728815s/14050.jpg
    39,13496,13496,1466917,101,553588486,9.78055358848e+12,George R.R. Martin,1996.0,A Game of Thrones,"A Game of Thrones (A Song of Ice and Fire, #1)",eng,4.45,1319204,1442220,46205,19988,28983,114092,404583,874574,https://images.gr-assets.com/books/1436732693m/13496.jpg,https://images.gr-assets.com/books/1436732693s/13496.jpg
    40,19501,19501,3352398,185,143038419,9.78014303841e+12,Elizabeth Gilbert,2006.0,"Eat, pray, love: one woman's search for everything across Italy, India and Indonesia","Eat, Pray, Love",eng,3.51,1181647,1206597,49714,100373,149549,310212,332191,314272,https://images.gr-assets.com/books/1503066414m/19501.jpg,https://images.gr-assets.com/books/1503066414s/19501.jpg
    41,28187,28187,3346751,159,786838655,9.78078683865e+12,Rick Riordan,2005.0,The Lightning Thief,"The Lightning Thief (Percy Jackson and the Olympians, #1)",eng,4.23,1366265,1411114,46006,18303,48294,219638,435514,689365,https://images.gr-assets.com/books/1400602609m/28187.jpg,https://images.gr-assets.com/books/1400602609s/28187.jpg
    42,1934,1934,3244642,1707,451529308,9.78045152930e+12,Louisa May Alcott,1868.0,Little Women,"Little Women (Little Women, #1)",en-US,4.04,1257121,1314293,17090,31645,70011,250794,426280,535563,https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png,https://s.gr-assets.com/assets/nophoto/book/50x75-a91bf249278a81aabab721ef782c4a74.png
    43,10210,10210,2977639,2568,142437204,9.78014243721e+12,"Charlotte Brontë, Michael Mason",1847.0,Jane Eyre,Jane Eyre,eng,4.1,1198557,1286135,31212,35132,64274,212294,400214,574221,https://images.gr-assets.com/books/1327867269m/10210.jpg,https://images.gr-assets.com/books/1327867269s/10210.jpg
    44,15931,15931,1498135,190,553816713,9.78055381672e+12,Nicholas Sparks,1996.0,The Notebook,"The Notebook (The Notebook, #1)",eng,4.06,1053403,1076749,17279,41395,63432,176469,298259,497194,https://images.gr-assets.com/books/1385738917m/15931.jpg,https://images.gr-assets.com/books/1385738917s/15931.jpg
    45,4214,4214,1392700,264,770430074,9.78077043008e+12,Yann Martel,2001.0,Life of Pi,Life of Pi,,3.88,1003228,1077431,42962,39768,74331,218702,384164,360466,https://images.gr-assets.com/books/1320562005m/4214.jpg,https://images.gr-assets.com/books/1320562005s/4214.jpg
    46,43641,43641,3441236,128,1565125606,9.78156512560e+12,Sara Gruen,2006.0,Water for Elephants,Water for Elephants,eng,4.07,1068146,1108839,55732,16705,49832,200154,417328,424820,https://images.gr-assets.com/books/1494428973m/43641.jpg,https://images.gr-assets.com/books/1494428973s/43641.jpg
    47,19063,19063,878368,251,375831002,9.780375831e+12,Markus Zusak,2005.0,The Book Thief,The Book Thief,eng,4.36,1159741,1287798,93611,17892,35360,135272,377218,722056,https://images.gr-assets.com/books/1390053681m/19063.jpg,https://images.gr-assets.com/books/1390053681s/19063.jpg
    48,4381,4381,1272463,507,307347974,9.78030734798e+12,Ray Bradbury,1953.0,Fahrenheit 451,Fahrenheit 451,spa,3.97,570498,1176240,30694,28366,64289,238242,426292,419051,https://images.gr-assets.com/books/1351643740m/4381.jpg,https://images.gr-assets.com/books/1351643740s/4381.jpg
    49,49041,49041,3203964,194,316160199,9.78031616019e+12,Stephenie Meyer,2006.0,"New Moon (Twilight, #2)","New Moon (Twilight, #2)",eng,3.52,1149630,1199000,44020,102837,160660,294207,290612,350684,https://images.gr-assets.com/books/1361039440m/49041.jpg,https://images.gr-assets.com/books/1361039440s/49041.jpg

              整体的设计思想很简单,并没有很难理解的节点,接下来看具体的代码实现:

    def bookRecommendSystem(map_data='book.csv',train_data='rating.csv',data_format='book user rating',sep=',',flag='SVD',k=10):
        '''
        图书推荐系统
        '''
        id_name_dic,name_id_dic=bookDataMapping(map_data)
        myModel,dataset=buildModel(data_path=train_data,data_format=data_format,sep=sep,flag=flag)
        print '==================model Training Finished========================'
        performace=evaluationModel(myModel,dataset)
        print '==================model performace==================='
        print performace
        current_playlist_id='1239'
        print u'当前的用户id:'+current_playlist_id
        current_playlist_name=id_name_dic[current_playlist_id]
        print u'当前的书籍名称:'+current_playlist_name
        playlist_inner_id=myModel.trainset.to_inner_uid(current_playlist_id)
        print u'当前的用户内部id:'+str(playlist_inner_id)
        #以10个用户为基准推荐
        playlist_neighbors=myModel.get_neighbors(playlist_inner_id,k=k)
        playlist_neighbors_id=(myModel.trainset.to_raw_uid(inner_id) for inner_id in playlist_neighbors)
        playlist_neighbors_name=(id_name_dic[playlist_id] for playlist_id in playlist_neighbors_id)
        print("和用户<", current_playlist_name, '> 最接近的10本书为:\n')
        for playlist_name in playlist_neighbors_name:
            print(playlist_name, name_id_dic[playlist_name])
    

              上面的函数实现了图书推荐系统,相应的注释都在里面,就不多解释了,下面对其中几个关键的函数实现进行说明。

               模型初始化模块:

    def initModel(flag='NormalPredictor'):
        '''
        多种推荐算法对比使用
        '''
        if flag=='NormalPredictor':  #使用NormalPredictor
            return NormalPredictor()
        elif flag=='BaselineOnly':  #使用BaselineOnly
            return BaselineOnly()
        elif flag=='KNNBasic':  #使用基础版协同过滤
            return KNNBasic()
        elif flag=='KNNWithMeans':  #使用均值协同过滤
            return KNNWithMeans()
        elif flag=='KNNBaseline':  #使用协同过滤baseline
            return KNNBaseline()
        elif flag=='SVD':  #使用SVD
            return SVD()
        elif flag=='SVDpp':  #使用SVD++
            return SVDpp()
        elif flag=='NMF':  #使用NMF
            return NMF()
        else:
            return SVD()
    

              推荐系统模型构建模块:

    def buildModel(data_path='rating.csv',data_format='user item rating',sep=',',flag='KNNBasic'):
        '''
        推荐系统模型
        '''
        #构建读取器
        reader=Reader(line_format=data_format,sep=sep)
        mydata=Dataset.load_from_file(data_path,reader=reader)
        #计算书籍之间的相似度
        train_set=mydata.build_full_trainset()
        print '================model training================'
        model=initModel(flag=flag)
        model.fit(train_set)
        return model,mydata
    

            数据集映射关系构建模块:

    def bookDataMapping(data_path='book.csv'):
        '''
        加载原始的 "id,name" 数据来构建字典映射
        '''
        csv_reader=csv.reader(open(data_path))
        id_name_dic,name_id_dic={},{}
        for row in csv_reader:
            id_name_dic[row[0]]=row[10]
            name_id_dic[row[10]]=row[0]
        return id_name_dic, name_id_dic
    

            简单调用如下所示:

    bookRecommendSystem(map_data='book.csv',train_data='RRR.csv',data_format='user item rating',sep=',',flag='KNNBasic',k=10)

             默认采用了KNN算法,五折交叉验证计算,具体的结果输出如下:

    ------------
    Fold 1
    Computing the msd similarity matrix...
    Done computing similarity matrix.
    RMSE: 0.9211
    MAE:  0.7108
    FCP:  0.7038
    ------------
    Fold 2
    Computing the msd similarity matrix...
    Done computing similarity matrix.
    RMSE: 0.9211
    MAE:  0.7093
    FCP:  0.6996
    ------------
    Fold 3
    Computing the msd similarity matrix...
    Done computing similarity matrix.
    RMSE: 0.9234
    MAE:  0.7133
    FCP:  0.7010
    ------------
    Fold 4
    Computing the msd similarity matrix...
    Done computing similarity matrix.
    RMSE: 0.9210
    MAE:  0.7119
    FCP:  0.7017
    ------------
    Fold 5
    Computing the msd similarity matrix...
    Done computing similarity matrix.
    RMSE: 0.9268
    MAE:  0.7167
    FCP:  0.6983
    ------------
    ------------
    Mean RMSE: 0.9227
    Mean MAE : 0.7124
    Mean FCP : 0.7009
    ------------
    ------------
    ==================model performace===================
    defaultdict(<type 'list'>, {u'fcp': [0.703847835793307, 0.6995619798679573, 0.7009530691688108, 0.7017142119961722, 0.6982634284783771], u'mae': [0.7107821167817494, 0.7093057204220446, 0.7132818148803571, 0.7119004793330316, 0.7167381500990199], u'rmse': [0.921100168545926, 0.9210542860057216, 0.9234120678271927, 0.9209873056186509, 0.9267740800608146]})
    当前的用户id:1239
    当前的书籍名称:Chronicle of a Death Foretold
    当前的用户内部id:537
    (u'\u548c\u7528\u6237<', 'Chronicle of a Death Foretold', u'> \u6700\u63a5\u8fd1\u768410\u672c\u4e66\u4e3a\uff1a\n')
    ('Frostbite (Vampire Academy, #2)', '384')
    ('The Call of the Wild', '375')
    ('The Knife of Never Letting Go (Chaos Walking, #1)', '1050')
    ('The Neverending Story', '877')
    ('Lord of the Flies', '28')
    ('Olive Kitteridge', '930')
    ('Twenty Thousand Leagues Under the Sea', '699')
    ("1st to Die (Women's Murder Club, #1)", '336')
    ('The Big Short: Inside the Doomsday Machine', '985')
    ('The Black Echo (Harry Bosch, #1; Harry Bosch Universe, #1)', '902')
    

            上述工作完成了从原始数据解析处理到最终的图书推荐系统的构建整个工作,接下来我基于Suprise内置的电影数据集来构建一个电影推荐系统,其实本质上来说,这里的图书推荐系统和电影推荐系统在技术上是极为相似的,区别只是在于数据集类型的不同,下面我们来看一下电影推荐系统。

             这里数据集使用的是内置的数据集ml-100k,这个数据集需要的话可以直接去网上进行搜索的,数据集截图如下所示:

           其中,u.data部分数据示例如下所示:

    196	242	3	881250949
    186	302	3	891717742
    22	377	1	878887116
    244	51	2	880606923
    166	346	1	886397596
    298	474	4	884182806
    115	265	2	881171488
    253	465	5	891628467
    305	451	3	886324817
    6	86	3	883603013
    62	257	2	879372434
    286	1014	5	879781125
    200	222	5	876042340
    210	40	3	891035994
    224	29	3	888104457
    303	785	3	879485318
    122	387	5	879270459
    194	274	2	879539794
    291	1042	4	874834944
    234	1184	2	892079237
    119	392	4	886176814
    167	486	4	892738452
    299	144	4	877881320
    291	118	2	874833878
    308	1	4	887736532
    95	546	2	879196566
    38	95	5	892430094
    102	768	2	883748450
    63	277	4	875747401
    160	234	5	876861185
    50	246	3	877052329
    301	98	4	882075827
    225	193	4	879539727
    290	88	4	880731963
    97	194	3	884238860
    157	274	4	886890835
    181	1081	1	878962623
    278	603	5	891295330
    276	796	1	874791932
    7	32	4	891350932
    10	16	4	877888877
    284	304	4	885329322
    201	979	2	884114233
    276	564	3	874791805
    287	327	5	875333916
    246	201	5	884921594
    242	1137	5	879741196
    249	241	5	879641194
    99	4	5	886519097
    178	332	3	882823437

            u.item部分数据示例如下所示:

    1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
    2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
    5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
    6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
    7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
    8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1|1|0|0|1|0|0|0|0|0|0|0|0|0|0
    9|Dead Man Walking (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Dead%20Man%20Walking%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
    10|Richard III (1995)|22-Jan-1996||http://us.imdb.com/M/title-exact?Richard%20III%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0
    11|Seven (Se7en) (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Se7en%20(1995)|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|1|0|0
    12|Usual Suspects, The (1995)|14-Aug-1995||http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995)|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|1|0|0
    13|Mighty Aphrodite (1995)|30-Oct-1995||http://us.imdb.com/M/title-exact?Mighty%20Aphrodite%20(1995)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
    14|Postino, Il (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Postino,%20Il%20(1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|1|0|0|0|0
    15|Mr. Holland's Opus (1995)|29-Jan-1996||http://us.imdb.com/M/title-exact?Mr.%20Holland's%20Opus%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
    16|French Twist (Gazon maudit) (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Gazon%20maudit%20(1995)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0|0|0|0
    17|From Dusk Till Dawn (1996)|05-Feb-1996||http://us.imdb.com/M/title-exact?From%20Dusk%20Till%20Dawn%20(1996)|0|1|0|0|0|1|1|0|0|0|0|1|0|0|0|0|1|0|0
    18|White Balloon, The (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Badkonake%20Sefid%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
    19|Antonia's Line (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Antonia%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
    20|Angels and Insects (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Angels%20and%20Insects%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|1|0|0|0|0

            关于数据集解压缩后的各个子文件解释信息如下所示:

    Here are brief descriptions of the data.
    
    ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                    gunzip ml-data.tar.gz
                    tar xvf ml-data.tar
                    mku.sh
    
    u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
                  Each user has rated at least 20 movies.  Users and items are
                  numbered consecutively from 1.  The data is randomly
                  ordered. This is a tab separated list of 
    	         user id | item id | rating | timestamp. 
                  The time stamps are unix seconds since 1/1/1970 UTC   
    
    u.info     -- The number of users, items, and ratings in the u data set.
    
    u.item     -- Information about the items (movies); this is a tab separated
                  list of
                  movie id | movie title | release date | video release date |
                  IMDb URL | unknown | Action | Adventure | Animation |
                  Children's | Comedy | Crime | Documentary | Drama | Fantasy |
                  Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
                  Thriller | War | Western |
                  The last 19 fields are the genres, a 1 indicates the movie
                  is of that genre, a 0 indicates it is not; movies can be in
                  several genres at once.
                  The movie ids are the ones used in the u.data data set.
    
    u.genre    -- A list of the genres.
    
    u.user     -- Demographic information about the users; this is a tab
                  separated list of
                  user id | age | gender | occupation | zip code
                  The user ids are the ones used in the u.data data set.
    
    u.occupation -- A list of the occupations.
    
    u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
    u1.test       are 80%/20% splits of the u data into training and test data.
    u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
    u2.test       5 fold cross validation (where you repeat your experiment
    u3.base       with each training and test set and average the results).
    u3.test       These data sets can be generated from u.data by mku.sh.
    u4.base
    u4.test
    u5.base
    u5.test
    
    ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
    ua.test       split the u data into a training set and a test set with
    ub.base       exactly 10 ratings per user in the test set.  The sets
    ub.test       ua.test and ub.test are disjoint.  These data sets can
                  be generated from u.data by mku.sh.
    
    allbut.pl  -- The script that generates training and test sets where
                  all but n of a users ratings are in the training data.
    
    mku.sh     -- A shell script to generate all the u data sets from u.data.

             了解和熟悉了数据集之后就可以对其进行建模分析了。首先,构建id、name的映射关系字典,方法如下:

    def dataMapping(data='item.txt'):
        '''
        构建id和name的映射关系字典
        '''
        id_name_dict,name_id_dict={},{}
        with open(data) as f:
            data_list=[one_line.strip().split('|') for one_line in f.readlines() if one_line]
        for one_list in data_list:
            id_name_dict[one_list[0]]=one_list[1]
            name_id_dict[one_list[1]]=one_list[0]
        return id_name_dict,name_id_dict
    

              其余的推荐工作与book推荐类似,这里就不再详细解释了,这里直接上代码:

    def movieRecommendSystem():
        '''
        电影推荐系统
        '''
        #构建训练数据集与KNN模型
        movie_data=Dataset.load_builtin('ml-100k')
        trainset=movie_data.build_full_trainset()
        algo=KNNBasic()
        algo.train(trainset)
        #构建id、name映射关系
        id_name_dict,name_id_dict=dataMapping(data='item.txt')
        #电影推荐
        #以电影Army of Darkness (1993)为基础
        raw_id=name_id_dict['Army of Darkness (1993)']  #获得raw_id
        inner_id=algo.trainset.to_inner_iid(raw_id)  #转换为模型内部id
        neighbors=algo.get_neighbors(inner_id,10)  #模型推荐电影(默认10个推荐结果)
        res_ids=[algo.trainset.to_raw_iid(_id) for _id in neighbors]  #模型内部id转换为实际电影id
        movies=[id_name_dict[raw_id] for raw_id in res_ids]  #获得电影名称
        print u"========================10个最相似的电影:========================"
        for movie in movies:
            print name_id_dict[movie],'==========>',movie
    

          我们基于实际的数据来进行简单的测试分析,推荐结果如下:

    Done computing similarity matrix.
    ========================10个最相似的电影:========================
    242 ==========> Kolya (1996)
    486 ==========> Sabrina (1954)
    88 ==========> Sleepless in Seattle (1993)
    603 ==========> Rear Window (1954)
    20 ==========> Angels and Insects (1995)
    479 ==========> Vertigo (1958)
    1336 ==========> Kazaam (1996)
    673 ==========> Cape Fear (1962)
    568 ==========> Speed (1994)
    623 ==========> Angels in the Outfield (1994)
    

           在本文的实践中默认使用到的推荐计算模型都是KNN,当然Suprise还提供了很多其他的模型和工具,奇异矩阵分解SVD就是非常常用的模型之一,这里我们对KNN和SVD的性能进行了简单的对比分析,结果如下图所示:

           本文到这里,基于机器学习的推荐实践就差不多结束了,感兴趣的话可以使用相应的数据集和代码进行尝试和建模分析了,接下来主要是基于深度学习模型来构建属于自己的音乐推荐系统。

         首先对数据集进行介绍说明,数据集主要来源于网易云音乐数据的采集,这里由于版权和官方的声明信息就不能公开爬虫的代码了,如果真的对数据集有需求的话可以联系我来获取一下实验的数据集,仅作为学术研究使用,切勿用作其他用途,谢谢合作。接下来我们对数据集进行初步的概览,部分数据结果如下所示:
          歌曲id-name映射数据集示例如下所示:

    4875306,逍遥叹-胡歌
    376417,一生有你-水木年华
    177575,让我一次爱个够-庾澄庆
    5255987,你若成风-许嵩
    165375,专属味道-汪苏泷
    63650,独家记忆-陈小春
    191819,当你孤单你会想起谁-张栋梁
    504826080,路过南京,路过你-江皓南
    408250378,写给我第一个喜欢的女孩的歌-西瓜Kune
    505451285,青春住了谁-杨丞琳
    31426805,Tell Her You Belong To Me-Beth Hart
    5094255,So Nice-Jim Tomlinson
    27008758,Angel-Randy Crawford
    427416048,Boom Boom Baby-Sean Hayes
    458496129,moonlight.-Sleep2.
    1308441,What a Difference the Day Made-Eddie Higgins
    3163956,Moon Song-Norah Jones
    2391318,Soledad-Concha Buika
    2639938,The Girl From Impanema-Gabriela Anders
    16952047,More-Matt Dusk
    536680802,江南雨巷-绯村柯北
    463425816,倾杯有酒-晃儿
    35956497,一衫轻纱-陈浩东
    34341487,寒江雪-Braska
    454717839,宿雨冷-老虎欧巴
    479177946,巷雨梨花-涵昱
    530986445,晴川雪-银临
    31062973,执伞-吾恩
    29984203,执伞待人归-而已
    30352430,旧诗行-只有影子
    461347998,Something Just Like This-The Chainsmokers
    411314681,This Is What You Came For -Calvin Harris
    460043372,It Ain't Me-Kygo
    515269424,Wolves-Selena Gomez
    422132237,Cold Water-Major Lazer
    521416693,So Far Away-Martin Garrix
    461518855,Stay-Zedd
    474581010,BOOM-Tiësto
    420922950,Let Me Love You-DJ Snake
    31370725,Say My Name (Kids Want Techno Remix)-Odesza
    28310930,涩-纣王老胡
    437755447,途-倪健
    490106148,山下-方拾贰
    30635613,秋酿-房东的猫
    417859220,皆非-马頔
    399340140,来信-陈鸿宇
    443967407,短叹-房东的猫
    436514312,成都-赵雷
    29572804,傲寒-马頔
    408814900,借我-谢春花
    29482203,風巻立つ-増田俊郎
    32743519,江上清风游-变奏的梦想
    32743521,明月逐人归-变奏的梦想
    31477886,人闲桂花静-F.Be.I
    785507,天照大御神-Musical Jarβ
    683826,春よ、来い-松任谷由実
    507152393,既听云深-秩厌
    428203067,行雲流水-流派未月亭
    509466710,饮酒赋诗-韦卓成
    31649696,千年の風-天地雅楽
    488953797,You Might Be (GoldFish Remix)-Autograf
    524149482,Sunset City-Andreas Phazer
    485612576,Creep-Gamper & Dadoni
    451703286,Burn (Gryffin Remix)-Gryffin
    32238090,Animals (Gryffin Remix)-Maroon 5
    464674974,Baby Boy (Famba Remix)-Famba
    34690580,How Deep Is Your Love (Liva K Remix)-Liva K
    407002710,Desire (Gryffin Remix)-Years & Years
    443292315,Deep Of The Night (Extended Mix)-Goldfish
    451701288,Am I Wrong (Gryffin Remix)-Gryffin

            专辑id-name映射部分数据示例如下所示:

    2106881647,寒假都快结束了,暑假为什么不接上?
    2095806875,日系治愈男声,逐步陷入温柔的世界
    2089907261,『無前奏 | 女嗓』开口即跪 心醉神迷.
    2099302296,纯音乐|钢琴与旧书页,会跳舞的黑白键
    2099228567,Bedroom-pop丨天马行空的绮梦
    2092484970,日系治愈|请问您今天要来点少女心吗?
    2091038787,如果可以,我想和古人谈一场恋爱
    2093273437,人声后摇 I 怎奈何琼夜一嗓伶俜
    2092474396,「抒情摇滚」过去的时光与黯然的诗
    2097548733,放完寒假,还是要继续追逐梦想
    2095135687,2018年韩国平昌冬奥会花样滑冰比赛BGM
    2129450612,有没有一首歌会让你想起周华健
    2097045424,音室Vol.4丨细 数 一 些 旧 时 光
    2094937822,2018年平昌冬奥会花样滑冰音乐选曲爆点精选
    2088799380,日系温柔女声丨沐浴在歌声中的暖阳下
    2084738832,「赶走阴霾」来一首欢快的欧美小调
    2092015892,我们在苏打绿的小宇宙里再遇见
    2081400147,【妖说】沧海桑田 只待君归
    2081133164,古风词作‖他们只是喜欢用歌的方式来讲故事
    2083911279,【Kobalt推荐】清晨最温柔的旋律
    2087725103,【超级碗2018】贾老板的中场盛事
    2082564528,听什么歌都像在唱自己
    2088028082,『曲作精选』细数古风圈原创作曲人❷
    2075587022,助眠集 | 自然音,伴灵动乐符萦绕耳畔
    2086732756,「深度睡眠」音符伴你入梦 愿你一夜好眠
    2074273616,想要办一场古风婚礼,许一世天地作嫁
    2075961982,告白恋语|喜欢你,直至生命最后一刻
    2074681032,「古风精选」你眼中是江湖 我眼中是你
    2086647823,最强大脑第五季BGM
    2077299279,『热血街舞团』参演曲目及出场BGM
    2088338811,『 孝利家民宿2 』允儿篇
    2078747658,『这!就是街舞』最全BGM合集持更
    2076551016,那片蔚蓝的天空静止了。
    2074505134,我们在寒冷的冬天停止不了摇滚与热吻
    2076170419,▶ 2018年欧美流行新歌速递
    2076565475,偶像练习生  参赛曲目 ( 3.23 New )
    2073678124,和声优谈恋爱什么感觉丨日本男声优撩妹现场
    2071490150,百首日系治愈 呼唤你的心灵 呼喊你的名字
    2073263803,[萌系/俏皮小调]不期而至的心动❤
    2069140707,恋爱系Melody,不知觉已陷入暗恋之心
    2065854146,古典清香 I 我的茶馆里住着巴赫与肖邦
    2062601053,日系摇滚『东京日和° 一番街的清冽少年』
    2069998416,Moombahton:异域风情的Drop最强音
    2064112746,那些年我们听错过的歌词「古风版」
    2067901040,空灵澄澈|梦寻空中花园
    2061447491,综艺《这!就是街舞》BGM合辑
    2061240468,「恬静英文」愿你酣然入梦
    2069189356,『Trap』低频轰炸机 令人上瘾的黑暗氛围
    2065668633,「提神醒脑」学习工作健身游戏必备
    2065515420,综艺《偶像练习生》BGM合辑
    2062160307,好莱坞黄金时代|歌梦盛世
    2060642136,長眠這裡吧妳已经沒有活下去的理由.
    2055571883,小姐姐搭档电子乐,声控党新潮的标配~
    2058497430,只是想安静的 享受那份呼吸的感觉。o・°。
    2063413626,流行禁区 • 摄人心魄的性感律动
    2066200314,「无前奏」喜欢无需铺垫 一秒便沦陷
    2059430694,痛彻心扉地哭,然后刻骨铭心地记住
    2063777060,「前奏沦陷」●迷醉在Absolut伏特加中
    2057752377,「古风」歌暖如茶,满城花开
    2054127850,节奏向|原谅我这一生不羁放纵爱自由
    2062522327,2018年冬季新番音乐之旅
    2060692794,【Jazz Blues】爵士乐句演绎12小节布鲁斯
    2059412026,华语 | 80/90都听过的经典老歌【怀旧篇】
    2052793999,(旋律控)|可可布朗尼般甜蜜
    2065782856,〖偶像练习生〗丨参赛曲目合辑(持更…)
    2048970456,百首日系抒情,总有一首触动你的心
    2049903536,日系对唱丨聆听他们绽放在青春的美好
    2041615881,『曲作精选』细数古风圈原创作曲人❶
    2040074016,「女声控」音色沁人心 旋律美如画
    2044527707,降燥八音盒 你要来一杯柠檬薄荷苏打水吗?
    2050704516,2018全年抖腿指南,老铁你怕了吗?
    2047743292,♪V家歌姬唱英文的时候♪
    2047424322,粤语女声 I 故事太多 没人会听你诉说
    2042009605,你绝对不应该错过的100首英文歌
    2042006896,华语 | 听歌最怕应景,触景最怕生情
    2059465574,韩语 | 一听就会中毒的韩文歌
    2055505250,「情书予你」江湖太远,我就不去了。

          用户评分数据部分数据示例如下所示:

    2148086011,32743519,100.0,1433001600000
    2148086011,32743521,95.0,1433001600000
    2148086011,31477886,95.0,1295712000000
    2148086011,785507,90.0,1293724800000
    2148086011,683826,100.0,1265126400000
    2148086011,507152393,80.0,1505725430579
    2148086011,428203067,95.0,1471017600000
    2148086011,509466710,70.0,1506495600000
    2148086011,31649696,90.0,1216310400000
    2139118830,488953797,100.0,1499356800007
    2139118830,524149482,100.0,1512662400007
    2139118830,485612576,100.0,1497974400007
    2139118830,451703286,100.0,1395763200007
    2139118830,32238090,100.0,1431878400007
    2139118830,464674974,95.0,1489334400007
    2139118830,34690580,100.0,1441987200007
    2139118830,407002710,100.0,1458319287803
    2139118830,443292315,95.0,1480003200007
    2139118830,451701288,100.0,1407772800007
    2139491961,287035,100.0,1172678400000
    2139491961,254485,100.0,965059200000
    2139491961,254432,100.0,1038672000000
    2139491961,224000,100.0,978278400000
    2139491961,210281,100.0,1295913600000
    2139491961,186010,100.0,1067616000000
    2139491961,188674,100.0,975600000000
    2139491961,375394,100.0,1130774400000
    2139491961,27747329,100.0,1379952000007
    2139491961,25641873,100.0,1104422400007
    2139305008,186345,100.0,962380800000
    2139305008,234841,100.0,892051200007
    2139305008,187564,100.0,938707200000
    2139305008,187600,100.0,938707200000
    2139305008,143474,100.0,539107200000
    2139305008,188222,100.0,691516800000
    2139305008,156427,100.0,1185897600000
    2139305008,153784,100.0,765129600000
    2139305008,5242750,100.0,1262275200000
    2139305008,194186,100.0,741456000000
    2144281377,108251,100.0,1291737600000
    2144281377,507815173,100.0,1505983872874
    2144281377,375100,100.0,1178812800000
    2144281377,385973,100.0,1122825600000
    2144281377,186021,100.0,1059580800000
    2144281377,29343809,100.0,1410278400007
    2144281377,82203,100.0,1230307200000
    2144281377,25699094,100.0,1199980800007
    2144281377,254045,100.0,1344556800000
    2144281377,287251,100.0,1128096000000
    2139324915,108390,100.0,1256832000000
    2139324915,375394,100.0,1130774400000
    2139324915,254574,100.0,941385600000
    2139324915,190072,100.0,975600000000
    2139324915,186560,100.0,891360000000
    2139324915,168089,100.0,1038672000000
    2139324915,110400,100.0,817747200000
    2139324915,287035,100.0,1172678400000
    2139324915,32507038,100.0,1433433600007
    2139324915,126946,100.0,1104508800004
    2139566312,2182015,95.0,1208822400000
    2139566312,26256399,85.0,1208822400000
    2139566312,526935207,80.0,1199808000007
    2139566312,526935208,80.0,1199808000007
    2139566312,527013149,75.0,1236787200007
    2139566312,527013150,80.0,1236787200007
    2139566312,526977909,65.0,1227542400007
    2139566312,526977910,55.0,1227542400007
    2139566312,527013280,75.0,1238428800007
    2139566312,25646006,80.0,1220918400000
    2140187381,486473539,100.0,1358438400007
    2140187381,426881088,100.0,589042800000
    2140187381,426881089,95.0,589042800000
    2140187381,426881090,90.0,589042800000
    2140187381,426881091,85.0,589042800000
    2140187381,426881092,85.0,589042800000
    2140187381,426881093,85.0,589042800000
    2140187381,426881094,80.0,589042800000
    2140187381,426881095,80.0,589042800000
    2140187381,426881096,75.0,589042800000
    2128755383,459089,100.0,1104508800000
    2128755383,23039253,100.0,1349049600000
    2128755383,459093,100.0,1104508800000
    2128755383,29744089,100.0,1416240000000
    2128755383,459097,100.0,1104508800000
    2128755383,21994019,95.0,1175443200000
    2128755383,21993923,95.0,1247500800000
    2128755383,459101,95.0,1104508800000
    2128755383,5044797,95.0,1194883200007
    2128755383,459105,90.0,1104508800000
    2131958935,29787426,100.0,1395014400000
    2131958935,16823382,100.0,1141660800000
    2131958935,427542109,100.0,1474848000000
    2131958935,33255655,100.0,1426348800007
    2131958935,20953761,100.0,992016000000
    2131958935,34376545,100.0,1453518281446
    2131958935,4175444,100.0,1275580800007
    2131958935,34229976,100.0,1380643200000
    2131958935,432464943,100.0,1475259769146
    2131958935,33211676,100.0,1416672000000
    2132468088,534065427,100.0,1516896000007
    2132468088,26209672,95.0,1364313600007
    2132468088,4936840,90.0,1274803200000
    2132468088,33367836,90.0,1421769600000
    2132468088,541480254,90.0,1509033600007
    2132468088,546730516,75.0,1414080000007
    2132468088,4920894,85.0,1324396800000
    2132468088,34928242,85.0,1443024000007
    2132468088,28636414,85.0,1364486400000
    2132468088,536622468,80.0,1498752000007
    2129467457,277382,100.0,1014912000000
    2129467457,108640,100.0,941385600000
    2129467457,277822,100.0,875635200000
    2129467457,277817,100.0,875635200000
    2129467457,277820,100.0,875635200000
    2129467457,277759,100.0,943977600000
    2129467457,277804,100.0,896630400000
    2129467457,277586,100.0,970329600000
    2129467457,276939,100.0,1214323200000
    2129467457,277836,100.0,820425600000
    2134421380,4877892,100.0,1007136000000
    2134421380,4877894,100.0,1007136000000
    2134421380,4873340,100.0,1208448000000
    2134421380,4873343,100.0,1208448000000

           基于深度学习的音乐推荐系统结构示例图如下所示:

          上面的示意图是本文音乐推荐系统的流程示意,相信是比较好理解的,接下来我们来看具体的代码实现:

    def dataPre(one_line):
        '''
        去脏、去无效数据
        '''
        with open('stopwords.txt') as f:
            stopwords_list=[one.strip() for one in f.readlines() if one]
        sigmod_list=[',','。','(',')','-','——','\n','“','”','*','#','《','》','、','[',']','(',')','-',
                       '.','/','】','【','……','!','!',':',':','…','@','~@','~','「一」','「','」',
                    '?','"','?','~','_',' ',';','◆','①','②','③','④','⑤','⑥','⑦','⑧','⑨','⑩',
                    '⑾','⑿','⒀','⒁','⒂','&amp;quot;',' ','/','·','…','!!!','】','!',',',
                    '。','[',']','【','、','?','/^/^','/^','”',')','(','~','》','《','。。。',
                    '=','⑻','⑴','⑵','⑶','⑷','⑸','⑹','⑺','…','|']
        for one_sigmod in sigmod_list:
            one_line=one_line.replace(one_sigmod,'')
        return one_line
    
    
    def seg(one_content):  
        ''' 
        分词并去除停用词 
        one_content:单条企业名称数据
        stopwords:停用词列表
        '''  
        stopwords=[]
        segs=jieba.cut(one_content,cut_all=False)  
        segs=[w.encode('utf8') for w in list(segs)]# 特别注意此处转换  
        seg_set=set(set(segs)-set(stopwords))  
        return list(seg_set) 
    
    
    def word2vecModel(con_list,model_path='my.model'):
        '''
        class gensim.models.word2vec.Word2Vec(sentences=None,size=100,alpha=0.025,window=5, min_count=5,max_vocab_size=None, sample=0.001,seed=1, workers=3,min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1,hashfxn=<built-in function hash>,iter=5,null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)
        参数:
        1.sentences:可以是一个List,对于大语料集,建议使用BrownCorpus,Text8Corpus或·ineSentence构建。
        2.sg: 用于设置训练算法,默认为0,对应CBOW算法;sg=1则采用skip-gram算法。
        3.size:是指输出的词的向量维数,默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
        4.window:为训练的窗口大小,8表示每个词考虑前8个词与后8个词(实际代码中还有一个随机选窗口的过程,窗口大小<=5),默认值为5。
        '''
        model=word2vec.Word2Vec(con_list,sg=1,size=100,window=5,min_count=1,
                                negative=3,sample=0.001, hs=1,workers=4)
        model.save(model_path)
        return con_list,model
    
    
    def songName2Words(data='neteasy_song_id_to_name_data.csv',save_path='music/songName.txt'):
        '''
        歌名数据向量化处理
        '''
        with open(data) as f:
            data_list=[one.strip().split(',') for one in f.readlines() if one]
        data_list.pop(0)  #去除标题行
        res_list=[]
        for one in data_list:
            musicId,content=one[0],''.join(one[1:])
            tmp=content.split('-')
            name,author=tmp[0].replace(' ',''),''.join(tmp[1:]).replace(' ','')
            name2=dataPre(name)
            author2=dataPre(author)
            cut_list=seg(name2)
            cut_list.append(author2)
            one_line=musicId+'|#|'+'/'.join(cut_list).strip()
            res_list.append(one_line)
        with open(save_path,'w') as f:
            for one in res_list:
                f.write(one.strip()+'\n')
    
    
    def song2Vec(data='music/songName.txt',model_path='music/song2Vec.model'):
        '''
        对歌曲名称分词后构建word2vec模型
        '''
        with open(data) as f:
            data_list=[one.strip() for one in f.readlines() if one]
        data=[]
        for i in range(len(data_list)):
            musicId,content=data_list[i].split('|#|')
            con_list=content.split('/')
            data.append(con_list)
        #训练模型
        word2vecModel(data,model_path=model_path)
    
    
    def mergeMovie(userVec='movie/userVec.json',songVec='movie/movieVec.json',save_path='movie/dataset.json'):
        '''
        将电影数据集向量拼接
        '''
        with open(userVec) as U:
            user_vector=json.load(U)
        with open(songVec) as S:
            song_vector=json.load(S)
        #加载评分数据
        with open('movie/ratings.dat') as f:
            data_list=[one.strip().split('::') for one in f.readlines()[:50000] if one]
        vector=[]
        for i in range(len(data_list)):
            one_list=[]
            userId,movieId,rating,T=data_list[i]
            try:
                userV=user_vector[userId]
                songV=song_vector[movieId]
                one_list+=userV
                one_list+=songV
                one_list.append(int(rating))     
                vector.append(one_list)
            except Exception as e:
                print('Exception: ',e)
        with open(save_path,'wb') as f:
            f.write(json.dumps(vector))

            上述代码我们基于word2vec向量化工具实现了歌曲和用户的向量化生成与表示,用于后续模型的计算。下面我们简单看一下特征向量:

    F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,F16,F17,F18,F19,F20,F21,F22,F23,F24,F25,F26,F27,F28,F29,F30,F31,F32,F33,F34,F35,F36,F37,F38,F39,F40,F41,F42,F43,F44,F45,F46,F47,F48,F49,F50,F51,F52,F53,F54,F55,F56,F57,F58,F59,F60,F61,F62,F63,F64,F65,F66,F67,F68,F69,F70,F71,F72,F73,F74,F75,F76,F77,F78,F79,F80,F81,F82,F83,F84,F85,F86,F87,F88,F89,F90,F91,F92,F93,F94,F95,F96,F97,F98,F99,F100,F101,F102,F103,F104,F105,F106,F107,F108,F109,F110,F111,F112,F113,F114,F115,F116,F117,F118,F119,F120,F121,F122,F123,F124,F125,F126,F127,F128,F129,F130,F131,F132,F133,F134,F135,F136,F137,F138,F139,F140,F141,F142,F143,F144,F145,F146,F147,F148,F149,F150,F151,F152,F153,F154,F155,F156,F157,F158,F159,F160,F161,F162,F163,F164,F165,F166,F167,F168,F169,F170,F171,F172,F173,F174,F175,F176,F177,F178,F179,F180,F181,F182,F183,F184,F185,F186,F187,F188,F189,F190,F191,F192,F193,F194,F195,F196,F197,F198,F199,rank
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,0.008389173075556755,-0.0021944893524050713,0.005254175513982773,-0.0010480922646820545,0.002954872790724039,0.005937131121754646,-0.012680958956480026,-0.0008859079098328948,0.006242068950086832,0.0013998642098158598,-0.0046497974544763565,0.004769282415509224,0.0024638278409838676,0.007718070410192013,-0.0030265755485743284,0.0020163003355264664,-0.0029373448342084885,0.0026505468413233757,0.0036762775853276253,-0.001057810615748167,0.010048923082649708,0.0023740148171782494,0.0021975813433527946,-0.00345548614859581,0.0007476559840142727,0.003124898299574852,0.005844129715114832,0.0013669220497831702,-0.00033092708326876163,-0.011868903413414955,-0.0028881989419460297,0.007996010594069958,-0.005054941400885582,-0.006188235245645046,-9.678240166977048e-05,0.004327509552240372,0.018809569999575615,0.002290343400090933,-0.0009504191693849862,-0.006280957255512476,-0.010345006361603737,-0.015455245971679688,0.0051696039736270905,0.011396088637411594,-0.014725587330758572,-0.0024055838584899902,0.0010706465691328049,-0.00768823828548193,-0.002703784964978695,-0.0015396936796605587,0.005851196125149727,0.0019892766140401363,0.0023974033538252115,0.005399889778345823,-0.012511718086898327,0.0017787865363061428,0.011473720893263817,0.003320086747407913,-0.005707764998078346,0.0031185653060674667,0.007588379550725222,0.009407034143805504,-0.0020049791783094406,-0.011117514222860336,0.008930917829275131,0.007140926085412502,0.00844403076916933,0.00846918299794197,-0.006229204125702381,-0.0028119836933910847,0.0020610331557691097,-0.006508949678391218,0.0027775117196142673,-0.0016185733256861567,0.005066308658570051,0.0014820161741226912,-0.013164183124899864,-0.01331096887588501,0.0032651261426508427,0.005103863310068846,-0.006894501857459545,0.004967343993484974,-0.010701723396778107,-0.005730992183089256,-0.0036964653991162777,0.012735491618514061,0.0025388621725142,0.003371666884049773,-0.015071623958647251,-0.010324200615286827,-0.0067957928404212,0.002820124151185155,-0.00553283654153347,0.0034825624898076057,0.007049893960356712,-0.0062140803784132,0.005579421296715736,-0.013508349657058716,0.0011655841954052448,-0.0015429649502038956,90.0
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,-0.022592157125473022,0.03406761959195137,-0.011294135823845863,-0.03557229042053223,0.006008323282003403,0.000521798268891871,0.04826105386018753,-0.012627901509404182,0.0129550751298666,-0.01717522367835045,0.013481661677360535,0.013365393504500389,-0.036065325140953064,-0.0078732343390584,-0.01379341073334217,0.015099458396434784,-0.00837993435561657,-0.018873106688261032,-0.0025675371289253235,-0.022314341738820076,-0.01362605020403862,-0.012553971260786057,0.014811022207140923,-0.019477257505059242,-0.007187745068222284,-0.0029132734052836895,-0.0005756211467087269,-0.018443219363689423,0.009171165525913239,0.008502017706632614,0.010730217210948467,-0.024739541113376617,0.023262105882167816,0.024543697014451027,-0.010074739344418049,0.023082096129655838,-0.05538586899638176,0.013144426979124546,-0.0034713721834123135,0.017734598368406296,0.002445317804813385,0.06830388307571411,-0.04603378847241402,-0.017839286476373672,0.02907683700323105,0.057052962481975555,-0.006418359465897083,0.04278741776943207,-0.018171297386288643,-0.039012014865875244,0.0031849518418312073,-0.048897162079811096,0.03169402852654457,-0.0037552379071712494,-0.0599692165851593,0.028303690254688263,-0.0270870141685009,-0.005421224981546402,0.02729225344955921,0.0005637332797050476,0.0309995636343956,0.0162825807929039,-0.01295829564332962,0.06263034790754318,0.014025387354195118,0.027204040437936783,-0.050342194736003876,-0.036244578659534454,-0.015098728239536285,0.005250332877039909,0.015661250799894333,0.004092647694051266,-0.001389537937939167,0.00746307335793972,-0.027304377406835556,-0.025669891387224197,0.050152361392974854,0.026778768748044968,-0.04022226482629776,0.03715263307094574,-0.007856795564293861,-0.007773062214255333,0.06832816451787949,0.02114926651120186,0.016750771552324295,-0.013472440652549267,0.010191161185503006,-0.014516017399728298,-0.0029776711016893387,0.0033395399805158377,-0.00017385557293891907,-0.00641307607293129,0.0037895026616752148,-0.033165350556373596,-0.05743199586868286,0.029474055394530296,-0.0629713237285614,0.058291152119636536,-0.0018130820244550705,-0.014646890573203564,95.0
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,0.019901975989341736,-0.015747029334306717,0.0033654612489044666,-0.03632471710443497,0.09456537663936615,0.0670727863907814,0.1920902580022812,0.04572567716240883,0.04062191769480705,0.1240314394235611,0.016196109354496002,0.07990884780883789,-0.09689442068338394,-0.06618727743625641,-0.041430652141571045,-0.0523345023393631,0.03341924399137497,-0.11789211630821228,-0.0317903570830822,-0.052808865904808044,-0.007092255167663097,0.02028701640665531,0.1428583711385727,-0.027414098381996155,-0.08579147607088089,0.00399013003334403,0.0071861702017486095,-0.0332927480340004,0.004831282421946526,0.030836742371320724,-0.10060083121061325,-0.022930579259991646,-0.04553624615073204,0.010508889332413673,-0.010346035473048687,0.023842399939894676,-0.0030776262283325195,-0.07267031073570251,-0.08600760996341705,0.07088109850883484,0.039998531341552734,0.0498877577483654,-0.13703669607639313,0.0275780837982893,0.118706613779068,0.09653228521347046,0.04980337619781494,0.09088883548974991,-0.0017204303294420242,-0.03217220678925514,0.005900798365473747,-0.09877994656562805,0.10114317387342453,0.004953338764607906,-0.033393122255802155,0.027972711250185966,-0.22143733501434326,-0.046712614595890045,-0.03027639538049698,0.03798357769846916,0.13159975409507751,-0.013990684412419796,0.003518206998705864,0.035273727029561996,0.10022589564323425,-0.008735567331314087,-0.04597346484661102,-0.04725199192762375,-0.04423960670828819,0.08597318828105927,0.020118530839681625,-0.02840239368379116,-0.037039615213871,0.026149779558181763,-0.07027353346347809,-0.030569007620215416,0.14087867736816406,0.04950963705778122,-0.08218803256750107,0.03112017922103405,0.10208556801080704,0.024767188355326653,0.1622713804244995,0.01481152419000864,0.017544297501444817,-0.14008687436580658,-0.0005397782661020756,-0.12663060426712036,0.06454144418239594,-0.02462538704276085,-0.08439943194389343,0.027440207079052925,-0.015044237487018108,-0.1196710616350174,-0.11459940671920776,0.0013232259079813957,-0.20382101833820343,0.07837190479040146,-0.01083550974726677,-0.1029856726527214,100.0
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,9.006005711853504e-05,-0.008629919961094856,0.0019712536595761776,0.00932253710925579,-0.007398911751806736,-0.0032335149589926004,-0.03696579858660698,0.0013150651939213276,0.0009624955127947032,-0.016642197966575623,0.0036938730627298355,0.00475682970136404,0.017125168815255165,0.011642811819911003,0.008950425311923027,0.0212385356426239,-0.01236299704760313,0.02017541043460369,0.010390358977019787,0.005180324427783489,0.007398080080747604,-0.008122666738927364,-0.01915871724486351,-0.00018959050066769123,0.02241162396967411,0.0023561869747936726,0.003728562267497182,0.014529380947351456,-0.008861050941050053,-0.011183235794305801,0.00481022521853447,0.008094079792499542,-0.0034939858596771955,-0.008968185633420944,0.004446524661034346,-0.0033132312819361687,0.02285945415496826,0.0042738099582493305,0.001622360316105187,-0.016874084249138832,-0.00798086542636156,-0.0230935737490654,0.03532770276069641,0.0018782116239890456,-0.029727887362241745,-0.02030695416033268,0.0055946167558431625,-0.021974366158246994,0.0017065443098545074,0.0015403588768094778,0.00689676171168685,0.023769419640302658,-0.014577401801943779,0.011449558660387993,-0.0006733378395438194,-0.011922353878617287,0.037568483501672745,-0.0007055480964481831,0.0006643265369348228,-0.011070968583226204,-0.02785041183233261,0.0014375851023942232,0.010307244956493378,-0.02321864664554596,0.0014364710077643394,-0.005117666907608509,0.01468372531235218,0.023289699107408524,0.009948080405592918,-0.011513765901327133,-0.005623922683298588,-0.004363678395748138,-0.0059468913823366165,-0.006565356161445379,0.01921733468770981,0.0013734159292653203,-0.02277904562652111,-0.014114737510681152,0.02229023352265358,-0.01910337060689926,-0.022069711238145828,0.0024455441161990166,-0.03800266236066818,-0.010611163452267647,-0.0024507753551006317,0.014019387774169445,-0.01211622916162014,0.020271161571145058,-0.013096032664179802,-0.00019119825446978211,-0.00020770687842741609,-0.0034759303089231253,-0.0021791167091578245,0.03311455622315407,0.024117659777402878,-0.009091646410524845,0.04075819253921509,-0.035506218671798706,0.014370344579219818,0.02386917918920517,95.0

            之后我们对用户的评分数据进行了分级和归一化处理,结果如下:

    F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15,F16,F17,F18,F19,F20,F21,F22,F23,F24,F25,F26,F27,F28,F29,F30,F31,F32,F33,F34,F35,F36,F37,F38,F39,F40,F41,F42,F43,F44,F45,F46,F47,F48,F49,F50,F51,F52,F53,F54,F55,F56,F57,F58,F59,F60,F61,F62,F63,F64,F65,F66,F67,F68,F69,F70,F71,F72,F73,F74,F75,F76,F77,F78,F79,F80,F81,F82,F83,F84,F85,F86,F87,F88,F89,F90,F91,F92,F93,F94,F95,F96,F97,F98,F99,F100,F101,F102,F103,F104,F105,F106,F107,F108,F109,F110,F111,F112,F113,F114,F115,F116,F117,F118,F119,F120,F121,F122,F123,F124,F125,F126,F127,F128,F129,F130,F131,F132,F133,F134,F135,F136,F137,F138,F139,F140,F141,F142,F143,F144,F145,F146,F147,F148,F149,F150,F151,F152,F153,F154,F155,F156,F157,F158,F159,F160,F161,F162,F163,F164,F165,F166,F167,F168,F169,F170,F171,F172,F173,F174,F175,F176,F177,F178,F179,F180,F181,F182,F183,F184,F185,F186,F187,F188,F189,F190,F191,F192,F193,F194,F195,F196,F197,F198,F199,rank
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,0.008389173075556755,-0.0021944893524050713,0.005254175513982773,-0.0010480922646820545,0.002954872790724039,0.005937131121754646,-0.012680958956480026,-0.0008859079098328948,0.006242068950086832,0.0013998642098158598,-0.0046497974544763565,0.004769282415509224,0.0024638278409838676,0.007718070410192013,-0.0030265755485743284,0.0020163003355264664,-0.0029373448342084885,0.0026505468413233757,0.0036762775853276253,-0.001057810615748167,0.010048923082649708,0.0023740148171782494,0.0021975813433527946,-0.00345548614859581,0.0007476559840142727,0.003124898299574852,0.005844129715114832,0.0013669220497831702,-0.00033092708326876163,-0.011868903413414955,-0.0028881989419460297,0.007996010594069958,-0.005054941400885582,-0.006188235245645046,-9.678240166977048e-05,0.004327509552240372,0.018809569999575615,0.002290343400090933,-0.0009504191693849862,-0.006280957255512476,-0.010345006361603737,-0.015455245971679688,0.0051696039736270905,0.011396088637411594,-0.014725587330758572,-0.0024055838584899902,0.0010706465691328049,-0.00768823828548193,-0.002703784964978695,-0.0015396936796605587,0.005851196125149727,0.0019892766140401363,0.0023974033538252115,0.005399889778345823,-0.012511718086898327,0.0017787865363061428,0.011473720893263817,0.003320086747407913,-0.005707764998078346,0.0031185653060674667,0.007588379550725222,0.009407034143805504,-0.0020049791783094406,-0.011117514222860336,0.008930917829275131,0.007140926085412502,0.00844403076916933,0.00846918299794197,-0.006229204125702381,-0.0028119836933910847,0.0020610331557691097,-0.006508949678391218,0.0027775117196142673,-0.0016185733256861567,0.005066308658570051,0.0014820161741226912,-0.013164183124899864,-0.01331096887588501,0.0032651261426508427,0.005103863310068846,-0.006894501857459545,0.004967343993484974,-0.010701723396778107,-0.005730992183089256,-0.0036964653991162777,0.012735491618514061,0.0025388621725142,0.003371666884049773,-0.015071623958647251,-0.010324200615286827,-0.0067957928404212,0.002820124151185155,-0.00553283654153347,0.0034825624898076057,0.007049893960356712,-0.0062140803784132,0.005579421296715736,-0.013508349657058716,0.0011655841954052448,-0.0015429649502038956,4
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,-0.022592157125473022,0.03406761959195137,-0.011294135823845863,-0.03557229042053223,0.006008323282003403,0.000521798268891871,0.04826105386018753,-0.012627901509404182,0.0129550751298666,-0.01717522367835045,0.013481661677360535,0.013365393504500389,-0.036065325140953064,-0.0078732343390584,-0.01379341073334217,0.015099458396434784,-0.00837993435561657,-0.018873106688261032,-0.0025675371289253235,-0.022314341738820076,-0.01362605020403862,-0.012553971260786057,0.014811022207140923,-0.019477257505059242,-0.007187745068222284,-0.0029132734052836895,-0.0005756211467087269,-0.018443219363689423,0.009171165525913239,0.008502017706632614,0.010730217210948467,-0.024739541113376617,0.023262105882167816,0.024543697014451027,-0.010074739344418049,0.023082096129655838,-0.05538586899638176,0.013144426979124546,-0.0034713721834123135,0.017734598368406296,0.002445317804813385,0.06830388307571411,-0.04603378847241402,-0.017839286476373672,0.02907683700323105,0.057052962481975555,-0.006418359465897083,0.04278741776943207,-0.018171297386288643,-0.039012014865875244,0.0031849518418312073,-0.048897162079811096,0.03169402852654457,-0.0037552379071712494,-0.0599692165851593,0.028303690254688263,-0.0270870141685009,-0.005421224981546402,0.02729225344955921,0.0005637332797050476,0.0309995636343956,0.0162825807929039,-0.01295829564332962,0.06263034790754318,0.014025387354195118,0.027204040437936783,-0.050342194736003876,-0.036244578659534454,-0.015098728239536285,0.005250332877039909,0.015661250799894333,0.004092647694051266,-0.001389537937939167,0.00746307335793972,-0.027304377406835556,-0.025669891387224197,0.050152361392974854,0.026778768748044968,-0.04022226482629776,0.03715263307094574,-0.007856795564293861,-0.007773062214255333,0.06832816451787949,0.02114926651120186,0.016750771552324295,-0.013472440652549267,0.010191161185503006,-0.014516017399728298,-0.0029776711016893387,0.0033395399805158377,-0.00017385557293891907,-0.00641307607293129,0.0037895026616752148,-0.033165350556373596,-0.05743199586868286,0.029474055394530296,-0.0629713237285614,0.058291152119636536,-0.0018130820244550705,-0.014646890573203564,4
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,0.019901975989341736,-0.015747029334306717,0.0033654612489044666,-0.03632471710443497,0.09456537663936615,0.0670727863907814,0.1920902580022812,0.04572567716240883,0.04062191769480705,0.1240314394235611,0.016196109354496002,0.07990884780883789,-0.09689442068338394,-0.06618727743625641,-0.041430652141571045,-0.0523345023393631,0.03341924399137497,-0.11789211630821228,-0.0317903570830822,-0.052808865904808044,-0.007092255167663097,0.02028701640665531,0.1428583711385727,-0.027414098381996155,-0.08579147607088089,0.00399013003334403,0.0071861702017486095,-0.0332927480340004,0.004831282421946526,0.030836742371320724,-0.10060083121061325,-0.022930579259991646,-0.04553624615073204,0.010508889332413673,-0.010346035473048687,0.023842399939894676,-0.0030776262283325195,-0.07267031073570251,-0.08600760996341705,0.07088109850883484,0.039998531341552734,0.0498877577483654,-0.13703669607639313,0.0275780837982893,0.118706613779068,0.09653228521347046,0.04980337619781494,0.09088883548974991,-0.0017204303294420242,-0.03217220678925514,0.005900798365473747,-0.09877994656562805,0.10114317387342453,0.004953338764607906,-0.033393122255802155,0.027972711250185966,-0.22143733501434326,-0.046712614595890045,-0.03027639538049698,0.03798357769846916,0.13159975409507751,-0.013990684412419796,0.003518206998705864,0.035273727029561996,0.10022589564323425,-0.008735567331314087,-0.04597346484661102,-0.04725199192762375,-0.04423960670828819,0.08597318828105927,0.020118530839681625,-0.02840239368379116,-0.037039615213871,0.026149779558181763,-0.07027353346347809,-0.030569007620215416,0.14087867736816406,0.04950963705778122,-0.08218803256750107,0.03112017922103405,0.10208556801080704,0.024767188355326653,0.1622713804244995,0.01481152419000864,0.017544297501444817,-0.14008687436580658,-0.0005397782661020756,-0.12663060426712036,0.06454144418239594,-0.02462538704276085,-0.08439943194389343,0.027440207079052925,-0.015044237487018108,-0.1196710616350174,-0.11459940671920776,0.0013232259079813957,-0.20382101833820343,0.07837190479040146,-0.01083550974726677,-0.1029856726527214,5
    -0.00905437208712101,-0.0007396119763143361,0.008964900858700275,0.007147061172872782,-0.002938180463388562,-0.006451033987104893,0.0006823862786404788,0.008351589553058147,-0.004327930044382811,-0.00039664877112954855,0.014263618737459183,0.0002304373774677515,0.007386152166873217,0.012727610766887665,-0.004937054589390755,0.010642743669450283,-0.0061443764716386795,-0.0030952980741858482,-0.000960861740168184,0.012492218986153603,0.004629271570593119,0.004056858364492655,-0.008649991825222969,-0.00019251192861702293,-0.0025250220205634832,0.000613546057138592,0.0020741717889904976,-0.0004906861577183008,-0.013737712986767292,-0.0005791311850771308,0.01548623014241457,0.007403503637760878,0.00670241005718708,0.00947426725178957,-0.003177322680130601,0.0032764971256256104,0.0006224086391739547,0.00381011632271111,0.004719317425042391,0.005109986290335655,0.0019342167070135474,0.004826852586120367,-0.0021173858549445868,-0.0022332491353154182,-0.0004247582401148975,-0.003798522986471653,0.00038840010529384017,0.0022172797471284866,-0.006280323024839163,-0.007924836128950119,0.00046939635649323463,0.007669319864362478,0.003541012294590473,0.006963915191590786,0.0031984655652195215,0.0007785331108607352,-0.006734368856996298,-0.00701282499358058,0.0065978821367025375,-0.0014829643769189715,0.007206542417407036,-0.002689479850232601,-0.007491654716432095,-0.007554797455668449,0.000426261976826936,-0.003307378152385354,-0.0013022092171013355,-0.002813913393765688,-0.005616216920316219,-0.005037547554820776,-0.005481714382767677,-0.007400141097605228,-0.006884533911943436,0.002671564696356654,-0.002913718344643712,-5.24939205206465e-05,0.007651910185813904,0.0009839265840128064,-0.0070876977406442165,-0.0006983017083257437,0.005889476742595434,-0.0011757116299122572,0.004702093079686165,-0.0011430471204221249,-0.004178674891591072,0.004804267082363367,-0.0023697437718510628,0.003897198010236025,-0.0142428083345294,-0.0080240648239851,-0.0027684057131409645,-0.006377904210239649,0.009644911624491215,0.013665839098393917,0.009657599963247776,0.012993580661714077,-0.004962602164596319,-0.01250487845391035,-0.0013530227588489652,0.001891392981633544,9.006005711853504e-05,-0.008629919961094856,0.0019712536595761776,0.00932253710925579,-0.007398911751806736,-0.0032335149589926004,-0.03696579858660698,0.0013150651939213276,0.0009624955127947032,-0.016642197966575623,0.0036938730627298355,0.00475682970136404,0.017125168815255165,0.011642811819911003,0.008950425311923027,0.0212385356426239,-0.01236299704760313,0.02017541043460369,0.010390358977019787,0.005180324427783489,0.007398080080747604,-0.008122666738927364,-0.01915871724486351,-0.00018959050066769123,0.02241162396967411,0.0023561869747936726,0.003728562267497182,0.014529380947351456,-0.008861050941050053,-0.011183235794305801,0.00481022521853447,0.008094079792499542,-0.0034939858596771955,-0.008968185633420944,0.004446524661034346,-0.0033132312819361687,0.02285945415496826,0.0042738099582493305,0.001622360316105187,-0.016874084249138832,-0.00798086542636156,-0.0230935737490654,0.03532770276069641,0.0018782116239890456,-0.029727887362241745,-0.02030695416033268,0.0055946167558431625,-0.021974366158246994,0.0017065443098545074,0.0015403588768094778,0.00689676171168685,0.023769419640302658,-0.014577401801943779,0.011449558660387993,-0.0006733378395438194,-0.011922353878617287,0.037568483501672745,-0.0007055480964481831,0.0006643265369348228,-0.011070968583226204,-0.02785041183233261,0.0014375851023942232,0.010307244956493378,-0.02321864664554596,0.0014364710077643394,-0.005117666907608509,0.01468372531235218,0.023289699107408524,0.009948080405592918,-0.011513765901327133,-0.005623922683298588,-0.004363678395748138,-0.0059468913823366165,-0.006565356161445379,0.01921733468770981,0.0013734159292653203,-0.02277904562652111,-0.014114737510681152,0.02229023352265358,-0.01910337060689926,-0.022069711238145828,0.0024455441161990166,-0.03800266236066818,-0.010611163452267647,-0.0024507753551006317,0.014019387774169445,-0.01211622916162014,0.020271161571145058,-0.013096032664179802,-0.00019119825446978211,-0.00020770687842741609,-0.0034759303089231253,-0.0021791167091578245,0.03311455622315407,0.024117659777402878,-0.009091646410524845,0.04075819253921509,-0.035506218671798706,0.014370344579219818,0.02386917918920517,4

           在搭建深度学习之前,我们简单借助于机器学习模型来进行了简单的实验,首先基于上述的得到的特征做了简单的分析与可视化:

         性能对比结果如下:

           由于特征数量较多,但是通过分析发现部分数据特征冗余度较高,这里做了初步的特征选择工作,之后重复进行了上面的几个步骤,得到的结果示意图如下所示:

           接下来就需要搭建深度学习模型了,这里的模型可以使用的种类也是很多的,这里由于项目的原因我公开一种基于DNN的baseline的做法,感兴趣的话可以继续深入研究,下面是具体的代码实现:

    def deepModel(data='dataset.json',saveDir='music/DNN/'):
        '''
        深度学习网络模型
        '''
        if not os.path.exists(saveDir):
            os.makedirs(saveDir)
        scaler,X_train,X_test,y_train,y_test=getVector(data=data)
        model=Sequential()
        model.add(Dense(1024,input_dim=X_train.shape[1]))
        model.add(Dropout(0.3))
        model.add(Dense(1024,activation='linear'))
        model.add(Dropout(0.3))
        model.add(Dense(1024,activation='sigmoid'))
        model.add(Dropout(0.3))
        model.add(Dense(1,activation='tanh'))  #softmax  relu  tanh
        optimizer=Adam(lr=0.002,beta_1=0.9,beta_2=0.999,epsilon=1e-08)
        model.compile(loss='mae',optimizer=optimizer)
        early_stopping=EarlyStopping(monitor='val_loss',patience=20)
        checkpointer=ModelCheckpoint(filepath=saveDir+'checkpointer.hdf5',verbose=1,save_best_only=True)  
        history=model.fit(X_train,y_train,batch_size=128,epochs=50,validation_split=0.3,verbose=1,shuffle=True,
                              callbacks=[checkpointer,early_stopping])  #validation_data=(X_validation,y_validation)
        lossdata,vallossdata=history.history['loss'],history.history['val_loss'] 
        plot_both_loss_acc_pic(lossdata,vallossdata,picpath=saveDir+'both_loss_epoch.png')
        y_predict=model.predict(X_test)
        y_predict_list=scaler.inverse_transform(y_predict.reshape(-1,1))
        y_true_list=scaler.inverse_transform(y_test.reshape(-1,1))
        # 拟合结果的评估和可视化
        y_predict_list=[int(one[0]) for one in y_predict_list.tolist()]
        y_true_list=[int(one[0]) for one in y_true_list.tolist()]
        res_list=calPerformance(y_true_list,y_predict_list)
        P=pearsonr(y_true_list,y_predict_list)[0]
        S=spearmanr(y_true_list,y_predict_list)[0]
        K=kendalltau(y_true_list,y_predict_list)[0]
        print('pearsonr: ',P)
        print('spearmanr: ',S)
        print('kendalltau: ',K)
        res_dict={}
        model.save(saveDir+'DL.model')
        plot_model(model,to_file=saveDir+'model_structure.png', show_shapes=True)  
        model_summary=model.summary()  
        print('-------------------------model_summary---------------------------------')
        print(model_summary)

          完成模型的代建和特征数据的生成处理之后就可以对其进行训练了,训练过程截图如下:

           在得到离线模型之后,我们就可以基于训练好的模型来进行推荐分析了,对单个指定用户的推荐具体代码实现如下:

    def singleUserRecommend(userId='2230728513',model_path='results/music/DL/DL.model'):
        '''
        输入用户id,输出推荐的内容
        '''
        #加载歌、id数据
        with open('data/music/id_song.json') as f:
            song_dict=json.load(f)
        #加载原始的推荐清单数据
        with open('data/music/neteasy_playlist_recommend_data.csv') as f:
            data_list=[one.strip().split(',') for one in f.readlines() if one]
        user_song={}
        user,song=[],[]
        for i in range(len(data_list)):
            one_list=[]
            userId,songId,rating,T=data_list[i]
            user.append(userId)
            song.append(songId)
            if userId in user_song:
                user_song[userId].append(songId)
            else:
                user_song[userId]=[songId]
        user=list(set(user))
        song=list(set(song))
        with open('data/music/user2Vec.json') as U:
            user_vector=json.load(U)
        with open('data/music/song2Vec.json') as S:
            song_vector=json.load(S)
        model=load_model(model_path)
        try:
            one_song_list=user_song[userId]
            no_listen_list=[one for one in song if one not in one_song_list]  #获取没有听过的歌
            userVec=user_vector[userId]
            one_no_dict={}
            for one_no in no_listen_list:
                try:
                    one=[]
                    # print('user: ',userId)
                    # print('song: ',one_no)
                    one+=user_vector[userId]
                    one+=song_vector[one_no]
                    X=np.array([one])
                    score=model.predict(X)
                    y_pre=score.tolist()[0]
                    # print('score: ',y_pre)
                    one_no_dict[one_no]=y_pre
                except Exception as e:
                    print('Exception1: ',e)
            one_no_sorted=sorted(one_no_dict.items(),key=lambda e:e[1],reverse=True)
            recommend_id_list=[one[0] for one in one_no_sorted][:10]
            for oneId in recommend_id_list:
                print('songId: ',oneId)
                print('songName: ',song_dict[oneId])
        except Exception as e:
            print('Exception2: ',e)

           结果输出如下所示:

           从结果上来看,这位用户应该是蛮喜欢日语歌曲的吧,一不小心就泄漏了用户的小秘密了哈。

           到这里,本文分享的推荐系统实践就到此结束了,如果有对这方面感兴趣的同学欢迎交流学习,共同进步!希望我的文章对您有所帮助,祝您工作顺利,学有所成!

    展开全文
  • 机器学习——推荐系统(一)推荐系统原理分析(二)餐馆菜肴推荐系统(三)音乐推荐系统 (一)推荐系统原理分析 人能够对一些事物的重要特征做抽象提取,奇异值分解(Singular Value Decomposition,SVD正是机器...

    (一)SVD原理分析

    人能够对一些事物的重要特征做抽象提取,奇异值分解(Singular Value Decomposition,SVD正是机器抽象提取一些事物重要特征的方法。利用SVD, 可使用小得多的数据集来表示原始数据集,这样会去除噪声数据和冗余信息。

    最早的SVD应用之一是信息检索。将利用SVD的方法称为隐性语义索引(Latent Semantic Indexing,LSI)或隐性语义分析(Latent Semantic Analysis,LSA)。

    SVD的另一个应用是推荐系统。简单版本的推荐系统能计算项或人之间的相似度。更先进的方法则先利用SVD从数据中构建一个主题空间,然后再在该空间下计算其相似度。

    矩阵分解
    在很多情况下,数据中的一小段携带了数据集中的大部分信息,而其他信息要么是噪声,要么就是毫不相关的信息。矩阵分解可将原始矩阵表示成新的易于处理的形式,新形式是两个或多个矩阵的乘积。

    不同的矩阵分解技术具有不同的性质,其中有些更适合于某个应用,有些则更适合于其他应用。最常见的一种矩阵分解技术就是SVD。SVD将原始的数据集矩阵Data分解成三个矩阵UUΣΣVTV^{T}。如果原始矩阵Data是m行n列,则有如下等式:
    Datam×n=Um×mΣm×nVn×nTData_{m\times n}=U_{m\times m}\Sigma _{m\times n}V_{n\times n}^{T}

    上述分解中会构建出一个矩阵ΣΣ,该矩阵只有对角元素,其他元素均为0。另一个惯例就是,ΣΣ的对角元素是从大到小排列的。这些对角元素称为奇异值(Singular Value),它们对应了原始数据集矩阵Data的奇异值。奇异值和特征值时有关系的。这里的奇异值就是矩阵DataDataTData∗Data^{T}特征值的平方根。

    矩阵ΣΣ只有从大到小排列的对角元素。在科学和工程中,一致存在这样一个普遍事实:在某个奇异值的数目(r个)之后,其他的奇异值都置为0。这就意味着数据集中仅有r个重要特征,而其余特征则都是噪声或冗余特征。

    利用Python实现SVD
    NumPy由一个称为linalg的线性代数工具箱,利用此工具箱可实现如下矩阵的SVD处理:

    from numpy import *
    U,Sigma,VT=linalg.svd([[1,1],[7,7]])
    U
    

    array([[-0.14142136, -0.98994949],
    [-0.98994949, 0.14142136]])

    Sigma
    

    array([1.00000000e+01, 2.82797782e-16])

    VT
    

    array([[-0.70710678, -0.70710678],
    [ 0.70710678, -0.70710678]])

    接下来在一个更大的数据集上进行更多的分解:

    def loadExData() :
        return [[1, 1, 1, 0, 0],
                [2, 2, 2, 0, 0],
                [1, 1, 1, 0, 0],
                [5, 5, 5, 0, 0],
                [1, 1, 0, 2, 2],
                [0, 0, 0, 3, 3],
                [0, 0, 0, 1, 1]]
    import svdRec as svdRec
    Data=svdRec.loadExData()
    U,Sigma,VT=linalg.svd(Data)
    Sigma
    

    array([9.72140007e+00, 5.29397912e+00, 6.84226362e-01, 4.11502614e-16,
    1.36030206e-16])

    前三个数据比其他的值大很多,后两个值在不同机器上结果可能会稍有差异,但数量级差不多。于是,我们可将后两个值去掉。原始数据集可用如下结果来近似:
    Datam×n=Um×3Σ3×3V3×nTData_{m\times n}=U_{m\times 3}\Sigma _{3\times 3}V_{3\times n}^{T}

    重构原始矩阵,首先构建一个3x3的矩阵Sig3:

    Sig3=mat([[Sigma[0], 0, 0],[0, Sigma[1], 0],[0, 0, Sigma[2]]])
    

    由于Sig3仅为3x3的矩阵,因而只需使用矩阵U的前3列和VT的前三行。为了在Python中实现这一点,输入如下命令:

    U[:,:3]*Sig3*VT[:3,:]
    

    matrix([[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
    7.75989921e-16, 7.71587483e-16],
    [ 2.00000000e+00, 2.00000000e+00, 2.00000000e+00,
    3.00514919e-16, 2.77832253e-16],
    [ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
    2.18975112e-16, 2.07633779e-16],
    [ 5.00000000e+00, 5.00000000e+00, 5.00000000e+00,
    3.00675663e-17, -1.28697294e-17],
    [ 1.00000000e+00, 1.00000000e+00, -5.48397422e-16,
    2.00000000e+00, 2.00000000e+00],
    [ 3.21319929e-16, 4.43562065e-16, -3.48967188e-16,
    3.00000000e+00, 3.00000000e+00],
    [ 9.71445147e-17, 1.45716772e-16, -1.52655666e-16,
    1.00000000e+00, 1.00000000e+00]])

    (二)餐馆菜肴推荐系统

    有很多方法可实现推荐功能,这里使用一种称为协同过滤(collaborative filtering)的方法。协同过滤是通过将用户和其他用户的数据进行对比来实现推荐的。

    当知道两个用户或两个物品之间的相似度,就可利用已有的数据来预测未知的用户喜好。

    下面计算一下手撕猪肉和烤牛肉之间的相似度。一开始使用欧氏距离来计算。

    (44)2+(33)2+(21)2=1\sqrt{(4-4)^{2}+(3-3)^{2}+(2-1)^{2}}=1

    而手撕猪肉和鳗鱼饭的欧式距离为:

    (42)2+(35)2+(22)2=2.83\sqrt{(4-2)^{2}+(3-5)^{2}+(2-2)^{2}}=2.83

    在该数据中,由于手撕猪肉和烤牛肉的距离小于手撕猪肉和鳗鱼饭的距离。因此手撕猪肉与烤牛肉比鳗鱼饭更为相似。我们希望,相似度值在0到1之间变化,并且物品对越相似,它们的相似度值也就越大。
    =1/(1+)相似度=1/(1+距离)当距离为0时,相似度为1.0。如果距离真的非常大时,相似度也就趋近于0。

    第二种计算距离的方法是皮尔逊相关系数(Pearson correlation)。在度量回归方程的精度时曾经用到过这个量,它度量的是两个向量之间的相似度。该方法相对于欧式距离的一个优势在于,它对用户评级的量级并不敏感。比如,某个狂躁者对所有物品的评分都是5分,而另一个忧郁者对所有物品的评分都是1分,皮尔逊相关系数会认为这两个向量时相等的。在NumPy中,皮尔逊相关系数的计算是由函数corrcoef()进行的,后面很快就会用到它了。皮尔逊相关系数取值范围从-1到+1,可通过0.5+0.5*corrcoef()这个函数计算,并且把其取值范围归一化到0到1之间。

    另一个常用的距离计算方法是余弦相似度(cosine similarity),其计算的是两个夹角的余弦值。如果夹角为90度,则相似度为0;如果两个向量的方向相同,则相似度为1.0。同皮尔逊相关系数一样,余弦相似度的取值范围也在-1到+1之间,因此也需将它归一化到0到1之间。计算余弦相似度,采用的两个向量AA和BB夹角的余弦相似度的定义如下:
    cosθ=ABABcos\theta =\frac{A\cdot B}{||A||||B||}

    其中,表示向量A、B的2范数,还可以定义向量的任一范数,但是如果不指定范数阶数,则都假设为2范数。向量[4, 2, 2]的2范数为:

    (4)2+(2)2+(2)2\sqrt{(4)^{2}+(2)^{2}+(2)^{2}}

    将上述各种相似度的计算方法写成Python中的函数。

    from numpy import *
    from numpy import linalg as la
    # inA和inB都是列向量
    def ecludSim(inA, inB) :
        return 1.0/(1.0 + la.norm(inA - inB))
    
    def pearsSim(inA, inB) :
        # 检查是否存在三个或更多的点,若不存在,则返回1.0,这是因为此时两个向量完全相关
        if len(inA) < 3 : return 1.0
        return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]
    
    def cosSim(inA, inB) :
        num = float(inA.T*inB)
        denom = la.norm(inA)*la.norm(inB)
        return 0.5+0.5*(num/denom)
    
    import ml.svdRec as svdRec
    from numpy import *
    myMat = mat(svdRec.loadExData())
    # 欧氏距离
    svdRec.ecludSim(myMat[:,0], myMat[:,4])
    0.13367660240019172
    svdRec.ecludSim(myMat[:,0], myMat[:,0])
    1.0
    
    # 余弦相似度
    svdRec.cosSim(myMat[:,0], myMat[:,4])
    0.54724555912615336
    svdRec.cosSim(myMat[:,0], myMat[:,0])
    0.99999999999999989
    
    # 皮尔逊相关系数
    svdRec.pearsSim(myMat[:,0], myMat[:,4])
    0.23768619407595826
    svdRec.pearsSim(myMat[:,0], myMat[:,0])
    1.0
    

    上面计算了两个餐馆菜肴之间的距离,这称为基于物品(item-based)的相似度。计算用户距离的方法则称为基于用户(user-based)的相似度。行与行之间比较的是基于用户的相似度,列与列之间比较的是基于物品的相似度。使用哪种相似度取决于用户或物品的数目。基于物品相似度计算的时间会随着物品数量的增加而增加,基于用户的相似度计算的时间则会随着用户数量的增加而增加。如果用户的数目很多,那么我们可能倾向于使用基于物品相似度的计算方法。

    推荐系统的工作过程是:给定一个用户,系统会为此用户返回N个最好的推荐菜。为了实现这一点,则需要做到:

    1. 寻找用户没有评级的菜肴,即在用户-物品矩阵中的0值;
    2. 在用户没有评级的所有物品中,对每个物品预计一个可能的评级分数。这就是说,我们认为用户可能对物品的打分(这就是相似度计算的初衷);
    3. 对这些物品的评分从高到底进行排序,返回前N个物品。

    基于物品相似度的推荐引擎代码如下:

    # 用来计算在给定相似度计算方法的条件下,用户对物品的估计评分值
    # 参数:数据矩阵、用户编号、物品编号、相似度计算方法,矩阵采用图1和图2的形式
    # 即行对应用户、列对应物品
    def standEst(dataMat, user, simMeas, item) :
        # 首先得到数据集中的物品数目
        n = shape(dataMat)[1]
        # 对两个用于计算估计评分值的变量进行初始化
        simTotal = 0.0; ratSimTotal = 0.0
        # 遍历行中的每个物品
        for j in range(n) :
            userRating = dataMat[user,j]
            # 如果某个物品评分值为0,意味着用户没有对该物品评分,跳过
            if userRating == 0 : continue
            # 寻找两个用户都评级的物品,变量overLap给出的是两个物品当中已经被评分的那个元素
            overLap = nonzero(logical_and(dataMat[:, item].A>0, dataMat[:, j].A>0))[0]
            # 若两者没有任何重合元素,则相似度为0且中止本次循环
            if len(overLap) == 0 : similarity = 0
            # 如果存在重合的物品,则基于这些重合物品计算相似度
            else : similarity = simMeas(dataMat[overLap, item], dataMat[overLap, j])
            # print 'the %d and %d similarity is : %f' % (item, j, similarity)
            # 随后相似度不断累加
            simTotal += similarity
            ratSimTotal += similarity * userRating
        if simTotal == 0 : return 0
        # 通过除以所有的评分总和,对上述相似度评分的乘积进行归一化。这使得评分值在0-5之间,
        # 而这些评分值则用于对预测值进行排序
        else : return ratSimTotal/simTotal
    
    # 推荐引擎,会调用standEst()函数,产生最高的N个推荐结果。
    # simMeas:相似度计算方法
    # estMethod:估计方法
    def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst) :
        # 寻找未评级的物品,对给定用户建立一个未评分的物品列表
        unratedItems = nonzero(dataMat[user, :].A==0)[1]
        # 如果不存在未评分物品,退出函数,否则在所有未评分物品上进行循环
        if len(unratedItems) == 0 : return 'you rated everything'
        itemScores = []
        for item in unratedItems :
            # 对于每个未评分物品,通过调用standEst()来产生该物品的预测评分。
            estimatedScore = estMethod(dataMat, user, simMeas, item)
            # 该物品的编号和估计得分值会放在一个元素列表itemScores
            itemScores.append((item, estimatedScore))
        # 寻找前N个未评级物品
        return  sorted(itemScores, key=lambda jj : jj[1], reverse=True)[:N] 
    
    import ml.svdRec as svdRec
    from numpy import *
    #调入原始矩阵
    myMat=mat(svdRec.loadExData())
    #该矩阵对于展示SVD的作用非常好,但是它本身不是十分有趣,因此我们要对其中的一些值进行更改
    myMat[0,1]=myMat[0,0]=myMat[1,0]=myMat[2,0]=4
    myMat[3,3]=2
    #得到矩阵如下
    myMat
    matrix([[4, 4, 1, 0, 0],
            [4, 2, 2, 0, 0],
            [4, 1, 1, 0, 0],
            [5, 5, 5, 2, 0],
            [1, 1, 0, 2, 2],
            [0, 0, 0, 3, 3],
            [0, 0, 0, 1, 1]])
    #尝试默认推荐
    svdRec.recommend(myMat,2)
    #表明用户2对物品4的预测评分值为2.5,对物品3的预测评分值为1.9
    [(4, 2.5), (3, 1.9703483892927431)]
    #利用其它相似度计算方法来计算推荐
    svdRec.recommend(myMat,2,simMeas=svdRec.ecludSim)
    [(4, 2.5), (3, 1.9866572968729499)]
    svdRec.recommend(myMat,2,simMeas=svdRec.pearsSim)
    [(4, 2.5), (3, 2.0)]
    

    利用SVD提高推荐的效果
    实际的数据集会比用于展示recommend()函数功能的myMat矩阵稀疏得多。

    def loadExData2():
        return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
               [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
               [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
               [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
               [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
               [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
               [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
               [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
               [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
               [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
               [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]
    

    接下来计算该矩阵的SVD来了解其到底需要多少维特征。

    from numpy import linalg as la
    from numpy import *
    U,Sigma,VT=la.svd(mat(svdRec.loadExData2()))
    Sigma
    

    array([ 15.77075346, 11.40670395, 11.03044558, 4.84639758,
    3.09292055, 2.58097379, 1.00413543, 0.72817072,
    0.43800353, 0.22082113, 0.07367823])

    接着看看到底多少个奇异值能达到总能量的90%。

    # 对Sigma中的值求平方
    Sig2=Sigma**2
    # 计算总能量
    sum(Sig2)
    541.99999999999955
    # 计算总能量的90%
    sum(Sig2)*0.9
    487.79999999999961
    # 计算前两个元素所包含的能量
    sum(Sig2[:2])
    378.8295595113579
    # 前两个元素所包含的能量低于总能量的90%,于是计算前三个元素所包含的能量
    sum(Sig2[:3])
    500.50028912757926
    

    该值高于总能量的90%,这就可以了,于是,我们可以将一个11维的矩阵转换成一个3维矩阵。下面对转换后的三维空间构造出一个相似度计算函数。利用SVD将所有的菜肴映射到一个低维空间中去。在低维空间下,可以利用前面相同的相似度计算方法来进行推荐。构建一个类似于standEst()的函数svdEst()。

    # 基于SVD的评分估计
    # 在recommend()中,svdEst用户替换对standEst()的调用,该函数对给定用户物品构建一个评分估计值。
    # 与standEst()非常相似,不同之处就在于它在第3行对数据集进行了SVD分解。在SVD分解后,只利用包含
    # 90%能量值的奇异值,这些奇异值以Numpy数组的形式得以保存。
    def svdEst(dataMat, user, simMeas, item) :
        n = shape(dataMat)[1]
        simTotal = 0.0; ratSimTotal = 0.0
        U,Sigma,VT = la.svd(dataMat)
        # 使用奇异值构建一个对角矩阵
        Sig4 = mat(eye(4)*Sigma[:4])
        # 利用U矩阵将物品转换到低维空间中
        xformedItems = dataMat.T * U[:, :4] * Sig4.I
        # 对于给定的用户,for循环在用户对应行的所有元素上进行遍历,与standEst()函数中的for循环目的一样
        # 不同的是,这里的相似度是在低维空间下进行的。相似度的计算方法也会作为一个参数传递给该函数
        for j in range(n) :
            userRating = dataMat[user,j]
            if userRating == 0 or j == item : continue
            similarity = simMeas(xformedItems[item, :].T, xformedItems[j, :].T)
            # print便于了解相似度计算的进展情况
            print 'the %d and %d similarity is : %f' % (item, j, similarity)
            # 对相似度求和
            simTotal += similarity
            # 对相似度及评分值的乘积求和
            ratSimTotal += similarity * userRating
        if simTotal == 0 : return 0
        else : return ratSimTotal/simTotal
    
    myMat=mat(svdRec.loadExData2())
    myMat
    

    matrix([[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
    [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
    [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
    [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
    [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
    [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
    [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
    [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
    [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
    [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
    [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]])

    svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
    

    the 0 and 3 similarity is : 0.490950
    the 0 and 5 similarity is : 0.484274
    the 0 and 10 similarity is : 0.512755
    the 1 and 3 similarity is : 0.491294
    the 1 and 5 similarity is : 0.481516
    the 1 and 10 similarity is : 0.509709
    the 2 and 3 similarity is : 0.491573
    the 2 and 5 similarity is : 0.482346
    the 2 and 10 similarity is : 0.510584
    the 4 and 3 similarity is : 0.450495
    the 4 and 5 similarity is : 0.506795
    the 4 and 10 similarity is : 0.512896
    the 6 and 3 similarity is : 0.743699
    the 6 and 5 similarity is : 0.468366
    the 6 and 10 similarity is : 0.439465
    the 7 and 3 similarity is : 0.482175
    the 7 and 5 similarity is : 0.494716
    the 7 and 10 similarity is : 0.524970
    the 8 and 3 similarity is : 0.491307
    the 8 and 5 similarity is : 0.491228
    the 8 and 10 similarity is : 0.520290
    the 9 and 3 similarity is : 0.522379
    the 9 and 5 similarity is : 0.496130
    the 9 and 10 similarity is : 0.493617
    [(4, 3.3447149384692283), (7, 3.3294020724526967), (9, 3.328100876390069)]

    尝试另外一种相似度计算方法:

    svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst, simMeas=svdRec.pearsSim)
    

    the 0 and 3 similarity is : 0.341942
    the 0 and 5 similarity is : 0.124132
    the 0 and 10 similarity is : 0.116698
    the 1 and 3 similarity is : 0.345560
    the 1 and 5 similarity is : 0.126456
    the 1 and 10 similarity is : 0.118892
    the 2 and 3 similarity is : 0.345149
    the 2 and 5 similarity is : 0.126190
    the 2 and 10 similarity is : 0.118640
    the 4 and 3 similarity is : 0.450126
    the 4 and 5 similarity is : 0.528504
    the 4 and 10 similarity is : 0.544647
    the 6 and 3 similarity is : 0.923822
    the 6 and 5 similarity is : 0.724840
    the 6 and 10 similarity is : 0.710896
    the 7 and 3 similarity is : 0.319482
    the 7 and 5 similarity is : 0.118324
    the 7 and 10 similarity is : 0.113370
    the 8 and 3 similarity is : 0.334910
    the 8 and 5 similarity is : 0.119673
    the 8 and 10 similarity is : 0.112497
    the 9 and 3 similarity is : 0.566918
    the 9 and 5 similarity is : 0.590049
    the 9 and 10 similarity is : 0.602380
    [(4, 3.3469521867021732), (9, 3.3353796573274699), (6, 3.307193027813037)]

    (三)音乐推荐系统

    首先对音乐数据集进行数据清洗和特征提取,基于矩阵分解方式来进行音乐推荐。

    1. 音乐数据处理
      读取音乐数据集,并统计其各项指标,选择有价值的信息当做我们的特征。
    2. 基于商品相似性的推荐
      选择相似度计算方法,通过相似度来计算推荐结果。
    3. 基于SVD矩阵分解的推荐
      使用矩阵分解方法,快速高效得到推荐结果
    import pandas as pd
    import numpy as np
    import time
    import sqlite3
    
    data_home = './'
    

    我们的数据中有一部分是数据库文件,使用sqlite3工具包来帮助我们进行数据的读取,关于数据的路径这个大家可以根据自己情况来设置。 先来看一下我们的数据长什么样子吧,对于不同格式的数据read_csv有很多参数可以来选择,例如分隔符与列名:

    数据读取
    在数据中只需要用户,歌曲,播放量

    triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt', 
                                  sep='\t', header=None, 
                                  names=['user','song','play_count'])
    

    数据规模还是蛮大的

    triplet_dataset.shape
    

    (48373586, 3)

    数据占用内存与各指标格式

    triplet_dataset.info()
    

    <class ‘pandas.core.frame.DataFrame’>
    RangeIndex: 48373586 entries, 0 to 48373585
    Data columns (total 3 columns):
    user object
    song object
    play_count int64
    dtypes: int64(1), object(2)
    memory usage: 1.1+ GB

    如果想更详细的了解数据的情况,可以打印其info信息,来观察不同列的类型以及整体占用内存,如果拿到的数据非常大,对数据进行处理的时候可能会出现内存溢出的错误,这里最简单的方法就是设置下数据个格式,比如将float64用float32来替代,这样可以大大节省内存开销。

    原始数据

    triplet_dataset.head(n=10)
    

    对每一个用户,分别统计他的播放总量

    数据中有用户的编号,歌曲编号,已经用户对该歌曲播放的次数。 有了基础数据之后,我们还可以统计出关于用户与歌曲的各项指标,例如对每一个用户,分别统计他的播放总量,代码如下:

    output_dict = {}
    with open(data_home+'train_triplets.txt') as f:
        for line_number, line in enumerate(f):
            #找到当前的用户
            user = line.split('\t')[0]
            #得到其播放量数据
            play_count = int(line.split('\t')[2])
            #如果字典中已经有该用户信息,在其基础上增加当前的播放量
            if user in output_dict:
                play_count +=output_dict[user]
                output_dict.update({user:play_count})
            output_dict.update({user:play_count})
    # 统计 用户-总播放量
    output_list = [{'user':k,'play_count':v} for k,v in output_dict.items()]
    #转换成DF格式
    play_count_df = pd.DataFrame(output_list)
    #排序
    play_count_df = play_count_df.sort_values(by = 'play_count', ascending = False)
    

    构建一个字典结构来统计不同用户分别播放的总数,这需要我们把数据集遍历一遍。当我们的数据集比较庞大的时候,每一步操作都可能花费较长时间,后续操作中如果稍有不慎可能还得重头再来一遍,最好还是把中间结果保存下来,既然我们已经把结果转换成df格式,直接使用to_csv()函数就可以完成保存的操作。

    play_count_df.to_csv(path_or_buf='user_playcount_df.csv', index = False)
    

    对于每一首歌,分别统计它的播放总量

    #统计方法跟上述类似
    output_dict = {}
    with open(data_home+'train_triplets.txt') as f:
        for line_number, line in enumerate(f):
            #找到当前歌曲
            song = line.split('\t')[1]
            #找到当前播放次数
            play_count = int(line.split('\t')[2])
            #统计每首歌曲被播放的总次数
            if song in output_dict:
                play_count +=output_dict[song]
                output_dict.update({song:play_count})
            output_dict.update({song:play_count})
    output_list = [{'song':k,'play_count':v} for k,v in output_dict.items()]
    #转换成df格式
    song_count_df = pd.DataFrame(output_list)
    song_count_df = song_count_df.sort_values(by = 'play_count', ascending = False)
    
    song_count_df.to_csv(path_or_buf='song_playcount_df.csv', index = False)
    

    看看目前的排行情况

    play_count_df = pd.read_csv(filepath_or_buffer='user_playcount_df.csv')
    play_count_df.head(n =10)
    
    song_count_df = pd.read_csv(filepath_or_buffer='song_playcount_df.csv')
    song_count_df.head(10)
    

    最受欢迎的一首歌曲有726885次播放。 刚才也看到了,这个音乐数据量集十分庞大,考虑到执行过程的时间消耗以及矩阵稀疏性问题,我们依据播放量指标对数据集进行了截取。因为有些注册用户可能只是关注了一下之后就不再登录平台,这些用户对我们建模不会起促进作用,反而增大了矩阵的稀疏性。对于歌曲也是同理,可能有些歌曲根本无人问津。由于之前已经对用户与歌曲播放情况进行了排序,所以我们分别选择了其中的10W名用户和3W首歌曲,关于截取的合适比例也可以通过观察选择数据的播放量占总体的比例来设置。

    取其中一部分数(按大小排好序的了,这些应该是比较重要的数据),作为我们的实验数据。

    #10W名用户的播放量占总体的比例
    total_play_count = sum(song_count_df.play_count)
    print ((float(play_count_df.head(n=100000).play_count.sum())/total_play_count)*100)
    play_count_subset = play_count_df.head(n=100000)
    

    40.8807280500655

    (float(song_count_df.head(n=30000).play_count.sum())/total_play_count)*100
    

    78.39315366645269

    song_count_subset = song_count_df.head(n=30000)
    

    前3W首歌的播放量占到了总体的78.39% 现在已经有了这10W名忠实用户和3W首经典歌曲,接下来我们就要对原始数据集进行过滤清洗,就是在原始数据集中剔除掉不包含这些用户以及歌曲的数据。

    取10W个用户,3W首歌

    user_subset = list(play_count_subset.user)
    song_subset = list(song_count_subset.song)
    

    过滤掉其他用户数据

    #读取原始数据集
    triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',sep='\t', 
                                  header=None, names=['user','song','play_count'])
    #只保留有这10W名用户的数据,其余过滤掉
    triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset) ]
    del(triplet_dataset)
    #只保留有这3W首歌曲的数据,其余也过滤掉
    triplet_dataset_sub_song = triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
    del(triplet_dataset_sub)
    
    triplet_dataset_sub_song.to_csv(path_or_buf=data_home+'triplet_dataset_sub_song.csv', index=False)
    

    当前我们的数据量

    triplet_dataset_sub_song.shape
    

    (10774558, 3)

    数据样本个数此时只有原来的1/4不到,但是我们过滤掉的样本都是稀疏数据不利于建模,所以当拿到了数据之后对数据进行清洗和预处理工作还是非常有必要的,不单单提升计算的速度,还会影响最终的结果。

    triplet_dataset_sub_song.head(n=10)
    

    加入音乐详细信息
    我们目前拿到的数据只有播放次数,可利用的信息实在太少了,对每首歌来说正常情况都应该有一份详细信息,例如歌手,发布时间,主题等,这些信息都存在一份数据库格式文件中,接下来我们就通过sqlite工具包来读取这些数据:

    conn = sqlite3.connect(data_home+'track_metadata.db')
    cur = conn.cursor()
    cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
    cur.fetchall()
    

    [(‘songs’,)]

    track_metadata_df = pd.read_sql(con=conn, sql='select * from songs')
    track_metadata_df_sub = track_metadata_df[track_metadata_df.song_id.isin(song_subset)]
    
    track_metadata_df_sub.to_csv(path_or_buf=data_home+'track_metadata_df_sub.csv', index=False)
    
    track_metadata_df_sub.shape
    

    (30447, 14)

    我们现有的数据

    triplet_dataset_sub_song = pd.read_csv(filepath_or_buffer=data_home+'triplet_dataset_sub_song.csv',encoding = "ISO-8859-1")
    track_metadata_df_sub = pd.read_csv(filepath_or_buffer=data_home+'track_metadata_df_sub.csv',encoding = "ISO-8859-1")
    
    triplet_dataset_sub_song.head()
    
    track_metadata_df_sub.head()
    

    清洗数据集
    去除掉无用的和重复的,数据清洗是很重要的一步

    # 去掉无用的信息
    del(track_metadata_df_sub['track_id'])
    del(track_metadata_df_sub['artist_mbid'])
    # 去掉重复的
    track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
    # 将这份音乐信息数据和我们之前的播放数据整合到一起
    triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_df_sub, how='left', left_on='song', right_on='song_id')
    # 可以自己改变列名
    triplet_dataset_sub_song_merged.rename(columns={'play_count':'listen_count'},inplace=True)
    
    # 去掉不需要的指标
    del(triplet_dataset_sub_song_merged['song_id'])
    del(triplet_dataset_sub_song_merged['artist_id'])
    del(triplet_dataset_sub_song_merged['duration'])
    del(triplet_dataset_sub_song_merged['artist_familiarity'])
    del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
    del(triplet_dataset_sub_song_merged['track_7digitalid'])
    del(triplet_dataset_sub_song_merged['shs_perf'])
    del(triplet_dataset_sub_song_merged['shs_work'])
    

    数据处理完毕,来看看它长什么样子吧

    triplet_dataset_sub_song_merged.head(n=10)
    

    现在的数据看起来工整多了,不光有用户对某个音乐作品的播放量,还有该音乐作品的名字和发布专辑,以及作者名字和发布时间。 现在我们只是大体了解了数据中各个指标的含义,对其具体内容还没有加以分析,一个新用户来了不知道给他推荐什么好,这时候就可以利用排行榜单了。可以统计最受欢迎的歌曲和歌手是哪些:

    展示最流行的歌曲

    import matplotlib.pyplot as plt; plt.rcdefaults()
    import numpy as np
    import matplotlib.pyplot as plt
    #按歌曲名字来统计其播放量的总数
    popular_songs = triplet_dataset_sub_song_merged[['title','listen_count']].groupby('title').sum().reset_index()
    #对结果进行排序
    popular_songs_top_20 = popular_songs.sort_values('listen_count', ascending=False).head(n=20)
    
    #转换成list格式方便画图
    objects = (list(popular_songs_top_20['title']))
    #设置位置
    y_pos = np.arange(len(objects))
    #对应结果值
    performance = list(popular_songs_top_20['listen_count'])
    #绘图
    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, objects, rotation='vertical')
    plt.ylabel('Item count')
    plt.title('Most popular songs')
     
    plt.show()
    

    最受欢迎的releases

    #按专辑名字来统计播放总量
    popular_release = triplet_dataset_sub_song_merged[['release','listen_count']].groupby('release').sum().reset_index()
    #排序
    popular_release_top_20 = popular_release.sort_values('listen_count', ascending=False).head(n=20)
    
    objects = (list(popular_release_top_20['release']))
    y_pos = np.arange(len(objects))
    performance = list(popular_release_top_20['listen_count'])
    #绘图 
    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, objects, rotation='vertical')
    plt.ylabel('Item count')
    plt.title('Most popular Release')
     
    plt.show()
    

    最受欢迎的歌手

    #按歌手来统计其播放总量
    popular_artist = triplet_dataset_sub_song_merged[['artist_name','listen_count']].groupby('artist_name').sum().reset_index()
    #排序
    popular_artist_top_20 = popular_artist.sort_values('listen_count', ascending=False).head(n=20)
    
    objects = (list(popular_artist_top_20['artist_name']))
    y_pos = np.arange(len(objects))
    performance = list(popular_artist_top_20['listen_count'])
    #绘图 
    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, objects, rotation='vertical')
    plt.ylabel('Item count')
    plt.title('Most popular Artists')
     
    plt.show()
    

    用户播放过歌曲量的分布

    user_song_count_distribution = triplet_dataset_sub_song_merged[['user','title']].groupby('user').count().reset_index().sort_values(
    by='title',ascending = False)
    user_song_count_distribution.title.describe()
    

    count 99996.000000
    mean 107.749890
    std 79.742561
    min 1.000000
    25% 53.000000
    50% 89.000000
    75% 141.000000
    max 1189.000000
    Name: title, dtype: float64

    x = user_song_count_distribution.title
    n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
    plt.xlabel('Play Counts')
    plt.ylabel('Num of Users')
    plt.title(r'$\mathrm{Histogram\ of\ User\ Play\ Count\ Distribution}\ $')
    plt.grid(True)
    plt.show()
    

    绝大多数用户播放歌曲的数量在100左右,关于数据的处理和介绍已经给大家都分析过了,接下来我们要做的就是构建一个能实际进行推荐的程序了。

    开始构建推荐系统

    import Recommenders as Recommenders
    from sklearn.model_selection import train_test_split
    

    最简单的推荐方式就是排行榜单了,这里我们创建了一个函数,需要我们传入的是原始数据,用户列名,待统计的指标(例如按歌曲名字,歌手名字,专辑名字。选择统计哪项指标得到的排行榜单):

    triplet_dataset_sub_song_merged_set = triplet_dataset_sub_song_merged
    train_data, test_data = train_test_split(triplet_dataset_sub_song_merged_set, test_size = 0.40, random_state=0)
    
    train_data.head()
    
    def create_popularity_recommendation(train_data, user_id, item_id):
        #根据指定的特征来统计其播放情况,可以选择歌曲名,专辑名,歌手名
        train_data_grouped = train_data.groupby([item_id]).agg({user_id: 'count'}).reset_index()
        #为了直观展示,我们用得分来表示其结果
        train_data_grouped.rename(columns = {user_id: 'score'},inplace=True)
        
        #排行榜单需要排序
        train_data_sort = train_data_grouped.sort_values(['score', item_id], ascending = [0,1])
        
        #加入一项排行等级,表示其推荐的优先级
        train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
            
        #返回指定个数的推荐结果
        popularity_recommendations = train_data_sort.head(20)
        return popularity_recommendations
    
    recommendations = create_popularity_recommendation(triplet_dataset_sub_song_merged,'user','title')
    

    得到推荐结果

    recommendations
    

    返回了一份前20的歌曲排行榜单,其中的得分这里只是进行了简单的播放计算,在设计的时候也可以综合考虑更多的指标,比如综合计算歌曲发布年份,歌手的流行程度等。

    基于歌曲相似度的推荐

    接下来就要进行相似度的计算来推荐歌曲了,为了加快代码的运行速度,选择了其中一部分数据来进行实验。

    song_count_subset = song_count_df.head(n=5000)
    user_subset = list(play_count_subset.user)
    song_subset = list(song_count_subset.song)
    triplet_dataset_sub_song_merged_sub = triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.song.isin(song_subset)]
    
    triplet_dataset_sub_song_merged_sub.head()
    

    计算相似度得到推荐结果

    import Recommenders as Recommenders
    train_data, test_data = train_test_split(triplet_dataset_sub_song_merged_sub, test_size = 0.30, random_state=0)
    is_model = Recommenders.item_similarity_recommender_py()
    is_model.create(train_data, 'user', 'title')
    user_id = list(train_data.user)[7]
    user_items = is_model.get_user_items(user_id)
    

    首先我们要针对某一个用户进行推荐,那必然得先得到他都听过哪些歌曲,通过这些已被听过的歌曲跟整个数据集中的歌曲进行对比,看哪些歌曲跟用户已听过的比较类似,推荐的就是这些类似的。如何计算呢?例如当前用户听过了66首歌曲,整个数据集中有4879个歌曲,我们要做的就是构建一个[66,4879]的矩阵,其中每一个值表示用户听过的每一个歌曲和数据集中每一个歌曲的相似度。这里使用Jaccard相似系数,矩阵中[i,j]的含义就是用户听过的第i首歌曲这些歌曲被哪些人听过,比如有3000人听过,数据集中的j歌曲被哪些人听过,比如有5000人听过。Jaccard相似系数就要求:
    Jaccard=(i3000j5000)(i3000j5000)Jaccard=\frac{交集(听过i歌曲的3000人和听过j歌曲的5000人)}{并集(听过i歌曲的3000人和听过j歌曲的5000人)}
    就是如果两个歌曲很相似,那其受众应当是一致的,交集/并集的比例应该比较大,如果两个歌曲没啥相关性,其值应当就比较小了。 上述代码中计算了矩阵[66,4879]中每一个位置的值应当是多少,在最后推荐的时候我们还应当注意一件事对于数据集中每一个待推荐的歌曲都需要跟该用户所有听过的歌曲计算其Jaccard值,例如歌曲j需要跟用户听过的66个歌曲计算其值,最终是否推荐的得分值还得进行处理,即把这66个值加在一起,最终求一个平均值,来代表该歌曲的推荐得分。

    #执行推荐
    is_model.recommend(user_id)
    

    No. of unique songs for the user: 66
    no. of unique songs in the training set: 4879
    Non zero values in cooccurence_matrix :290327

    基于矩阵分解(SVD)的推荐

    triplet_dataset_sub_song_merged_sum_df = triplet_dataset_sub_song_merged[['user','listen_count']].groupby('user').sum().reset_index()
    triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_count'},inplace=True)
    triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_sub_song_merged_sum_df)
    triplet_dataset_sub_song_merged.head()
    

    在这里插入图片描述

    triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_merged['listen_count']/triplet_dataset_sub_song_merged['total_listen_count']
    
    triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.user =='d6589314c0a9bcbca4fee0c93b14bc402363afea'][['user','song','listen_count','fractional_play_count']].head()
    
    from scipy.sparse import coo_matrix
    
    small_set = triplet_dataset_sub_song_merged
    user_codes = small_set.user.drop_duplicates().reset_index()
    song_codes = small_set.song.drop_duplicates().reset_index()
    user_codes.rename(columns={'index':'user_index'}, inplace=True)
    song_codes.rename(columns={'index':'song_index'}, inplace=True)
    song_codes['so_index_value'] = list(song_codes.index)
    user_codes['us_index_value'] = list(user_codes.index)
    small_set = pd.merge(small_set,song_codes,how='left')
    small_set = pd.merge(small_set,user_codes,how='left')
    mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
    data_array = mat_candidate.fractional_play_count.values
    row_array = mat_candidate.us_index_value.values
    col_array = mat_candidate.so_index_value.values
    
    data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)
    
    data_sparse
    

    <99996x30000 sparse matrix of type ‘<class ‘numpy.float64’>’
    with 10774558 stored elements in COOrdinate format>

    上面代码先根据用户进行分组,计算每个用户的总的播放总量,然后用每首歌的播放总量相处,得到每首歌的分值,最后一列特征fractional_play_count就是用户对每首歌曲的评分值。 有了评分值之后就可以来构建矩阵了,这里有一些小问题需要处理一下,原始数据中无论是用户ID还是歌曲ID都是很长一串,这表达起来不太方便,需要重新对其制作索引。

    user_codes[user_codes.user =='2a2f776cbac6df64d6cb505e7e834e01684673b6']
    

    使用SVD方法来进行矩阵分解
    矩阵构造好了之后我们就要执行SVD矩阵分解了,这里还需要一些额外的工具包来帮助我们完成计算,scipy就是其中一个好帮手了,里面已经封装好了SVD计算方法。

    import math as mt
    from scipy.sparse.linalg import * #used for matrix multiplication
    from scipy.sparse.linalg import svds
    from scipy.sparse import csc_matrix
    
    def compute_svd(urm, K):
        U, s, Vt = svds(urm, K)
    
        dim = (len(s), len(s))
        S = np.zeros(dim, dtype=np.float32)
        for i in range(0, len(s)):
            S[i,i] = mt.sqrt(s[i])
    
        U = csc_matrix(U, dtype=np.float32)
        S = csc_matrix(S, dtype=np.float32)
        Vt = csc_matrix(Vt, dtype=np.float32)
        
        return U, S, Vt
    
    def compute_estimated_matrix(urm, U, S, Vt, uTest, K, test):
        rightTerm = S*Vt 
        max_recommendation = 250
        estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
        recomendRatings = np.zeros(shape=(MAX_UID,max_recommendation ), dtype=np.float16)
        for userTest in uTest:
            prod = U[userTest, :]*rightTerm
            estimatedRatings[userTest, :] = prod.todense()
            recomendRatings[userTest, :] = (-estimatedRatings[userTest, :]).argsort()[:max_recommendation]
        return recomendRatings
    

    在执行SVD的时候需要我们额外指定一个指标K值,其含义就是我们选择前多少个特征值来做近似代表,也就是S矩阵中的数量。如果K值较大整体的计算效率会慢一些但是会更接近真实结果,这个值还需要我们自己来衡量一下。

    K=50
    urm = data_sparse
    MAX_PID = urm.shape[1]
    MAX_UID = urm.shape[0]
    
    U, S, Vt = compute_svd(urm, K)
    

    这里我们选择K值等于50,其中PID表示我们最开始选择的部分歌曲,UID表示我们选择的部分用户。

    接下来我们需要选择待测试用户了:

    uTest = [4,5,6,7,8,873,23]

    随便选择一些用户就好,这里表示用户的索引编号,接下来需要对每一个用户计算其对我们候选集中3W首歌曲的喜好程度,就是估计他对这3W首歌的评分值应该等于多少,前面我们通过SVD矩阵分解已经计算所需各个小矩阵了,接下来把其还原回去就可以啦:

    uTest = [4,5,6,7,8,873,23]
    
    uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)
    
    for user in uTest:
        print("当前待推荐用户编号 {}". format(user))
        rank_value = 1
        for i in uTest_recommended_items[user,0:10]:
            song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
            print("推荐编号: {} 推荐歌曲: {} 作者: {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0]))
            rank_value+=1
    

    当前待推荐用户编号 4
    推荐编号: 1 推荐歌曲: Fireflies 作者: Charttraxx Karaoke
    推荐编号: 2 推荐歌曲: Hey_ Soul Sister 作者: Train
    推荐编号: 3 推荐歌曲: OMG 作者: Usher featuring will.i.am
    推荐编号: 4 推荐歌曲: Lucky (Album Version) 作者: Jason Mraz & Colbie Caillat
    推荐编号: 5 推荐歌曲: Vanilla Twilight 作者: Owl City
    推荐编号: 6 推荐歌曲: Crumpshit 作者: Philippe Rochard
    推荐编号: 7 推荐歌曲: Billionaire [feat. Bruno Mars] (Explicit Album Version) 作者: Travie McCoy
    推荐编号: 8 推荐歌曲: Love Story 作者: Taylor Swift
    推荐编号: 9 推荐歌曲: TULENLIEKKI 作者: M.A. Numminen
    推荐编号: 10 推荐歌曲: Use Somebody 作者: Kings Of Leon
    当前待推荐用户编号 5
    推荐编号: 1 推荐歌曲: Sehr kosmisch 作者: Harmonia
    推荐编号: 2 推荐歌曲: Ain’t Misbehavin 作者: Sam Cooke
    推荐编号: 3 推荐歌曲: Dog Days Are Over (Radio Edit) 作者: Florence + The Machine
    推荐编号: 4 推荐歌曲: Revelry 作者: Kings Of Leon
    推荐编号: 5 推荐歌曲: Undo 作者: Björk
    推荐编号: 6 推荐歌曲: Cosmic Love 作者: Florence + The Machine
    推荐编号: 7 推荐歌曲: Home 作者: Edward Sharpe & The Magnetic Zeros
    推荐编号: 8 推荐歌曲: You’ve Got The Love 作者: Florence + The Machine
    推荐编号: 9 推荐歌曲: Bring Me To Life 作者: Evanescence
    推荐编号: 10 推荐歌曲: Tighten Up 作者: The Black Keys
    当前待推荐用户编号 6
    推荐编号: 1 推荐歌曲: Crumpshit 作者: Philippe Rochard
    推荐编号: 2 推荐歌曲: Marry Me 作者: Train
    推荐编号: 3 推荐歌曲: Hey_ Soul Sister 作者: Train
    推荐编号: 4 推荐歌曲: Lucky (Album Version) 作者: Jason Mraz & Colbie Caillat
    推荐编号: 5 推荐歌曲: One On One 作者: the bird and the bee
    推荐编号: 6 推荐歌曲: I Never Told You 作者: Colbie Caillat
    推荐编号: 7 推荐歌曲: Canada 作者: Five Iron Frenzy
    推荐编号: 8 推荐歌曲: Fireflies 作者: Charttraxx Karaoke
    推荐编号: 9 推荐歌曲: TULENLIEKKI 作者: M.A. Numminen
    推荐编号: 10 推荐歌曲: Bring Me To Life 作者: Evanescence
    当前待推荐用户编号 7
    推荐编号: 1 推荐歌曲: Behind The Sea [Live In Chicago] 作者: Panic At The Disco
    推荐编号: 2 推荐歌曲: The City Is At War (Album Version) 作者: Cobra Starship
    推荐编号: 3 推荐歌曲: Dead Souls 作者: Nine Inch Nails
    推荐编号: 4 推荐歌曲: Una Confusion 作者: LU
    推荐编号: 5 推荐歌曲: Home 作者: Edward Sharpe & The Magnetic Zeros
    推荐编号: 6 推荐歌曲: Climbing Up The Walls 作者: Radiohead
    推荐编号: 7 推荐歌曲: Tighten Up 作者: The Black Keys
    推荐编号: 8 推荐歌曲: Tive Sim 作者: Cartola
    推荐编号: 9 推荐歌曲: West One (Shine On Me) 作者: The Ruts
    推荐编号: 10 推荐歌曲: Cosmic Love 作者: Florence + The Machine
    当前待推荐用户编号 8
    推荐编号: 1 推荐歌曲: Undo 作者: Björk
    推荐编号: 2 推荐歌曲: Canada 作者: Five Iron Frenzy
    推荐编号: 3 推荐歌曲: Better To Reign In Hell 作者: Cradle Of Filth
    推荐编号: 4 推荐歌曲: Unite (2009 Digital Remaster) 作者: Beastie Boys
    推荐编号: 5 推荐歌曲: Behind The Sea [Live In Chicago] 作者: Panic At The Disco
    推荐编号: 6 推荐歌曲: Rockin’ Around The Christmas Tree 作者: Brenda Lee
    推荐编号: 7 推荐歌曲: Devil’s Slide 作者: Joe Satriani
    推荐编号: 8 推荐歌曲: Revelry 作者: Kings Of Leon
    推荐编号: 9 推荐歌曲: 16 Candles 作者: The Crests
    推荐编号: 10 推荐歌曲: Catch You Baby (Steve Pitron & Max Sanna Radio Edit) 作者: Lonnie Gordon
    当前待推荐用户编号 873
    推荐编号: 1 推荐歌曲: The Scientist 作者: Coldplay
    推荐编号: 2 推荐歌曲: Yellow 作者: Coldplay
    推荐编号: 3 推荐歌曲: Clocks 作者: Coldplay
    推荐编号: 4 推荐歌曲: Fix You 作者: Coldplay
    推荐编号: 5 推荐歌曲: In My Place 作者: Coldplay
    推荐编号: 6 推荐歌曲: Shiver 作者: Coldplay
    推荐编号: 7 推荐歌曲: Speed Of Sound 作者: Coldplay
    推荐编号: 8 推荐歌曲: Creep (Explicit) 作者: Radiohead
    推荐编号: 9 推荐歌曲: Sparks 作者: Coldplay
    推荐编号: 10 推荐歌曲: Use Somebody 作者: Kings Of Leon
    当前待推荐用户编号 23
    推荐编号: 1 推荐歌曲: Garden Of Eden 作者: Guns N’ Roses
    推荐编号: 2 推荐歌曲: Don’t Speak 作者: John Dahlbäck
    推荐编号: 3 推荐歌曲: Master Of Puppets 作者: Metallica
    推荐编号: 4 推荐歌曲: TULENLIEKKI 作者: M.A. Numminen
    推荐编号: 5 推荐歌曲: Bring Me To Life 作者: Evanescence
    推荐编号: 6 推荐歌曲: Kryptonite 作者: 3 Doors Down
    推荐编号: 7 推荐歌曲: Make Her Say 作者: Kid Cudi / Kanye West / Common
    推荐编号: 8 推荐歌曲: Night Village 作者: Deep Forest
    推荐编号: 9 推荐歌曲: Better To Reign In Hell 作者: Cradle Of Filth
    推荐编号: 10 推荐歌曲: Xanadu 作者: Olivia Newton-John;Electric Light Orchestra

    这里对每一个用户都得到了其对应的推荐结果,并且将结果按照得分值进行排序。

    我们选择了音乐数据集来进行个性化推荐任务,首先对数据进行预处理和整合,选择两种方法分别完成推荐任务。在相似度计算中根据用户所听过的歌曲在候选集中选择与其最相似的歌曲,存在的问题就是计算时间消耗太多,每一个用户都需要重新计算一遍才能得出推荐结果。在SVD矩阵分解的方法中,我们首先构建评分矩阵,对其进行SVD分解,然后选择待推荐用户,还原得到其对所有歌曲的估测评分值,最后排序返回结果即可。

    uTest = [27513]
    uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)
    
    for user in uTest:
        print("当前待推荐用户编号 {}". format(user))
        rank_value = 1
        for i in uTest_recommended_items[user,0:10]:
            song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
            print("推荐编号: {} 推荐歌曲: {} 作者: {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0]))
            rank_value+=1
    

    Recommendation for user with user id 27513
    The number 1 recommended song is Master Of Puppets BY Metallica
    The number 2 recommended song is Garden Of Eden BY Guns N’ Roses
    The number 3 recommended song is Bring Me To Life BY Evanescence
    The number 4 recommended song is Kryptonite BY 3 Doors Down
    The number 5 recommended song is Make Her Say BY Kid Cudi / Kanye West / Common
    The number 6 recommended song is Night Village BY Deep Forest
    The number 7 recommended song is Savior BY Rise Against
    The number 8 recommended song is Good Things BY Rich Boy / Polow Da Don / Keri Hilson
    The number 9 recommended song is Bleed It Out [Live At Milton Keynes] BY Linkin Park
    The number 10 recommended song is Uprising BY Muse

    展开全文
  • 本文转载至博客园的小编周旭龙:初探机器学习推荐系统的基础知识 一、推荐系统是神马 维基百科这样解释道:推荐系统属于资讯过滤的一种应用。推荐系统能够将可能受喜好的资讯或实物(例如:电影、电视节目、...

    本文转载至博客园的小编周旭龙:初探机器学习之推荐系统的基础知识

    一、推荐系统是神马

    维基百科这样解释道:推荐系统属于资讯过滤的一种应用。推荐系统能够将可能受喜好的资讯或实物(例如:电影、电视节目、音乐、书籍、新闻、图片、网页)推荐给使用者。

      推荐系统的基本流程有哪些:

      Step1.首先收集用户的历史行为数据

      Step2.然后通过预处理的方法得到用户-评价矩阵

      Step3.利用机器学习领域中相关推荐技术(主要指算法)形成对用户的个性化推荐

      PS:有的推荐系统还搜集用户对推荐结果的反馈,并根据实际的反馈信息实时调整推荐策略,产生更符合用户需求的推荐结果。  

      常见的推荐系统应用场景实例:

      站在剁手党的视角 => 哎呀妈呀,都是俺喜欢的,剁手还是不剁手,This is a question!

      

      站在音乐发烧友的视角 => 额,这首歌好听,喜欢,那首歌好听,我也喜欢!尽给我推我喜欢的歌单!收藏了。

      

      站在社交达人的视角 => 热门话题,都是我关注的,赶紧去镇楼!

      

      推荐系统的作用何在:

      (1)帮助用户找到想要的 => 长尾理论

      经常点开淘宝时,面对眼花缭乱的打折活动我们不知道要买啥。

      在经济学中,有一个著名的理论叫做“长尾理论”,如下图所示:

      

    长尾曲线模型

      在互联网领域中,指的就是最热的那一小部分资源将得到绝大部分的关注,而剩下的很大一部分资源却鲜少有人问津。这不仅造成了资源利用上的浪费,也让很多口味偏小众的用户无法找到自己感兴趣的内容。

      因此推荐系统的最重要作用就是,激活那些对客户真正有用却没有真正得到关注的内容。

      (2)降低信息过载

      互联网时代信息量已然处于爆炸状态,若是将所有内容都放在网站首页上用户是无从阅读的,信息的利用率将会十分低下。

      因此我们需要推荐系统来帮助用户过滤掉低价值的信息。

      (3)提高站点点击率/转化率

      好的推荐系统能让用户更频繁地访问一个站点,并且总是能为用户找到他想要购买的商品或者阅读的内容。

      (4)加深对用户的了解以便提供定制化服务

      每当系统成功推荐了一个用户感兴趣的内容后,我们对该用户的兴趣爱好等维度上的形象是越来越清晰的。当我们能够精确描绘出每个用户的形象之后,就可以为他们定制一系列服务,让拥有各种需求的用户都能在我们的平台上得到满足。

    用户模型分析矩阵

    二、推荐系统常见算法

    2.1 推荐算法初窥

      推荐算法到底是个啥?我们可以把它简化为一个函数。函数接受若干个参数,输出一个返回值,如下图所示:

      

    f(x) = y ?

      正如上图,在推荐算法中,输入参数是用户和item的各种属性和特征(包括年龄、性别、地域、商品的类别、发布时间等等),经过推荐算法处理后,返回一个按照用户喜好度排序的item列表。

    2.2 常见推荐算法

      常见推荐算法大致可以分为以下几种:

    • 基于流行度的算法
    • 协同过滤算法
    • 基于内容的算法
    • 基于模型的算法
    • 混合算法

      下面一一来看看他们都是啥:

      (1)基于流行度的算法

      基于流行度的算法非常简单粗暴,类似于各大新闻、微博热榜等,根据PV、UV、日均PV或分享率等数据来按某种热度排序来推荐给用户。

      

      优点是简单,适用于刚注册的新用户。缺点很明显,它无法针对用户提供个性化的推荐。

    PS:基于这种算法也可做一些优化,比如加入用户分群的流行度排序,例如把热榜上的体育内容优先推荐给体育迷,把政要热文推给热爱谈论政治的用户。

      (2)协同过滤算法

      这可能是我们最熟悉的一个推荐算法了,想想教科书上的案例:沃尔玛的尿布与啤酒...

      协同过滤(Collaborative Filtering, CF)算法在很多电商网站上都有用到,它主要包括基于用户的CF(User-based CF)和基于物品的CF(Item-based CF)。

      基于用户的协同过滤算法步骤如下:

      1. 分析各个用户对item的评价(通过浏览记录、购买记录等);

      2. 依据用户对item的评价计算得出所有用户之间的相似度;

      3. 选出与当前用户最相似的N个用户;

      4. 将这N个用户评价最高并且当前用户又没有浏览过的item推荐给当前用户。

      整个步骤如下图所示,具体原理与算法可以参考这一篇:《基于用户的协同过滤推荐算法原理与实现

      

      基于物品的协同过滤算法步骤如下:

      1. 分析各个用户对item的浏览记录。

      2. 依据浏览记录分析得出所有item之间的相似度;

      3. 对于当前用户评价高的item,找出与之相似度最高的N个item;

      4. 将这N个item推荐给用户。 

      整个步骤如下图所示:

      

      不管是基于用户还是基于物品,其关键都在于建立关联矩阵,首先会用余弦相似度/Jaccard 公式来计算用户与物品之间,物品与物品之间的相似度,其中值越接近1表示这两个用户越相似。最后,只需要找出与用户A或物品A相似度最高N个项(N>=2),去掉他们已经评价过的物品,剩下的就是最后的推荐结果,

      

      但是,协同过滤算法仍然存在一些问题:

      1. 依赖于准确的用户评分;

      2. 在计算的过程中,那些大热的物品会有更大的几率被推荐给用户;

      3. 冷启动问题:当有一名新用户或者新物品进入系统时,推荐将无从依据;

      4. 在一些item生存周期短(如新闻、广告)的系统中,由于更新速度快,大量item不会有用户评分,造成评分矩阵稀疏,不利于这些内容的推荐。

      对于问题4稀疏矩阵,可以通过把一个nm的矩阵分解为一个nk的矩阵乘以一个k*m的矩阵(即矩阵因子分解)来解决,这里的k可以是用户的特征、兴趣爱好与物品属性的一些联系,通过因子分解,可以找到用户和物品之间的一些潜在关联,从而填补之前矩阵中的缺失值。

      (3)基于内容的算法

      协同过滤算法看起来很好很强大,通过改进也能克服各种缺点。那么问题来了,假如我是个《指环王》的忠实读者,我买过一本《双塔奇兵》,这时库里新进了第三部:《王者归来》,那么显然我会很感兴趣。然而基于之前的算法,无论是用户评分还是书名的检索都不太好使,于是基于内容的推荐算法呼之欲出。

      For example,现在系统里有一个用户和一条新闻。通过分析用户的行为以及新闻的文本内容,我们提取出数个关键字,如下图所示:

      

      将这些关键字作为属性,把用户和新闻分解成向量,如下图所示:

      

      之后再计算向量距离,便可以得出该用户和新闻的相似度了。这种方法很简单,如果在为一名热爱观看英超联赛的足球迷推荐新闻时,新闻里同时存在关键字体育、足球、英超,显然匹配前两个词都不如直接匹配英超来得准确,系统该如何体现出关键词的这种“重要性”呢?这时可以引入词权的概念。在大量的语料库中通过计算,可以算出新闻中每一个关键词的权重,在计算相似度时引入这个权重的影响,就可以达到更精确的效果。

    sim(user, item) = 文本相似度(user, item) * 词权

      那么,问题也来了:要是用户的兴趣是足球,而新闻的关键词是德甲、英超,按照上面的文本匹配方法显然无法将他们关联到一起。

      在此,可以引用话题聚类,如下图所示:

      

      利用word2vec一类工具,可以将文本的关键词聚类,然后根据topic将文本向量化。For example,可以将德甲、英超、西甲聚类到“足球”这个topic下,将LV、Gucci聚类到“奢侈品”这个topic下,再根据topic为文本内容与用户作相似度计算。

      综上,基于内容的推荐算法能够很好地解决冷启动问题,并且也不会囿于热度的限制,因为它是直接基于内容匹配的,而与浏览记录无关。然而它也会存在一些弊端,比如过度专业化(over-specialisation)的问题:这种方法会一直推荐给用户内容密切关联的item,而失去了推荐内容的多样性。

      (4)基于模型的算法

      基于模型的方法有很多,用到的诸如机器学习的方法也可以很深,这里只看看一个比较简单的方法——Logistics回归预测。

      举个例子,通过分析系统中用户的行为和购买记录等数据,可以得到如下表:

      

      表中的行是一种物品,x1~xn是影响用户行为的各种特征属性,如用户年龄段、性别、地域、物品的价格、类别等等,y则是用户对于该物品的喜好程度,可以是购买记录、浏览、收藏等等。通过大量这类的数据,我们可以回归拟合出一个函数,计算出x1~xn对应的系数,这即是各特征属性对应的权重,权重值越大则表明该属性对于用户选择商品越重要。

      在拟合函数的时候我们会想到,单一的某种属性和另一种属性可能并不存在强关联。比如,年龄与购买护肤品这个行为并不呈强关联,性别与购买护肤品也不强关联,但当我们把年龄与性别综合在一起考虑时,它们便和购买行为产生了强关联。比如(这里仅仅只是比如),20~30岁的女性用户更倾向于购买护肤品,这就叫交叉属性。通过反复测试和经验,可以调整特征属性的组合,拟合出最准确的回归函数。最后得出的属性权重如下:

      

      基于模型的算法由于快速、准确,适用于实时性比较高的业务如新闻、广告等,而若是需要这种算法达到更好的效果,则需要人工干预反复的进行属性的组合和筛选,也就是常说的Feature Engineering。而由于新闻的时效性,系统也需要反复更新线上的数学模型,以适应变化。

      (5)混合算法

      现实应用中,很少有直接单纯地用某一种算法来做推荐的系统。在一些大的网站如Netflix,就是融合了数十种算法的推荐系统。因此,我们也可以通过给不同算法的结果加权重来综合结果,或者是在不同的计算环节中运用不同的算法来混合,达到更贴合自己业务的目的。

    三、推荐系统技术实现

    3.1 基于开源技术自己搭建

      目前最为流行的就是基于Spark Streaming + Spark MLlib来实现,其架构图如下所示:

      

      具体介绍请参考:《推荐系统架构及流程说明

    3.2 基于云服务平台搭建

      目前阿里云、腾讯云以及Azure等国外老牌云服务提供商也公开提供自己的机器学习平台云服务,大部分没有机器学习开发能力的企业可以选择基于云服务来实现自己的业务实践,而无须过多关注算法层面的实现。

      

      对于阿里云,它提供了完善的视频教程和文档,详细请参考阿里云《机器学习PAI快速入门与业务实战》。

    四、后续学习任务

      后续我会学习一本领导推荐的参考书《推荐系统实践》,然后会进入微软大法好的ML.NET的学习,最后会尝试写一些机器学习的Demo来尝尝鲜。关于ML.NET的介绍和教程,大家可以参考这里《ML.NET 机器学习教程》。当然,我都会做一些学习总结,到时也会分享出来。

      

    参考资料

    (1)AnnieJ,《推荐系统介绍

    (2)豆腐脑D,《推荐系统从入门到继续

    (3)Micorosoft,《ML.NET 机器学习教程

    (4)Bean.Hsiang,《ML.NET系列文章

    (5)阿里云,《机器学习PAI快速入门与业务实战

    (6)交大全栈工程师,《程序员必学的5类系统推荐算法总结

    展开全文
  • 推荐系统基础背景相似度皮尔逊相关系数(Pearson Correlation Coefficient) 很早就对推荐感兴趣了,特别是用了网易云音乐后(不是打广告),对它推荐的歌曲非常适合我的口味(平时听歌广泛,不拘于几首好听的),...


    很早就对推荐感兴趣了,特别是用了网易云音乐后(不是打广告),对它推荐的歌曲非常适合我的口味(平时听歌广泛,不拘于几首好听的),于是乎,更加对推荐增添了些许兴趣。但之前看的有关推荐内容都很杂,这次把它稍微整理一下。

    背景

    推荐系统是一种帮助用户快速发现有用信息的工具,通过分析用户的历史行为,研究用户偏好,对用户兴趣建模,从而主动给用户推荐能够满足他们感兴趣的信息。本质上,推荐系统是解决用户额外信息获取的问题。在海量冗余信息(信息过载)的情况下,用户容易迷失目标,推荐系统主动筛选信息,将基础数据与算法模型进行结合,帮助其确定目标,最终达到智能化推荐。


    相似度

    在推荐系统中,为了实现推荐,我们常用的手段会涉及计算用户之间相似度、物品之间的相似度和用户与物品之间的相关性的。

    其中相似度计算是基于向量间距离,距离越近相似度越大。例如,在用户对物品偏好的二维矩阵中,一个用户对所有物品的偏好作为一个向量,可用于计算用户之间的相似度,即两个向量间的距离;将所有用户对一个物品的偏好作为向量表示此物品,可以用于计算物品之间的相似度。

    也就是说,我们通过计算相似度达到为不同的用户推进相应的物品,而这个相似度计算有如下几种方式:


    皮尔逊相关系数(Pearson Correlation Coefficient)

    一般用于计算两个变量间的相关性,它的取值是[-1,1],(正弦)当取值大于0时表示两个变量是正相关的;当取值小于0时表示两个变量是负相关的,取值为0表示不相关。在推荐系统中,常用于用户之间的相似度计算,计算公式如下:
    在这里插入图片描述
    其中,?为两个用户?、?共同评价过物品的总数;?_?表示用户?对物品?的评分,? ̅ 表示用户?所有评价过的物品的平均分;?_?表示用户?对物品?的评分,? ̅ 表示用户?所有评价过的物品的平均分;

    通过以上公式,我们可以发现这个系数是两个变量之间的协方差和标准差的商(复习一下什么叫协方差、标准差。)


    欧几里德相似度

    用于计算欧几里德空间中两个点的距离,以两个用户?和?为例子,看成是?维空间的两个向量?和?,?_?表示用户?对物品?的喜好值,?_?表示用户?对物品?的喜好值,他们之间的欧几里德距离(Euclidean Distance)计算公式如下:(类比初中学的两点间的距离公式)
    在这里插入图片描述
    对应的欧几里德相似度,一般采用以下公式进行转换:
    在这里插入图片描述
    表示距离越小,相似度越大。


    余弦向量相似度(Cosine Similarity)

    先说说为啥这个东西能怎样表示相似度的吧~向量a和向量b表示二维情况下两个用户的喜好,根据我们学过的知识:两个向量的数量积等于b向量的模乘以a向量在b向量上的投影。当夹角足够小时,a向量乘以余弦值和b向量将重合,此时可以表明两个用户的喜好具有极高的相似性。
    在这里插入图片描述
    在这里插入图片描述
    余弦向量相似度(Cosine Similarity)是计算两个向量的夹角余弦,被广泛应用于计算文档之间的相似度,相比欧氏距离,余弦距离更加注重两个向量在方向上的差异。其计算公式为如下(分母打错了,应该是根号a乘以根号b的形式):

    在这里插入图片描述
    如图:向量a(1,0,0,1)和b(0,1,0,1)的相似度
    在这里插入图片描述
    余弦相似度更多的是从方向上区分差异,而对最后的结果数值不敏感,所以无法度量每个维数值的差异,在某些情况下会导致无法区分用户的评分。

    例如用户对内容进行评分,按照5分制进行打分,1分最差,5分最好,A和B两用户分别对两个物品进行评分,分值分别为(1,2)和(4,5),使用余弦相似度得出的结果是0.98。

    两者相似度较高,但实际上A用户不喜欢这2个内容,而B用户比较喜欢,这说明结果产生了误差,调整余弦相似度是所有维度上的数值都减去均值,再用余弦相似度计算,例如A和B对两个物品的评分均值都是3,那么调整后为(-2,-1)和(1,2)得到相似度结果为-0.8,相似度为负值并且差异较大,这样更加符合事实。


    Tanimoto系数

    该系数也称Jaccard系数,是Cosine相似度的扩展,也较多地用于计算文档间相似度的计算。(我们之前分析的聚类的外部指标之一)

    计算方式如下来自豆瓣
    A=[1,2,3,4],B=[1,2,5],C = A & B = [1,2]
    T = Nc / ( Na + Nb -Nc) = len© / ( len(a) + len(b) - len©) = 2 / (4+3-2) = 0.4


    展开全文
  • 可以看到无论给谁推荐推荐的歌曲都一样,所以这个模型无法克服流行歌曲(商品)推荐力过强的问题。 #建立一个个性化的推荐模型 personalized_model = graphlab.item_similarity_recommender.create(train_...
  • 最近公司有意做一款机器学习的应用,主要集中于推荐系统这个方向,因此看了看一些基础知识,此篇是一个学习总结,不算是完整原创文章。 一、推荐系统是神马 维基百科这样解释道:推荐系统属于资讯过滤的一种应用...
  • 1. 基于内容的推荐系统 (1)推荐系统的应用很广泛: 1)电子商务:根据客户购买和浏览商品推荐相关商品 2)电影和视频推荐:根据视频浏览记录,喜好推荐 3)音乐电台的推荐 4)网页及新闻:推荐根据网页浏览...
  • #声音处理接口属性:nfft = 2048接口每次处理音乐数据的量 #声音文件处理需要安装的包:python_speech_featrures MFCC MFCC分两步,第一做傅里叶变换,第二步再做梅尔倒谱 #pydub包,做mp3与wav之间的转换,因为...
  • 基于深度学习音乐推荐系统简述

    千次阅读 热门讨论 2020-04-04 21:15:27
    本文简要介绍我做的基于深度学习音乐推荐系统。主要从需求分析与设计实现的角度来进行介绍。 (一)需求分析   基于深度学习音乐推荐系统旨在以个性化音乐推荐模型为基础,使用B/S架构的形式实现。个性化推荐...
  • 3、基于流行度的推荐系统 popularity_model = graphlab.popularity_recommender.create(train_data, user_id='user_id', item_id='song') popularity_model.recommend(users=[users[0]])   4...
  • 最新机器学习超多项目实战 纯项目实战+音乐推荐系统+Pytorch+机器翻译+金融反欺诈等 ===============课程目录=============== ├&lt;讲义代码&gt; │ ├&lt;第01课&gt; │ │ ├《推荐系统》数据...
  • Recommender systems are widely used in product recommendations such as recommendations of music, movies, books, news, research articles,... 推荐系统广泛用于产品推荐,例如音乐,电影,书籍,新闻,研究文...
  • 第1课 音乐推荐系统_(上) 知识点1: 问题的引入,可获取的资源渠道,数据获取,数据组织,问题解决思路 实战项目: 聚类与协同过滤、协同过滤优化与代码实现 第2课 音乐推荐系统_(下) 实战项目: 推荐系统优化:...
  • 我们在哪能见到推荐系统 个性化正在改变我们关于世界的经验 影片推荐 ## 商品推荐 音乐推荐 朋友推荐 药品 - 靶相互作用 3 推荐的分类模型 3.1 最简单的方法 - 流行度 3.2 解决方案一 分类模型 我将要买这个...

空空如也

空空如也

1 2 3 4 5 ... 8
收藏数 148
精华内容 59
关键字:

机器学习音乐推荐系统