目标:特征降维处理主成分分析APA
方法:
关联表:user_id---->aisle
交叉表:构造每个用户购买了哪些物品细分类别的商品及数量
降维处理:主成分分析APA
数据来源:https://www.kaggle.com/c/instacart-market-basket-analysis/data
·order_products_prior.csv:订单与商品信息
。字段:order_id,product_id,add_to_cart_order,reordered
。解释:订单id,产品id,加入购物车订单,再次订购(不止一次订购)
·products.csv:商品信息
。字段:product_id,product_name,aisle_id,department_id
。解释:产品id,产品名称,物品类别id,产品大分类id
·orders.csv:用户的订单信息
。字段:order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
。解释:订单编号,用户编号,评价等级,订单数量,星期几,当天的购买时段h,距离预定日期的天数
·aisles.csv:商品所属具体物品类别
。字段:aisle_id,aisle
。解释:物品细分类别id,物品细分类别名称
import numpy as np
import pandas as pd
aisles = pd.read_csv(r"E:\instacart-market-basket-analysis\aisles.csv",sep=",",encoding="utf-8")
orders = pd.read_csv(r"E:\instacart-market-basket-analysis\orders.csv",sep=",",encoding="utf-8")
products = pd.read_csv(r"E:\instacart-market-basket-analysis\products.csv",sep=",",encoding="utf-8")
order_products_prior = pd.read_csv(r"E:\instacart-market-basket-analysis\order_products__prior.csv",sep=",",encoding="utf-8")
display(aisles.head(3))
display(orders.head(3))
display(products.head(3))
display(order_products_prior.head(3))
| aisle_id | aisle |
---|
0 | 1 | prepared soups salads |
---|
1 | 2 | specialty cheeses |
---|
2 | 3 | energy granola bars |
---|
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|
0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
---|
1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
---|
2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
---|
| product_id | product_name | aisle_id | department_id |
---|
0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
---|
1 | 2 | All-Seasons Salt | 104 | 13 |
---|
2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
---|
| order_id | product_id | add_to_cart_order | reordered |
---|
0 | 2 | 33120 | 1 | 1 |
---|
1 | 2 | 28985 | 2 | 1 |
---|
2 | 2 | 9327 | 3 | 0 |
---|
import time
data01 = pd.merge(orders,order_products_prior,how='inner',on=["order_id","order_id"])
time.sleep(15)
data02 = pd.merge(data01,products,on=["product_id","product_id"])
data03 = pd.merge(data02,aisles,on=["aisle_id","aisle_id"])
time.sleep(3)
display(data03.shape,data03.tail(10000))
(32434489, 14)
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | product_id | add_to_cart_order | reordered | product_name | aisle_id | department_id | aisle |
---|
32424489 | 2542240 | 75675 | prior | 12 | 5 | 12 | 5.0 | 44471 | 7 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424490 | 3260483 | 75675 | prior | 16 | 0 | 9 | 14.0 | 44471 | 21 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424491 | 2196407 | 75675 | prior | 30 | 0 | 11 | 12.0 | 44471 | 9 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424492 | 532672 | 75675 | prior | 38 | 5 | 13 | 7.0 | 44471 | 20 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424493 | 1705047 | 75675 | prior | 39 | 5 | 13 | 0.0 | 44471 | 20 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424494 | 998672 | 75675 | prior | 48 | 5 | 14 | 11.0 | 44471 | 13 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424495 | 2149746 | 75675 | prior | 49 | 6 | 9 | 8.0 | 44471 | 6 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424496 | 483804 | 75804 | prior | 12 | 6 | 15 | 4.0 | 44471 | 19 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424497 | 1783191 | 76027 | prior | 6 | 4 | 16 | 13.0 | 44471 | 13 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424498 | 3074202 | 76027 | prior | 7 | 2 | 15 | 5.0 | 44471 | 8 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424499 | 431155 | 76081 | prior | 8 | 0 | 14 | 16.0 | 44471 | 8 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424500 | 2879529 | 76238 | prior | 36 | 6 | 10 | 6.0 | 44471 | 25 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424501 | 1652877 | 76238 | prior | 39 | 5 | 10 | 6.0 | 44471 | 10 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424502 | 737972 | 76466 | prior | 20 | 0 | 10 | 7.0 | 44471 | 7 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424503 | 3154632 | 76556 | prior | 80 | 3 | 18 | 2.0 | 44471 | 7 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424504 | 1776861 | 76576 | prior | 7 | 0 | 15 | 7.0 | 44471 | 2 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424505 | 2695824 | 76726 | prior | 4 | 0 | 11 | 28.0 | 44471 | 26 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424506 | 3176388 | 76823 | prior | 1 | 6 | 12 | NaN | 44471 | 19 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424507 | 1441764 | 76866 | prior | 13 | 0 | 16 | 25.0 | 44471 | 7 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424508 | 2888446 | 76868 | prior | 17 | 5 | 10 | 16.0 | 44471 | 19 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424509 | 2670733 | 77148 | prior | 19 | 1 | 9 | 12.0 | 44471 | 24 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424510 | 2328300 | 77187 | prior | 1 | 1 | 9 | NaN | 44471 | 1 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424511 | 1923581 | 77229 | prior | 21 | 3 | 11 | 17.0 | 44471 | 2 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424512 | 2042750 | 77229 | prior | 24 | 0 | 14 | 12.0 | 44471 | 6 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424513 | 2685754 | 77238 | prior | 2 | 0 | 9 | 6.0 | 44471 | 5 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424514 | 1401197 | 77265 | prior | 6 | 1 | 5 | 9.0 | 44471 | 8 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424515 | 2917195 | 77265 | prior | 10 | 4 | 20 | 5.0 | 44471 | 4 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424516 | 1321674 | 77265 | prior | 31 | 0 | 10 | 11.0 | 44471 | 2 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424517 | 1268589 | 77265 | prior | 37 | 1 | 18 | 29.0 | 44471 | 7 | 1 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
32424518 | 3044303 | 77280 | prior | 23 | 4 | 23 | 1.0 | 44471 | 3 | 0 | Free & Clear Unscented Baby Wipes | 82 | 18 | baby accessories |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
32434459 | 814403 | 161964 | prior | 10 | 6 | 12 | 5.0 | 26478 | 20 | 0 | Frozen Apple Juice | 113 | 1 | frozen juice |
---|
32434460 | 503516 | 175436 | prior | 4 | 5 | 16 | 13.0 | 26478 | 18 | 0 | Frozen Apple Juice | 113 | 1 | frozen juice |
---|
32434461 | 385156 | 183189 | prior | 4 | 1 | 23 | 22.0 | 26478 | 2 | 0 | Frozen Apple Juice | 113 | 1 | frozen juice |
---|
32434462 | 471382 | 85005 | prior | 7 | 5 | 0 | 13.0 | 24344 | 1 | 0 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434463 | 1833016 | 92263 | prior | 5 | 2 | 13 | 8.0 | 24344 | 2 | 0 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434464 | 2624885 | 136840 | prior | 2 | 6 | 10 | 4.0 | 24344 | 11 | 0 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434465 | 1604793 | 136840 | prior | 6 | 5 | 10 | 3.0 | 24344 | 17 | 1 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434466 | 3154099 | 136840 | prior | 16 | 2 | 16 | 3.0 | 24344 | 4 | 1 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434467 | 3135581 | 151840 | prior | 70 | 0 | 9 | 1.0 | 24344 | 6 | 0 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434468 | 3297537 | 181495 | prior | 2 | 1 | 14 | 15.0 | 24344 | 9 | 0 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434469 | 823196 | 181495 | prior | 3 | 1 | 14 | 0.0 | 24344 | 1 | 1 | Frozen Concentrate Non-Alcoholic Pina Colada | 113 | 1 | frozen juice |
---|
32434470 | 2471510 | 107801 | prior | 8 | 6 | 15 | 4.0 | 5500 | 19 | 0 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434471 | 2181814 | 135090 | prior | 5 | 3 | 14 | 10.0 | 5500 | 3 | 0 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434472 | 962734 | 167413 | prior | 1 | 1 | 12 | NaN | 5500 | 9 | 0 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434473 | 2928960 | 167413 | prior | 4 | 0 | 12 | 10.0 | 5500 | 3 | 1 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434474 | 1393242 | 167413 | prior | 5 | 0 | 12 | 7.0 | 5500 | 21 | 1 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434475 | 2601337 | 181750 | prior | 13 | 0 | 20 | 30.0 | 5500 | 2 | 0 | Blended Juice Beverage, Mango Orange | 113 | 1 | frozen juice |
---|
32434476 | 2125702 | 109046 | prior | 3 | 3 | 16 | 8.0 | 2642 | 3 | 0 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434477 | 2849065 | 138824 | prior | 1 | 6 | 13 | NaN | 2642 | 20 | 0 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434478 | 2634996 | 138824 | prior | 6 | 0 | 16 | 28.0 | 2642 | 15 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434479 | 1857751 | 181888 | prior | 2 | 0 | 7 | 10.0 | 2642 | 5 | 0 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434480 | 2131276 | 181888 | prior | 7 | 1 | 11 | 8.0 | 2642 | 6 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434481 | 1466142 | 181888 | prior | 9 | 3 | 14 | 16.0 | 2642 | 4 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434482 | 1022794 | 204495 | prior | 48 | 0 | 9 | 5.0 | 2642 | 9 | 0 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434483 | 3249444 | 204495 | prior | 50 | 6 | 14 | 4.0 | 2642 | 8 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434484 | 2231925 | 204495 | prior | 51 | 1 | 15 | 9.0 | 2642 | 8 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434485 | 327001 | 204495 | prior | 53 | 2 | 8 | 7.0 | 2642 | 1 | 1 | Frozen Concentrated Orange Juice With Added Ca... | 113 | 1 | frozen juice |
---|
32434486 | 1997103 | 110030 | prior | 4 | 2 | 16 | 5.0 | 24189 | 8 | 0 | Tropical Fruit Smoothie Tasty American Favorites | 113 | 1 | frozen juice |
---|
32434487 | 1362143 | 113181 | prior | 33 | 3 | 17 | 5.0 | 24189 | 12 | 0 | Tropical Fruit Smoothie Tasty American Favorites | 113 | 1 | frozen juice |
---|
32434488 | 777464 | 179210 | prior | 7 | 5 | 15 | 20.0 | 24189 | 16 | 0 | Tropical Fruit Smoothie Tasty American Favorites | 113 | 1 | frozen juice |
---|
10000 rows × 14 columns
data04 = pd.crosstab(data03["user_id"],data03["aisle"])
display(data04.shape,data04.head(10))
(206209, 134)
aisle | air fresheners candles | asian foods | baby accessories | baby bath body care | baby food formula | bakery desserts | baking ingredients | baking supplies decor | beauty | beers coolers | ... | spreads | tea | tofu meat alternatives | tortillas flat bread | trail mix snack mix | trash bags liners | vitamins supplements | water seltzer sparkling water | white wines | yogurt |
---|
user_id | | | | | | | | | | | | | | | | | | | | | |
---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---|
2 | 0 | 3 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 42 |
---|
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
---|
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
---|
5 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
---|
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|
7 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
---|
8 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|
9 | 0 | 0 | 0 | 0 | 6 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 19 |
---|
10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
---|
10 rows × 134 columns
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
data = data04
transfer = PCA(n_components=0.9)
xi = transfer.fit_transform(data)
print(xi.shape,transfer.explained_variance_ratio_)
Fi=[ ]
for i in range(1,xi.shape[1]+1):
F="F" + str(i)
Fi.append(F)
data02 = pd.DataFrame(xi,columns=Fi)
display(data02.head(3))
(206209, 27) [0.48237998 0.09585824 0.05185877 0.03590181 0.0293466 0.02393094
0.01899492 0.0183208 0.01487788 0.0134451 0.01121877 0.01102918
0.01052171 0.00980307 0.00832174 0.00726185 0.00712991 0.00683061
0.00640343 0.00580483 0.00534075 0.00487297 0.00477908 0.00462158
0.00444346 0.00413755 0.00408034]
| F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 | ... | F18 | F19 | F20 | F21 | F22 | F23 | F24 | F25 | F26 | F27 |
---|
0 | -24.215659 | 2.429427 | -2.466370 | -0.145686 | 0.269042 | -1.432932 | 2.140677 | -2.738031 | -2.714316 | -1.743135 | ... | -3.225987 | -4.580076 | 0.777403 | -3.699129 | 1.907214 | 2.995386 | 0.772923 | 0.686800 | 1.694394 | -2.343230 |
---|
1 | 6.463208 | 36.751116 | 8.382553 | 15.097530 | -6.920938 | -0.978375 | 6.011567 | 3.787725 | -8.180749 | -9.040861 | ... | -0.737606 | -0.737402 | 0.740042 | -0.091338 | 5.151285 | -4.584815 | -3.237894 | 4.121213 | 2.446897 | -4.283485 |
---|
2 | -7.990302 | 2.404383 | -11.030064 | 0.672230 | -0.442368 | -2.823272 | -6.284140 | 6.512509 | -2.148634 | -1.585257 | ... | 5.434733 | -3.604842 | 4.282794 | -0.445834 | 3.039337 | -1.469566 | -2.946656 | 1.775345 | -0.444194 | 0.786666 |
---|
3 rows × 27 columns