1章 ウェブの注文数を分析する10本
ここでは、ある企業のECサイトでの商品の注文数の推移を分析していきます。
データの属性を理解し、分析をするためにデータを加工した後、
データの可視化を行うことで問題を発見していくプロセスを学びます。
1:データを読み込んでみよう
Q. pandasのインポート
import pandas as pd
Q. customer_master.csvをcustomer_masterに格納し冒頭五行を表示
customer_master = pd.read_csv("customer_master.csv")
customer_master.head(5)
|
customer_id |
customer_name |
registration_date |
customer_name_kana |
email |
gender |
age |
birth |
pref |
0 |
IK152942 |
平田 裕次郎 |
2019-01-01 00:25:33 |
ひらた ゆうじろう |
hirata_yuujirou@example.com |
M |
29 |
1990/6/10 |
石川県 |
1 |
TS808488 |
田村 詩織 |
2019-01-01 01:13:45 |
たむら しおり |
tamura_shiori@example.com |
F |
33 |
1986/5/20 |
東京都 |
2 |
AS834628 |
久野 由樹 |
2019-01-01 02:00:14 |
ひさの ゆき |
hisano_yuki@example.com |
F |
63 |
1956/1/2 |
茨城県 |
3 |
AS345469 |
鶴岡 薫 |
2019-01-01 04:48:22 |
つるおか かおる |
tsuruoka_kaoru@example.com |
M |
74 |
1945/3/25 |
東京都 |
4 |
GD892565 |
大内 高史 |
2019-01-01 04:54:51 |
おおうち たかし |
oouchi_takashi@example.com |
M |
54 |
1965/8/5 |
千葉県 |
Q. item_master.csvをitem_masterに格納し冒頭五行を表示
item_master = pd.read_csv("item_master.csv")
item_master.head(5)
|
item_id |
item_name |
item_price |
0 |
S001 |
PC-A |
50000 |
1 |
S002 |
PC-B |
85000 |
2 |
S003 |
PC-C |
120000 |
3 |
S004 |
PC-D |
180000 |
4 |
S005 |
PC-E |
210000 |
transaction_1csvをtransaction_1に格納し冒頭五行を表示
transaction_1 = pd.read_csv("transaction_1.csv")
transaction_1.head()
|
transaction_id |
price |
payment_date |
customer_id |
0 |
T0000000113 |
210000 |
2019-02-01 01:36:57 |
PL563502 |
1 |
T0000000114 |
50000 |
2019-02-01 01:37:23 |
HD678019 |
2 |
T0000000115 |
120000 |
2019-02-01 02:34:19 |
HD298120 |
3 |
T0000000116 |
210000 |
2019-02-01 02:47:23 |
IK452215 |
4 |
T0000000117 |
170000 |
2019-02-01 04:33:46 |
PL542865 |
transaction_detail_1.csvをtransaction_detail_1格納し冒頭五行を表示
transaction_detail_1 = pd.read_csv("transaction_detail_1.csv")
transaction_detail_1.head(5)
|
detail_id |
transaction_id |
item_id |
quantity |
0 |
0 |
T0000000113 |
S005 |
1 |
1 |
1 |
T0000000114 |
S001 |
1 |
2 |
2 |
T0000000115 |
S003 |
1 |
3 |
3 |
T0000000116 |
S005 |
1 |
4 |
4 |
T0000000117 |
S002 |
2 |
2:データを結合(ユニオン)してみよう
transaction_2.csvをtransaction_2格納し冒頭五行を表示
transaction_2 = pd.read_csv("transaction_2.csv")
Q. transaction_1と2を結合しtransactionに格納し冒頭五行を表示
transaction = pd.concat([transaction_1,transaction_2],ignore_index=True)
transaction.head(5)
|
transaction_id |
price |
payment_date |
customer_id |
0 |
T0000000113 |
210000 |
2019-02-01 01:36:57 |
PL563502 |
1 |
T0000000114 |
50000 |
2019-02-01 01:37:23 |
HD678019 |
2 |
T0000000115 |
120000 |
2019-02-01 02:34:19 |
HD298120 |
3 |
T0000000116 |
210000 |
2019-02-01 02:47:23 |
IK452215 |
4 |
T0000000117 |
170000 |
2019-02-01 04:33:46 |
PL542865 |
Q. transaction_1と2とtransactionの長さを表示
print(len(transaction_1))
print(len(transaction_2))
print(len(transaction_1)+len(transaction_2))
print(len(transaction))
5000
1786
6786
6786
transaction_detail_2.csvをtransaction_detail_2格納し冒頭五行を表示
transaction_detail_2 = pd.read_csv("transaction_detail_2.csv")
transaction_detail_2.head(5)
|
detail_id |
transaction_id |
item_id |
quantity |
0 |
5000 |
T0000004870 |
S002 |
3 |
1 |
5001 |
T0000004871 |
S003 |
1 |
2 |
5002 |
T0000004872 |
S001 |
2 |
3 |
5003 |
T0000004873 |
S004 |
1 |
4 |
5004 |
T0000004874 |
S003 |
2 |
Q. transaction_detail_1と2を結合しtransaction_detailに格納し冒頭五行を表示
transaction_detail = pd.concat([transaction_detail_1,transaction_detail_2],ignore_index=True)
transaction_detail.head(5)
|
detail_id |
transaction_id |
item_id |
quantity |
0 |
0 |
T0000000113 |
S005 |
1 |
1 |
1 |
T0000000114 |
S001 |
1 |
2 |
2 |
T0000000115 |
S003 |
1 |
3 |
3 |
T0000000116 |
S005 |
1 |
4 |
4 |
T0000000117 |
S002 |
2 |
Q. transaction_detail_1と2の合計がtransaction_detailの総数と合致することをprint無しで一行で表示
len(transaction_detail_1)+len(transaction_detail_2)==len(transaction_detail)
True
3:売上データ同士を結合(ジョイン)してみよう
Q. transaction_detailとtransactionを横結合しjoin_dataに格納
- transaction_detailは全列
- transaction:"transaction_id","payment_date","customer_id"
- transaction_idで結合
- left

'マージ|結合
join_data = pd.merge(transaction_detail,transaction[["transaction_id","payment_date","customer_id"]], on="transaction_id", how="left")
join_data.head()
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
1 |
1 |
T0000000114 |
S001 |
1 |
2019-02-01 01:37:23 |
HD678019 |
2 |
2 |
T0000000115 |
S003 |
1 |
2019-02-01 02:34:19 |
HD298120 |
3 |
3 |
T0000000116 |
S005 |
1 |
2019-02-01 02:47:23 |
IK452215 |
4 |
4 |
T0000000117 |
S002 |
2 |
2019-02-01 04:33:46 |
PL542865 |
4:マスタデータを結合(ジョイン)してみよう
Q. join_dataとcustomer_masterを横結合しjoin_dataに格納
- join_dataは全列
- customer_master:"customer_id"
- customer_idで結合
- left
#横結合|マージ
join_data=pd.merge(join_data,customer_master,on="customer_id",how="left")
join_data.head(5)
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
customer_name_x |
registration_date_x |
customer_name_kana_x |
email_x |
... |
price |
payment_month |
customer_name_y |
registration_date_y |
customer_name_kana_y |
email_y |
gender_y |
age_y |
birth_y |
pref_y |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
井本 芳正 |
2019-01-07 14:34:35 |
いもと よしまさ |
imoto_yoshimasa@example.com |
... |
210000 |
201902 |
井本 芳正 |
2019-01-07 14:34:35 |
いもと よしまさ |
imoto_yoshimasa@example.com |
M |
30 |
1989/7/15 |
熊本県 |
1 |
1 |
T0000000114 |
S001 |
1 |
2019-02-01 01:37:23 |
HD678019 |
三船 六郎 |
2019-01-27 18:00:11 |
みふね ろくろう |
mifune_rokurou@example.com |
... |
50000 |
201902 |
三船 六郎 |
2019-01-27 18:00:11 |
みふね ろくろう |
mifune_rokurou@example.com |
M |
73 |
1945/11/29 |
京都府 |
2 |
2 |
T0000000115 |
S003 |
1 |
2019-02-01 02:34:19 |
HD298120 |
山根 小雁 |
2019-01-11 08:16:02 |
やまね こがん |
yamane_kogan@example.com |
... |
120000 |
201902 |
山根 小雁 |
2019-01-11 08:16:02 |
やまね こがん |
yamane_kogan@example.com |
M |
42 |
1977/5/17 |
茨城県 |
3 |
3 |
T0000000116 |
S005 |
1 |
2019-02-01 02:47:23 |
IK452215 |
池田 菜摘 |
2019-01-10 05:07:38 |
いけだ なつみ |
ikeda_natsumi@example.com |
... |
210000 |
201902 |
池田 菜摘 |
2019-01-10 05:07:38 |
いけだ なつみ |
ikeda_natsumi@example.com |
F |
47 |
1972/3/17 |
兵庫県 |
4 |
4 |
T0000000117 |
S002 |
2 |
2019-02-01 04:33:46 |
PL542865 |
栗田 憲一 |
2019-01-25 06:46:05 |
くりた けんいち |
kurita_kenichi@example.com |
... |
170000 |
201902 |
栗田 憲一 |
2019-01-25 06:46:05 |
くりた けんいち |
kurita_kenichi@example.com |
M |
74 |
1944/12/17 |
長崎県 |
5 rows × 26 columns
join_data=pd.merge(join_data,item_master,on="item_id",how="left")
join_data.head(5)
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
customer_name |
registration_date |
customer_name_kana |
email |
gender |
age |
birth |
pref |
item_name |
item_price |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
井本 芳正 |
2019-01-07 14:34:35 |
いもと よしまさ |
imoto_yoshimasa@example.com |
M |
30 |
1989/7/15 |
熊本県 |
PC-E |
210000 |
1 |
1 |
T0000000114 |
S001 |
1 |
2019-02-01 01:37:23 |
HD678019 |
三船 六郎 |
2019-01-27 18:00:11 |
みふね ろくろう |
mifune_rokurou@example.com |
M |
73 |
1945/11/29 |
京都府 |
PC-A |
50000 |
2 |
2 |
T0000000115 |
S003 |
1 |
2019-02-01 02:34:19 |
HD298120 |
山根 小雁 |
2019-01-11 08:16:02 |
やまね こがん |
yamane_kogan@example.com |
M |
42 |
1977/5/17 |
茨城県 |
PC-C |
120000 |
3 |
3 |
T0000000116 |
S005 |
1 |
2019-02-01 02:47:23 |
IK452215 |
池田 菜摘 |
2019-01-10 05:07:38 |
いけだ なつみ |
ikeda_natsumi@example.com |
F |
47 |
1972/3/17 |
兵庫県 |
PC-E |
210000 |
4 |
4 |
T0000000117 |
S002 |
2 |
2019-02-01 04:33:46 |
PL542865 |
栗田 憲一 |
2019-01-25 06:46:05 |
くりた けんいち |
kurita_kenichi@example.com |
M |
74 |
1944/12/17 |
長崎県 |
PC-B |
85000 |
5:必要なデータ列を作ろう
Q. 列を追加(quantity*item_price -> price列)
#列追加
join_data["price"]=join_data["quantity"]*join_data["item_price"]
join_data[["quantity","item_price","price"]].head(5)
|
quantity |
item_price |
price |
0 |
1 |
210000 |
210000 |
1 |
1 |
50000 |
50000 |
2 |
1 |
120000 |
120000 |
3 |
1 |
210000 |
210000 |
4 |
2 |
85000 |
170000 |
join_data.head(1)
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
customer_name |
registration_date |
customer_name_kana |
email |
gender |
age |
birth |
pref |
item_name |
item_price |
price |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
井本 芳正 |
2019-01-07 14:34:35 |
いもと よしまさ |
imoto_yoshimasa@example.com |
M |
30 |
1989/7/15 |
熊本県 |
PC-E |
210000 |
210000 |
6:データ検算をしよう
Q. join_dataとtransactionのpriceの合計をそれぞれ表示
print(join_data["price"].sum())
print(transaction["price"].sum())
join_data["price"].sum()==transaction["price"].sum()
971135000
971135000
True
7:各種統計量を把握しよう
Q.欠損値の合計を表示
#欠損値
join_data.isnull().sum()
detail_id 0
transaction_id 0
item_id 0
quantity 0
payment_date 0
customer_id 0
customer_name 0
registration_date 0
customer_name_kana 0
email 0
gender 0
age 0
birth 0
pref 0
item_name 0
item_price 0
price 0
dtype: int64
各種統計量 (std=標準偏差、四分位数(シブンイスウ))

join_data.describe()
|
detail_id |
quantity |
age |
item_price |
price |
count |
7144.000000 |
7144.000000 |
7144.000000 |
7144.000000 |
7144.000000 |
mean |
3571.500000 |
1.199888 |
50.265677 |
121698.628219 |
135937.150056 |
std |
2062.439494 |
0.513647 |
17.190314 |
64571.311830 |
68511.453297 |
min |
0.000000 |
1.000000 |
20.000000 |
50000.000000 |
50000.000000 |
25% |
1785.750000 |
1.000000 |
36.000000 |
50000.000000 |
85000.000000 |
50% |
3571.500000 |
1.000000 |
50.000000 |
102500.000000 |
120000.000000 |
75% |
5357.250000 |
1.000000 |
65.000000 |
187500.000000 |
210000.000000 |
max |
7143.000000 |
4.000000 |
80.000000 |
210000.000000 |
420000.000000 |
payment_dateカラムの値で集計範囲を確認
print(join_data["payment_date"].min()) #最小
print(join_data["payment_date"].max()) #最大
2019-02-01 01:36:57
2019-07-31 23:41:38
payment_dateカラムの値で集計範囲を確認
8:月別でデータを集計してみよう
join_dataの各カラムのデータ型を確認
join_data.dtypes #データ型の確認
detail_id int64
transaction_id object
item_id object
quantity int64
payment_date object
customer_id object
customer_name object
registration_date object
customer_name_kana object
email object
gender object
age int64
birth object
pref object
item_name object
item_price int64
price int64
dtype: object
Q.payment_dateをdatetime型に変換
#datetime型に変換
join_data["payment_date"] = pd.to_datetime(join_data["payment_date"])
Q.payment_dateを年/月で文字列に変換し、payment_monthへ格納
join_data["payment_month"]=join_data["payment_date"].dt.strftime("%Y%m")
#strftime(): 日付、時間から文字列への変換 (dtはdatetime)
#※strptime(): 文字列から日付、時間への変換
payment_dateとpayment_monthの五行目を表示payment_dateを年/月で文字列に変換し、payment_monthへ格納
#年/月で文字列に変換
join_data[["payment_date","payment_month"]].head(5)
|
payment_date |
payment_month |
0 |
2019-02-01 01:36:57 |
201902 |
1 |
2019-02-01 01:37:23 |
201902 |
2 |
2019-02-01 02:34:19 |
201902 |
3 |
2019-02-01 02:47:23 |
201902 |
4 |
2019-02-01 04:33:46 |
201902 |
Q.payment_monthでグループ化し、合計を表示

#payment_monthでグループ化し、合計を表示
join_data.groupby("payment_month").sum()
|
detail_id |
quantity |
age |
item_price |
price |
payment_month |
|
|
|
|
|
201902 |
676866 |
1403 |
59279 |
142805000 |
160185000 |
201903 |
2071474 |
1427 |
58996 |
142980000 |
160370000 |
201904 |
3476816 |
1421 |
59246 |
143670000 |
160510000 |
201905 |
4812795 |
1390 |
58195 |
139655000 |
155420000 |
201906 |
6369999 |
1446 |
61070 |
147090000 |
164030000 |
201907 |
8106846 |
1485 |
62312 |
153215000 |
170620000 |
上記グループ化でpriceのみを表示
#上記グループ化でpriceのみを表示
join_data.groupby("payment_month").sum()["price"]
payment_month
201902 160185000
201903 160370000
201904 160510000
201905 155420000
201906 164030000
201907 170620000
Name: price, dtype: int64
9:月別、商品別でデータを集計してみよう
月別payment_month、商品別item_nameでデータをグループ化、priceとquantityの合計
#月別payment_month、商品別item_nameでデータをグループ化、priceとquantityの合計
join_data.groupby(["payment_month", "item_name"]).sum()[["price","quantity"]]
|
|
price |
quantity |
payment_month |
item_name |
|
|
201902 |
PC-A |
24150000 |
483 |
PC-B |
25245000 |
297 |
PC-C |
19800000 |
165 |
PC-D |
31140000 |
173 |
PC-E |
59850000 |
285 |
201903 |
PC-A |
26000000 |
520 |
PC-B |
25500000 |
300 |
PC-C |
19080000 |
159 |
PC-D |
25740000 |
143 |
PC-E |
64050000 |
305 |
201904 |
PC-A |
25900000 |
518 |
PC-B |
23460000 |
276 |
PC-C |
21960000 |
183 |
PC-D |
24300000 |
135 |
PC-E |
64890000 |
309 |
201905 |
PC-A |
24850000 |
497 |
PC-B |
25330000 |
298 |
PC-C |
20520000 |
171 |
PC-D |
25920000 |
144 |
PC-E |
58800000 |
280 |
201906 |
PC-A |
26000000 |
520 |
PC-B |
23970000 |
282 |
PC-C |
21840000 |
182 |
PC-D |
28800000 |
160 |
PC-E |
63420000 |
302 |
201907 |
PC-A |
25250000 |
505 |
PC-B |
28220000 |
332 |
PC-C |
19440000 |
162 |
PC-D |
26100000 |
145 |
PC-E |
71610000 |
341 |
#上記グループ化でpriceのみを表示
join_data.groupby("payment_month").sum()["price"]
payment_month
201902 160185000
201903 160370000
201904 160510000
201905 155420000
201906 164030000
201907 170620000
Name: price, dtype: int64
Qピボットテーブル

#ピボットテーブル
pd.pivot_table(join_data, index='item_name', columns='payment_month', values=['price', 'quantity'], aggfunc='sum')
|
price |
quantity |
payment_month |
201902 |
201903 |
201904 |
201905 |
201906 |
201907 |
201902 |
201903 |
201904 |
201905 |
201906 |
201907 |
item_name |
|
|
|
|
|
|
|
|
|
|
|
|
PC-A |
24150000 |
26000000 |
25900000 |
24850000 |
26000000 |
25250000 |
483 |
520 |
518 |
497 |
520 |
505 |
PC-B |
25245000 |
25500000 |
23460000 |
25330000 |
23970000 |
28220000 |
297 |
300 |
276 |
298 |
282 |
332 |
PC-C |
19800000 |
19080000 |
21960000 |
20520000 |
21840000 |
19440000 |
165 |
159 |
183 |
171 |
182 |
162 |
PC-D |
31140000 |
25740000 |
24300000 |
25920000 |
28800000 |
26100000 |
173 |
143 |
135 |
144 |
160 |
145 |
PC-E |
59850000 |
64050000 |
64890000 |
58800000 |
63420000 |
71610000 |
285 |
305 |
309 |
280 |
302 |
341 |
10:商品別の売上推移を可視化してみよう
Q.ピボットテーブル2 (graph_dataにピボットを格納>冒頭を表示

#ピボットテーブル
graph_data = pd.pivot_table(join_data, index='payment_month', columns='item_name', values='price', aggfunc='sum')
graph_data.head()
item_name |
PC-A |
PC-B |
PC-C |
PC-D |
PC-E |
payment_month |
|
|
|
|
|
201902 |
24150000 |
25245000 |
19800000 |
31140000 |
59850000 |
201903 |
26000000 |
25500000 |
19080000 |
25740000 |
64050000 |
201904 |
25900000 |
23460000 |
21960000 |
24300000 |
64890000 |
201905 |
24850000 |
25330000 |
20520000 |
25920000 |
58800000 |
201906 |
26000000 |
23970000 |
21840000 |
28800000 |
63420000 |
Q. graph_dataをグラフで表示
- payment_monthをデータグレーム型のindexとして横軸に

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(list(graph_data.index),graph_data["PC-A"], label='PC-A')
plt.plot(list(graph_data.index), graph_data["PC-B"], label='PC-B')
plt.plot(list(graph_data.index), graph_data["PC-C"], label='PC-C')
plt.plot(list(graph_data.index), graph_data["PC-D"], label='PC-D')
plt.plot(list(graph_data.index), graph_data["PC-E"], label='PC-E')
plt.legend() #凡例の表示
