はじめに
deepblueインターン生の中山です。
【集計、ソート、縦持ち・横持ち変換】PythonとRのデータフレーム操作比較 vol.3の続きです。
列分割・NA操作・サンプリングについて、tidyverse
を用いたRとPythonの比較をしたいと思います。
連載
- 下記5記事のまとめ:PythonのpandasとRのdplyrを用いたデータフレーム操作比較
- 【読み込み、要約統計量、行・列選択、代入】PythonとRのデータフレーム操作比較 vol.1
- 【列追加、重複削除、列名変更、結合、マージ、集約】PythonとRのデータフレーム操作比較 vol.2
- 【集計、ソート、縦持ち・横持ち変換】PythonとRのデータフレーム操作比較 vol.3
- 【列分割・NA操作・サンプリング】PythonとRのデータフレーム操作比較 vol.4
- 【train/test分割】PythonとRのデータフレーム操作比較 vol.5
利用したデータ
使ったデータは、MineThatData E-Mail Analytics And Data Mining Challenge datasetです。
前回のblogとデータは同じです。
データ読み込み
Python
import pandas as pd
PATH = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv";
df = pd.read_csv(PATH)
R
library(tidyverse)
PATH <- "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
df <- read.csv(PATH)
列の分割
正規表現を用いて1つの列を複数の列に分割するコード。
history_segment
とsegment
の列をそれぞれ分割しました。
hitory_segment
にはエスケープの問題がありますが、その説明は割愛します。
Python
df.hoge.str.split().tolist(), columns=["col1", "col2"])
df1 = pd.DataFrame(df.history_segment.str.split("\) ").tolist(),
columns=["hs1", "hs2"])
df2 = pd.DataFrame(df.segment.str.split().tolist(),
columns=["seg1", "seg2"])
df_new = pd.concat([df, df1, df2], axis=1)
df_new = df_new.drop(["history_segment", "segment"], axis=1)
# df.head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone Womens E-Mail 0 0 0.0
# 1 6 3) $200 - $350 329.08 1 1 Rural 1 Web No E-Mail 0 0 0.0
# 2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web Womens E-Mail 0 0 0.0
# 3 9 5) $500 - $750 675.83 1 0 Rural 1 Web Mens E-Mail 0 0 0.0
# 4 2 1) $0 - $100 45.34 1 0 Urban 0 Web Womens E-Mail 0 0 0.0
# df_new.head()
# recency history mens womens zip_code newbie channel visit conversion spend hs1 hs2 seg1 seg2
# 0 10 142.44 1 0 Surburban 0 Phone 0 0 0.0 2 $100 - $200 Womens E-Mail
# 1 6 329.08 1 1 Rural 1 Web 0 0 0.0 3 $200 - $350 No E-Mail
# 2 7 180.65 0 1 Surburban 1 Web 0 0 0.0 2 $100 - $200 Womens E-Mail
# 3 9 675.83 1 0 Rural 1 Web 0 0 0.0 5 $500 - $750 Mens E-Mail
# 4 2 45.34 1 0 Urban 0 Web 0 0 0.0 1 $0 - $100 Womens E-Mail
R
separate(hoge, c("col1", "col2"), " ")
df_new <- df %>%
separate(history_segment, c("hs1", "hs2"), "[)] ") %>%
separate(segment, c("seg1", "seg2"), " ")
# df %>% head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 1 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone Womens E-Mail 0 0 0
# 2 6 3) $200 - $350 329.08 1 1 Rural 1 Web No E-Mail 0 0 0
# 3 7 2) $100 - $200 180.65 0 1 Surburban 1 Web Womens E-Mail 0 0 0
# 4 9 5) $500 - $750 675.83 1 0 Rural 1 Web Mens E-Mail 0 0 0
# 5 2 1) $0 - $100 45.34 1 0 Urban 0 Web Womens E-Mail 0 0 0
# 6 6 2) $100 - $200 134.83 0 1 Surburban 0 Phone Womens E-Mail 1 0 0
# df_new %>% head()
# recency hs1 hs2 history mens womens zip_code newbie channel seg1 seg2 visit conversion spend
# 1 10 2 $100 - $200 142.44 1 0 Surburban 0 Phone Womens E-Mail 0 0 0
# 2 6 3 $200 - $350 329.08 1 1 Rural 1 Web No E-Mail 0 0 0
# 3 7 2 $100 - $200 180.65 0 1 Surburban 1 Web Womens E-Mail 0 0 0
# 4 9 5 $500 - $750 675.83 1 0 Rural 1 Web Mens E-Mail 0 0 0
# 5 2 1 $0 - $100 45.34 1 0 Urban 0 Web Womens E-Mail 0 0 0
# 6 6 2 $100 - $200 134.83 0 1 Surburban 0 Phone Womens E-Mail 1 0 0
NAの削除
データフレームの中にNA
がある時に行削除をするコード。
コードの前半では、上記で説明した列の分割を利用して敢えてNA
を作っています。
これは、history_segment
の列に7) $1,000 +
が1308個含まれているためです。
行数が64000
から、62692
になっていることを確認できると思います。
Python
dropna()
df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df, df1], axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.dropna()
# df_na.iloc[67:72,]
# recency history mens womens zip_code newbie channel segment visit conversion spend hs1 hs2 hs3 hs4
# 67 11 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0.0 1) $0 - $100
# 68 9 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0.0 2) $100 - $200
# 69 10 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0.0 7) $1,000 + None
# 70 2 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0.0 3) $200 - $350
# 71 10 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0.0 1) $0 - $100
#
# print(len(df_na))
# 62692
# df_new.iloc[67:72,]
# recency history mens womens zip_code newbie channel segment visit conversion spend hs1 hs2 hs3 hs4
# 67 11 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0.0 1) $0 - $100
# 68 9 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0.0 2) $100 - $200
# 70 2 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0.0 3) $200 - $350
# 71 10 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0.0 1) $0 - $100
# 72 2 428.74 1 0 Rural 0 Phone Mens E-Mail 0 0 0.0 4) $350 - $500
#
# print(len(df_new))
# 64000
R
drop_na()
df_na <- df %>%
separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
drop_na()
# df_na[68:72,]
# recency hs1 hs2 hs3 hs4 history mens womens zip_code newbie channel segment visit conversion spend
# 68 11 1) $0 - $100 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0
# 69 9 2) $100 - $200 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0
# 70 10 7) $1,000 + <NA> 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0
# 71 2 3) $200 - $350 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0
# 72 10 1) $0 - $100 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0
#
# nrow(df_na)
# 64000
# df_new[68:72,]
# recency hs1 hs2 hs3 hs4 history mens womens zip_code newbie channel segment visit conversion spend
# 68 11 1) $0 - $100 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0
# 69 9 2) $100 - $200 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0
# 71 2 3) $200 - $350 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0
# 72 10 1) $0 - $100 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0
# 73 2 4) $350 - $500 428.74 1 0 Rural 0 Phone Mens E-Mail 0 0 0
#
# nrow(df_new)
# 62692
NAの置換
NA
の削除をするコード。
NA
を作り出すまでは、上記「NAの削除」と同様です。
「NAの削除」では、NA
の削除を行っていますが、似たようなコードでNA
を埋めることもできます。
Python
fillna(value={"hoge": "fuga"})
df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df,df1],axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.fillna(value={"hs4":"$1000"})
# display(df_na.iloc[67:72,])
# recency history mens womens zip_code newbie channel segment visit conversion spend hs1 hs2 hs3 hs4
# 67 11 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0.0 1) $0 - $100
# 68 9 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0.0 2) $100 - $200
# 69 10 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0.0 7) $1,000 + None
# 70 2 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0.0 3) $200 - $350
# 71 10 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0.0 1) $0 - $100
#
# print(len(df))
# 64000
# display(df_new.iloc[67:72,])
# recency history mens womens zip_code newbie channel segment visit conversion spend hs1 hs2 hs3 hs4
# 67 11 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0.0 1) $0 - $100
# 68 9 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0.0 2) $100 - $200
# 69 10 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0.0 7) $1,000 + $1000
# 70 2 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0.0 3) $200 - $350
# 71 10 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0.0 1) $0 - $100
#
# print(len(df_new))
# 64000
R
replace_na(list(hoge = "fuga"))
df_na <- df %>%
separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
replace_na(list(hs4="$1000"))
# df_na[68:72,]
# recency hs1 hs2 hs3 hs4 history mens womens zip_code newbie channel segment visit conversion spend
# 68 11 1) $0 - $100 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0
# 69 9 2) $100 - $200 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0
# 70 10 7) $1,000 + <NA> 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0
# 71 2 3) $200 - $350 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0
# 72 10 1) $0 - $100 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0
#
# nrow(df_na)
# 64000
# df_new[68:72,]
# recency hs1 hs2 hs3 hs4 history mens womens zip_code newbie channel segment visit conversion spend
# 68 11 1) $0 - $100 53.81 1 0 Urban 1 Phone No E-Mail 0 0 0
# 69 9 2) $100 - $200 154.15 0 1 Surburban 0 Web Mens E-Mail 0 0 0
# 70 10 7) $1,000 + $1000 1009.44 1 0 Surburban 1 Multichannel Womens E-Mail 0 0 0
# 71 2 3) $200 - $350 278.80 1 0 Rural 0 Web Mens E-Mail 0 0 0
# 72 10 1) $0 - $100 31.12 1 0 Urban 0 Phone Womens E-Mail 0 0 0
#
# nrow(df_new)
# 64000
サンプリング
行のサンプリングをするコード。
データフレームの中からいくつかの行を抽出します。
やり方は主に2種類あり、
・抽出する個数を選択する
・抽出する割合を選択する
のどちらかです。
Python
sample(a)
sample(frac = a)
df1 = df.sample(64) # 個数
df2 = df.sample(frac=0.001) # 割合
# df1.head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 57905 4 6) $750 - $1,000 799.47 0 1 Urban 1 Web Womens E-Mail 0 0 0.0
# 51278 7 1) $0 - $100 83.27 1 0 Urban 0 Phone No E-Mail 0 0 0.0
# 12856 10 4) $350 - $500 362.40 1 1 Urban 0 Web No E-Mail 0 0 0.0
# 2306 11 2) $100 - $200 146.22 0 1 Surburban 1 Phone Womens E-Mail 0 0 0.0
# 6540 10 1) $0 - $100 93.61 1 0 Urban 1 Phone Womens E-Mail 0 0 0.0
#
# df1.shape
# (64, 12)
# df2.head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 27973 5 1) $0 - $100 37.82 0 1 Urban 1 Web Mens E-Mail 0 0 0.0
# 31988 12 2) $100 - $200 129.14 0 1 Urban 0 Web Womens E-Mail 0 0 0.0
# 12988 2 2) $100 - $200 138.71 0 1 Surburban 0 Web No E-Mail 1 0 0.0
# 21122 4 3) $200 - $350 345.04 1 0 Surburban 0 Phone Mens E-Mail 0 0 0.0
# 17095 2 1) $0 - $100 39.85 1 0 Rural 0 Phone Womens E-Mail 0 0 0.0
#
# df2.shape
# (64, 12)
R
sample_n(a)
sample_frac(a)
df1 <- df %>%
sample_n(64) # 個数
df2 <- df %>%
sample_frac(0.001) # 割合
# df1 %>% head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 1 10 1) $0 - $100 50.93 1 0 Surburban 1 Phone Mens E-Mail 0 0 0
# 2 4 2) $100 - $200 124.94 0 1 Rural 0 Web Mens E-Mail 0 0 0
# 3 10 5) $500 - $750 686.45 1 1 Surburban 1 Multichannel Mens E-Mail 0 0 0
# 4 3 7) $1,000 + 1829.32 1 1 Rural 1 Multichannel Mens E-Mail 0 0 0
# 5 1 3) $200 - $350 255.71 1 0 Surburban 0 Multichannel Womens E-Mail 0 0 0
# 6 5 1) $0 - $100 29.99 1 0 Surburban 1 Web Mens E-Mail 1 0 0
#
# dim(df1)
# 64 12
# df2 %>% head()
# recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
# 1 6 1) $0 - $100 29.99 1 0 Surburban 1 Web No E-Mail 0 0 0
# 2 12 2) $100 - $200 125.92 0 1 Urban 1 Web Womens E-Mail 0 0 0
# 3 8 2) $100 - $200 157.14 0 1 Rural 0 Phone Mens E-Mail 0 0 0
# 4 11 3) $200 - $350 284.96 0 1 Surburban 0 Multichannel Mens E-Mail 1 0 0
# 5 2 4) $350 - $500 412.44 1 0 Rural 0 Web No E-Mail 0 0 0
# 6 7 1) $0 - $100 49.97 1 0 Urban 0 Web Womens E-Mail 0 0 0
#
# dim(df2)
# 64 12