R - クラスター分析をしてみよう

クラスター分析を実際に行って見ましょう。以下のコードを [[RStudio]] に順番に貼り付けて実行してみてください。（参考 [[R - Rプログラムをコピペして動かそう]]） ## データの準備と前処理 ```r # まずデータを全部削除します。 # すでにあるデータは削除されてしまうので、保存した後に実行してください。 rm(list=ls()) # より多様性のある30人分の身長と体重のデータ d <- data.frame( height = c( 155, 158, 157, 154, 156, 160, 157, 159, 158, 156, 182, 185, 183, 186, 188, 184, 187, 183, 185, 189, 168, 172, 170, 173, 169, 145, 195, 175, 162, 178 ), weight = c( 50, 52, 51, 49, 53, 78, 75, 80, 77, 76, 65, 68, 67, 64, 66, 95, 98, 93, 97, 100, 68, 70, 67, 71, 69, 45, 85, 62, 90, 58 ) ) # データの確認 head(d) ``` こんな感じのデータになりました。身長と体重が記載されています。 ```R > head(d,) height weight std_height std_weight 1 155 50 -1.1624686 -1.3547854 2 158 52 -0.9440507 -1.2275755 3 157 51 -1.0168566 -1.2911805 4 154 49 -1.2352745 -1.4183904 5 156 53 -1.0896626 -1.1639706 6 160 78 -0.7984388 0.4261532 ``` 次にデータを標準化しましょう `scale` というのは標準化という手法です。標準化とは平均 0 標準偏差 1 の分布に変換することです。ここでは詳しく分からなくても良いです。 ```R # データを標準化して新しい変数として追加 d$std_height <- scale(d$height) d$std_weight <- scale(d$weight) head(d) ``` ## k-means法によるクラスター分析 k-means法でクラスター分析を行います。標準化した変数を使ってクラスター分析を実行します。クラスター数 3 や `nstart = 25` はここでは適当に決めています。 ```r # k-means法によるクラスター分析 km <- kmeans(d[, c("std_height", "std_weight")], centers = 3, nstart = 25) # クラスター情報をデータに追加 d$cluster <- as.factor(km$cluster) ``` ## クラスターの可視化 k-means法の結果を散布図で可視化します。 ```r # クラスターの可視化 library(ggplot2) p <- ggplot(d, aes(x = height, y = weight, color = cluster)) + geom_point(size = 3) + scale_color_brewer(palette = "Set1") + labs(x = "Height (cm)", y = "Weight (kg)", color = "Cluster") + theme_minimal() print(p) ``` ![[Pasted image 20250501215758.png|400]] ## クラスターの解釈分析の結果、30人のデータが3つのグループに分けられました。各クラスターの特徴は次の通りです。 ```r # 各クラスターのサイズ（人数） table(d$cluster) # 各クラスターの標準偏差 # クラスター1の標準偏差 cluster1 <- d[d$cluster == 1, ] mean(cluster1$height) mean(cluster1$weight) # クラスター2の標準偏差 cluster2 <- d[d$cluster == 2, ] mean(cluster2$height) mean(cluster2$weight) # クラスター3の標準偏差 cluster3 <- d[d$cluster == 3, ] mean(cluster3$height) mean(cluster3$weight) ``` 結果は以下の通りです。 ```R > # 各クラスターのサイズ（人数） > table(d$cluster) 1 2 3 6 18 6 > > # 各クラスターの標準偏差 > # クラスター1の標準偏差 > cluster1 <- d[d$cluster == 1, ] > mean(cluster1$height) [1] 187.1667 > mean(cluster1$weight) [1] 94.66667 > > # クラスター2の標準偏差 > cluster2 <- d[d$cluster == 2, ] > mean(cluster2$height) [1] 171.1667 > mean(cluster2$weight) [1] 70.61111 > > # クラスター3の標準偏差 > cluster3 <- d[d$cluster == 3, ] > mean(cluster3$height) [1] 154.1667 > mean(cluster3$weight) [1] 50 ``` k-means法による分析の結果、30人のデータが3つのグループに分けられました。各クラスターの特徴は次の通りです： 1. **高身長・高体重グループ**：平均身長約187cm、平均体重約95kg。全体的に大柄な体型の人々（6人） 2. **中身長・中体重グループ**：平均身長約171cm、平均体重約71kg。平均的な体型の人々（18人） 3. **低身長・低体重グループ**：平均身長約154cm、平均体重50kg。全体的に小柄な体型の人々（6人）このように、身長と体重の組み合わせによって、データは自然と3つの特徴的なグループに分類されました。最も多くの人が属しているのは中程度の体格のグループで、高身長・高体重グループと低身長・低体重グループはそれぞれ同程度の人数で構成されています。