[R] PCA (주성분 분석) 1

데이터 분석/R

eunki 2021. 7. 8. 19:23

728x90

PCA (주성분 분석) : Principal Component Analysis

데이터에서 가장 중요한 성분을 순서대로 추출하는 기법

- 분산을 가장 잘 설명해주는 축이 주성분1 (PC1)

- PC1에 직교하는 축이 주성분2 (PC2)

주성분의 개수 설정 방법

1. 시각화를 위해 2~3개로 설정

2. Eigenvalue (주성분 분산) > 1 을 기준으로 설정

3. scree plot에서 elbow point로 설정

데이터 확인

head(iris)

결측치 확인

colSums(is.na(iris))

변수별 기술통계 및 분포

summary(iris)

boxplot(iris[,1:4])

pca 함수 적용
center = T, scale. = T : 평균 = 0, 분산 = 1 로 표준화

iris.pca <- prcomp(iris[1:4], center = T, scale. = T)

pca 요약 정보
Standard deviation 제곱 = 분산 = eigenvalue
Proportion of Variance : 전체 분산에서 차지하는 비율

summary(iris.pca)

각 주성분의 eigenvector

iris.pca$rotation

각 주성분의 값

head(iris.pca$x, 10)

scree plot 확인
type = 'l' : 선(Line) 그래프

plot(iris.pca, type = 'l', main = 'Scree Plot')

2개의 차원으로 축소

head(iris.pca$x[,1:2], 10)

2차원으로 축소된 데이터 시각화

install.packages("ggfortify") 
library(ggfortify) 

autoplot(iris.pca, data = iris, colour = 'Species')

728x90