[R] Decision Tree & Random Forest

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

라일락 꽃이 피는 날

[R] Decision Tree & Random Forest 본문

데이터 분석/R

[R] Decision Tree & Random Forest

eunki 2021. 7. 2. 18:50

728x90

데이터 불러오기

rawdata <- read.csv("wine.csv", header = TRUE) 
rawdata$Class <- as.factor(rawdata$Class) 
str(rawdata)

트레이닝-테스트 셋 분리 (7:3)

analdata <- rawdata 

set.seed(2020) 
datatotal <- sort(sample(nrow(analdata), nrow(analdata)*.7)) 
train <- rawdata[datatotal,] 
test <- rawdata[-datatotal,] 

train_x <- train[,1:13] 
train_y <- train[,14] 

test_x <- test[,1:13] 
test_y <- test[,14]

Decision Tree - 패키지 설치

install.packages("tree") 
library(tree)

기본적인 트리 작성

treeRaw <- tree(Class ~ ., data = train) 
plot(treeRaw) 
text(treeRaw)

cross-validation (cv)

FUN : 가지치기 함수 선택
prune.misclass : 오분류 기준
오분류율이 낮을수록 정확도가 높다.

트리가 길어질수록 피쳐를 많이 쓰기 때문에 복잡해진다.

cv_tree <- cv.tree(treeRaw, FUN = prune.misclass)
plot(cv_tree)

→ Decision Tree 최적 사이즈 = 4

가지치기 (pruning)

best : cross-validation을 통해 구한 최적 사이즈

pretty = 0 : 분할 피쳐 이름을 바꾸지 않음

prune_tree <- prune.tree(treeRaw, best = 4)
plot(prune_tree)
text(prune_tree, pretty = 0)

예측

pred <- predict(prune_tree, test, type = 'class') 
confusionMatrix(pred, test$Class)

→ Accuracy : 0.8519, Kappa : 0.7692

Random Forest

library(caret)

ctrl <- trainControl(method = "repeatedcv", repeats = 5) 
rfFit <- train(Class~., 
               data = train, 
               method = "rf",  # Random Forest 
               trControl = ctrl, 
               preProcess = c("center", "scale"), 
               metric = "Accuracy") 

rfFit

→ mtry = 2일 때, 정확도가 가장 높다.

plot(rfFit)

예측

pred_test <- predict(rfFit, newdata = test) 
confusionMatrix(pred_test, test$Class)

→ Accuracy : 0.9815, Kappa : 0.9706

변수중요도

importance_nb <- varImp(rfFit, scale = FALSE) 
importance_nb

plot(importance_nb)

728x90

'데이터 분석 > R' 카테고리의 다른 글

[R] PCA (주성분 분석) 1 (0)	2021.07.08
[R] Support Vector Machine (SVM) (0)	2021.07.02
[R] Naive Bayes Classification (나이브 베이즈 분류) (0)	2021.07.02
[R] Logistic Regression (로지스틱 회귀) (0)	2021.06.30
[R] k-Nearest Neighbor (KNN) (1)	2021.06.30

'데이터 분석/R' Related Articles

라일락 꽃이 피는 날

[R] Decision Tree & Random Forest 본문

[R] Decision Tree & Random Forest

'데이터 분석 > R' 카테고리의 다른 글

티스토리툴바