[R] Logistic Regression (로지스틱 회귀)

데이터 분석/R

[R] Logistic Regression (로지스틱 회귀)

eunki 2021. 6. 30. 19:44

728x90

Logistic Regression (로지스틱 회귀)

1. Boosted Logistic Regression
method = 'LogitBoost'
2. Logistic Model Trees
  method = 'LMT'
3. Penalized Logistic Regression
  method = 'plr'
4. Regularized Logistic Regression
  method = 'regLogistic'

데이터 불러오기

rawdata <- read.csv("heart.csv", header = TRUE)
str(rawdata)

타겟 클래스 범주화

rawdata$target <- as.factor(rawdata$target) 
unique(rawdata$target)

연속형 독립변수 표준화

rawdata$age <- scale(rawdata$age) 
rawdata$trestbps <- scale(rawdata$trestbps) 
rawdata$chol <- scale(rawdata$chol) 
rawdata$thalach <- scale(rawdata$thalach) 
rawdata$oldpeak <- scale(rawdata$oldpeak) 
rawdata$slope <- scale(rawdata$slope)

범주형 독립변수를 명목형 변수로 전환

newdata <- rawdata 
factorVar <- c("sex", "cp", "fbs", "restecg", "exang", "ca", "thal") 
newdata[, factorVar] = lapply(newdata[, factorVar], factor)

트레이닝-테스트 셋 분리 (7:3)

set.seed(2020)  # 시드

datatotal <- sort(sample(nrow(newdata), nrow(newdata)*.7))

train <- newdata[datatotal,]
test <- newdata[-datatotal,]

train_x <- train[,1:12]
train_y <- train[,13]

test_x <- test[,1:12]
test_y <- test[,13]

LogitBoost

ctrl <- trainControl(method = "repeatedcv", repeats = 5)

logitFit <- train(target~.,
                  data = train,
                  method = "LogitBoost",  # 원하는 로지스틱 모형 선택 
                  trControl = ctrl,
                  metric = "Accuracy")

logitFit

→ nIter = 21일 때, 가장 높은 정확도를 가진다.
→ 학습을 21번 반복했을 때, 가장 높은 정확도를 가진다.

plot(logitFit)

예측

pred_test <- predict(logitFit, newdata = test)
confusionMatrix(pred_test, test$target)

→ Accuracy : 0.7582, Kappa : 0.5197

변수중요도

importance_logit <- varImp(logitFit, scale = FALSE)
importance_logit

plot(importance_logit)

→ "cp" 변수의 중요도가 가장 높다.

728x90