회귀분석

데이터 분석/Python

회귀분석

eunki 2021. 5. 27. 13:17

728x90

1. 데이터 전처리

1) 데이터 타입 변경

df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(str)

preprocessed_df = df[['Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']]
preprocessed_df.head()

2) one-hot encoding

def make_list(x1, x2):
    type_list = []
    type_list.append(x1)
    if x2 is not np.nan:
        type_list.append(x2)
    return type_list
    
preprocessed_df['Type'] = preprocessed_df.apply(lambda x: make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()

del preprocessed_df['Type 1']
del preprocessed_df['Type 2']

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
preprocessed_df= preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')), columns=mlb.classes_))
preprocessed_df.head()

encoded_df = pd.get_dummies(preprocessed_df)
encoded_df.head()

3) 피처 표준화

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_columns  = ['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])
preprocessed_df.head()

4) 데이터셋 분리

from sklearn.model_selection import train_test_split

x = preprocessed_df.loc[:, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)

x_train.shape  # (600, 26)
x_test.shape  # (200, 26)

2. 회귀 분석 모델 학습

1) 모델 학습

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

2) 모델 평가

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy_score(y_test, y_pred)  # 0.955
precision_score(y_test, y_pred)  # 0.6153846153846154
recall_score(y_test, y_pred)  # 0.6666666666666666
f1_score(y_test, y_pred)  # 0.64

from sklearn.metrics import confusion_matrix

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
confmat  # [[183   5]
         #  [  4   8]]

3. 클래스 불균형 조정

1) 1:1 샘플링

positive_random_idx= preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=33).index.tolist()
negative_random_idx= preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=33).index.tolist()

2) 데이터셋 분리

random_idx = positive_random_idx + negative_random_idx

x = preprocessed_df.loc[random_idx, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)

x_train.shape  # (97, 26)
x_test.shape  # (33, 26)

3) 모델 재학습

lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

4) 모델 재평가

accuracy_score(y_test, y_pred)  # 0.9090909090909091
precision_score(y_test, y_pred)  # 0.8461538461538461
recall_score(y_test, y_pred)  # 0.9166666666666666
f1_score(y_test, y_pred)  # 0.8799999999999999

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
confmat  # [[19  2]
         #  [ 1 11]]

4. 학습 결과 해석

1) R2 score, RMSE score 계산

model.score(x_train, y_train)  # 0.7490284664199387
model.score(x_test, y_test)  # 0.700934213532155

from sklearn.metrics import mean_squared_error
from math import sqrt

y_predictions = lr.predict(x_train)
sqrt(mean_squared_error(y_train, y_predictions))  # 4.672162734008587

y_predictions = lr.predict(x_test)
sqrt(mean_squared_error(y_test, y_predictions))  # 4.61495178491331

2) 피처 유의성 검정 (stats model)

import statsmodels.api as sm

x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()

- R-Square 0.749로 높은 편이다.

- P-value 0.05수준에서 유의한 변수는 CRIM, ZN, CHAS, NOX, RM, DIS, RAD, TAX, PTRATIO, B, LSTAT 이다.

- INDUS, AGE는 유의하지 않는 것으로 나타났다. 즉, INDUS, AGE가 CMEDV에 미치는 영향은 유의하지 않다고 할 수 있다.

- 회귀식: CMEDV = 22.4800 - 0.9555*CRIM + 1.1869*ZN + ... + 0.7977*B - 4.1738*LSTAT

3) 다중 공선성

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
vif["feature"] = x_train.columns

728x90