[Oracle] 머신러닝 - 미국인 의료비 데이터

프로그래밍/Oracle

[Oracle] 머신러닝 - 미국인 의료비 데이터

eunki 2022. 6. 12. 20:33

728x90

1. 테이블 생성

drop table insurance;

create table insurance
( age       number(3),
  sex        varchar2(10),
  bmi        number(10,2),
  children  number(2),
  smoker    varchar2(10),
  region    varchar2(20), 
  expenses  number(10,2) );

-- 1338
select count(*) from insurance;

id 컬럼에 번호를 순서대로 1번부터 부여

create table insurance2
as
select rownum as id, i.*
    from insurance i;

drop table insurance;

rename insurance2 to insurance;

2. 훈련 데이터와 테스트 데이터로 분리

drop table insurance_training; 

create table insurance_training
as
   select *
     from insurance
     where id < 1114;

drop table insurance_test;

create table insurance_test
as
   select *
     from insurance
     where id >= 1114;

훈련 데이터 : 기계를 학습 시킬 데이터 (공부할 책)

테스트 데이터 : 학습된 기계를 평가할 데이터 (시험 문제)

전체 1338건 중에 1114건의 데이터(83%)를 훈련 데이터로 구성하고, 나머지 17% 의 데이터를 테스트 데이터로 사용

-- 1113     
select count(*) from insurance_training;  
   
-- 225
select count(*) from insurance_test;

3. 머신러닝 모델의 환경 구성 테이블을 생성

drop table settings_reg2;

create table settings_reg2
as
select *
  from table (dbms_data_mining.get_default_settings)
  where setting_name like '%GLM%';

환경 구성 테이블이란 기계를 학습 시키기 위해서 어떤 학습 방법을 사용할건지 등에 대한 데이터를 저장하는 테이블

begin
-- 기계를 학습시킬 학습 방법을 회귀 분석으로 하겠다 라는 데이터를 입력합니다.
insert into settings_reg2
 values (dbms_data_mining.algo_name,'ALGO_GENERALIZED_LINEAR_MODEL');
 
-- 기계가 알아서 제일 좋은 방법으로 공부해서 결과를 출력해라.
insert into settings_reg2
 values (dbms_data_mining.prep_auto, 'ON');

commit;
end;
/

4. 머신러닝 모델을 생성

begin
  dbms_data_mining.drop_model('MD_REG_MODEL2');
end;
/

begin 
   dbms_data_mining.create_model(
      model_name            => 'MD_REG_MODEL2', -- 모델 이름
      mining_function       => dbms_data_mining.regression, -- 기계를 학습시킬 방법(회귀)
      data_table_name       => 'INSURANCE_TRAINING', -- 훈련 data 테이블
      case_id_column_name   => 'ID', -- 환자 번호
      target_column_name    => 'EXPENSES', -- 의료비(정답)
      settings_table_name   => 'SETTINGS_REG2'); -- 환경구성 table
end;
/

5. 생성된 머신러닝 모델을 확인

select model_name,
          algorithm,
          mining_function
  from all_mining_models
  where model_name = 'MD_REG_MODEL2';

6. 머신러닝 모델 구성 정보를 확인

select setting_name, setting_value
  from all_mining_model_settings
  where model_name = 'MD_REG_MODEL2';

신뢰 수준이 0.95로 기본 값으로 설정되어 있다.

신뢰 수준이란 통계에서 어떠한 값이 알맞은 추정 값이라고 믿을 수 있는 정도이다.

학습하는 테이블의 컬럼 중에서 어떤 컬럼이 의료비에 영향을 주는지를 알기 위해서 신뢰 수준을 처음에 정해줘야 한다.

그리고 어느 컬럼이 의료비에 영향을 주는지를 알기 위해서는 p-value 값을 봐야 하는데,

p-value 값이 유의수준 0.05 보다 작으면 영향력이 있는 유의미한 컬럼으로 볼 수 있다.

귀무 가설: 나이는 의료비에 영향력이 없다.

대립 가설: 나이는 의료비에 영향력이 있다.

나이의 p-value 값이 유의수준 0.05보다 작으면 나이가 의료비에 영향력이 있는 컬럼이라고 말할 수 있다.

7. 회귀 분석 모델의 회귀 계수를 확인

select attribute_name, attribute_value, round(coefficient)
  from table (dbms_data_mining.get_model_details_glm ('MD_REG_MODEL2'));

AGE		259
BMI		341
CHILDREN		389
REGION	northeast	1108
REGION	northwest	631
REGION	southwest	-156
SEX	female	173
SMOKER	yes	23689

[ 데이터 분석 결과 설명 ]

나이가 1년씩 더해질 때마다 평균적으로 연간 의료비가 259$ 더 든다.
BMI(비만지수) 가 1씩 증가할 때마다 연간 의료비가 341$ 더 든다.
CHILDREN(부양가족수)가 1명 더 늘어날 때마다 연간 의료비가 389$ 더 든다.
northeast에 사는 사람들은 southeast에 사는 사람들에 비해서 연간 의료비가 1108$ 더 든다.
northwest에 사는 사람들은 southeast에 사는 사람들에 비해서 연간 의료비가 631$ 더 든다.
southwest에 사는 사람들은 southeast에 사는 사람들에 비해서 연간 의료비가 156$ 덜 든다.
성별은 여자가 남자에 비해서 연간 의료비가 173$ 더 든다.
흡연자는 비흡연자에 비해서 연간 의료비가 23689$ 더 든다.

8. 예측 값을 확인

select id, age, sex, expenses, 
          round(prediction (MD_REG_MODEL2 using *),2) model_predict_response
  from insurance_test t;

expenses : 실제 의료비 (정답)

round(prediction (MD_REG_MODEL2 using *),2) : 예측 의료비 (예측 값)

select corr(expenses, model_predict_response)
    from ( select id, age, sex, expenses, 
                      round(prediction (MD_REG_MODEL2 using *),2) model_predict_response
              from insurance_test t );

corr : 테스트 데이터의 실제 의료비와 예측 의료비가 서로 얼마나 상관이 있는지 수치로 출력하는 오라클 함수

상관계수 = 0.86

9. 결정계수 r 스퀘어 값을 확인

select global_detail_name, round(global_detail_value,3)
  from
  table(dbms_data_mining.get_model_details_global(model_name =>'MD_REG_MODEL2'))
  where  global_detail_name in ('R_SQ','ADJUSTED_R_SQUARE');

결정계수란 : 회귀 모델이 학습한 데이터를 얼마나 잘 설명 하는지에 대한 설명력

(0 ~ 1 사이의 값이고 1에 가까울수록 설명력이 높다.)

R_SQ	0.75	결정계수
ADJUSTED_R_SQUARE	0.749	수정된 결정계수

결정계수 = 0.75

보통 좋은 인공지능 모델이라고 얘기하려면 결정계수가 0.90은 넘어줘야 한다.

그러기 위해서 학습하기에 좋은 데이터를 만들어줘야 한다.

10. 가설 검정

select attribute_name, attribute_value, round(coefficient), p_value
  from table (dbms_data_mining.get_model_details_glm ('MD_REG_MODEL2'));

귀무 가설: 나이는 의료비에 영향을 미치지 않는다.

대립 가설: 나이는 의료비에 영향을 미친다.

p_value 가 0.05 미만이므로 귀무 가설을 기각할 충분한 근거가 있다.

따라서 나이는 의료비에 영향을 미친다고 볼 수 있다.

11. bmi 가 30 이상이면 1 아니면 0인 컬럼을 추가

alter table insurance
    add bmi30 number(10);

merge into insurance i
using ( select id, case when bmi >= 30 then 1
                   else 0 end as r1
            from insurance ) s
on (i.id = s.id)
when matched then
update set i.bmi30 = s.r1;

commit;

12. 흡연을 하면서 bmi30이 1이면 1 그렇지 않으면 0인 컬럼을 추가

alter table insurance
    add smoker_bmi number(10);

merge into insurance i
using ( select id, case when smoker = 'yes' and bmi30 = 1 then 1
                    else 0 end as r1 
            from insurance ) s
on (i.id = s.id)
when matched then
update set i.smoker_bmi = s.r1;

commit;

13. 다시 기계 학습을 시키고 결과 확인

상관계수는 0.93으로 올라갔고, 결정계수는 0.75에서 0.86으로 올라갔다.

귀무 가설 : 비만인 사람의 흡연은 의료비에 영향을 미치지 않는다.

대립 가설 : 비만인 사람의 흡연은 의료비에 영향을 미친다.

비만인 사람이 흡연을 하는 여부를 나타내는 smoker_bmi 의 p-value 값이 0 이므로 귀무 가설을 기각할 충분한 근거가 있다.

따라서 비만인 사람의 흡연은 의료비에 영향을 미친다고 볼 수 있다.

728x90