[빅분기 실기] 연관규칙분석 (Association Rule)

데이터 분석/빅데이터 분석 기사

[빅분기 실기] 연관규칙분석 (Association Rule)

eunki 2022. 6. 20. 00:55

728x90

연관규칙분석 (Association Rule, Apriori Algorithm)

대용량의 트랜잭션 데이터로부터 'X이면 Y이다' 라는 형식의 연관관계를 발견하는 기법이다.

어떤 두 아이템 집합이 빈번히 발생하는가를 알려주는 일련의 규칙들을 생성하는 알고리즘이다.

흔히 장바구니 분석(Market Basket Analysis) 이라고도 한다.

연관규칙을 수행하기 위해서는 거래 데이터의 형식으로 되어 있어야 한다.

지지도 (Support) : 전체 거래 건수 중에서 항목집합 X와 Y를 모두 포함하는 거래 건수의 비율

X와 Y를 모두 포함하는 거래 수 / 전체 거래 수 = n(X∩Y) / N

신뢰도 (Confidence) : 항목집합 X를 포함하는 거래 중에서 항목집합 Y도 포함하는 거래 비율

X와 Y를 모두 포함하는 거래 수 / X가 포함된 거래 수 = n(X∩Y) / n(X)

향상도 (Lift) : 신뢰도 / P(Y)

향상도가 절대값 1보다 크면 우수함을 의미하고, 1이면 X와 Y는 독립적이라는 것을 의미한다.

[주요 하이퍼파라미터]

- min_support : 최소 지지도, 기준값 이상만 제시

- min_confidence : 최소 신뢰도, 기준값 이상만 제시

- min_lift : 최소 향상도, 기준값 이상만 제시

1. 분석 데이터 준비

import numpy as np 
import pandas as pd 

data = pd.read_csv('Market_Basket.csv', header = None)
data.head()

transactions = []

for i in range(data.shape[0]):
    transactions.append([str(data[j][i]) for j in
                         range(data.shape[1]-data.isnull().sum(axis=1)[i])])

transaction data로 변환하기 위해서 transactions 이라는 빈 리스트를 만들어 놓고 케이스(행) 수만큼 for문을 수행한다.

transactions에 추가하면서 data에 있는 제품 품목을 담는다.

2. 모델 적용

from apyori import apriori

rules = apriori(transactions, min_support = 0.015, min_confidence = 0.2,  
                min_lift = 1, min_length = 1)
results = list(rules)

df=pd.DataFrame(results)
df

3. 연관품목의 시각화

ar=(df.iloc[1:74]['items'])
ar

78개 규칙 중 74개만 뽑아 그래프로 표현한다.

import matplotlib.pyplot as plt
from matplotlib import font_manager
import networkx as nx
from networkx.drawing.nx_pydot import graphviz_layout

df = pd.DataFrame(list(ar), columns=['FROM', 'TO'])
G = nx.from_pandas_edgelist(df, source = 'FROM', target = 'TO')

# 한글 폰트 설정
ko_font_location = "C:/Windows/Fonts/malgun.ttf"
ko_font_name = font_manager.FontProperties(fname=ko_font_location).get_name()

# 품목 연관 시각화
plt.figure(figsize=(10,10)) 
nx.draw_kamada_kawai(G)
pos=nx.kamada_kawai_layout(G)

nx.draw_networkx_labels(G, pos, font_family=ko_font_name, font_size=10, font_color='black')
nx.draw_networkx_nodes(G, pos, node_color='orange', node_size=2000, alpha=1)

plt.show()

728x90