'데이터 분석/실습' 카테고리의 글 목록

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

HDBSCAN 알고리즘 [HDBSCAN 특징] - DBSCAN에서 Hierarchical가 합쳐진 알고리즘 - different sizes, densities, noise, arbitrary shapes인 데이터에 적합 - 계층적 구조를 반영한 clustering 가능 1.다양한 분포와 사이즈의 데이터 생성 from sklearn.datasets import make_blobs # make_moons : 달 모양 군집 생성 # make_blobs : 원 모양 군집 생성 # centers 옵션으로 중심점 지정 # cluster_std 옵션으로 분포도 지정 moons, _ = make_moons(n_samples=100, noise=0.05) blobs1, _ = make_blobs(n_samples=50,..

데이터 분석/실습 2022. 1. 20. 16:56

The Iris Dataset (3) - DBSCAN

DBSCAN 알고리즘 [DBSCAN 장점] - K-means와 달리 최초 k(군집수)를 직접 할당하지 않음 - Density(밀도)에 따라서 군집을 나누기 때문에, 기하학적인 모양을 갖는 분포도 적용 가능 - Oulier 구분 가능 1. 비구형(nonspherical) 데이터 생성 from sklearn.datasets import make_moons # DBSCAN 알고리즘을 적용시킬 비구형 분포 데이터 생성 moon_data, moon_labels = make_moons(n_samples=400, noise=0.1, random_state=42) moon_data[:5] # array 형태의 데이터를 Dataframe 형태로 변경 moon_data_df = pd.DataFrame(moon_data, ..

데이터 분석/실습 2022. 1. 19. 21:04

The Iris Dataset (2) - Agglomerative

Clustering : Agglomerative 알고리즘 (계층군집) [Hierarchical clustering 장점] - cluster 수(k)를 정하지 않아도 사용 가능 - random point에서 시작하지 않으므로 동일한 결과가 나옴 - dendrogram을 통해 전체적인 군집 확인 가능 (nested clusters) [Hierarchical clustering 단점] - 대용량 데이터는 계산이 많아서 비효율적 1. Agglomerative 모듈 훈련 [AgglomerativeClustering 파라미터 참고사항] - linkage 종류 : {‘ward’, ‘complete’, ‘average’, ‘single’} - linkage="ward"이면, affinity="euclidean" - d..

데이터 분석/실습 2022. 1. 19. 21:03

The Iris Dataset (1) - K-Means

데이터 정보 sepal length (cm): 꽃받침 길이 sepal width (cm): 꽃받침 폭 petal length (cm): 꽃잎 길이 petal width (cm): 꽃잎 폭 target: 꽃 종류 (0: Setosa, 1: Versicolor, 2: Virginica) 데이터셋 준비 import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # iris 데이터셋 불러오기 iris = load_iris() # array 형태의 데이터를 Dataframe으로 변환 iris_df = pd.DataFrame(data=iris..

데이터 분석/실습 2022. 1. 18. 20:16

COVID-19 data from John Hopkins University

https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university COVID-19 data from John Hopkins University Updated daily at 6am UTC in both raw and convenient form www.kaggle.com 데이터 정보 Country/Region: 국가 Province/State: 지방/주 Lat: 지역의 위도 Long: 지역의 경도 날짜: 각 날짜의 확진자/사망자 수 데이터셋 준비 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df_case = pd..

데이터 분석/실습 2022. 1. 6. 06:02

Video Game Sales with Ratings

https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings Video Game Sales with Ratings Video game sales from Vgchartz and corresponding ratings from Metacritic www.kaggle.com 데이터 정보 Name: 게임의 이름 Platform: 게임이 동작하는 콘솔 Year_of_Release: 발매 년도 Genre: 게임의 장르 Publisher: 게임의 유통사 NA_Sales: 북미 판매량 (Millions) EU_Sales: 유럽 연합 판매량 (Millions) JP_Sales: 일본 판매량 (Millions) Other_Sales: 기타 판매량 (아프리카, 일본 제..

데이터 분석/실습 2022. 1. 6. 04:00

World Happiness Report up to 2020

https://www.kaggle.com/mathurinache/world-happiness-report World Happiness Report up to 2020 Bliss scored agreeing to financial, social, etc. www.kaggle.com 데이터 정보 Country: 국가 Region: 국가의 지역 Happiness Rank: 행복지수 순위 Happiness Score: 행복지수 점수 GDP per capita: 1인당 GDP Healthy Life Expectancy: 건강 기대수명 Social support: 사회적 지원 Freedom to make life choices: 삶에 대한 선택의 자유 Generosity: 관용 Corruption Perceptio..

데이터 분석/실습 2022. 1. 5. 22:08

New York City Airbnb

https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data New York City Airbnb Open Data Airbnb listings and metrics in NYC, NY, USA (2019) www.kaggle.com 데이터 정보 id: 항목의 ID name: 항목의 이름 (타이틀) host_id: 호스트 ID host_name: 호스트의 이름 neighbourhood_group: 방이 있는 구역 그룹 neighbourhood: 방이 있는 구역 latitude: 방이 위치한 위도 longitude: 방이 위치한 경도 room_type: 방의 종류 price: 가격 ($) minimum_nights: 최소 숙박 일수 number_of_re..

데이터 분석/실습 2022. 1. 4. 05:03

라일락 꽃이 피는 날

목록데이터 분석/실습 (14)

라일락 꽃이 피는 날

티스토리툴바