[TIL] 데이터분석 데브코스 64일차 - 데이터 마이닝(Data Mining)/데이터 웨어하우스(Data Warehouse)/시계열 분석

데이터 분석 Data Analytics/프로그래머스 데이터분석 데브코스 2기

[TIL] 데이터분석 데브코스 64일차 - 데이터 마이닝(Data Mining)/데이터 웨어하우스(Data Warehouse)/시계열 분석

상급닌자연습생 2024. 5. 19. 23:55

데이터 마이닝 (Data Mining)

: 대용량의 다양한 유형의 데이터 내에 존재하는 관계, 패턴, 규칙을 탐색하고, 이로부터 유용한 지식을 추출하는 과정

데이터 마이닝의 중요성

의사결정 강화
효율성 증대
고객 이해
시장 동향 예측

데이터 마이닝 프로세스

1단계. 데이터 수집 및 통합

: 목표하는 문제를 풀기 위한 다양한 데이터를 수집하는 과정

데이터 통합

같은 종류의 데이터라면 일관된 형식으로 만드는 과정 필요
ex. 크롤링 과정으로 생성된 DOM 구조 제거
ex. 이미지 데이터 크기 조절

데이터 품질 관리

데이터 검증 및 정화 (오류, 중복 수정/제거)
완결성 검사 (누락 데이터 서치/핸들링/제거/대체)
모니터링 (지속적으로 품질 모니터링, 업데이트로 버전관리)

2단계. 데이터 전처리

: 모델 및 분석 방법에 맞도록 데이터를 가공하는 초기 과정

노이즈 및 오류 제거

노이즈로 인한 이상치 데이터를 확인 (ex. IQR, 이상치 알고리즘 결과)
수집 과정에서의 이상 상태로 인한 오류 데이터 존재 가능
식별된 이상치 혹은 오류 데이터는 제거 혹은 수정

데이터 정규화

데이터의 스케일을 일치시키는 과정
서로 다른 데이터 사이의 일치를 위해 진행
같은 데이터 내에서도 통일성을 위해 진행

3단계. 데이터 마이닝 기법 적용

: 유의미한 패턴, 관계, 통찰을 도출할 수 있도록 수집한 데이터에 맞는 데이터 분석 방법론을 적용하는 과정

비슷한 데이터를 분석한 선례를 확인
주로 활용되는 기법 : 분류, 클러스터링, 예측, 잠재적 의미 표면화

4단계. 데이터 마이닝 결과 분석

: 데이터에서 인사이트를 얻고 이를 바탕으로 의사결정과 같은 과정에 사용

모델 평가 과정이 존재한다면,
- 평가 수치가 의사결정에 도움이 되는 평가인지를 경계하면서 판단해야함
- 평가한 데이터가 의미있는 데이터인지 확인해야함
평과 과정 없이 사람의 직관/판단이 들어가야 한다면,
- 원본 데이터에 특이성과 같은 편향에서 자유로운지 확인해야함
- 그 직관에 위험성은 없는지 확인해야함

[문제 상황] 대기업의 데이터 관리자입장에서 생각해보자.

다양한 부서가 존재하고 특정 부서는 데이터가 생성됨
임의의 부서는 다른 부서들의 데이터에 접근함
이때마다 데이터 호출 인터페이스를 각각 따로 만든다면 → 상당한 비용 소모, 거미집 현상(Spider Web) 발생

↓

[해결방법] 데이터를 생성하는 부서와 소비하는 부서 사이에 '데이터 창고'를 두고 데이터의 흐름을 컨트롤하면

기업 내부에서 움직이는 데이터의 흐름을 효율적으로 컨트롤할 수 있음

↓

데이터 웨어하우스 (Data Warehouse)

: 데이터가 데이터가 모이는 창고(warehouse)

조직이 수집한 데이터를 과거의 정보까지 모두 저장
기술의 발전 & 비즈니스 요구의 변화로 정형 데이터 + 비정형 데이터 모두 처리하고 저장하는 기능으로 통합

데이터 웨어하우스 (DW) vs. 데이터 베이스 (DB)

	데이터 웨어하우스 (DW)	데이터 베이스 (DB)
목적	- 대규모 데이터를 통합/분석/보고하는데 사용 - 과거 데이터까지 모두 포함	- 실시간 데이터 처리, 트랜잭션 관리 - 일상적인 업무 및 응용 프로그램에 필요한 현재의 데이터를 저장/관리 - 데이터의 신속한 read & write가 목적
생성/관리	- DB의 데이터가 주기적으로 모여서 만들어짐	- 데이터 소비처 or 생산처에서 생성/관리되는 대상
접근 사용자	- 조직 내 특정 그룹의 사용자에게만 제한	- 다수의 사용들이 동시에 입력/수정 가능

데이터 웨어하우스의 구조

1. ETL

E : 원천 데이터 소스에서 데이터를 추출(Extract)하고
T : 저장할 형태에 맞게 변형(Transform)하고
L : 데이터 웨어하우스 중앙 데이터 저장소로 적제(Load) 하는 과정

2. 중앙 데이터 저장소

ETL 처리된 데이터가 쌓이는 저장소

3. 메타 데이터

데이터가 쌓이면서 만들어지는 추가 정보
원천 데이터의 장소, 중앙 데이터 저장소의 크기/구성방법 등

4. 접근

사용자의 데이터 저장소와의 상호작용을 지원

데이터 마트 (Data Mart)

: 데이터 웨어하우스에 있는 과거의 데이터를 보면서 주기적으로 통찰을 얻고자 하는 사람들을 위해, 데이터 웨어하우스에서 요청에 맞게 제공해주는 작은 데이터 집합

소비자를 위해 창고에서 물건을 마트에 가져다 두는 것과 유사한 개념
해당 부서에서 사용하는 DB와는 달리, 과거 데이터를 포함해 분석/보고를 목적으로 함
write는 안하고 read
필수는 아니지만 조직 내부에서 사업적 분석을 통해 인사이트를 얻고자 사용됨
부서 중심적 & 주제 중심적
- 데이터 마트는 특정 부서/주제에 맞게 설계됨
- 항상 준비된게 X → 주제에 맞는 부서 요청이 있을때만 만들어짐
데이터 집중도가 매우 큼
- 관련된 데이터, 사용자 그룹이 필요로 하는 데이터만 집중적으로 포함
효율적 운영 & 사용자 친화성
- 필요한 데이터에 대한 간단한 쿼리 작성/분석 가능

데이터 마이닝 실습

사용 데이터

글로벌인구 통계 추세 데이터 (WPP2022 Demographic Indicators)

🔗 실습 링크 : https://www.kaggle.com/datasets/abmsayem/wpp2022-demographic-indicators

WPP2022_Demographic_Indicators

United Nations, Department of Economic and Social Affairs, Population Division

www.kaggle.com

총 67개 컬럼

1단계. 데이터 로드

import pandas as pd

file_path = 'WPP2022_Demographic_Indicators.csv'
data = pd.read_csv(file_path)

data.head()

# 연도는 1950년부터,
# 2020년 이후 데이터는 사용하지 않을 예정
data['Time']

2단계. 타겟 데이터 선정

# 여기서 TPopulation1Jan 를 기준으로 분석할 예정
data = data[data['TPopulation1Jan'].notnull()]

# 지역 종류 확인
set(data['LocTypeName'].tolist())

all_countries = set(data[data['LocTypeName']=='Country/Area']['Location'].tolist())

# 나라 정보만 가져오기
country_data = data[data['LocTypeName']=='Country/Area'].copy()

3단계. 위도, 경도 정보 추가

[방법 1] API를 통해 호출하는 방법

너무 많이 불러오면 IP ban(제한)에 걸림
시간이 너무 오래 걸림

# # ###########################################################
# # 방법 1) API를 통해 호출하는 경우 시간이 오래걸리고, 제한이 걸릴 수 있음! #
# # ###########################################################

from geopy.geocoders import Nominatim
import time

# # Geopy의 Nominatim 사용을 위한 Geolocator 객체 생성
geolocator = Nominatim(user_agent="geoapiExercises")

# # 국가별 위도와 경도를 저장할 딕셔너리 생성
location_coordinates = {}

# # 중복된 위치 요청을 방지하기 위해 이미 조회된 위치를 저장할 집합
queried_locations = set()

# # 각 국가에 대해 위도와 경도 조회
for location in all_countries:
    # 이미 조회된 위치는 건너뜀
    if location in queried_locations:
        continue

    try:
        # 위치에 대한 위도, 경도 정보 얻기
        loc = geolocator.geocode(location)
        if loc:
            location_coordinates[location] = (loc.latitude, loc.longitude)
            queried_locations.add(location)
    except Exception as e:
        print(f"Error occurred for location: {location}, error: {e}")

    # 요청 사이에 일정 시간 지연을 두어 서비스 제한을 피함
    time.sleep(1)

location_coordinates

[방법 2] 이미 만들어둔 위도&경도 데이터 불러오기

# ########################################
# 방법 2) 이미 만들어둔 위도 & 경도 데이터 불러오기 #
# ########################################

# 다운로드 받은 데이터 업로드 필요!
import json

with open('location_coordinates.json', 'r') as f:
    location_coordinates = json.load(f)

# 데이터 프레임에 위도 및 경도 정보 추가
country_data['latitude'] = country_data['Location'].apply(lambda x: location_coordinates[x][0]
                                                          if x in location_coordinates else None)
country_data['longitude'] = country_data['Location'].apply(lambda x: location_coordinates[x][1]
                                                           if x in location_coordinates else None)

4단계. 지도에 인구수 통계값 표시 (원 활용)

1) 위도, 경도 추가 + 지도 생성

country_data_2020 = country_data[(country_data['Time'] == 2020) &
                                    country_data['TPopulation1Jan'].notnull() &
                                    ## 뒤쪽에 열 추가
                                    country_data['latitude'].notnull() & ## 위도
                                    country_data['longitude'].notnull()] ## 경도

`folium.Map` : 인터렉션이 가능한 세계제도를 보여준다.

# 기본적인 비어있는 지도 생성!
import folium
m = folium.Map(location=[20, 0], zoom_start=2) # location : 초기 지도의 중심 위치 (위도, 경도)
                                               # zoom_start : 지도의 초기 확대 레벨
m

2) 원으로 지도에 표시

m = folium.Map(location=[20, 0], zoom_start=2)

for idx, row in country_data_2020.iterrows():
    # 원의 크기는 인구에 비례하도록 설정
    radius = row['TPopulation1July'] # 1단위 : 천 명
    # 원을 선택 시, 뜨는 정보 표시 (나라 & 인구수)
    popup_message = f"{row['Location']}: {row['TPopulation1July']}"

    # 원을 지도에 추가
    folium.Circle(
        location=[row['latitude'], row['longitude']],
        radius=radius,
        color='blue',
        fill=True,
        fill_color='blue',
        popup=popup_message  # 팝업으로 정보 표시
    ).add_to(m)

# 지도 표시
m

3) 제곱근으로 원의 크기를 적절하게 조절

import numpy as np

m = folium.Map(location=[20, 0], zoom_start=2)

# 지도에 인구 데이터를 원으로 표시 (제곱근 스케일 사용)
for idx, row in country_data_2020.iterrows():
    # 제곱근 스케일로 원의 크기 조정
    radius = np.sqrt(row['TPopulation1July']) * 1000

    popup_message = f"{row['Location']}: {row['TPopulation1July']}"

    folium.Circle(
        location=[row['latitude'], row['longitude']],
        radius=radius,
        color='blue',
        fill=True,
        fill_color='blue',
        popup=popup_message
    ).add_to(m)

m

5단계. 지도에 인구수 통계값 표시 (히트맵 활용)

import folium
from folium.plugins import HeatMap

# 기본 지도 생성
m = folium.Map(location=[20, 0], zoom_start=2)

# 히트맵 데이터 준비 (위도, 경도, 인구수를 가중치로 사용)
heat_data = [[row['latitude'], row['longitude'], row['TPopulation1July']] for idx, row in country_data_2020.iterrows()]

# 히트맵 추가
HeatMap(heat_data,
        min_opacity=0.5,
        radius=30,
        blur=25).add_to(m)

m

import folium
from folium.plugins import HeatMap

m = folium.Map(location=[20, 0], zoom_start=2)

heat_data = [[row['latitude'], row['longitude'], row['TPopulation1July']] for idx, row in country_data_2020.iterrows()]

# 히트맵 색상 조절을 위한 gradient 생성
gradient = {0.1: 'blue',
            0.5: 'green',
            1.0: 'red'}

HeatMap(heat_data,
        min_opacity=0.5,
        radius=30,
        blur=25,
        gradient=gradient).add_to(m)

m

6단계. 시간에 따른 인구수 통계값 확인 (IPyWidgets 활용)

사용자 인터랙션의 결과로 특정 값을 반환할 수 있는 위젯 사용 가능
슬라이더로 사용자가 움직여가면서 해당 연도의 데이터를 불러와서 지도에 그림을 그림

import ipywidgets as widgets
from IPython.display import display, clear_output

# 연도 선택을 위한 정수 기반 슬라이더(IntSlider) 위젯 생성
year_slider = widgets.IntSlider(
    value=2020, # 초기값 설정
    min=1950, # 최솟값
    max=2022, # 최댓값
    step=1,
    description='Year:', # 연도별로 볼거니까

    # 값 변화가 있는 그 순간에 변화가 업데이트 되지 않도록 함. 변화가 멈춰야 값을 업데이트
    # 마우스를 클릭하고 놨을때 값이 업데이트됨
    # True로 설정하면 클릭하고 있을때도 계속 값이 업데이트됨
    continuous_update=False 
)

# 지도 출력을 위한 Output 위젯
map_output = widgets.Output()

data_clean = country_data.dropna(subset=['latitude', 'longitude'])

# 지도를 새로 그리는 함수 설정
def update_map(year):
    with map_output:
        # 출력 영역을 클리어하고 새 지도를 표시
        clear_output(wait=True)
        # 지도 생성
        m = folium.Map(location=[20, 0], zoom_start=2)
        # 해당 연도의 데이터 필터링
        data_year = data_clean[data_clean['Time'] == year]

        # 지도에 데이터 추가
        for idx, row in data_year.iterrows():
            radius = np.sqrt(row['TPopulation1July']) * 1000
            folium.Circle(
                location=[row['latitude'], row['longitude']],
                radius=radius,
                color='blue',
                fill=True,
                fill_color='blue',
                popup=f"{row['Location']}: {row['TPopulation1July']}"
            ).add_to(m)

        # 지도 표시
        display(m)

# on_year_chage : 지도를 다시 그리는 update_map 함수 호출
# 여기에는 변경 사항 중 새로운 값(new)을 입력으로 넣어줌
def on_year_change(change):
    update_map(change['new'])

# year_slider의 변화를 감지(observe)하고, (특히 value라는 속성이 변경되면)
# 변화가 있다면 on_year_change 함수를 실행
year_slider.observe(on_year_change, names='value')

# 초기 지도 및 슬라이더 위젯 표시
display(year_slider)
update_map(2020)
display(map_output)

7단계. 연도 변경에 따른 인구 변화 확인 (Bar Plot 활용)

수의 양적 차이를 직관적으로 보여주는 그래프
Bar Plot을 위젯 내부에 그려줌
`seaborn` 패키지 활용 : `Matplotlib`의 기본 기능에 추가 개선된 기능 제공, API 쉽게 사용, 많은 스타일 및 테마

import seaborn as sns
import ipywidgets as widgets
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

# 필요한 위젯 설정
year_slider = widgets.IntSlider(
    value=2020,
    min=1950,
    max=2022,
    step=1,
    description='Year:',
    continuous_update=False
)

data_clean = country_data.dropna(subset=['latitude', 'longitude'])

# 바 차트 출력을 위한 Output 위젯
output = widgets.Output()

# 바 차트를 그리는 함수
def plot_top10_population(year):
    with output:
        clear_output(wait=True)
        # 해당 연도 데이터 필터링
        data_year = data_clean[data_clean['Time'] == year]

        # 상위 10개 국가의 인구 데이터 추출
        top10_data = data_year.nlargest(10, 'TPopulation1July')

        # 바 차트 그리기
        plt.figure(figsize=(10, 6))
        sns.barplot(x='TPopulation1July', y='Location', data=top10_data)
        plt.title(f'Top 10 Countries by Population in {year}')
        plt.xlabel('Population (in thousands)')
        plt.ylabel('Country')
        plt.show()

# 슬라이더의 값이 변경될 때마다 바 차트 업데이트
def on_year_change(change):
    plot_top10_population(change['new'])

year_slider.observe(on_year_change, names='value')

# 초기 바 차트 및 슬라이더 위젯 표시
display(year_slider)
plot_top10_population(2020)
display(output)

나라마다 고유한 색상을 지정해서 변화를 직관적으로 살펴보자.

import seaborn as sns
import ipywidgets as widgets
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

year_slider = widgets.IntSlider(
    value=2020,
    min=1950,
    max=2022,
    step=1,
    description='Year:',
    continuous_update=False
)

data_clean = country_data.dropna(subset=['latitude', 'longitude'])

# 전체 데이터셋에서 고유한 나라들 추출
unique_countries = data_clean['Location'].unique()

# 고유한 나라들에 대한 색상 매핑 생성
colors = sns.color_palette("tab20", len(unique_countries))
country_colors = dict(zip(unique_countries, colors))

output = widgets.Output()

def plot_top10_population(year):
    with output:
        clear_output(wait=True)
        data_year = data_clean[data_clean['Time'] == year]
        top10_data = data_year.nlargest(10, 'TPopulation1July')

        # 상위 10개 국가에 대한 색상 할당
        top10_colors = {country: country_colors[country] for country in top10_data['Location']}

        # 바 차트 그리기
        plt.figure(figsize=(10, 6))
        sns.barplot(x='TPopulation1July', y='Location', data=top10_data, hue='Location', legend=False,
                    palette=[top10_colors[country] for country in top10_data['Location']])
        plt.title(f'Top 10 Countries by Population in {year}')
        plt.xlabel('Population (in thousands)')
        plt.ylabel('Country')
        plt.show()


def on_year_change(change):
    plot_top10_population(change['new'])

year_slider.observe(on_year_change, names='value')

display(year_slider)
plot_top10_population(2020)
display(output)

트래픽 신호 시계열 데이터 실습

사용 데이터

미국 미네소타 트윈 시티 메트로 지역의 교통 트래픽을 수집한 데이터

🔗 실습 링크 : https://www.kaggle.com/datasets/boltzmannbrain/nab

Numenta Anomaly Benchmark (NAB)

Dataset and scoring for detecting anomalies in streaming data

www.kaggle.com

여기서 realTraffic 폴더 내 데이터만 사용할 예정

1단계. 데이터 로드

`display()` : 하나의 셀에서 여러 개의 데이터를 데이터프레임 형태로 보여준다.

import pandas as pd
from IPython.display import display

# 데이터 로드
TravelTime_387 = pd.read_csv("TravelTime_387.csv")
TravelTime_451 = pd.read_csv("TravelTime_451.csv")
occupancy_6005 = pd.read_csv("occupancy_6005.csv")
occupancy_t4013 = pd.read_csv("occupancy_t4013.csv")
speed_6005 = pd.read_csv("speed_6005.csv")
speed_7578 = pd.read_csv("speed_7578.csv")
speed_t4013 = pd.read_csv("speed_t4013.csv")

print('TravelTime_387')
display(TravelTime_387.head())
print('\nTravelTime_451')
display(TravelTime_451.head())
print('\noccupancy_6005')
display(occupancy_6005.head())
print('\noccupancy_t4013')
display(occupancy_t4013.head())
print('\nspeed_6005')
display(speed_6005.head())
print('\nspeed_7578')
display(speed_7578.head())
print('\nspeed_t4013')
display(speed_t4013.head())

► 같은 지역에 있는 센서도 측정한 timestamp의 시작점이 다를 수 있다.

원활한 시간 연산을 위해 timestamp 컬럼을 `datetime` 타입으로 변환해보자.

# timestamp 열을 datetime 타입으로 변환
print('변환 전: ' ,occupancy_6005['timestamp'].dtype) # object : str으로 보면 됨
display(occupancy_6005.head())

occupancy_6005['timestamp'] = pd.to_datetime(occupancy_6005['timestamp'])

print('\n변환 후: ' ,occupancy_6005['timestamp'].dtype) # datetime64[ns] : datetime 타입
display(occupancy_6005.head())

2단계. 시계열 데이터 시각화 (matplotlib 활용)

우선, 나머지 컬럼들도 datetime 타입으로 변환해주자.

# datetime 형태로 변경
TravelTime_387['timestamp'] = pd.to_datetime(TravelTime_387['timestamp'])
TravelTime_451['timestamp'] = pd.to_datetime(TravelTime_451['timestamp'])
occupancy_6005['timestamp'] = pd.to_datetime(occupancy_6005['timestamp'])
occupancy_t4013['timestamp'] = pd.to_datetime(occupancy_t4013['timestamp'])
speed_6005['timestamp'] = pd.to_datetime(speed_6005['timestamp'])
speed_7578['timestamp'] = pd.to_datetime(speed_7578['timestamp'])
speed_t4013['timestamp'] = pd.to_datetime(speed_t4013['timestamp'])

`subplot` : 서로 다르 데이터를 한번에 여러개 시각화 가능함
`matplotlib`의 한계점 : 시간적으로 길이가 길다면 데이터으 특징을 눈으로 보기 어려움

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 12))

plt.subplot(2, 2, 1)
plt.plot(occupancy_6005['timestamp'], occupancy_6005['value'], label='Occupancy 6005', color='blue')
plt.title('Time vs Occupancy (6005)')
plt.xlabel('Timestamp')
plt.ylabel('Occupancy')
plt.xticks(rotation=45)
plt.legend()

plt.subplot(2, 2, 2)
plt.plot(occupancy_t4013['timestamp'], occupancy_t4013['value'], label='Occupancy t4013', color='orange')
plt.title('Time vs Occupancy (t4013)')
plt.xlabel('Timestamp')
plt.ylabel('Occupancy')
plt.xticks(rotation=45)
plt.legend()


plt.subplot(2, 2, 3)
plt.plot(speed_6005['timestamp'], speed_6005['value'], label='Speed 6005', color='green')
plt.title('Time vs Speed (6005)')
plt.xlabel('Timestamp')
plt.ylabel('Speed')
plt.xticks(rotation=45)
plt.legend()

plt.subplot(2, 2, 4)
plt.plot(speed_t4013['timestamp'], speed_t4013['value'], label='Speed t4013', color='magenta')
plt.title('Time vs Speed (t4013)')
plt.xlabel('Timestamp')
plt.ylabel('Speed')
plt.xticks(rotation=45)
plt.legend()

plt.tight_layout()
plt.show()

3단계. 시계열 데이터 시각화 (plotly 활용)

`plotly` : 파이썬, R, JS 등에서 사용할 수 있는 그래프 생성 라이브러리

사용자가 시계열 데이터를 쉽게 탐색할 수 있는 기능 제공
- 줌 인/아웃 및 슬라이딩
- 여러 변수 간의 관계 탐색
- 마우스 오버로 정보 확인

import plotly.express as px

# Plotly를 사용하여 시각화
fig = px.line(occupancy_6005, x='timestamp', y='value', title='Time vs Occupancy 6005')
fig.update_xaxes(rangeslider_visible=True) # 하단에 보고싶은 timestamp를 슬라이드할 수 있는 바 설정
fig.show()

`graph_object` 클래스

`add_trace()` : 트래킹하고 싶은 요소를 추가할 수 있는 함수

`yaxis`: 서로 다른 y축 스케일 지정
- `y1` : 그래프를 왼쪽 y축에 맞춰서 그리도록 지정
- `y2` : 그래프를 오른쪽 y축에 맞춰서 그리도록 지정

import plotly.graph_objects as go
fig = go.Figure() # 그래픽적 요소를 살펴볼 수 있음

# Occupancy 6005 데이터 추가
# add_trace : 트래킹하고 싶은 요소 추가
fig.add_trace(go.Scatter(x=occupancy_6005['timestamp'], y=occupancy_6005['value'],
                         mode='lines', name='occupancy_6005',
                         yaxis='y1')) # 그래프를 왼쪽 y축에 맞춰 그리도록 설정 (y1 : 왼쪽)

# Speed 6005 데이터 추가
fig.add_trace(go.Scatter(x=speed_6005['timestamp'], y=speed_6005['value'],
                         mode='lines', name='speed_6005',
                         yaxis='y2')) # 그래프를 오른쪽 y축에 맞춰 그리도록 설정 (y2 : 오른쪽)

# 그래프 레이아웃 설정
fig.update_layout(title='Occupancy and Speed Data in 6005 sensor',
                  xaxis_title='Timestamp',
                  legend_title='Data Type',
                  yaxis=dict(
                       title='Occupancy',
                  ),
                  yaxis2=dict(
                      title='Speed',
                      anchor="x", # y1의 x축과 서로 연동
                      overlaying="y", # y1 그래프 위에 겹쳐 그림
                      side="right", # y2 정보를 오른쪽에 표시
                  ),
                  xaxis_rangeslider_visible=True)

fig.show()

서로 다른 데이터를 병합해보자.

# plotly 패키지를 활용해 이종의 데이터를 같은 시간 도메인에서 확인
# 두 데이터셋 결합
# 두 데이터셋에 존재하는 timestamp 시간의 값만 취해서 생성
combined_data = pd.merge(occupancy_6005, speed_6005,
                         on='timestamp', suffixes=('_occupancy', '_speed'))

# 필터링
filtered_speed_6005 = speed_6005[
    (speed_6005['timestamp'] >= '2015-09-01 13:40:00') &
    (speed_6005['timestamp'] < '2015-09-01 14:00:00')
]

print('occupancy_6005')
display(occupancy_6005.head())
print('\nspeed_6005')
display(speed_6005.head())
print('\nfiltered_speed_6005')
display(filtered_speed_6005.head())
print('\ncombined_data')
display(combined_data.head())

두 개의 데이터가 공존하는 시간만 취한다.

4단계. 주기성을 활용한 통계 시각화 (Bar Plot 활용)

본 데이터의 경우 교통의 흐름이므로 매일(day)을 기준으로 반복되는 주기를 갖고 있다.

원본 데이터의 시간 정보를 바탕으로 시간(Hour) 정보를 추출
추출된 시간(Hour) 정보를 활용해 동일한 시간 정보가 같은 데이터끼리 통합
결국 0~24시간 데이터끼리 평균을 취해 최종 주기 데이터를 생성

# 시간대별 그룹화
occupancy_6005['hour'] = occupancy_6005['timestamp'].dt.hour
occupancy_6005.head()

# 그룹화된 시간대별 평균 계산
average_occupancy_by_hour = occupancy_6005.groupby('hour')['value'].mean().reset_index()

# Plotly를 이용한 시각화
fig = px.bar(average_occupancy_by_hour, x='hour', y='value',
             title='Average Occupancy by Hour of Day (6005)')
fig.update_layout(xaxis_title='Hour of Day', yaxis_title='Average Occupancy')
fig.show()

import plotly.graph_objects as go

# 서로 다른 데이터 간의 주기성 데이터 확인
# plotly 패키지의 bar plot 활용

# 시간대별 그룹화 및 평균 계산
occupancy_6005['hour'] = occupancy_6005['timestamp'].dt.hour
speed_6005['hour'] = speed_6005['timestamp'].dt.hour

average_occupancy_by_hour = occupancy_6005.groupby('hour')['value'].mean().reset_index()
average_speed_by_hour = speed_6005.groupby('hour')['value'].mean().reset_index()

# Plotly를 이용한 대화형 시각화
fig = go.Figure()

# 교통량 그래프 추가
fig.add_trace(go.Bar(x=average_occupancy_by_hour['hour'],
                     y=average_occupancy_by_hour['value'],
                     name='Average Occupancy'))

# 속도 그래프 추가
fig.add_trace(go.Bar(x=average_speed_by_hour['hour'],
                     y=average_speed_by_hour['value'],
                     name='Average Speed'))

# 레이아웃 설정
fig.update_layout(title='Average Occupancy and Speed by Hour of Day (6005)',
                  xaxis_title='Hour of Day',
                  yaxis_title='Average Value',
                  barmode='group')

fig.show()

5단계. 주기성을 활용한 통계 시각화 (Heatmap 활용)

하루 주기성(시간 별)과 주간 주기성(요일 별)을 기준으로 확인해보자.

# occupancy_6005 데이터를 사용하여 필요한 2 종류의 주기 데이터 확보
occupancy_6005['day_of_week'] = occupancy_6005['timestamp'].dt.dayofweek # 0 : 월요일 ~ 6 : 일요일
occupancy_6005['hour'] = occupancy_6005['timestamp'].dt.hour

# 요일과 시간별 평균 occupancy 값 생성
occupancy_heatmap_data = occupancy_6005.groupby(['day_of_week', 'hour'])['value'].mean().unstack()
occupancy_heatmap_data

# day_of_week 값과 hour 값을 활용해 히트맵 생성

import numpy as np
import seaborn as sns

# 히트맵 그리기
plt.figure(figsize=(12, 6))
sns.heatmap(occupancy_heatmap_data, cmap="YlGnBu", annot=False)
plt.title('Heatmap of occupancy_6005')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.yticks(np.arange(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], rotation=0)
plt.show()

# Speed_6005 데이터도 히트맵 생성
# 교통량과 속도 데이터 비교 확인

# Speed 데이터를 요일과 시간별로 정리
speed_6005['day_of_week'] = speed_6005['timestamp'].dt.dayofweek
speed_6005['hour'] = speed_6005['timestamp'].dt.hour

# 요일과 시간별 평균 Speed 계산
speed_heatmap_data = speed_6005.groupby(['day_of_week', 'hour'])['value'].mean().unstack()

# 히트맵 생성
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.heatmap(occupancy_heatmap_data, cmap="YlGnBu", annot=False)
plt.title('Heatmap of occupancy_6005')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.yticks(np.arange(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], rotation=0)

plt.subplot(1, 2, 2)
sns.heatmap(speed_heatmap_data, cmap="YlGnBu", annot=False)
plt.title('Heatmap of speed_6005')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.yticks(np.arange(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], rotation=0)

plt.tight_layout()
plt.show()

6단계. 계절성 분석

계절성 주기

: 주, 달, 분기, 계절 년 등의 시간적 주기

시계열 데이터의 구성 성분

1. Trend (경향)

: 데이터 전반에 걸쳐 보여지는 일반적인 경향성

시간적 구간(window)을 잡아 해당 구간의 평균값 활용
윈도잉(windowing) 방식으로 시간의 축 방향으로 연속적으로 계산

2. Seasonal (계절성)

: 데이터에 존재하는 주기적인 성분

어떠한 주기 때문에 발생되는 데이터 변동
사용자가 제공한 주기(period) 단위로 평균값 활용

3. Residual (잔차)

: Trend와 Seasonal로 예측되지 못하는 추가 변동분

Residual = 원본 데이터 - (Trend + Seasonal)
값이 너무 클 경우 :
- 사용자가 제공한 주기(period)가 터무니 없을 수 있음
- 계절성 분석으로 분석하기에 데이터가 너무 복잡할 수 있음
- 노이즈, 이상치에 심한 영향을 받은 데이터일 수 있음

► 계절성 분석의 additive 방식
원래 데이터= Trend + Seasonal + Residual

※ 계절성 분석을 위해서는 datetime 형태가 인덱스(index)로 들어가 있어야 함

# 데이터 전처리
# 사용하는 패키지는 timestamp를 index로 받아야 함
print('원래 데이터')
display(occupancy_6005.head())

occupancy_6005.set_index('timestamp', inplace=True)
print('index 대체 데이터 ')
display(occupancy_6005.head())

※ 데이터가 누락되어있다면 계절성 분석이 불가능하다.

# 일별 데이터 평균으로 재표본화하고 결측치 처리
# 일별 데이터를 모아서 평균을 사용
# 만약 데이터가 없어 NaN이 존재한다면 ffill 방식으로 데이터를 대체
# ffill : forward fill. 바로 앞의 유효한 데이터로 NaN 데이터를 대체

occupancy_6005_resample = occupancy_6005['value'].resample('D').mean().fillna(method='ffill')

df = pd.DataFrame({'A': [1, 2, None, 4, 5, None, 7]})

# 결측치를 앞 방향으로 채우기
fill_df = df.fillna(method='ffill')

print('Test data')
display(df)
print('\nfill NaN')
display(fill_df)

📌 window : 특정 구간

`period` : window를 설정할 수 있음

# 필요한 라이브러리 재로드
from statsmodels.tsa.seasonal import seasonal_decompose

# 시계열 분해
result = seasonal_decompose(occupancy_6005_resample, model='additive', period=4) # period : 사용자가 지정하는 주기, 4일을 기준

# 분해 결과 시각화
fig_decompose = result.plot()
fig_decompose.set_size_inches(14, 10)
plt.show()

► Residual의 분포를 보고 작아지는 구간을 찾았다면 그게 바로 해당 데이터를 표현하는 '주기'가 된다.

# 시간 단위로 재표본화하여 결측치 처리
occupancy_6005_resample_hourly = occupancy_6005['value'].resample('H').mean().fillna(method='ffill')

# 시계열 분해 (시간 단위 주기로 가정)
result_hourly = seasonal_decompose(occupancy_6005_resample_hourly, model='additive', period=24) # 24 시간을 기준으로 주기 설정

# 분해 결과 시각화
fig_decompose_hourly = result_hourly.plot()
fig_decompose_hourly.set_size_inches(14, 10)
plt.show()

7단계. 계절성 분석 결과 시각화 (plotly 활용)

# 필요한 라이브러리 임포트
from plotly.subplots import make_subplots

# Plotly를 이용한 대화형 시각화
fig = go.Figure()
fig = make_subplots(rows=4, cols=1, subplot_titles=('Original', 'Trend', 'Seasonality', 'Residual'))

# 원본 데이터 서브플롯 추가
fig.add_trace(go.Scatter(x=result_hourly.observed.index,
                         y=result_hourly.observed,
                         mode='lines', name='Original'), row=1, col=1)

# 추세 컴포넌트 서브플롯 추가
fig.add_trace(go.Scatter(x=result_hourly.trend.index,
                         y=result_hourly.trend,
                         mode='lines', name='Trend'), row=2, col=1)

# 계절성 컴포넌트 서브플롯 추가
fig.add_trace(go.Scatter(x=result_hourly.seasonal.index,
                         y=result_hourly.seasonal,
                         mode='lines', name='Seasonality'), row=3, col=1)

# 잔차 컴포넌트 서브플롯 추가
fig.add_trace(go.Scatter(x=result_hourly.resid.index,
                         y=result_hourly.resid,
                         mode='lines', name='Residual'), row=4, col=1)

# 레이아웃 업데이트
fig.update_layout(height=600,
                  width=800,
                  title_text="Seasonal Decompose using Plotly one by one")

# 그래프 표시
fig.show()

# 필요한 라이브러리 임포트
from plotly.subplots import make_subplots

# Plotly를 이용한 대화형 시각화
fig = go.Figure()

# 원본 데이터 추가
fig.add_trace(go.Scatter(x=result_hourly.observed.index,
                         y=result_hourly.observed,
                         mode='lines', name='Original'))

# 추세 컴포넌트 추가
fig.add_trace(go.Scatter(x=result_hourly.trend.index,
                         y=result_hourly.trend,
                         mode='lines', name='Trend'))

# 계절성 컴포넌트 추가
fig.add_trace(go.Scatter(x=result_hourly.seasonal.index,
                         y=result_hourly.seasonal,
                         mode='lines', name='Seasonality'))

# 잔차 컴포넌트 추가
fig.add_trace(go.Scatter(x=result_hourly.resid.index,
                         y=result_hourly.resid,
                         mode='lines', name='Residual'))

# 레이아웃 업데이트
fig.update_layout(height=600,
                  width=800,
                  title_text="Seasonal Decompose using Plotly in one palette",
                  xaxis_rangeslider_visible=True)

# 그래프 표시
fig.show()