[파이썬] 미국 경제 데이터셋으로 머신러닝 배우기 <선형회귀>

데이터

R의 ggplot2 라이브러리의 "economics" 데이터를 활용했습니다.

https://ggplot2.tidyverse.org/reference/economics.html

US economic time series — economics

This dataset was produced from US economic time series data available from https://fred.stlouisfed.org/. economics is in "wide" format, economics_long is in "long" format.

ggplot2.tidyverse.org

배열(array)

import numpy as np
array1 = np.array([[1, 2, 3], [4, 5, 6]])
array2 = np.array([[1], [2], [3]])
print(array1)
print(array2)
print(array1[[0,1]])

👉 배열 : 1차원도 될 수 있고... 2차원도 될 수 있는... 이터레이터

👉 array1 : 한 개의 괄호 안에 두 개의 괄호가 위치함. 괄호 각각이 하나의 행을 나타냄.

➖ 배열의 데이터(행)를 불러올 때는, 마지막 줄처럼 괄호를 두 개 사용해서 호출.

👉 array2 : 3개의 행 모두 하나의 숫자밖에 없으니, 열이 하나임. (array1의 경우 3개의 열)

데이터 분리 및 학습

x = df.loc[:,'pce'].values.reshape(-1,1)
y = df.loc[:,'psavert'].values.reshape(-1,1)

👉 loc로 열 추출 / values 메서드 : 배열로 변환 (scikit-learn은 배열만 활용)

👉 reshape : -1은 행 개수 자동으로 카운팅, 1은 열의 개수

➖ values함수만 사용하면 1차원 배열로 리턴되기에 reshape 메서드 더해주기!

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

👉 train_test_split : 훈련 데이터, 검증 데이터 분리하는 함수

➖ 각각의 데이터를 배열로 제공하기에 모델 만들기에 최적화.

➖ test_size : 검증 데이터의 비율

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x,y)

👉 선형회귀분석 모델 생성 후 배열 먹여주기

y_pred = reg.predict(x_train)
y_pred[0]

👉 모델에 x(독립변수) 집어넣으면? 기존 y가 아닌, 모델이 예측한 y_pred 배열이 나온다.

선형회귀 모델 시각화

fig, ax = plt.subplots(1,2,figsize = (11,3))
fig.suptitle('economics overview')
ax[0].scatter(x_train,y_train, s = 5, alpha = 0.3, c = 'blue', label = 'train')
ax[0].scatter(x_test,y_test, s = 5, alpha = 0.3, c = 'red', label = 'test')
ax[0].plot(x_train,y_pred, c = 'green')
ax[0].set_xlabel('pce(billions)')
ax[0].set_ylabel('saving rate')
ax[0].legend()

ax[1].plot(df.index, df['unemploy'], c = 'red', lw = 0.7, label = 'unemploy')
ax[1].plot(df.index, df['pce'], c = 'blue', lw = 0.7, label = 'pce')
ax[1].set_xticks([])
ax[1].set_xlabel('time from 1967-07-01 to 2015-04-01')
ax[1].legend()

plt.show()

👉 subplots 함수로 여러 개의 그래프 만들기

⛔ 주의 : subplots 의 행이 1개인 경우 ax[n,n] 꼴이 아닌, ax[n] 꼴로 설정한다. 1차원 느낌

왼쪽 : pce(personal consumer expence : 소비자 지출) 와 savings (저축) 간 관계
- 당연히 지출과 저축은 음의 상관관계! 회귀로 검증ed.
오른쪽 : 시간이 지남에 따라 pce와 unemploy(실업자 수) 파악
- 지출은 미국 경제가 성장함에 따라 증가할 수 밖에 없다. 인플레이션도 한몫.
- 실업은 경기의 순환과 반비례하며 반대 방향으로 순환한다고 보면 된다.

검증

print(reg.coef_)
print(reg.intercept_)

👉 reg.coef_ : 기울기 / reg.intercept_ : 절편

reg.score(x_test,y_test)

👉 reg.score : 모델의 성능 평가 (0~1)

'Coding & Data Analysis > Python' 카테고리의 다른 글

[파이썬] iris 데이터셋으로 머신러닝 배우기 <다중선형회귀 & 평가지표> (0)	2025.01.02
[파이썬] 미국 경제 데이터셋으로 머신러닝 배우기 2 <경사 하강법> (1)	2024.12.30
파이썬 데이터 분석 쌩 기본기 : matplotlib 시각화 기초 (2) (1)	2024.12.26
파이썬 데이터 분석 쌩 기본기 : matplotlib 시각화 기초 (1) (0)	2024.12.26
[Python] 파이썬 데이터 분석 쌩 기본기 : 데이터 다루기 (0)	2024.12.24

데이터

배열(array)

데이터 분리 및 학습

선형회귀 모델 시각화

검증

'Coding & Data Analysis > Python' 카테고리의 다른 글

티스토리툴바