Linear Regression Example-5¶

Linear Regression Multi Variable အိမ်တန်ဖိုး ခန်းမှန်းသည့် ဥပမာ¶

ပြီးခဲ့သည့် ဥပမာ-၁ (example-1)တွင် Linear Regression Single Variable အကြောင်းကို ဖော်ပြပြီးဖြစ်သည်။ ယခု ဥပမာတွင် variable များစွာပါဝင်သည်။

အိမ်တစ် တိုက်ခန်းတစ်၏ တန်ဖိုးသည် ကြမ်းခင်းဧရိယာအကျယ်တစ်ခု အပေါ်တွင်သာ မူတည်သည် မဟုတ်ပေ။ အိပ်ခန်း အရေအတွက်၊ ဆောက်ပြီး အသုံးပြုနေသည့် သက်တမ်း စသည်တို့ ပေါ်တွင်လည်း မူတည်သည်။ ယခု ဥပမာတွင် ထိုအချက််များကို ထည့်သွင်းတွက်ချက်၍ အိမ်တန်ဖိုးကို ခန်းမှန်းမည် ဖြစ်သည်။ variable များစွာပါဝင်သည့် regression နည်းဖြင့် အိမ်တန်ဖိုးကို ခန့်မှန်းမည် ဖြစ်သည်။

အိမ်စျေးနှုန်းတို့၏ သဘောမှာ ကြမ်းခင်း ဧရိယာချင်းတူလျှင် သက်တမ်းပိုငယ်သည် အိပ်သည် စျေးကောင်းကောင်းရနိုင်သည်။ ကြမ်းခင်း ဧရိယာနှင့် အိပ်ခန်းအရေအတွက်များလျှင် စျေးပိုကြီးသည်။ ထိုအချက်များကြောင့် Linear Regression နည်းကို အသုံးပြုနိုင်သည်။

ကြမ်းခင်း ဧရိယာ၊ အိပ်ခန်း အရေအတွက်၊ သက်တမ်းစသည့်အချက် (၃)ချက်သည် သချင်္ာ ဝေါဟာရဖြင့် independent variable များ ဖြစ်ကြသည်။ Machine learning ဝေါဟာရဖြင့် feature ဖြစ်သည်။ feature အဖြစ်ရွေးချယ်မည်ဆိုလျှင် independent variable များ ဖြစ်သင့်သည်။

လိုအပ်သည့် lib များကို import လုပ်သည်။¶

Data visualization လုပ်ရန် matplotlib library နှင့် Python packages များဖြစ်သည့် scipy , NumPy, pandas တို့ ကို အသုံးပြုရန် import လုပ်သည်။

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

import warnings
warnings.filterwarnings('ignore')

Loading the Data¶

kc_house_data.csv ဖိုင်မှ ဒေတာများကို pd.read_csv ဖြင့် ဖတ်ယူသည်။

dataset = pd.read_csv('kc_house_data.csv')

ဖတ်ယူထားသည့် ဒေတာများကို ပြန်ကြည့်သည်။

dataset.shape

(21613, 21)

dataset.head()

dataset.describe() ဖြင့် statistical value များကို ကြည့်သည်။

dataset.describe()

dataset.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

Data visualization လုပ်ရန် Plotting လုပ်ခြင်း¶

dataset.plot ဖြင့် ပုံဆွဲသည်။ အခန်းအရေအတွက်(Bedrooms)နှင့် ဈေးနှုန်း(Price) ကို တပြိုင်နက် ယှဥ်တွဲဖော်ပြသည့် ပုံဆွဲသည်။

dataset.plot(x='bedrooms', y='price', style='o')
plt.title('Bedrooms vs Price')
plt.xlabel('Bedrooms Available')
plt.ylabel('Total Price')
plt.show()

SquareFeet-Living နှင့် ဈေးနှုန်း ကို တပြိုင်နက် ယှဥ◌်တွဲဖော်ပြသည့် ပုံဆွဲသည်။

dataset.plot(x='sqft_living', y='price', style='o')
plt.title('SquareFeet-Living vs Price')
plt.xlabel('SquareFeet-Living')
plt.ylabel('Total Price')
plt.show()

sqft_living ကို x အဖြစ် price ကို y အဖြစ် သတ်မှတ်ပြီး sns.jointplot ဖြင့် နည်းကို သုံး၍ regression နည်းဖြင့် Coefficient တန်ဖိုးနှင့် p-value ကို တွက်သည်။

sns.jointplot(x='sqft_living', y='price', data=dataset, kind='reg')
x = dataset['sqft_living']
y = dataset['price']
r,p = pearsonr(x,y)
print('Coefficient: ', r)
print('p-value: ',p)

Coefficient:  0.7020350546118
p-value:  0.0

sqft_lot ကို x အဖြစ် price ကို y အဖြစ် သတ်မှတ်ပြီး sns.jointplot ဖြင့် နည်းကို သုံး၍ regression နည်းဖြင့် Coefficient တန်ဖိုးနှင့် p-value ကို တွက်သည်။

sns.jointplot(x='sqft_lot', y='price', data=dataset, kind='reg')
x = dataset['sqft_lot']
y = dataset['price']
r,p = pearsonr(x,y)
print('Coefficient: ', r)
print('p-value: ',p)

Coefficient:  0.08966086058710011
p-value:  7.972504510326147e-40

floors နှင့် ဈေးနှုန်း(Price) ကို တပြိုင်နက် ယှဥ်တွဲဖော်ပြသည့် ပုံဆွဲသည်။

dataset.plot(x='floors', y='price', style='o')
plt.title('Floors vs Price')
plt.xlabel('Floors')
plt.ylabel('Total Price')
plt.show()

waterfront နှင့် ဈေးနှုန်း(Price) ကို တပြိုင်နက် ယှဥ်တွဲဖော်ပြသည့် ပုံဆွဲသည်။

dataset.plot(x='waterfront', y='price', style='o')
plt.title('Waterfront vs Price')
plt.xlabel('Waterfront')
plt.ylabel('Total Price')
plt.show()

ဒေတာများကို ပြင်ဆင်ခြင်း (Preparing the Data-Test train)¶

dataset.columns ဖြင့် dataset မှာ ကော်လံများ၏ နာမည်များကို ဖတ်ယူသည်။

dataset.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

ကြမ်းခင်း ဧရိယာ၊ အိပ်ခန်း အရေအတွက်၊ သက်တမ်းစသည့်အချက် (၃)ချက်သည် သင်္ချာ ဝေါဟာရဖြင့် independent variable များ ဖြစ်ကြသည်။ Machine learning ဝေါဟာရဖြင့် feature ဖြစ်သည်။ feature အဖြစ်ရွေးချယ်မည်ဆိုလျှင် independent variable များ ဖြစ်သင့်သည်။ independent variable များကို X အဖြစ်သတ်မှတ်သည်။ price သည် Y ဖြစ်သည်။

x = dataset[['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement','sqft_living15', 'sqft_lot15']]
y = dataset['price']

sklearn.model_selection မှ train_test_split ကို import လုပ်သည်။ test_size=0.3 သည် ဒေတာများ၏ ၃၀% ကို test လုပ်ရန် ခွဲခြားချန်ထားသည်၊ ၇၀ ကိုသာ train လုပ်ရန် အသုံးပြုသည် ဆိုလိုသည်။

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

Training လုပ်ခြင်း¶

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

ခန့်မှန်းခြင်း(Making the Prediction)¶

y_pred = regressor.predict(X_test)

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(10)

Algorithm ဆန်းစစ်ခြင်း (Evaluating)¶

from sklearn import metrics

print('Mean Absolute Error: ', metrics.mean_absolute_error(y_test, y_pred))

print('Mean Squared Error: ', metrics.mean_squared_error(y_test, y_pred))

print('Root Mean Squared Error: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error:  149696.50669647913
Mean Squared Error:  55600589600.93881
Root Mean Squared Error:  235797.77268019054

Reference : https://github.com/harshitahluwalia7895/Liner-Regression

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900.0	3	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000.0	3	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000.0	2	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000.0	4	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000.0	3	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

	id	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
count	2.161300e+04	2.161300e+04	21613.000000	21613.000000	21613.000000	2.161300e+04	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000
mean	4.580302e+09	5.400881e+05	3.370842	2.114757	2079.899736	1.510697e+04	1.494309	0.007542	0.234303	3.409430	7.656873	1788.390691	291.509045	1971.005136	84.402258	98077.939805	47.560053	-122.213896	1986.552492	12768.455652
std	2.876566e+09	3.671272e+05	0.930062	0.770163	918.440897	4.142051e+04	0.539989	0.086517	0.766318	0.650743	1.175459	828.090978	442.575043	29.373411	401.679240	53.505026	0.138564	0.140828	685.391304	27304.179631
min	1.000102e+06	7.500000e+04	0.000000	0.000000	290.000000	5.200000e+02	1.000000	0.000000	0.000000	1.000000	1.000000	290.000000	0.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	2.123049e+09	3.219500e+05	3.000000	1.750000	1427.000000	5.040000e+03	1.000000	0.000000	0.000000	3.000000	7.000000	1190.000000	0.000000	1951.000000	0.000000	98033.000000	47.471000	-122.328000	1490.000000	5100.000000
50%	3.904930e+09	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	3.000000	7.000000	1560.000000	0.000000	1975.000000	0.000000	98065.000000	47.571800	-122.230000	1840.000000	7620.000000
75%	7.308900e+09	6.450000e+05	4.000000	2.500000	2550.000000	1.068800e+04	2.000000	0.000000	0.000000	4.000000	8.000000	2210.000000	560.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	9.900000e+09	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	5.000000	13.000000	9410.000000	4820.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

	Actual	Predicted
17384	297000.0	3.637577e+05
722	1578000.0	1.417807e+06
2680	562100.0	3.789532e+05
18754	631500.0	4.710917e+05
14554	780000.0	9.186489e+05
16227	485000.0	3.544231e+05
6631	340000.0	3.919590e+05
19813	335606.0	6.257597e+05
3367	425000.0	6.791005e+05
21372	490000.0	1.251377e+06