ယောက်ကျားအသံ နှင့် မိန်းမအသံ များကို အသံသွင်းထားသည့် အချက်အလက်များ ပါသည့် voice.csv ဆိုသည့် ဒေတာဖိုင်ကို အသုံးပြု၍ SVM ဖြင့် ယောက်ကျားအသံဖြစ်သည် သို့မဟုတ် မိန်းမအသံ ဖြစ်သည်ကို ခွဲခြားသည့် ကုဒ်များကို ရှင်းပြထားသည်။ အသံများ(voice and speech)တို့၏ acoustic properties ကို အခြေခံ၍ ယောက်ကျားအသံ သို့မဟုတ် မိန်းမအသံ ဖြစ်သည်ကို ခွဲခြားသည်။ recorded voice samples 3,168 ခု ပါရှိသည်။
Voice Data file, voice.csv(1.02 MB)
http://www.acmv.org/MachineLearning/svm/example1/voice.csv
Voice Data file, voice.csv(1.02 MB) https://www.kaggle.com/primaryobjects/voicegender
following acoustic properties များမှာ
meanfreq: mean frequency (in kHz)
sd: standard deviation of frequency
median: median frequency (in kHz)
Q25: first quantile (in kHz)
Q75: third quantile (in kHz)
IQR: interquantile range (in kHz)
skew: skewness (see note in specprop description)
kurt: kurtosis (see note in specprop description)
sp.ent: spectral entropy
sfm: spectral flatness
mode: mode frequency
centroid: frequency centroid (see specprop)
peakf: peak frequency (frequency with highest energy)
meanfun: average of fundamental frequency measured across acoustic signal
minfun: minimum fundamental frequency measured across acoustic signal
maxfun: maximum fundamental frequency measured across acoustic signal
meandom: average of dominant frequency measured across acoustic signal
mindom: minimum of dominant frequency measured across acoustic signal
maxdom: maximum of dominant frequency measured across acoustic signal
dfrange: range of dominant frequency measured across acoustic signal
modindx: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental
frequencies divided by the frequency range
label: male or female
Support Vector machine ကို sklearn ထဲမှ linear, gaussian, polynomial စသည့် kernel (၃)မျိုးဖြင့် သရုပ်ဖော် run ပြထားသည်။ best performing model ရရှိစေရန်အတွက် C , gamma နှင့် degree စသည့် parameter (၃)မျိုးကို တန်ဖိုးမည်မျှ ဖြစ်သင့်သည်ကို ရှာဖွေပြထားသည်။
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))
# Any results you write to the current directory are saved as output.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('voice.csv')
df.head()
df.corr()
df.isnull().sum()
# dataframe ၏ row နှင့် ကော်လံ အရေအတွက်ကို ဖတ်သည်။
print('Shape of DataFrame(rows,colums):',df.shape)
Dataset ထဲတွင် sample 3168 ခု ပါဝင်ပြီး 21 feature ၂၁ ခု ပါဝင်သည်။ (There are 21 features and 3168 instances.)
Dataset ထဲမှ label ထဲတွင် ယောက်ကျားအသံ မိန်းမ အသံ မည်မျှ ရှိသည်ကို စစ်ကြည့်သည်။
# Counting Number of Male and Number of Female in Data
print("Total number of labels: {}".format(df.shape[0]))
print("Number of male: {}".format(df[df.label == 'male'].shape[0]))
print("Number of female: {}".format(df[df.label == 'female'].shape[0]))
# print('Number of Males:',df[df['label']=='male'].shape[0])
# print('Number of Femles:',df[df['label']=='female'].shape[0])
Dataset ထဲမှ label ထဲတွင် ယောက်ကျားအသံ, မိန်းမအသံ အရေအတွက် တူညီကြသည်။ (equal number of male and female labels)
Feature များသည် X ဖြစ်သည်။ Label များသည် y ဖြစ်သည်။
# printing keys or features of Voice DataFrame
print('Features of DataFrame: \n', df.keys())
X=df.iloc[:, :-1]
X.head() # ပထမဆုံး (၅) ခုကို ကြည့်သည်။
from sklearn.preprocessing import LabelEncoder
y=df.iloc[:,-1]
# Encode label category
# male -> 1
# female -> 0
gender_encoder = LabelEncoder()
y = gender_encoder.fit_transform(y)
y
feature 20 ကို ၁၀ခုစီခွဲ၍ ဂရပ်ပုံများဆွဲကြည့်သည်။ 10 features out of 20 of voice (Music) DataFrame
# Function to Plotting 10 features out of 20 of voice (Music)
def Plotting_Features(Fun,f):
i=0 # initial index of features
j=0 # initial index of color
color = ['r','g','b','y','c','darkblue','lightgreen',
'purple','k','orange','olive'] # colors for plots
# Number of rows
nrows =5
# Creating Figure and Axis to plot
fig, axes = plt.subplots(nrows,2)
# Setting Figure size
fig.set_figheight(20)
fig.set_figwidth(20)
for row in axes:
plot1 = Fun[f[i]]
plot2 = Fun[f[i+3]]
col = [color[j],color[j+1]]
label = [f[i],f[i+1]]
plot(row, plot1,plot2,col,label)
i=i+4
j=j+2
plt.show()
def plot(axrow, plot1, plot2, col, label):
axrow[0].plot(plot1,label=label[0],color=col[0])
axrow[0].legend()
axrow[1].plot(plot2,label=label[1],color=col[1])
axrow[1].legend()
# Setting Male Acoustic Parameters
Male = df[df['label']=='male']
Male = Male.drop(['label'],axis=1)
features = Male.keys()
Plotting_Features(Male,features)
# Setting female Acoustic Parameters
Female = df[df['label']=='female']
Female = Female.drop(['label'],axis=1)
features = Female.keys()
Plotting_Features(Female,features)
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data.
# Scale the data to be between -1 and 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
generalisation ကောင်းစေရန် အတွက် dataset ကို raining set(X_train, X_test) နှင့် testing set(y_train, y_test) အဖြစ် ခွဲသည်။ test_size=0.2 ဆိုသည်မှာ testing set အရေအတွက် dataset တစ်ခုလုံး၏ ၂၀% ဖြစ်သည်။
# splitting Data into training and testing Data using cross_validation.train_test_split
# Train - Test data ratio of 80%-20%
# Random State to Randomize data = 1
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# အောက်တွင် Train - Test data ratio of 75%-25%
#X_train,X_test,Y_train,Y_test = train_test_split(X, y,test_size = 0.25, random_state=123)
default hyperparameter များဖြင့်သာ ဘာမှ မပြောင်းဘဲ SVM ကို run မည်။ sklearn.svm မှ SVC ကို import လုပ်သည်။ metrics ကို import လုပ်သည်။
# importing Support Vector Machine Algorithm for Prediction
from sklearn.svm import SVC
from sklearn import metrics
svc=SVC() # Default hyperparameters ကိုသာ သုံးမည် ဖြစ်သောကြောင့် () ထဲတွင် ဘာမျှ မထည့်ပါ။
# အောက်က ကုဒ်သည် C = 200 နှင့် gamma = 0.1 တို့ထည့်၍ Classifier creat လုပ်သည့် ကုဒ်ဖြစ်သည်။
# svm = SVC(C = 200, gamma = 0.1)
svc.fit(X_train,y_train) # svc model တည်ဆောက်သည်။
y_pred=svc.predict(X_test) # တည်ဆောက်ထားသည့် svc model ထဲတွင် X_test ထည့်၍ ခန့်မှန်းကြည်သည်။
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred)) # Accuracy Score ကို ကြည့်သည်။
#Accuracy Score ကို အောက်တွင် ဖော်ပြထားသည့်အတိုင်းလည်း တွက်နိုင်သည်။
"""
from sklearn.metrics import accuracy_score
#Accuracy = accuracy_score(y_pred,y_test)
Accuracy = accuracy_score(y_test,y_pred)
print('Accuracy Score:\n',Accuracy)
"""
'linear','rbf','poly' စသည့် Kernel (၃)ခု၏ Cross-Validation score ကို for loop ဖြင့် တွက်ပြီး နှိုင်းယှဥ်သည်။
# importing cross_val_score to calculate score
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
# Defining three different kernels
kernels = ['linear','rbf','poly']
score = []
for i in kernels:
svc=SVC(kernel = i)
Accuracy = cross_val_score(svc,X_train,y_train,cv = 15, scoring='accuracy')
score.append(Accuracy.mean())
for i in range(len(kernels)):
print(kernels[i],':',score[i])
score = []
for i in range(10):
# clf = svm.SVC(C = i+1)
svc=SVC(C = i+1)
Accuracy = cross_val_score(svc, X_train, y_train, cv = 15, scoring='accuracy')
score.append(Accuracy.mean())
for i in range(10):
print('C =',i+1,': Score =',score[i])
score = []
gamma_values = [0.0001,0.001,0.01,0.1,1.0,100.0,1000.0]
for i in gamma_values:
svc=SVC(gamma = i)
Accuracy = cross_val_score(svc,X_train, y_train, cv = 15, scoring='accuracy')
score.append(Accuracy.mean())
for i in range(len(gamma_values)):
print('gamma:',gamma_values[i],': Score:',score[i])
svc=SVC(kernel='linear') # Linear kernel ကို အသုံးပြုမည်။
# Classifier ကို training လုပ်သည်။ သို့မဟုတ် တည်ဆောက်ပြီးသည့် model ကို train လုပ်သည်။
svc.fit(X_train,y_train)
အပေါ်မှ တန်းဖိုးများသည် SVM model တည်ဆောက်ထားသည့် value များ ဖြစ်သည်။
# predicting our test data
y_pred=svc.predict(X_test)
y_pred # တည်ဆောက်ထားသည့် SVM model ခန့်မှန်းပေးသည့် ရလဒ်ကို ကြည့်သည်။
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
We can conclude from above that svm by default uses rbf kernel as a parameter for kernel
svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
Polynomial kernel is performing poorly.The reason behind this maybe it is overfitting the training dataset
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='linear')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
We can see above how the accuracy score is different everytime.This shows that accuracy score depends upon how the datasets got split.
print(scores.mean())
In K-fold cross validation we generally take the mean of all the scores.
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='rbf')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
print(scores.mean())
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='poly')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
print(scores.mean())
K-fold cross validation လုပ်ခဲ့ပြီး ဖြစ်သည်။ iteration တစ်ခုချင်းစီမှ ရရှိသည့် score များကို တွေ့နိုင်သည်။ train_test_split method ကို အသုံးပြုသည်။ train_test_split method သည် dataset ကို testing နှင့် training dataset အဖြစ်ခွဲသည့်အခါ random နည်းဖြင့်(random manner) ခွဲပေးသည်။
K-fold cross validation နည်းသည် dataset ကို အညီအမျှ (၁၀)ပိုင်း ပိုင်း(split into 10 equal parts) သောကြောင့် dataset တစ်ခုလုံးကို ခြုံငုံမိသည်။ ထို့ကြောင့် မတူညီသည့် accuracy score (၁၀)ခု ရရှိသည်။ (got 10 different accuracy score)
C parameter သည် SVM optimization အား misclassifying လုပ်မိသည့် training example များကို မည်ကဲ့သို့ လုပ်ရမည်ကို ညွန်ကြားပေးသည့် parameter ဖြစ်သည်။
C value တန်ဖိုး များလျှင် SVM optimization algorithm သည် margin ကျဥ်းသည့် hyperplane ကို ရွေးချယ် ပေးသည်။ training point များအားလုံးကို မှန်ကန်စွာ ခွဲပေးသည်။( training points classified correctly.)။ overfitting ဖြစ်နိုင်သည်။
C value တန်ဖိုး နည်းလျှင် SVM optimization algorithm သည် margin ကျယ်သည့် hyperplane ကို ရွေးချယ် ပေးသည်။ margin ကျယ်သောကြောင့် mis classified ဖြစ်သည့် point တစ်ခု တစ်လေ ရှိနိုင်သည်။ underfitting ဖြစ်နိုင်သည်။
generalised အဖြစ်နိုင်ဆုံးသော C value ကို ရွေးချယ်သင့်သည်။
C_range=list(range(1,26))
acc_score=[]
for c in C_range:
svc = SVC(kernel='linear', C=c)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
C_values=list(range(1,26))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.rcParams['figure.figsize'] = [10, 5]
plt.rcParams['figure.dpi'] = 100
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0,27,2))
plt.xlabel('Value of C for SVC')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
From the above plot we can see that accuracy has been close to 97% for C=1 and C=6 and then it drops around 96.8% and remains constant.
Let us look into more detail of what is the exact value of C which is giving us a good accuracy score
C_range=list(np.arange(0.1,6,0.1))
acc_score=[]
for c in C_range:
svc = SVC(kernel='linear', C=c)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
C_values=list(np.arange(0.1,6,0.1))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0.0,6,0.3))
plt.xlabel('Value of C for SVC ')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
Accuracy score is highest for C=0.1.
Technically, the gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as similarity measure between two points. Intuitively, a small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. In the other hand, a large gamma value means define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.xticks(np.arange(0.0001,100,5))
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
gamma=10 နှင့် 100 kernel ၏ performance သိပ်မကောင်းပါ။ gamma = 1 တွင် accuracy score အနည်းငယ် ကျသွားသည်။ gamma range 0.0001 မှ 0.1 အတွင်း လေ့လာကြရအောင်
gamma_range=[0.0001,0.001,0.01,0.1]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
gamma_range=[0.0001,0.001,0.01,0.1]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
The score increases steadily and raches its peak at 0.01 and then decreases till gamma=1.Thus Gamma should be around 0.01.
gamma_range=[0.01,0.02,0.03,0.04,0.05]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
gamma_range=[0.01,0.02,0.03,0.04,0.05]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
We can see there is constant decrease in the accuracy score as gamma value increase.Thus gamma=0.01 is the best parameter.
degree=[2,3,4,5,6]
acc_score=[]
for d in degree:
svc = SVC(kernel='poly', degree=d)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
degree=[2,3,4,5,6]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(degree,acc_score,color='r')
plt.xlabel('degrees for SVC ')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
Score is high for third degree polynomial and then there is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model and thus causes overfitting.
from sklearn.svm import SVC
svc= SVC(kernel='linear',C=0.1)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
svc=SVC(kernel='linear',C=0.1)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
Taking the mean of all the scores
print(scores.mean())
The accuracy is slightly good without K-fold cross validation but it may fail to generalise the unseen data.Hence it is advisable to perform K-fold cross validation where all the data is covered so it may predict unseen data well.
from sklearn.svm import SVC
svc= SVC(kernel='rbf',gamma=0.01)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
metrics.accuracy_score(y_test,y_predict)
svc=SVC(kernel='linear',gamma=0.01)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())
from sklearn.svm import SVC
svc= SVC(kernel='poly',degree=3)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)
svc=SVC(kernel='poly',degree=3)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())
အကောင်းဆုံး(best) parameter များကို ရှာဖွေရန်အတွက် Grid search technique အသုံးပြုသည်။
from sklearn.svm import SVC
svm_model= SVC()
tuned_parameters = {
'C': (np.arange(0.1,1,0.1)) , 'kernel': ['linear'],
'C': (np.arange(0.1,1,0.1)) , 'gamma': [0.01,0.02,0.03,0.04,0.05], 'kernel': ['rbf'],
'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05], 'C':(np.arange(0.1,1,0.1)) , 'kernel':['poly']
}
#from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
model_svm = GridSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy')
model_svm.fit(X_train, y_train)
print(model_svm.best_score_)
print(model_svm.cv)
print(model_svm.best_params_)
y_pred= model_svm.predict(X_test)
print(metrics.accuracy_score(y_pred,y_test))