Simple Logistic Regression Model

In this post we’ll see how to build a simple Logistic Regression model

The Problem ?

We would want to predict if a user is likely to purchase an item, based on their other behaviour. Can Gender, Age and Salary help in predicting this ? We’ll see that below.

Our Task here is to build a logistic regression model to predict the purchase behaviour.

Import libraries and read the data.

import numpy as np
import pandas as pd

data = pd.read_csv('./data/User_Data.csv')
data.head(10)

As we can see in the above output, ‘Purchased’ is the Y or target variable and Gender, Age and Salary are X variables or features.

Separate the features from the target variable.

We’ll split the data into two sets of variables. This is the requirement for any Machine Learning model.

#Split data into training dataset and target variable
x = data.iloc[:,1:4]
y = data.iloc[:,4]

Convert categorical values into one-hot vector.

Gender is in string format and it can’t be given as input to the model. We need to convert this variable into zeros and ones.

#Convert categorical variables into two separate columns
x = pd.get_dummies(x)
x.head()

Standard Scalar

Each variable has different scale. Salary is in thousands and Age is in tens. Model won’t work if the common scale is not maintained across variables.

We use Scikit learn’s Standard Scalar class to do this job for us. We finally print first 5 rows to actually see the output.

from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
x = scalar.fit_transform(x)
x[:5]

Train Test Split

We need to split our data into training & test data sets. Training data is used for training the model. Test data is used to check if the model is able to predict output on a new dataset, which it has not seen during training.

We use SciKit learn’s train_test_split method for this. We need to specify what percentage of data to be kept aside as Test Data set. In this example we keep 20% of the data as test dataset.

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2 )

The Model

Once all the preporcessing steps are completed, now is the time to build Logistic Regression model.

We’ll first import this class from SciKit learn package. We create an instance of this class and call it – LR

We then call .fit() method of this class which will take care of training data set.

from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(x_train,y_train)

Prediction on Test set

Now our model has learnt about data from training dataset. We need to check if it can work as expected on test data set as well.

We use .predict() method of LogisticRegression class to predict the model output based on test data. Note that we have actual output for test data as well.

y_pred will have predicted values and y_test will have actual values.

We’ll use these two values to calculate accuracy.

y_pred = LR.predict(x_test)

Accuracy

We’ll first calculate Confusion Matrix to see if our model is able to corrrectly predict the output.

from sklearn.metrics import confusion_matrix,f1_score,accuracy_score

cm = confusion_matrix(y_pred,y_test)
cm
#output : 
#array([[52,  4],
#       [ 9, 15]])

True Positives (52) and True Negatives(15) have been correctly identified.

We can now calculate accuracy score for this output.

accuracy_score(y_pred,y_test)

#output : 0.8375

So, our model is able to give 83% accuracy.

Conclusion

We have built a Logistic Regression model, for the dataset we have chosen.

We have converted string variable Gender into one-hot vector. We then converted all the variables into common scale. Finally we have built the model which on Test dataset gives 83% accuracy.