# How to Implement Logistic Regression from scratch in Python ?

The best way to understand any Computer algorithm is to build it from scratch on your own.

This article is all about decoding the Logistic Regression algorithm using Gradient Descent. We’ll first build the model from scratch using python and then we’ll test the model using Breast Cancer dataset. Finally we shall test the performance of our model against actual Algorithm by scikit learn.

## Algorithm

1. Assign random weights to weight matrix
2. Multiply weight matrix with input values
3. Use sigmoid function to squash values between 0 and 1
4. Compare the predicted output with actual output
5. Backpropagate and update the weight matrix.

Let’s calculate the z value which is combination of features(x1,x2….xn) and weights(w1,w2,….wn)

In python code, we can write this as:

#

z = np.dot(X,weights)

#

This output will be fed into sigmoid function.

The code for the same is given below:

#

def sigmoid(z):
return 1 / (1 + np.exp(-z))

#

### Matrix Visualization

We can visualise the steps so far with the chart shown below. For this post, we’ll use Breast Cancer Wisconsin Dataset which has 569 samples and 30 features.

In each epoch we’ll multiply entire dataset with the weight matrix and pass it into sigmoid function and then we’ll get the predicted values. The table below shows an example of what happens in one epoch. Entire dataset is multiplied with weight matrix to produce the output. This output is compared with actual output and error is calculated. This error helps in updating the weights before the next epoch.

Note that in practice we’ll not use entire dataset for training. We’ll split it into training and test dataset.

## Loss Function

The simple function described above will be of no use if the model doesn’t learn on each iteration.

We need to evaluate the result with cross entropy loss function:

Natural logarithm works in our favour here, because it penalizes heavily if the prediction is far off the true value. For example, if the model predicted value ŷ=0.16 and true y=1, the error is high and vice versa.

We can combine the loss function into a single equation as follows:

Here (yi) and (1-yi) will take care of the multiplication terms depending on actual value of y. If y =0 then only second term get evaluated and if y=1 only first term will get evaluated.

Let’s see how this will look in actual code.

#

def loss_func(X, y, weights):
m=len(X)
yhat = sigmoid(np.dot(X, weights))
predict_1 = y * np.log(yhat)
predict_0 = (1 - y) * np.log(1 - yhat)
return -sum(predict_1 + predict_0) / m

#

So far we have seen part of the algorithm that takes care of producing output (yhat (ŷ) ). As said earlier we need to make algorithm learn from the actual output. To achieve that we need to calculate derivative of the loss function and update the weights after every training epoch.

We have already defined our loss function above. All we need is derivative of loss function w.r.t. the weights. This can be calculated as:

We will calculate this individually and multiply each term.

### Derivative of the first term in the chain (dL/dy)

We know that derivative of natural log(x) is 1/x. To calculate the derivative of error function, we’ll split into two parts and then add them.

We can combine both the terms and simplify:

### Derivative of the second term in the chain (dŷ /dz)

The equation for ŷ is :

### Combining all of the derivatives we get:

We have got nice simple derivative for the weights. We can use this to update the weights on every epoch.

#

epochs = 100
learning_rate = 0.1

for _ in range(epochs):

w -= learning_rate * x * (y_hat - y)

#

The learning rate is usually a small number to control how fast or slow we want to move towards minimization.

## Complete Code

#

import numpy as np

class LR:

def sigmoid(self,z):
return 1 / (1 + np.exp(-z))

intercept = np.ones((X.shape[0], 1))
return np.concatenate((intercept, X), axis=1)

def loss_func(self,X, y, weights):
m =len(X)
yhat = sigmoid(np.dot(X, weights))
predict_1 = y * np.log(yhat)
predict_0 = (1 - y) * np.log(1 - yhat)
return -sum(predict_1 + predict_0) / m

def fit(self,X,y,epochs=25,learning_rate=0.05):
loss = []

weights = np.random.rand(X.shape[1])
N = len(X)

for _ in range(epochs):
z = np.dot(X,weights)
y_hat = sigmoid(z)
weights -= learning_rate * (X.T @ (y_hat-y))/N
loss.append(self.loss_func(X,y,weights))
self.weights = weights
self.loss = loss

def predict(self, X):
z = np.dot(X, self.weights)
#Binary result
return [1 if i > 0.5 else 0 for i in self.sigmoid(z)]

#

Note that an extra function (add_intercept) is included in the program which will add ‘intercept’ to every training data. This will help model from overfitting.

## Model Predictions & Evaluation

logreg = LR()
%time logreg.fit(X_train, y_train,epochs = 200,learning_rate = 0.7)
y_pred = logreg.predict(X_test)

# CPU times: user 119 ms, sys: 7.31 ms, total: 126 ms
# Wall time: 34.9 ms

The model has taken 34.9 seconds to execute.

(y_pred==y_test).mean()

# 0.9734

Predicted values, when compared with test dataset show 97% accuracy.

print(classification_report(y_test, y_pred))
print('-'*60)
print('Confusion Matrix\\n')
print(confusion_matrix(y_test, y_pred))

## Logistic Regression from sklearn

We’ll import sklearn package and Logistic Regression class from it. We’ll use the same dataset to predict the values.

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
%time log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_test)

# CPU times: user 2.82 ms, sys: 1.42 ms, total: 4.24 ms
# Wall time: 2.4 ms

Sklearn model has taken 2.4ms to give the similar results.

(y_pred==y_test).mean()

# 0.9787

print(classification_report(y_test, y_pred))
print('-'*60)
print('Confusion Matrix\\n')
print(confusion_matrix(y_test, y_pred))

## Conclusion

We can see that our model is at par with the standard model from Scikit. However our model takes too much time to compute compared to scikit model ( 2.4 ms vs 34.9 ms). One of the reason for this is we have not added regularization to the model which will prevent overfitting.

Further steps here could be the addition of l2 regularization and multiclass classification.

References