Comparing Custom Logistic Regression Model on Datasets

In my previous post I had explained how to build a Logistic Regression model from scratch. That model is not perfect when compared to actual model by scikit learn. I just wanted to check its performance on different dataset.

This is relatively a simple post which summarizes the datasets used for comparison of models (scikit’s LogisticRegression and Custom Model built earlier) and the final output of the model.

Here’s the list of datasets that were used for comparison:

1. Iris Dataset

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

Attribute Information:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm 5.

class: — Iris Setosa — Iris Versicolour — Iris Virginica

Performance of Scikit Learn Model

Performance of Custom Model

2. Titanic dataset

Task: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. Build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Performance of Scikit Learn Model

Performance of Custom Model

3. Companies Dataset

This dataset has Gender,Age and Salary information. The task is build a model that predicts if user has made a purchase.

Performance of Scikit Learn Model

Performance of Custom Model

4. Candidates Dataset

This is students dataset that has features like Gmat score, GPA and work experience. The model should be built which will predict if user is going to be admitted to college or not.

Performance of Scikit Learn Model

Performance of Custom Model

5. Bank Marketing Dataset

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

URL: https://archive.ics.uci.edu/ml/datasets/bank+marketing#

Performance of Scikit Learn Model

Performance of Custom Model

Summary Table

Conclusion

We have seen how the custom model works across different datasets. On Iris dataset and Banks dataset the F1 score across both the models seem to be similar whereas on other dataset the difference is huge.

Things to note here:

  1. Custom model is not very accurate. There’s is no Regularization that’s being added to it. So there’s a chance that model is overfitting on training data.
  2. Training data is run on the model with minimum preprocessing and it has not been scaled. Non scaled data will have impact in gradient calculation.

So, here are the next steps:

  1. Scale the features and to Run the model on these datasets again to see how the performance changes especially on custom model
  2. Investigate more into each of dataset where the difference between the performance of standard model and custom model is huge.