The main purpose of this post is to explore the different ways in which Logistic Regression can be applied to the dataset and hence understanding how the model actually works. The idea is not to solve the problem itself. This post doesn’t focus on getting best score using different models however it assumes that there’s only one model available for use.
This is part of the series of posts to learn and share the details of Logistic Regression. If you’re new to this kindly refer my earlier posts on the same topic:
- How to Implement Logistic Regression from scratch in Python ?
- Comparing Custom Logistic Regression Model on Datasets
- Datasets for practicing Logistic Regression
- Metrics for measuring your Classification Model
Well, for this post I’m using the Heart Disease prediction dataset from Kaggle. The data set has different features like Demographics, Behavioural which includes current smoker, cigarettes per day and Medical history and our task is to predict if the person has 10 year risk of coronary heart disease. The output will be 1 or 0 (binary: “1”, means “Yes”, “0” means “No”).
The entire process can be summed into the following steps:
- Build a Simple model
- Eliminate Features –
- Handle Imbalance Data
- Select K Best Features
- Scale the Features-
- Scale & Handle Imbalance
- Different Solvers
1. Build a Simple model
Firstly, we’ll import all the libraries and dataset.
Create a small helper method that runs the model and calculates performance metrics of the model. We’ll use this function through out the post.
Check for the null values and drop them as they are not significant portion of the dataset.
The distribution of the Independent variable says the the dataset is imbalanced.
Let’s now create test and train dataset and run the Logistic Regression model.
The model’s performance is poor in terms of F1 score which is 0.05. However the accuracy is 85%.
In this case model is classifying only negative values and that’s why we have specificity of 99% and sensitivity of 2.7%.
If you’re not sure about metrics for measuring the classification models then refer my earlier post on this
2.Eliminate the Features
Using scikit’s RFECV class we’ll eliminate un important features and create a new dataset.
We’ll now run the model and measure it’s performance.
We can see that F1 score has improved to 12% and Sensitivity (recall) improved from 3% to 7%.
2.1 Handle Imbalanced Data
We’ll randomly select values from dataset which represents ‘zero’ and merge this with dataset which represents ‘one’ so that values are equally distributed.
We’ll run the model,
We see that the F1 score has improved to 34% from 12%.
3. Select K best Features
Using scikit learn’s another class we’ll select the k best features and run the model.
The F1 score now is 9% which is better than the score we got earlier after removing the unwanted features.
3.1 Scale the features
Scaling didn’t improve the score which we have got earlier.
3.1 Scale & Handle Imbalance
We see that the there’s significant increase in F1 score which is 42%.
3.1 Try Different Solvers
Logistic Regression class gives us options to try different solvers for fitting the model. We’ll use all of them recursively and measure the performance.
liblinear solver has maximum F1 score of 42% compared to all other solvers for this dataset.
We’ve seen how Logistic Regression model works at various stages of data processing. After scaling the features and handling the imbalance data we have achieved the maximum F1 score of 42%.
Note that we have not used Area Under Curve metric to measure the performance in this post.
The entire code for this is available here