Solving fruits classification problem in Python

In this blog post we’ll try to understand how to do a simple classification on fruits data.

Dataset contains fruit names as target variables and mass, width, height and color score as features. It is a simple data set with less than 100 training examples. You can download the dataset from here

Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the data file

data = pd.read_table("./data/fruit_data_with_colors.txt")
data.head()

Plot the fruit names

To understand the distribution of fruit names let’s plot count of each category using seaborn library.

sns.countplot(data['fruit_name'],label='Count')
plt.show()

Looks like all the fruits are equally distributed except mandarin.

Plot the distribution of features

data.drop('fruit_label',axis=1).plot(kind='box',subplots=True,layout=(2,2),figsize=(8,8))
plt.show()

We can see that there are few outliers in mass, height and width features. These might affect in models accuracy. For now, let’s not remove the outliers.

Build the Model

Let’s first separate features from target variables. So, X variable will have all the features and y variable will have the target.

feature_names = ['mass', 'width', 'height', 'color_score']
X = data[feature_names]
y = data['fruit_label']

Import the libraries

#Preprocessing

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#Modelling

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#Accuracy
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Train test split & Scaling

We’ll split the data set into train and test dataset using sklearn’s method. We then apply minmaxscalar function to scale all the features under similar base.

x_train,x_test,y_train,y_test = train_test_split(X,y, test_size=0.2)

scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Let’s build different modes

lr = LogisticRegression()
lr.fit(x_train,y_train)

print(f'Accuracy on training set {lr.score(x_train,y_train)}')
print(f'Accuracy on test set {lr.score(x_test,y_test)}')

Accuracy on training set 0.70

Accuracy on test set 0.66

dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

print(f'Accuracy on training set {dtc.score(x_train,y_train)}')
print(f'Accuracy on test set {dtc.score(x_test,y_test)}')

Accuracy on training set 1.0

Accuracy on test set 0.833

knc = KNeighborsClassifier()
knc.fit(x_train,y_train)

print(f'Accuracy on training set {knc.score(x_train,y_train)}')
print(f'Accuracy on test set {knc.score(x_test,y_test)}')

Accuracy on training set 0.9574468085106383

Accuracy on test set 0.9166666666666666

Across all the model, K neighbours model gives us the better performance. Hence we’ll use same model for calculating accuracy.

Confusion matrix

pred = knc.predict(x_test)
confusion_matrix(pred,y_test)

#Output
array([[4, 0, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 4, 0],
       [0, 0, 0, 2]])

Looks like model is able to predict most of the classes accurately.

print(classification_report(y_test,pred))

Model is able to give f1 score of .89 and above across all the fruit names.

Conclusion

We’ve trained the model on different algorithms and K neighbours has given us the better score across both training and test dataset. We finally used this model to calculate the model accuracy and confusion matrix. We have got f1 score of .89 and above for al the classes.