ROOKIE DATA SCIENTIST GUIDE

ROOKIE DATA SCIENTIST GUIDE

A comprehensive walkthrough for the Big Mart Sales prediction project using Machine Learning. A perfect start to becoming a Data Scientist.

11 MINUTE READ

‘Machine learning’, one of the hottest topics in the world of tech right now. Are you confused about where to start? Maybe you have finished some online courses, but still not confident to develop your own project. While learning from online courses that make you do some project is a great start, it’s still nowhere close to developing your own models in machine learning.

Contrary to popular belief, machine learning is not all about making a model that gives you predictions. This flow chart will help you to understand what are the principal components of machine learning:Created by Suraj SarangiThe chart shows the steps to making a working model. – The first step is preprocessing. This is the most time taking step according to me. It needs careful observation of the datasets we’re provided with and analyzing the relations between the attributes and the target. It also involves imputation of missing values as NaN values are not very friendly to work with. The last step involves dropping features that don’t seem very important.

  • The second step is feature engineering. This is the most important step as it can make bad datasets work really well. This involves scaling features if the range is very vast.

Some attributes maybe too extravagant to work with, so we make derived variables to make them computationally more efficient.

Then comes the making of a model for predictions. There can be linear models like Linear Regression, Lasso, Ridge, etc. There are also tree models like Decision Tree, Random Forest, etc. There is always a neural network somewhere around the corner.

After making the model, the hyperparameters are carefully adjusted based on training set and test set accuracy. After this the model is ensembled with other algorithms. There is one final step, which is not shown in the graph, i.e, deployment of the model. This requires making an app and uploading data to a cloud.

PREPROCESSING

First step involves getting data. You can download a data set to work on, from the internet or if you have the resources, you can create your own data set. This data comes with a lot of errors, missing values or some unwanted attributes. While the rules of English put a great emphasis on punctuation marks and spaces, our machine simply doesn’t care. Cleaning the data, like removing white spaces, unnecessary punctuations is generally the first step while making models for text classification. Image classification tasks generally have very robust datasets and generally don’t require any preprocessing.

Let’s work on a business improvement by making predictions for sales data of Big Mart. This model can help Business Associations to understand which products would be in demand and which ones should they think of getting rid of. Let’s load the datasets and required libraries for processing the data.

import pandas as pd
url1 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Train.csv'
train = pd.read_csv(url1)
url2 = 'https://raw.githubusercontent.com/SurajSarangi/Big-Mart-Sales-Prediction/master/Test.csv'
test = pd.read_csv(url)
train.head()
0FDA159.30Low Fat0.016047Dairy249.8092OUT0491999MediumTier 1Supermarket Type13735.1380
1DRC015.92Regular0.019278Soft Drinks48.2692OUT0182009MediumTier 3Supermarket Type2443.4228
2FDN1517.50Low Fat0.016760Meat141.6180OUT0491999MediumTier 1Supermarket Type12097.2700
3FDX0719.20Regular0.000000Fruits and Vegetables182.0950OUT0101998NaNTier 3Grocery Store732.3800
4NCD198.93Low Fat0.000000Household53.8614OUT0131987HighTier 3Supermarket Type1994.7052

Two important functions which help in getting some information about the dataset are:

train.info()

<class pandas.core.frame.DataFrame>

RangeIndex: 8523 entries, 0 to 8522

Data columns (total 12 columns):

Item_Identifier 8523 non-null object

Item_Weight 7060 non-null float64

Item_Fat_Content 8523 non-null object

Item_Visibility 8523 non-null float64

Item_Type 8523 non-null object

Item_MRP 8523 non-null float64

Outlet_Identifier 8523 non-null object

Outlet_Establishment_Year 8523 non-null int64

Outlet_Size 6113 non-null object

Outlet_Location_Type 8523 non-null object

Outlet_Type 8523 non-null object

Item_Outlet_Sales 8523 non-null float64

dtypes: float64(4), int64(1), object(7)

memory usage: 799.2+ KB

train.describe()
count7060.0000008523.0000008523.0000008523.0000008523.000000
mean12.8576450.066132140.9927821997.8318672181.288914
std4.6434560.05159862.2750678.3717601706.499616
min4.5550000.00000031.2900001985.00000033.290000
25%8.7737500.02698993.8265001987.000000834.247400
50%12.6000000.053931143.0128001999.0000001794.331000
75%16.8500000.094585185.6437002004.0000003101.296400
max21.3500000.328391266.8884002009.00000013086.964800

Visualizing the data with different plots can be really helpful to determine the correlation of certain attributes of the data. Distplots, Countplots, Histograms, Scatterplots are some of the major visualization methods. Analyzing Correlation is an important part as well. This helps to determine the most important features in the data. We use libraries seaborn and matplotlib for our visualization:

import matplotlib.pyplot as plt
import seaborn as sb
plt.style.use('seaborn')
plt.figure(figsize=(12,5))
sb.distplot(train.Item_Outlet_Sales,bins=20,color='red')
plt.xlabel('Item Outlet Sales',fontsize=15)
plt.ylabel('No. of Sales',fontsize=15)
plt.title('Histogram of Target',fontweight='bold',fontsize=17);

Created by Suraj SarangiI love graphs, playing with colors and graphs is a must if you’re new to it. You might be wondering about the semicolon(;) at the end. > “It’s python, it’s a sin to use semicolon!”.

Well, while working with plots, semicolon suppresses the useless object outputs. Try it for yourself to see it.


UNIVARIATE ANALYSIS

We divide the numerical and categorical data to different dataframes as the analysis for both of them is different

import numpy as np
num=train.select_dtypes(include=np.number)
cate=train.select_dtypes(exclude=np.number)

NUMERIC DATA

Numeric Data is analysed using Correlation. Correlation is a measure of how sensitive the values of one column are with respect to the changes in values of other columns. We can use corr() to get the correlation matrix

co=num.corr()
co
Item_Weight1.000000-0.0140480.027141-0.0115880.014123
Item_Visibility-0.0140481.000000-0.001315-0.074834-0.128625
Item_MRP0.027141-0.0013151.0000000.0050200.567574
Outlet_Establishment_Year-0.011588-0.0748340.0050201.000000-0.049135
Item_Outlet_Sales0.014123-0.1286250.567574-0.0491351.000000

Now, if you felt the matrix as boring, I wouldn’t say you’re entirely wrong. Lets see that same matrix in a much better way.

plt.figure(figsize=(8,6))
sb.heatmap(co,square=True,annot=True)
ax=plt.gca()
bo,to=ax.get_ylim()
ax.set_ylim(bo+0.5,to-0.5)
plt.title('Correlation Heatmap',fontweight='bold',fontsize=17);

Created by Suraj SarangiNow seeing the most important correlations

co['Item_Outlet_Sales'].sort_values(ascending=False)

Item_Outlet_Sales 1.000000 Item_MRP 0.567574 Item_Weight 0.014123 Outlet_Establishment_Year -0.049135 Item_Visibility -0.128625 Name: Item_Outlet_Sales, dtype: float64

As we see, the sales column has the maximum correlation with MRP (we disregard the 1.000 as it’s not very informative about its relations).


CATEGORICAL DATA

For Categorical Data, we plot Countplots.

plt.figure(figsize=(15,5))
cols=list(cate.columns)
cols.remove('Item_Identifier')       # we don’t need the identifiers
cols.remove('Outlet_Identifier')
sb.countplot(cate[cols[0]]);

Created by Suraj SarangiNow seeing the most important correlations

We see the irregularities as LF,reg which are just different versions of Low Fat and Regular Fat. We’ll change these later.

plt.figure(figsize=(15,5))
sb.countplot(cate[cols[1]])
plt.title(cols[1],fontweight='bold',fontsize=15)
plt.xticks(rotation=90,fontsize=15)
plt.xlabel(cols[1],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Count',fontsize=15);

Created by Suraj SarangiThere are a lot of features in Item_type. Computation will be really wasteful if we keep so many features. We’ll reduce the features later in feature Engineering.

Likewise, we can plot the graphs for all the other categorical data.


BIVARIATE ANALYSIS

* We make another variable for all the features in training set cols2=list(train.columns)

This is again divided into numeric, categorical and mixed analysis Numeric

vis_pt=train.pivot_table(index=cols2[4], values=cols2[3], aggfunc=np.median)
vis_pt.plot(kind='bar',color='darkorchid',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[4],fontsize=15)
plt.ylabel(cols2[3],fontsize=15)
plt.title('Item Visibility and Item Type',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi

As we can see, there are a lot of zeros in this feature, this needs to be changed while doing imputation of missing values.

year_pt=train.pivot_table(index=cols2[-5], values=cols2[-1], aggfunc=np.median)
year_pt.plot(kind='bar',color='k',figsize=(15,5),alpha=0.6)
plt.xlabel(cols2[-5],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Establishment Year and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj SarangiThere is an unusual drop in the year 1998. This makes the trend very irregular, and hence we need to do some feature engineering for this particular attribute.


CATEGORICAL

We can use pivot tables to plot categorical data

type_pt=train.pivot_table(index=cols[1], values=cols2[-1], aggfunc=np.median)
type_pt.plot(kind='bar',color='deeppink',figsize=(15,5),alpha=0.8)
plt.xlabel(cols[1],fontsize=15)
plt.ylabel(cols2[-1],fontsize=15)
plt.title('Item type and Item Outlet Sales',fontweight='bold',fontsize=17)
plt.xticks(rotation=90,fontsize=13)
plt.yticks(fontsize=13)
plt.show();

Created by Suraj Sarangi

We can plot for all the remaining pairs and see if there is any irregularity.

Now that we have visualized the data and found the problems, we move to Feature Engineering.


IMPUTATION OF MISSING VALUES

Discontinuations in the form of NaN values are really disliked by the model. But the data is not free from these values. The process of filling these values is generally done by taking the average over the column or simply by placing a zero. On analysis of the correlation matrix, the best features are generally shortlisted while the less important ones are dropped.

From info() we found that ‘Item_Weight’ and ‘Outlet_Size’ had null values.

mr=np.mean(train['Item_Weight'])
train['Item_Weight'].fillna(mr,inplace=True)
mr2=np.mean(test['Item_Weight'])
test['Item_Weight'].fillna(mr2,inplace=True)

This will impute the NaN values in the weight column of train and test with the mean of the columns. Generally, mean is a preferred choice for numeric data.

kp=(train.mode(axis=0))['Outlet_Size'].iloc[0]
train['Outlet_Size'].fillna(value=kp,inplace=True)
kp2=(test.mode(axis=0))['Outlet_Size'].iloc[0]
test['Outlet_Size'].fillna(kp2,inplace=True)

This will impute NaN values in Outlet size as the mode of the column in train and test set. Generally, mode is a preferred choice for categorical data.


FEATURE ENGINEERING

RESURRECTING FAT CONTENT

train.replace("LF","Low Fat",inplace=True)
train.replace("low fat","Low Fat",inplace=True)
train.replace("reg","Regular",inplace=True)

test.replace("LF","Low Fat",inplace=True)
test.replace("low fat","Low Fat",inplace=True)
test.replace("reg","Regular",inplace=True)

This replaces the irregularities in the fat content column of train and test. The new visualization would look like thisCreated by Suraj SarangiSome items had 0 visibility which makes no sense. But the same items had some visibility in some other outlet. Hence we change the zeros.

pt=train.pivot_table(values='Item_Visibility',index='Item_Identifier')
for i in range(0,len(train['Item_Visibility'])):
    if train['Item_Visibility'].iloc[i]==0:
        train['Item_Visibility'].iloc[i]=pt['Item_Visibility'].loc[train['Item_Identifier'].iloc[i]]

The new visualization:Created by Suraj SarangiWe do the same for the test dataframe as well.

We change the Establishment Year to the number of years operated. This data is from 2013.

train['Years_operated']=2013-train['Outlet_Establishment_Year']
test['Years_operated']=2013-test['Outlet_Establishment_Year']

It looks like this:Created by Suraj Sarangi

We saw Item_type has a lot of features. On taking a closer look at the dataset, we have the Item_Identifier column which has data beginning with either FD, NC, DR. We can use this to create a new Item_type. FD can be Food, DR can be drink, NC can be Non Consumable.

di={'FD':'Food','DR':'Drinks','NC':'Non-Consumable'}
l=[]
for i in train['Item_Identifier']:
    l.append(di[i[:2]])
train['Item_Type2']=l

New Visualization:Created by Suraj Sarangi

We do the same for test dataframe. Since we added a Non Consumable item, having a fat content for it would make no sense. Hence, we alter the Fat_Content column.

for i in range(0,len(train['Item_Type2'])):
    if train['Item_Type2'].iloc[i]=='Non-Consumable':
        train['Item_Fat_Content'].iloc[i]='Non-Consumable'
for i in range(0,len(test['Item_Type2'])):
    if test['Item_Type2'].iloc[i]=='Non-Consumable':
        test['Item_Fat_Content'].iloc[i]='Non-Consumable'

Fat_content now looks like this:Created by Suraj Sarangi


ENCODING

Since we are using library scikit-learn for our models, it cannot work with categorical variables. Therefore, we need to use encoding to generate numeric attributes for these categorical columns. We used One Hot Encoding (funny name, eh?) for our attributes. It assigns 1 to the rows that have the particular value and 0 to those who don’t.

cols.append('Item_Type2')    # to add the new feature we created
cols.remove('Item_Type')  #to remove the old Item_type

from sklearn.preprocessing import LabelEncoder
for i in cols:
    train[i]=LabelEncoder().fit_transform(train[i])

for i in cols:
    test[i]=LabelEncoder().fit_transform(test[i])

train=pd.get_dummies(train,columns=cols)
test=pd.get_dummies(test,columns=cols)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 25 columns):
Item_Identifier              8523 non-null object
Item_Weight                  8523 non-null float64
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Item_Outlet_Sales            8523 non-null float64
Years_operated               8523 non-null int64
Item_Fat_Content_0           8523 non-null uint8
Item_Fat_Content_1           8523 non-null uint8
Item_Fat_Content_2           8523 non-null uint8
Outlet_Size_0                8523 non-null uint8
Outlet_Size_1                8523 non-null uint8
Outlet_Size_2                8523 non-null uint8
Outlet_Location_Type_0       8523 non-null uint8
Outlet_Location_Type_1       8523 non-null uint8
Outlet_Location_Type_2       8523 non-null uint8
Outlet_Type_0                8523 non-null uint8
Outlet_Type_1                8523 non-null uint8
Outlet_Type_2                8523 non-null uint8
Outlet_Type_3                8523 non-null uint8
Item_Type2_0                 8523 non-null uint8
Item_Type2_1                 8523 non-null uint8
Item_Type2_2                 8523 non-null uint8
dtypes: float64(4), int64(2), object(3), uint8(16)
memory usage: 732.6+ KB

As we can see the dummy variables have been generated according to the number of different labels present in the categorical attributes. Like, Fat_content had ‘low fat’, ‘regular’, ‘Non-consumable’, we have 3 dummy classes for Fat_Content.

The next step is to remove these redundant attributes from our dataframe. Like Item_type, establishment_yeat.

train.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
test.drop(columns=['Item_Type','Outlet_Establishment_Year'],inplace=True)
train.head()

Now, we’ll prepare our training and test sets

cols3=list(train.columns)
j=cols3.pop(5) #to get outlet_sales at last
cols3.append(j)
cols3.remove('Item_Identifier')
cols3.remove('Outlet_Identifier')
X_train=train[cols3[:-1]]
X_test=test[cols3[:-1]]
y_train=train[cols3[-1]]

After so much of preprocessing and Feature Engineering, we have finally come up to the favourite part of mine. Yes, the model selection. Since sales is a continuous data, we need to use Regression here.

from sklearn.linear_model import Lasso
las=Lasso().fit(X_train, y_train)
test2=pd.read_csv("test.csv")    # the test set is opened as it is for easy viewing
test2["Item_Outlet_Sales"]=las.predict(X_test)     # A new column is made for predictions
f1=open("Predictions.csv","w")
f1.write(test2[["Item_Identifier","Outlet_Identifier","Item_Outlet_Sales"]].to_csv(index=False, line_terminator='\n'))
f1.close()

This will make a csv file containing the identifiers and the sales for each of them. It’ll look like this:

0FDW58OUT0491767.496700
1FDW14OUT0171523.687095
2NCN55OUT0101907.343992
3FDQ58OUT0172539.158000
4FDY38OUT0275183.936817
5676FDB58OUT0462360.714311
5677FDD47OUT0182464.120755
5678NCO17OUT0451914.865997
5679FDJ26OUT0173501.090376
5680FDU37OUT0451369.458339

5681 rows × 3 columns

This completes the first 3 steps of our flow chart. The next part is hyperparameter tuning. That’s really important when using neural_networks to make predictions.

Check out my repository on GitHub to find the project in much more detail. You’ll find many more graphs and a lot more colors, if that’s what you’re looking for. github.com/SurajSarangi/Big-Mart-Sales-Prediction

Please give it a star if you happen to be mesmerized by the colours and graphs. Feel free to check out the other machine learning projects as well.


ABOUT THE AUTHOR

Suraj Sarangi, is an undergrad, pursuing a degree in BTech. Python and Deep Learning are his weapons of choice. He is well versed in the language of English, likes to debate. Apparently, he loves football and coffee more than anything in his life.Suraj Sarangi


SURAJ SARANGI

REVIEWS

If You find it interesting!! we would really like to hear from you.

Ping us at Instagram/@the.blur.code

IF YOU WANT ARTICLES ON ANY TOPICS DM US ON INSTA.

Thanks for reading!! Happy Coding

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top
Hey, wait!

Before you dive into the exciting Article, wanna join our newsletter list, We promise we will not spam