Logistic Regression Algorithm Analysis with Python

Hi, everyone. I am Orhan Yagizer. In this article, I will work with the Logistic regression algorithm in python. Let’s get start it.

Firstly, what is a logistic regression algorithm?

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object is detected in the image would be assigned a probability between 0 and 1, with a sum of one.

As I mentioned above, logistic regression appears everywhere in our lives, that’s why it’s important to learn it and know it.

What are the differences between linear regression and logistic regression?

Sometimes these two algorithms can be confused with each other.

Linear regression is used to predict the continuous dependent variable using a given set of independent variables. It is used for solving the Regression problem. In Linear regression, we predict the value of continuous variables.

On the other hand, Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables. Logistic regression is used for solving classification problems. In logistic Regression, we predict the values of categorical variables.

Logistic Regression Analysis with Python

Now it’s time to analyze them in python. I will mostly use sci-kit learn. I will use the Titanic data set from Kaggle. It’s a very famous ML data set. You can download the data set from here.

Firstly, we will import the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let’s start by reading in the Titanic data set file into a pandas dataframe. My file’s name is “titanic_train.csv”. Then check the dataframe’s head.

train = pd.read_csv('titanic_train.csv')
train.head()

Let’s begin some exploratory data analysis. We’ll start by checking out missing data. We can use seaborn to create a simple heatmap to see where we are missing data.

plt.figure(figsize=(10,6))
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="Greens")

Almost one-fourth of age data is missing. Also, look at the cabin column, it looks like we are just missing too much of that data. We’ll have to drop these columns.

Let’s continue by visualizing some more of the data.

sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='pastel')

According to the chart above, most people couldn’t survive. Now let’s look at survivors by sex.

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

As you can see, most male people didn’t survive the sinking of the Titanic. Approximately 220 female, 110 male survived.

Let’s look at survivors by class column.

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='viridis')

As you can see, more people survived in class 1. Whereas more people died in class 3.

Let’s look at our age column.

plt.figure(figsize=(10,7))
sns.distplot(train["Age"].dropna(),kde=False,bins=30);

As you can see, most people’s age between 20 and 30.

Now, we should make data cleaning. I want to fill in missing age data instead of just dropping the missing age data rows. We will create a function for it. I will fill in missing data with average age values.

plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='viridis')
Boxplot
def trans_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        
        if Pclass == 1:
            return 37 
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

Now, apply the function.

train['Age'] = train[['Age','Pclass']].apply(trans_age,axis=1)

Also, we should drop the cabin column due to missing data.

train.drop('Cabin',axis=1,inplace=True)
train.head()

Now check the heatmap again.

plt.figure(figsize=(10,6))
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="Greens")

As you can see, we don’t have any missing values anymore.

Before the modelling process, we should convert categorical data to dummy variables. We should get_dummies for it. Our categorical data is the sex and embarked column. Then after the dummy process, we can drop the original categorical columns.

sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

Now we should concatenate our dataframes. Then we’ll check the train’s head.

train = pd.concat([train,sex,embark],axis=1)
train.head()

Our data is ready for modelling. Now, we can split our data training set and test set. Our target column is Survived. We will try to predict Survived column.

I’ll use sci-kit learn. So, you should import sci-kit learn.

from sklearn.model_selection import train_test_split
X = train.drop('Survived',axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

Then, we should train and predict the data with a logistic regression algorithm. For this, we should import linear_model from sci-kit learn.

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

We train and fit our data. Now we can make a prediction.

predictions = logmodel.predict(X_test)

Let’s move on to evaluate our model. We can check precision, recall,f1-score using a classification report and confusion matrix. For this, we should import metrics from sci-kit learn.

from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test,predictions))
print("\n")
print(classification_report(y_test,predictions))

Here is the result of our model. It’s not bad, the predictions are fine. But in the normal world data, our modelling process isn’t that easy. We should more EDA on normal world data.

Today, we analyzed the logistic regression algorithm with the Titanic data set in Python.

I hope, you enjoy my article and it will be useful for you. Thanks for reading!

Orhan Yağızer Çınar

Linkedin

Published by orhanyagizer

Leader at Young Leaders Over The Horizon | Trainee at Yetkin Gençler | High School Board Member at GelecektekiSen | Booking Assistant at Harrington Housing | Data Science | Machine Learning | Translator

%d bloggers like this: