Skip to content

Implementing Logistic Regression from Scratch

Let's dive into logistic regression, a fundamental classification algorithm. We'll implement it from scratch, break down the code step by step, and demonstrate its application on a popular dataset.

Open In Colab

AI Summary

The document provides a comprehensive tutorial on implementing logistic regression from scratch, a fundamental classification algorithm used for binary prediction tasks. It explains the mathematical foundations of logistic regression, including the sigmoid function and decision boundary, and demonstrates a step-by-step implementation using NumPy on the Breast Cancer dataset. The tutorial covers data preparation, model training, prediction, and evaluation, comparing the manual implementation with Scikit-Learn's approach and achieving high accuracy in classifying cancer data.

What is Logistic Regression?

Logistic Regression is a linear model used for classification. It applies a logistic (sigmoid) function to the linear combination of input features to predict a probability between 0 and 1.

Sigmoid Function

The sigmoid function is defined as follows:

\[ sigmoid(z) = \frac{1}{1 + e^{-z}} \]

Here \( z = w^T x + b \) is the linear combination of weights and inputs. \( w \) are the model weights, \( b \) is the bias.

Sigmoid Curve

Sigmoid Curve

Decision Boundary

The model predicts a class based on a threshold, typically 0.5:

\[ \hat{y} = \begin{cases} 1 & \text{if } \sigma(z) \ge 0.5 \\ 0 & \text{if } \sigma(z) < 0.5 \end{cases} \]

Implementation of Logistic Regression

1. Importing the Libraries

import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt

2. Preparing the Dataset

data = datasets.load_breast_cancer()
X = data.data
y = data.target

# Normalize the features
X = (X - X.mean(axis=0)) / X.std(axis=0)

# Add intercept term
X = np.c_[np.ones((X.shape[0], 1)), X]
We load the Breast Cancer dataset. Normalize the features to ensure they are on a similar scale. Add an intercept (bias) term with a column of ones.

3. Building and Training of the Model

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
Above code is the implementation of the sigmoid function.

def logistic_regression(X, y, alpha=0.01, epochs=100):
    m = len(y)
    w, b = np.zeros(X.shape[1]), 0
    for _ in range(epochs):
        z = np.dot(X, w) + b
        predictions = sigmoid(z)
        w -= alpha * np.dot(X.T, (predictions - y)) / m
        b -= alpha * np.mean(predictions - y)
    return w, b

The alogrithm used for the learning is Gradient Descent

4. Making Predictions and Evaluating the Model

def predict(X, w, b):
    return (sigmoid(np.dot(X, w) + b) >= 0.5).astype(int)

w, b = logistic_regression(X, y)
y_pred = predict(X, w, b)
accuracy = np.mean(y_pred == y)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 94.55%

Visualizing the Results

plt.scatter(X[:, 1], y, zorder=2)
plt.scatter(X[:, 1], sigmoid(np.dot(X, w) + b), zorder=2)
plt.grid()
plt.show()

Logistic Output

Implementing Logistic Regression using Scikit-Learn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X[:, 1:], y)
sklearn_accuracy = model.score(X[:, 1:], y)
print(f"Sklearn Accuracy: {sklearn_accuracy * 100:.2f}%")

Sklearn Accuracy: 98.77%

Conclusion

Logistic Regression is a powerful yet simple algorithm for binary classification tasks. Implementing it from scratch helps in understanding the core concepts of model building, optimization, and evaluation. By leveraging libraries like NumPy and pandas, we can gain a deeper insight into how logistic regression works under the hood.

Comments