Classifiers/Multilayer Perceptrons
Prerequisites
- Deep Learning Setup : Setup workspace and download python libraries
Learning Objectives
- Classifiers
- Data Preparation and Exploration
- Build the Model
- Loss Functions
- Optimizers
- Activation Functions
- Train and Evaluate the Model
- The Logit, the Probability and the Label
- Evaluation Metrics
Overview of Classification
Classifiers
In the previous section, we predicted numeric values using in a regression model. However, deep learning is not limited to predicting numbers but classes as well. So how do we do this?
Data Preparation and Exploration
We are going to start by scoping a problem for our model to solve. In this case we are going to use glioblastoma gene expression data, where we will use a gene's expression (level of RNA that gene has made) to predict a phenotype. To start we will try to predict smoking status. First, let's load the data and take a peak!
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
gbm = pd.read_csv("../data/gbm_data.csv",on_bad_lines='skip')
gbm.shape
output
(87, 133)
Great! We can see that this data frame contains 87 rows and 133 columns. Let's take a peak at the column that would contain smoking status:
gbm.loc[:,"SMOKING_HISTORY"]
output
0 Current reformed smoker within past 15 years
1 Current reformed smoker within past 15 years
2 NaN
...
86 Lifelong non-smoker: Less than 100 cigarettes ...
Name: SMOKING_HISTORY, Length: 87, dtype: object
This doesn't look particularly clean, so we will just have to make a new column with the smoking status. We will do this by saying that if we see "non-smoker" in the text, then we classify the sample as coming from a non-smoker. But before this we should remove the NA rows, given we don't have any smoking information on them!
gbm = gbm[gbm['SMOKING_HISTORY'].notna()]
gbm['smoking_status'] = np.where(gbm['SMOKING_HISTORY'].str.contains('non-smoker'), 'non-smoker', 'smoker')
gbm.loc[:,"smoking_status"]
output
0 smoker
1 smoker
3 smoker
...
84 smoker
85 non-smoker
86 non-smoker
Name: smoking_status, Length: 75, dtype: object
Now before going forward, it is critical to take a look at our data. Are there any obvious patterns right off the bat? Let's visualize the first few variables using a pairplot:
# Create a pair plot
fig = px.scatter_matrix(gbm,
dimensions=['LINC00159','EFTUD1P1','C20orf202','KRT17P8','RPL7L1P9'],
color="smoking_status",
title="Pair Plot of Selected Variables")
fig.update_layout(
height=700,
width=700,
template='plotly_white'
)
fig.show()
Pairplot of Selected Features
We can see that a few of our variables can sort of startify smokers and non-smokers. Now that we have the smoking status variable in our data frame and we have a few features to use, we can start getting our data ready for our model! However, to give this to a machine learning model we need to make this into a number (Here we will do 0 and 1).
gbm_filt = gbm.loc[:,['LINC00159','EFTUD1P1','C20orf202','KRT17P8','RPL7L1P9','smoking_status']]
gbm_filt['smoking_status'] = gbm_filt['smoking_status'].astype('category').cat.codes
gbm_filt['smoking_status']
output
0 1
1 1
..
85 0
86 0
Name: smoking_status, Length: 75, dtype: int8
Now let's split our data into our features (things used to predict) and outcome (the thing to be predicted):
# Split the data into features and outcome variable
X = gbm_filt.drop('smoking_status', axis=1).values
y = gbm_filt['smoking_status'].values
When dealing with variables we often need to make sure they are on the same scale. That way, one variable doesn't have way more pull than another just because one is an order of magnitude larger.
# Normalize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
Now we are going to split our data into training and test sets. Then we will convert our data into tensors so Pytorch can use the data:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=81)
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)
We will now convert our training and test data into a TensorDataset
object, then we manage that object using DataLoader
, which acts to process the data in batches for parallel processing:
# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
Build the Model
With our data prepared, we can make our model! But we have decisions to make. Our model is going to be a little more complex than our previous simple linear model. We are going to make what is called a multilayer perceptron. It sounds complicated but all it is is a neural network where all nodes in one layer are connected to all nodes in the next layer:
Multilayer Perceptron Architecture
This is great starting point with neural networks as we are not doing anything funky with connecting nodes. Each node is just connected to all nodes in the next layer. But we still need to make decisions about our loss function, the optimizer and now our activation function given we are connecting multiple layers.
Loss Functions
As a review loss functions tell us how far off our predicted value is to our real value and there are a few different ways to define loss:
Loss Function | Task | Equation | Description | PyTorch Function |
---|---|---|---|---|
Binary Cross-Entropy Loss | Binary classification | \(\text{BCE} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]\) | - \(y\): true label - \(\hat{y}\): predicted probability - \(N\): number of samples |
torch.nn.BCELoss or torch.nn.BCEWithLogitsLoss |
Cross-Entropy Loss (Multi-Class) | Multi-class classification | \(\text{CE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)\) | - \(C\): number of classes - \(y_i\): one-hot encoded true label - \(\hat{y}_i\): predicted probability for class \(i\) |
torch.nn.CrossEntropyLoss |
Mean Absolute Error (MAE) / L1 Loss | Regression | \(\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} \|y_i - \hat{y}_i\|\) | - \(y\): true value - \(\hat{y}\): predicted value |
torch.nn.L1Loss |
Mean Squared Error (MSE) / L2 Loss | Regression | \(\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\) | - \(y\): true value - \(\hat{y}\): predicted value |
torch.nn.MSELoss |
Now you may notice that when using the cross-entropy loss, labels are one-hot encoded. What does that even mean? Well all this means is that in a column with two or more unique values, unique values are made into new columns, and if the original column had that value, you put a 1, if not, you put a 0. Take a look at this visualization:
One-Hot Encoding
Now how about optimizers?
Optimizers
Optimizers tweak our model so that we optimize our model parameters to minimize the loss:
Optimizer | Task | Equation / Update Rule | Description | PyTorch Function |
---|---|---|---|---|
Stochastic Gradient Descent (SGD) | General optimization | \(\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta_t)\) | \(\theta_t\): parameters at step \(t\) \(\eta\): learning rate \(\nabla_\theta J(\theta_t)\): gradient of the loss function Basic optimization algorithm |
torch.optim.SGD |
Adam | Adaptive momentum and learning rate | \(m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta_t)\) \(v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J(\theta_t))^2\) \(\hat{m}\_t = \frac{m_t}{1 - \beta_1^t}\) \(\hat{v}\_t = \frac{v_t}{1 - \beta_2^t}\) \(\theta_{t+1} = \theta_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}\) |
\(\theta_t\): parameters at step \(t\) \(m_t\): first moment (mean) estimate \(v_t\): second moment (variance) estimate \(\hat{m}_t\), \(\hat{v}_t\): bias-corrected moment estimates \(\eta\): learning rate \(\beta_1\), \(\beta_2\): exponential decay rates \(\epsilon\): small constant for numerical stability |
torch.optim.Adam |
Eww that is a lot of math feel free to click the example below to see how a simple gradient descent is calculated using SGD and an MSE loss function:
Gradient Descent Example
When we train our model, we need to know how changing the parameters—like weights and biases—affects the output. This is where partial derivatives come in. Think of them as tiny nudges in different directions that tell us how much each part of our function changes when we adjust one variable while keeping the others fixed. We’re going to look at how to find these nudges step by step, using a simple linear regression model:
\(J(w, b) = \frac{1}{2} (y - \hat{y})^2\)
where \(\hat{y}\) is our prediction:
\(\hat{y} = w \cdot x + b\)
Step 1: Partial Derivative with Respect to \(w\)
First up, we need to figure out how changing \(w\) affects our loss \(J\). Mathematically, that looks like this:
\(\frac{\partial J}{\partial w} = \frac{\partial}{\partial w} \left[ \frac{1}{2} (y - \hat{y})^2 \right]\)
Let’s plug in our prediction:
\(\frac{\partial J}{\partial w} = \frac{\partial}{\partial w} \left[ \frac{1}{2} (y - (w \cdot x + b))^2 \right]\)
We apply the chain rule here:
- The outer part, \((y - \hat{y})^2\), becomes \((y - \hat{y}) \cdot (-1)\) when we differentiate.
- The inner part, \(\hat{y}\), with respect to \(w\) is just \(x\).
Putting it together:
\(\frac{\partial J}{\partial w} = -(y - \hat{y}) \cdot x\)
Plug in the Values:
With our data point \((x, y) = (2, 3)\) and initial parameters \(w_0 = 1.5\), \(b_0 = 0.1\):
\(\hat{y} = 1.5 \cdot 2 + 0.1 = 3.1\)
Gradient with respect to \(w\):
\(\frac{\partial J}{\partial w} = -(3 - 3.1) \cdot 2 = 0.2\)
Step 2: Partial Derivative with Respect to \(b\)
Now let’s see what happens when we change \(b\):
\(\frac{\partial J}{\partial b} = -(y - \hat{y})\)
Plug in the values:
\(\frac{\partial J}{\partial b} = -(3 - 3.1) = 0.1\)
Step 3: Combine into a Gradient
We store these partial derivatives in a gradient:
\(\nabla J(w, b) = \begin{bmatrix} \frac{\partial J}{\partial w} \\ \frac{\partial J}{\partial b} \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.1 \end{bmatrix}\)
Step 4: Update the Parameters
Finally, we use the gradient descent update rule (with \(\eta\) being our learning rate - here it is 0.01):
\(\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta_t)\)
For \(w\) and \(b\), this means:
- Update \(w\):
\(w_{1} = 1.5 - 0.01 \cdot 0.2 = 1.498\)
- Update \(b\):
\(b_{1} = 0.1 - 0.01 \cdot 0.1 = 0.099\)
The Result
After one step, our new parameters are:
- Updated \(w\): 1.498
- Updated \(b\): 0.099
And that update to our model parameter is a step towards improving our model!
Activation Functions
Once we have updated our model parameters to optimize our loss, we need to decide what to do with the output. Depending on our goal we will use different activation functions:
Activation Function | Equation | Description | Typical Use | PyTorch Function |
---|---|---|---|---|
ReLU (Rectified Linear Unit) | \(\text{ReLU}(x) = \max(0, x)\) | - Outputs the input directly if positive, otherwise outputs zero - Helps mitigate the vanishing gradient problem |
Hidden layers of deep neural networks | torch.nn.ReLU() or torch.relu() |
Sigmoid | \(\sigma(x) = \frac{1}{1 + e^{-x}}\) | - Squashes input to a range between 0 and 1 - Useful for probabilities |
Output layers in binary classification | torch.nn.Sigmoid() or torch.sigmoid() |
Tanh | \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\) | - Squashes input to a range between -1 and 1 - Centered around zero, leading to better convergence than Sigmoid |
Hidden layers in neural networks | torch.nn.Tanh() or torch.tanh() |
Softmax | \(\sigma(\vec{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\) | - Converts an input vector \(\vec{z}\) into a probability distribution over \(K\) classes - \(e^{z_i}\): standard exponential function for input vector - \(K\): number of classes in the multi-class classifier - Ensures that outputs sum to 1 |
Output layer for multi-class classification | torch.nn.Softmax() or torch.softmax() |
Leaky ReLU | \(\text{Leaky ReLU}(x) = \begin{cases} x, & \text{if } x > 0 \\\\ \alpha x, & \text{if } x \leq 0 \end{cases}\) | - Similar to ReLU but with a small slope \(\alpha\) for negative \(x\) - Prevents the dying ReLU problem |
Hidden layers in deep networks | torch.nn.LeakyReLU() |
Let's create a model that takes in our 5 genes, creates two hidden layers with 64 nodes, uses a ReLU activation function, and returns one output layer:
class SSModel(nn.Module):
'''
Create a model to predict smoking status
which is one column 0 for non-smoker, 1 for smoker
'''
def __init__(self, input_size,num_nodes=64):
super().__init__()
self.fc1 = nn.Linear(input_size, num_nodes) # take input_size number of features and make a hidden layer with 64 nodes
self.relu = nn.ReLU() # use ReLU as the activation function
self.fc2 = nn.Linear(num_nodes, 1) # we are predicting values in one column
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
# Initialize the model, and specify the number of
# features going in and how many going out
input_size = X_train.shape[1]
model = SSModel(input_size)
The power of an activation function is that it can introduce non-linearity into the model. This allows for the learning of complex patterns - patterns that a linear model might miss.
Train and Evaluate the Model
Now let's specify our optimizer (to get to our optimum weights) and our loss function (to specify how far away our model is from the truth):
# Specify the loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
With all the pieces in place, we can train and evaluate our model!
# Training loop
num_epochs = 1000
epoch_vals = []
loss_vals = []
acc_vals = []
pre_vals = []
recall_vals = []
f1_vals = []
for epoch in range(num_epochs):
# Training the model
model.train() # set model in training mode
running_loss = 0.0 # to accumulate loss for the epoch
for X_batch, y_batch in train_loader:
#X_batch, y_batch = X_batch.to(device), y_batch.to(device) # move to device
# Forward pass
outputs = model(X_batch) # run model on the batch
y_long = y_batch.view(-1, 1).float() # reshape y for compatibility
loss = criterion(outputs, y_long) # calculate loss
# Backward pass and optimization
optimizer.zero_grad() # clear gradients
loss.backward() # backpropagation
optimizer.step() # gradient descent
running_loss += loss.item() # accumulate batch loss
# Append the average loss for the epoch
loss_vals.append(running_loss / len(train_loader))
# Evaluation
model.eval() # set model in evaluation mode
with torch.inference_mode():
all_preds = [] # list for predicted labels
all_labels = [] # list for true labels
# Model predictions on test data
for X_batch, y_batch in test_loader:
#X_batch, y_batch = X_batch.to(device), y_batch.to(device) # move to device
outputs = model(X_batch) # get model outputs
probabilities = torch.sigmoid(outputs) # get probabilities from logits
predicted = (probabilities > 0.5).float() # convert probabilities to labels
all_preds.extend(predicted.cpu().numpy()) # append predictions
all_labels.extend(y_batch.cpu().numpy()) # append true labels
# Calculate metrics
accuracy = accuracy_score(all_labels, all_preds)
precision = precision_score(all_labels, all_preds, average='binary', zero_division=0)
recall = recall_score(all_labels, all_preds, average='binary')
f1 = f1_score(all_labels, all_preds, average='binary')
# Append metrics to lists
epoch_vals.append(epoch)
acc_vals.append(accuracy)
pre_vals.append(precision)
recall_vals.append(recall)
f1_vals.append(f1)
Woah that is a lot! Let's break it down piece by piece:
num_epochs = 1000
epoch_vals = []
loss_vals = []
acc_vals = []
pre_vals = []
recall_vals = []
f1_vals = []
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
- Begins the training loop that runs for each epoch.
- Sets the model to training mode and initializes a variable to accumulate the total loss for the current epoch.
for X_batch, y_batch in train_loader:
outputs = model(X_batch)
y_long = y_batch.view(-1, 1).float()
loss = criterion(outputs, y_long)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
- Iterates over batches in the training set.
- Runs a forward pass to get predictions, calculates the loss, clears the previous gradients, performs backpropagation to calculate new gradients, and updates the model parameters using the optimizer.
- Accumulates the batch loss into running_loss.
loss_vals.append(running_loss / len(train_loader))
- Calculates and appends the average loss for the epoch to
loss_vals
.
model.eval()
with torch.inference_mode():
all_preds = []
all_labels = []
- Switches the model to evaluation mode, disabling certain training-specific behaviors and ensuring no gradients are computed for memory and computation efficiency.
- Creates lists to store predictions and true labels for evaluation.
for X_batch, y_batch in test_loader:
outputs = model(X_batch)
probabilities = torch.sigmoid(outputs)
predicted = (probabilities > 0.5).float()
all_preds.extend(predicted.cpu().numpy())
all_labels.extend(y_batch.cpu().numpy())
- Iterates over batches in the test set, runs model outputs, applies the sigmoid function to convert logits to probabilities, and thresholds probabilities to generate binary labels.
- Appends the predictions and true labels to their respective lists for later metric calculations
The Logit, the Probability and the Label
We are going to break here before moving on to explain the rest of the code to talk a bit about logits, probabilities and labels. We are predicting smoker or not smoker. But the output we get from our model is not a 1 or 0. To get that label we need to first take the output of our model (or logit) and convert it to a probability using the sigmoid activation function. This will squeeze the values between 0 and 1. Then we can set some threshold, here over 0.5, and if the value is over that threshold we label it 1 and if it is below we label it 0. And that is how we get labels from our model output! Now back to the rest of the code:
accuracy = accuracy_score(all_labels, all_preds)
precision = precision_score(all_labels, all_preds, average='binary', zero_division=0)
recall = recall_score(all_labels, all_preds, average='binary')
f1 = f1_score(all_labels, all_preds, average='binary')
epoch_vals.append(epoch)
acc_vals.append(accuracy)
pre_vals.append(precision)
recall_vals.append(recall)
f1_vals.append(f1)
Evaluation Metrics
You'll notice that at the end of our evaluation loop we calculate some metrics to see how well our model is doing:
Metric | Formula | Explanation |
---|---|---|
Accuracy | \(\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}\) | Measures the ratio of correctly predicted labels to the total number of labels. It gives an overall indication of model performance. |
Precision | \(\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\) | Measures how many of the predicted positive labels are actually positive. It indicates the model’s ability to avoid false positives. |
Recall | \(\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\) | Measures how many actual positive labels were correctly predicted. It shows the model’s ability to capture all positive cases. |
F1-score | \(\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\) | The harmonic mean of precision and recall. It balances the trade-off between precision and recall and is useful when both are important. |
Now let's visualize how these change over different numbers of epochs:
import plotly.graph_objects as go
# Create a figure
fig = go.Figure()
# Add traces for each metric
# add in loss
fig.add_trace(go.Scatter(x=list(range(1, len(epoch_vals) + 1)), y=loss_vals, mode='lines+markers', name='Loss'))
# add in accuracy
fig.add_trace(go.Scatter(x=list(range(1, len(epoch_vals) + 1)), y=acc_vals, mode='lines+markers', name='Accuracy'))
# add in precision
fig.add_trace(go.Scatter(x=list(range(1, len(epoch_vals) + 1)), y=pre_vals, mode='lines+markers', name='Precision'))
# add in recall
fig.add_trace(go.Scatter(x=list(range(1, len(epoch_vals) + 1)), y=recall_vals, mode='lines+markers', name='Recall'))
# add in F1 score
fig.add_trace(go.Scatter(x=list(range(1, len(epoch_vals) + 1)), y=f1_vals, mode='lines+markers', name='F1 Score'))
# Update layout
fig.update_layout(
title='Test Metrics Over Epochs',
xaxis_title='Epoch',
yaxis_title='Value',
legend_title='Metric',
template='plotly_white'
)
# Show the plot
fig.show()
Classifier Performance Metrics
Great! Here we see that after around epoch 700 these metrics plateau. However, that is not to say that with more epochs we couldn't get better performance!
Key Points
- Deep learning models can be used for both regression (predicting numbers) and classification (predicting classes).
- Data preparation usually involves cleaning, creating relevant columns, visualizing data, and normalizing features to ensure consistency in the model.
- The data is usually split into training and test sets.
- We can convert this data into PyTorch tensors and TensorDataset objects for efficient batch processing.
- Multilayer perceptrons are neural networks wehere all nodes in one layer are connected to all nodes in the next layer
- The training loop includes forward passes, loss calculation, backpropagation, and parameter updates.
- We can evaluate our model using metrics like accuracy, precision, recall, and F1-score to gauge model performance.