Autoencoders
Prerequisites
- Deep Learning Setup : Setup workspace and download python libraries
Learning Objectives
- Autoencoder Background
- Building an Autoencoder
- Training and Evaluating an Autoencoder
- Variational Autoencoder (VAE) Background
- Building a VAE
- Training and Evaluating a VAE
Autoencoders
Background
Neural networks can be great at learning patterns in data. But the trade off is that the model can be too good, meaning it essentially memorizes all the training data and is not generalizable to other data - in other words the model is overfit:
Model Overfitting
So enter autoencoders! Autoencoders take input from a higher dimensional space and encode it in a lower dimensional space, then decode the output of the latent space and reconstructs the data:
Autoencoders
So, why bother - how does this contribute to prevent overfitting? Well, by encoding and decoding the input data we tend to "denoise" our data, allowing the model to learn general patterns without memorizing our data. To assess an autoencoder we use the loss function that makes most sense for our problem. In this tutorial we are going to be predicting smoking status again (a binary variable) so we need the binary cross entropy loss function!
- \(L_{BCE}\): loss after encoding and decoding
- \(N\): number of observations
- \(y_i\): true value
- \(\hat{y}_i\): predicted value after endcoding and decoding
With a lower \(L_{BCE}\), the model is doing a better job of reconstructing the original data!
Building an Autoencoder
Let's make ourselves an autoencoder then! We will pull in our glioblastoma data from the classifier example:
# pandas and numpy for data manipulation and plotly for plotting
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
# sklearn for model metrics, normalization and data splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# torch for tensor operations and neural network functions
import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
# read in data and filter out na values
gbm = pd.read_csv("../data/gbm_data.csv",on_bad_lines='skip')
gbm = gbm[gbm['SMOKING_HISTORY'].notna()]
# make a new column for smoking status
gbm['smoking_status'] = np.where(gbm['SMOKING_HISTORY'].str.contains('non-smoker'), 'non-smoker', 'smoker')
# filter our data for features of interest
gbm_filt = gbm.loc[:,['LINC00159','EFTUD1P1','C20orf202','KRT17P8','RPL7L1P9','smoking_status']]
gbm_filt['smoking_status'] = gbm_filt['smoking_status'].astype('category').cat.codes
Now we will split our data into training and test datasets and normalize:
# Split the data into features and outcome variable
X = gbm_filt.drop('smoking_status', axis=1).values
y = gbm_filt['smoking_status'].values
# Normalize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=81)
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)
Great, let's make that autoencoder now!
class AutoEncoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super(AutoEncoder, self).__init__()
# encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, latent_dim)
)
# decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 1),
nn.Sigmoid()
)
# define the flow through the nodes
def forward(self, x):
latent = self.encoder(x)
reconstructed = self.decoder(latent)
return reconstructed
Here we:
- Create an autoencoder where the encoder takes our initial inputs
- Sends to two hidden layers with 64 nodes and a ReLU activation function
- Squeezes the input to some latent dimension
- Next in the decoder, we take the input from the latent space and send it to two more layers with the original dimensions of the hidden layers - 64
- Then it outputs probabilities for binary classification!
- Finally, we define how is data going to move through our model - it is encoded by our encoder and then decoded by our decoder.
Train the Autoencoder
Let's train our autoencoder and see how well it does!
# get input dimension
input_dim = X_train_tensor.shape[1]
model = AutoEncoder(input_dim=input_dim, latent_dim=2)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# convert target tensors to float32
y_train_tensor = y_train_tensor.float()
y_test_tensor = y_test_tensor.float()
# set a list for our training and test loss values
ae_train_loss_vals = []
ae_test_loss_vals = []
# training loop
for epoch in range(300):
model.train()
# what is our model output and loss
output = model(X_train_tensor)
loss = criterion(output.squeeze(), y_train_tensor)
# clear our gradients, backpropagate and
# perform gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()
# loss for each training epoch
ae_train_loss_vals.append(loss.item())
# evaluate on test data
model.eval()
with torch.inference_mode():
recon_test = model(X_test_tensor)
test_loss = criterion(recon_test.squeeze(), y_test_tensor)
ae_test_loss_vals.append(test_loss.item())
You'll see our familiar flow, where we:
- call our model with the our number of input features, and the number of nodes in the latent space
- define our loss function and optimizer
- convert our outcome variables to float values
- loop through our epochs in training mode
- plug and chug our model
- get our loss
- clear our gradients
- backpropogate to update our model weights
- perform gradient descent to find the optimal weights
Now let's see how our loss changes as the number of epochs increases!
# create an empty plot
fig = go.Figure()
# add training and test loss values
fig.add_trace(go.Scatter(x=list(range(300)), y=ae_train_loss_vals, mode='lines', name='Training Loss'))
fig.add_trace(go.Scatter(x=list(range(300)), y=ae_test_loss_vals, mode='lines', name='Test Loss'))
# update labels & layout
fig.update_layout(
title='Autoencoder Loss vs. Epoch',
xaxis_title='Epoch',
yaxis_title='Loss',
template = 'plotly_white'
)
# show the plot
fig.show()
output
Great! we see that as the number of epochs increases the loss decreases and that this loss seems to level out after epoch 300.
Variational Autoencoders (VAE)
Background
A recent improvement on the autoencoder is the variational autoencoder (VAE). So a normal autoencoder is returning a reconstructed value in an effort to avoid overfitting the model. A VAE will turn the latent space into a distribution instead of a vector by returning two things from the latent space:
- Mean (\(\mu\)): the central point of the distribution.
- Log variance \(log(\sigma^2)\): describes how spread out or “wide” the distribution should be around the mean.
Now to get data points they essentially sample from a normal distribution defined by the mean and variance output from the VAE and output another data point \(z\):
\(z = \mu + \sigma \odot \epsilon\)
Where:
- \(\mu\) is the mean
- \(\sigma\) is the standard variation (which you get by converting the log variance with \(exp(0.5⋅log(\sigma^2))\) )
- \(\epsilon\) is the sample from \(\mathcal{N}(0,1)\) (a normal distribution with a mean of 0 and a standard deviation of 1)
- \(\odot\) is the element wise product
What is cool about this is that we can use variational autoencoders to model the underlying distribution. However, we have an issue, we sampled from a normal distribution \(\mathcal{N}(0,1)\) and our latent space has a distribution of \(\mathcal{N}(\mu,\sigma^2)\). To make sure we can sample from \(\mathcal{N}(0,1)\) we should have some way of penalizing the model if it strays too far from that normal distribution \(\mathcal{N}(0,1)\). We do this using Kullback-Leibler (KL) divergence. For a binary loss that is summarized to:
\(L_{KL} = -\frac{1}{2} \sum_{j=1}^{d} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right)\)
How did we get to this?
-
What is KL Divergence?:
-
The KL divergence between two distributions \(q(z|x)\) and \(p(z)\) is:
\(L_{KL} = D_{KL}(q(z|x) \parallel p(z)) = \int q(z|x) \log \frac{q(z|x)}{p(z)} \, dz\)
-
For a VAE, \(q(z|x)\) has a mean \(\mu\) and a variance \(\sigma^2\) (encoded for each input by the encoder network).
-
Plugging in the Standard Normal Distribution:
-
\(p(z)\) is a standard normal distribution \(N(0, 1)\), so it has the following formula:
\(p(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}}\)
-
\(q(z|x)\), has a mean \(\mu\) and variance \(\sigma^2\), and is a normal distribution for each input sample: \(q(z|x) = N(\mu, \sigma^2)\).
-
Solving the KL Divergence Integral:
-
Substituting the expressions for \(q(z|x)\) and \(p(z)\) into the KL divergence formula, we get:
\(L_{KL} = \int N(\mu, \sigma^2) \log \frac{N(\mu, \sigma^2)}{N(0, 1)} \, dz\)
-
Now if solve our integral we get our KL divergence!:
\(L_{KL} = -\frac{1}{2} \sum_{j=1}^{d} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right)\)
We can then add this loss to our reconstruction loss to get the loss of our VAE:
\(L_{VAE} = L_{BCE} + \beta L_{KL}\)
Where \(\beta\) is a term for how much we want to penalize the latent space. Now enough math, let's make a VAE!
Building a VAE
class VAEClassifier(nn.Module):
def __init__(self, input_dim, latent_dim):
super(VAEClassifier, self).__init__()
# encoder: compresses input to a smaller space
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU()
)
# latent space parameters: calculates the mean (mu) and log variance (logvar)
self.fc_mu = nn.Linear(64, latent_dim)
self.fc_logvar = nn.Linear(64, latent_dim)
# Classifier: Outputs a probability for each binary label (0 or 1)
# uses a ReLU activation function and
# sigmoid function to get class probabilities
self.classifier = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 1),
nn.Sigmoid()
)
# Reparameterization: Samples from the latent distribution
# using mean (mu) and standard deviation derived from log variance (logvar)
def reparameterize(self, mu, logvar):
# get standard deviation
std = torch.exp(0.5 * logvar)
# sample from a normal distribution
eps = torch.randn_like(std)
# output a sampled point
return mu + eps * std
# Forward pass: Defines data flow through the network
def forward(self, x):
# encode the input
encoded = self.encoder(x)
# calcluate the mean and log variance of the latent space
mu = self.fc_mu(encoded)
logvar = self.fc_logvar(encoded)
# sample points from latent space
z = self.reparameterize(mu, logvar)
# classify samples and return the output, mean and log variance
output = self.classifier(z)
return output, mu, logvar
- Encoder: Compresses the input into latent space.
- Latent Space Parameters: Calculates the mean (mu) and log variance (logvar) for the latent space.
- Classifier: Uses a linear layer, ReLU, and sigmoid to produce a probability, predicting the binary class (e.g., 0 or 1).
- Reparameterization: Samples from the latent distribution, turning logvar into standard deviation and adding random value from normal distribution \(\mathcal{N}(0,1)\).
- Forward Pass: Passes data through the encoder, computes mean and log variance, and then reparameterizes to get a sample point.
- Sampling: Combines mean with a standard deviation-scaled data point to produce z, a point in the latent space.
- Binary Classification: Uses z to predict a binary outcome, outputting a probability with the sigmoid activation.
- Outputs: Returns the probability, along with mean and log variance (which we will need to calcualte the KL divergence!).
Train a VAE
Let's get to training our new VAE!
# call our model and set the number of dimensions for the input and
# latent space
vae = VAEClassifier(input_dim=X_train_tensor.shape[1], latent_dim=2)
optimizer = torch.optim.Adam(vae.parameters(), lr=0.001)
# define our VAE loss using the reconstructed loss value
# the mean and log variance
def vae_loss(recon_x, x, mu, logvar):
recon_loss = nn.BCELoss()(recon_x.squeeze(), x)
kl_divergence = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kl_divergence
vae_train_loss_vals = []
vae_test_loss_vals = []
# loop through and train the VAE
for epoch in range(300):
vae.train()
# grab out reconstruction loss, mean and log variance
recon, mu, logvar = vae(X_train_tensor)
# calcluate our loss
loss = vae_loss(recon, y_train_tensor, mu, logvar)
# clear our gradients, backpropagate and
# perform gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()
# append to list
vae_train_loss_vals.append(loss.item())
# evaluate on test data
vae.eval()
with torch.inference_mode():
recon_test, mu_test, logvar_test = vae(X_test_tensor)
test_loss = vae_loss(recon_test, y_test_tensor, mu_test, logvar_test)
vae_test_loss_vals.append(test_loss.item())
Here we are:
- Defining our loss function and optimizer.
- Defining a function to calculate our VAE loss using our reconstructed loss value, the mean and log variance.
- Setting a training loop and calculating our loss.
- Clearing our gradients, backpropogating and running gradient descent
- Adding that loss to our list of losses
- Evaluating our model on test data and adding the test loss to our list
Let's see how our VAE did with our test and training data!
# create an empty plot
fig = go.Figure()
# add training and test loss values
fig.add_trace(go.Scatter(x=list(range(300)), y=vae_train_loss_vals, mode='lines', name='Training Loss'))
fig.add_trace(go.Scatter(x=list(range(300)), y=vae_test_loss_vals, mode='lines', name='Test Loss'))
# update labels & layout
fig.update_layout(
title='VAE Loss vs. Epoch',
xaxis_title='Epoch',
yaxis_title='Loss',
template = 'plotly_white'
)
# show the plot
fig.show()
output
Key Points
- Neural networks can learn patterns but risk overfitting.
- Autoencoders solve this by compressing (encoding) input data to a smaller, lower-dimensional latent space, and then reconstructing (decoding) it back.
- Variational Autoencoders (VAEs) improve on autoencoders by modeling the latent space as a distribution (mean and log variance), better matching the underlying distribution of the data
- The KL divergence term in the VAE loss penalizes deviations from a standard normal distribution, ensuring the latent space is smooth and structured.