Autoencoders
Part I – Theoretical Foundation
1. Deep Learning Basics
Deep learning is a part of machine learning that uses neural networks with many layers to learn features from complex data like images or text. These networks learn by adjusting their weights to minimize a "loss function" using a method called backpropagation.
2. What Is an Autoencoder?
An autoencoder is a neural network used to compress and then reconstruct input data. It works without any labels (unsupervised learning).
Encoder: Compresses the input into a smaller code or latent vector.
Decoder: Reconstructs the original input from this code.
The goal is to recreate the input as accurately as possible. If done well, the code (latent vector) contains the most important information from the input.
Autoencoders are useful for:
Reducing dimensions
Extracting features
Removing noise
Detecting anomalies
3. Variants of Autoencoders
A. Sparse Autoencoder (SAE)
This version forces the model to use as few neurons as possible at once by adding a sparsity constraint.
Most hidden units remain inactive (near zero).
Achieved using a penalty (like L1 loss or KL-divergence) in the loss function.
Helps the model learn clean, sharp, and meaningful features.
B. Contractive Autoencoder (CAE)
This version discourages the encoder from changing its output too much when the input is slightly modified.
It adds a penalty based on how sensitive the encoder is to small input changes.
This is done using the Frobenius norm of the Jacobian of encoder outputs.
Helps create a smooth, noise-resistant latent space.
Makes the model more stable and generalizable.
4. U-Net Autoencoder (Without Skip Connections)
The U-Net is commonly used in image tasks like segmentation. In this assignment, we use a modified version:
Encoder: Series of convolutional layers that shrink the image while learning features.
Decoder: Upsamples the compressed code back into the original image size.
No skip connections: This ensures the model must actually compress the information instead of copying it across layers.
Part II – Assignment Explanation
Task 1: Sparse vs. Contractive Autoencoders (on MNIST)
a) Visualizing Embeddings with t-SNE
You will:
Feed MNIST test images through each encoder (SAE and CAE).
Reduce the resulting latent vectors to 2D using t-SNE.
Color points by digit labels to see how well they cluster.
Why: This shows how well the models separate different digit types in latent space.
What to expect:
CAE usually creates smoother, tighter clusters.
SAE often has sharper but more scattered groupings.
b) Interpolation Study
You will:
Pick 20 image pairs.
Create blended images between them in pixel space.
Compare the encodings of these true blends with simple linear interpolations in latent space.
Decode both to images and compute two things:
PSNR (image similarity)
Latent L2 difference (code similarity)
Why: This tests whether a straight path in code space leads to meaningful image transitions.
What to expect: CAE’s codes often create smoother and more accurate interpolated images because the regularization helps shape the latent space better.
c) Classification Using Latent Embeddings
You will:
Use the encoded vectors (from SAE and CAE) as features.
Train a classifier like SVM or logistic regression to predict the digit.
Why: This checks how useful and separable the latent features are for downstream tasks, even though the autoencoders weren’t trained with labels.
What to expect:
SAE often gives higher classification accuracy due to its neuron selectivity.
CAE is generally more robust, especially when data is noisy.
Note: You must not use skip connections in U-Net to prevent cheating (i.e., directly copying input features).
Task 2: Variational Autoencoder (VAE) on Frey Face Dataset
You will train a VAE with a 20-dimensional latent space.
What to Do:
After training, sample random latent codes and generate new faces.
Then, fix all but one latent dimension and vary that one to see how the face changes.
Why: Unlike traditional AEs, VAEs create a structured and continuous latent space (usually assumed to follow a standard normal distribution).
Sampling shows the model’s ability to generate realistic new faces.
Varying one dimension helps us interpret what each latent factor controls (like smile, head tilt, or lighting).
How it works:
VAE adds a KL divergence term to keep latent codes close to normal distribution.
The reparameterization trick allows gradients to flow through the random sampling process during training.
Expected results:
Sampling should generate a variety of realistic faces.
Changing one latent value at a time should show gradual transformations, revealing which features are controlled by which code dimension.
Let me know if you'd like this saved as a .docx, .pdf, or .txt file.
Last updated