📠Machine Learning
ML comes where Human expertise does not exits
Course Content
Introduction: Definitions, Datasets for Machine Learning, Different Paradigms of Machine Learning, Data Normalization, Hypothesis Evaluation, VC-Dimensions and Distribution, Bias-Variance Tradeoff, Linear Regression, Classification (5-6 Lectures)
• Bayes Decision Theory: Bayes decision rule, Minimum error rate classification, Normal density and discriminant functions Parameter Estimation: Maximum Likelihood and Bayesian Parameter Estimation (3-4 Lectures)
• Discriminative Methods: SVM, Distance-based methods, Linear Discriminant Functions, Decision Tree, Random Decision Forest and Boosting (4 Lectures)
• Dimensionality Reduction: PCA, LDA, ICA, SFFS, SBFS (2-3 Lectures)
• Clustering: k-means clustering, Gaussian Mixture Modeling, EM-algorithm (3 Lectures)
• Kernels and Neural Networks, Kernel Tricks, SVMs (primal and dual forms), K-SVR, K-PCA (2 Lectures)
• Artificial Neural Networks: MLP, Backprop, and RBF-Net (3 Lectures)
• Foundations of Deep Learning: CNN, Autoencoders (2-3 lectures)
• Time series analysis
Exams
50% internal
22.5% (7.5% each) for 3 -Quizzes
12.5% for Assignments 1 ( 2 group)
15% for Assignments 1 ( 3 group)
50% Main
Lecture 1: (11/01/2025)
)Category of Data Set: <Explalation needed>
What is ML
Learning is any process by which a system improves performance from experience – Herbert Simon
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. —Tom Mitchell, 1997
E, T, P examples
Tic-tac-toe
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
Hand writing check
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of humanlabeled images of handwritten words
Auto driven car
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a humanjudged error
E: A sequence of images and steering commands recorded while observing a human driver.
Email Spam check
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels

Task T
Classifications of data
Ranking
Recommendation
Clustering
Density estimation

When do we use Machine Learning ?
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)
• Learning isn't always useful

Sample Applications of ML
Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging software
Medical imaging
Lecture 2: (12/01/2025)
)Supervise learning and Unsupervised leaning
Supervise learning:
When we have training data and desire output
Like: email spanning or finding dog/cat from animals images
Binary vs Multi-class classification
Binary => true/ false
Multi-class => Multiple options
Unsupervised leaning
When we have training data only
There is no labelled data
Finding the patter(clusters) from the given data
Like: Astronomical data, market segmentation
Cluster:
Finding patters no of cluster is given by user is Unsupervised learning
Reinforcement Leaning
Learning with rewards, get reward for right work and penalty for wrong work
Example: ChatGPT and self car Driving
ML System Classification
Batch vs Online Learning
Batch => offline data learning
Online Learning => chat GPT
Instance Base vs Model Base
Challenges of Machine learning
Insufficient data
NonRepresentative training data
Poor data
Irrelevant data
Performance Measure
It also called Cost
Root Mean Square error
RMSE = sqrt [(Σ(Pi – Oi)²) / n]
Mean Square error
MSE = (Σ(Pi – Oi)²) / n
Explore Data
Understand the data like
missing data
data range, count
Visually watch the data using histograms
watch patter and outliner
Library => import matplotlib.pyplot as plt
Duplicate Data => Remove it
Segregate Data
Get Unique Identifier
Find Correlations => Standard Correlations Coefficient( Pearson’s r)
Segregate 20% (for less depend on data set) for Testing purpose

Lecture 3: (18/01/2025)
Data Preparation:
80% of data analysis is spent on the process of cleaning and preparation data.
Imputation: replacing null or blank vale with zero or mean or medium value so that instead of removing vale complete value. by mapping with some other value
Good Imputation: ?? Homework
Data Cleaning:
Capping: remove the outline
Encoding

Converting Text to Number… or we can said mapping text into numerical value. Like ordinal_encoder

Features Scaling and transformation
ML algo don’t perform well when input numerical attribute have very different scales. Gaussian Algo is best for ML, where min = 0
Feature Scaling: Adjusting the range of features (e.g., normalization or standardization) to ensure all features contribute equally to the model, preventing dominance by features with larger magnitudes.
Feature Transformation: Modifying features (e.g., log, square root, or polynomial transformations) to make data more suitable for modeling, often improving linearity or addressing skewness.
Multimodal Distribution
Hyperparameter reading: Grid search method
Feature Importance : Drop 0 like value
Evaluate on Test
Linear Regression:
for reference ML ppt CS229
Mean square error problem (Cost function)
Iteration to find theta
Gradient Descent: Mountain Example

Lecture 4: (19/01/2025)
)Linear Regression
models the relationship between a dependent variable Y and one or more independent variables X using a linear equation:
where β_0 is the intercept, β_1 is the slope, and epsilon(ϵ) is the error term.
Example: Predicting house prices based on square footage.
If Y= 50000 + 200X,
then a 1000 sq. ft. house costs 50000+200(1000) = 250000.
Hypothesis function
In Linear Regression, the represents the predicted output as a linear combination of input features:
where θ_0(intercept) and θ_1 (slope) are learned parameters.
why we calculate the hypothesis?
"To estimate the relationship between input 𝑋 and output 𝑌, allowing us to make predictions for new data."
Example: If h(X)= 50 + 10X, for X=5, the predicted value is h(5)=50+10(5)= 50 + 10(5) = 100.
Hypothesis function for Multiple Linear Regression
where predictions depend on multiple input features:
Each X_i represents an independent variable, and θ_i are the learned coefficients.
Example: Predicting house price based on size (X_1) and number of rooms (X_2):
h(X)=50000+200X_1+10000X_2
For X_1= 1000 sq. ft, X_2 = 3 rooms, the price is ₹2,80,000.
Calculation Of θ ⇒ Cost Function
The values of θ_0,θ_1,… are found using Gradient Descent or the Normal Equation.
1. Gradient Descent Algorithm
Minimizes the cost function:
m = Total number of training examples.
Xi = Input features of the ith training example.
Yi = Actual output (target value) of the ith training example.
h(Xi) = Predicted output using the hypothesis function.
Update Rule:
where α is the learning rate.
Example: For data points (1,2), (2,2.8), (3,3.6), running gradient descent iteratively updates θ_0 and θ_1 to best fit h(X).
2. Normal Equation (Direct Method)
Used when data is small, as it’s computationally expensive for large datasets.
Solves for θ without iteration:
Role of the hypothesis function
The hypothesis function serves as a mathematical model that maps inputs X to outputs Y, whether continuous (regression) or discrete (classification)
Regression, h(X) outputs continuous values, meaning the predictions can take any real number. Example: Predicting house prices—h(X) 50000 + 200X can output any value like ₹2,50,000 or ₹2,50,500. Predict quantities
Classification, h(X) outputs discrete values, meaning predictions belong to predefined categories. Example: Spam detection—h(X) predicts either Spam (1) or Not Spam (0) based on email features.
Predict label
Least Squares Optimization Problem
The Least Squares Optimization Problem finds the best-fit line by minimizes the sum of squared errors between predicted and actual values:
where
Methods:
Gradient Descent iteratively updates θ to minimize J(θ).
Normal Equation directly computes.
Example: Fitting a line to points (1,2),(2,2.8),(3,3.6)(1,2), (2,2.8), (3,3.6) by minimizing the squared differences between actual Y and predicted h(X).
Pitfalls of Least Squares Optimization:
Sensitive to Outliers: Large errors get squared, making the model biased toward extreme values.
Overfitting in High Dimensions: Too many features (X) can lead to poor generalization.
Multicollinearity: Highly correlated features cause unstable parameter estimates.
Non-Linearity: Least squares assumes a linear relationship, failing for complex patterns.
Heteroscedasticity: Unequal variance in errors violates model assumptions.
Example: If one house in a dataset has an extreme price (₹1 crore while others are ₹10-20 lakhs), the least squares model will be skewed.
Learning Rate:
Learning rate hyperparameter.
The learning rate (α) controls how much Gradient Descent updates model parameters in each step:

Effects:
Too high (α≫1) → Divergence (jumps over the minimum).
Too low (α≪1) → Slow convergence.
Example: If α= 0.01, the model learns steadily, but if α= 10, it may overshoot and fail to minimize the cost function.
When will be stop
fixed no for iteration
Stop at threshold
Numerical on MSE
Feature Scaling
improves Gradient Descent convergence by normalizing feature values. Two common methods:
Min-Max Scaling:
Scales values between 0 and 1.
Standardization (Z-score):
Centers mean at 0 with unit variance.
Example: If house sizes range from 500 to 5000 sq. ft, without scaling, Gradient Descent takes longer to converge. Normalizing makes updates uniform, speeding up learning.

Batch Gradient Decent vs Stochastic GD
Batch Gradient Descent computes gradients using the entire dataset, making it slow for large datasets but stable. Batch: is like linear search b/c we gave all data to Machine so Time increases
Stochastic Gradient Descent (SGD) updates parameters using one random instance at a time, making it faster but noisy. Stochastic: like random search b/c it pick random point and check GD but it will never stop at minimum but it will come near to that point b/c it changing testing data every time.
SGD does not converge exactly but oscillates near the minimum, helping escape local minima.
Example: In house price prediction, Batch GD updates after processing all houses, while SGD updates after each house, making it faster but less stable.

Mini Batch Gradient Decent
It is mix of both, instead of random data it will pick it will pick random sets and perform batch of that, it will go very close to global minima
Normal Equation Derivation
Not useful for large data set, b/c we have to take inverse of matrix and that’s costly.
Gradient calculation is faster by using Normal equations
It will best work for minimum 70k data..
If the inverse is not exist then clean data or use another approach
Polynomial Regression
Polynomial Regression extends Linear Regression by adding polynomial terms to capture non-linear relationships:
Example: Predicting salary based on experience, where a simple linear model fails. If
h(X)= = 5000 + 2000X + 300X^2
for X = 5 years, the predicted salary is ₹32,500.
Learning Curves
A plot of training and validation errors vs. training size, showing model performance.
Underfitting
Occurs when the model is too simple (high bias), leading to high training and validation errors. Example: Linear regression on a curved dataset results in poor predictions.
Overfitting
Occurs when the model is too complex (high variance), fitting noise instead of patterns. Example: A high-degree polynomial perfectly fits training data but performs poorly on new data.
Lecture 5: (25/01/2025)
)Practical Session only
Lecture 6: (01/02/2025)
)Regularised Linear Models- tackles overfitting
Lasso Regression
Elastic
Error Bias & Variance Tradeoff & Irreducible
Bias => does not fit the data well, i.e. underfoot
Variance => small change in data result change lot i.e overfit
Irreducible => Noisy data, if you can’t fit the data in model. So we need to clean up the data. Remove outliners
Lecture 7: (02/02/2025)
)

Lecture 8: (08/02/2025)
)https://chatgpt.com/share/67ac361b-1e38-800c-8d24-9e3991a11f25
Lecture 9: (09/02/2025)
)https://chatgpt.com/share/67af6047-4a34-8006-a25b-168265542c77
Lecture 10: (15/02/2025)
)Random Forest
Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. It uses bagging and feature randomness for robustness.
Bagging (Bootstrap Aggregating)
in Random Forest improves stability and accuracy by training each decision tree on a different random subset of the dataset with replacement. This reduces variance and prevents overfitting.
Example: In a customer churn prediction model, each tree is trained on a different bootstrapped sample, and the final decision is made by averaging (regression) or voting (classification).
Example 2 ⇒
Bagging in Random Forest can be understood using an example of classifying apples and oranges. Suppose we have a dataset of fruits with features like color, weight, and texture.
Each decision tree in the Random Forest is trained on a random subset of this dataset (with replacement). Some trees may focus more on color, while others on weight. When classifying a new fruit, the final decision is made by majority voting.
Like:
Tree 1: Says "Apple" based on red color
Tree 2: Says "Orange" based on texture
Tree 3: Says "Apple" based on weight
Final prediction: "Apple" (majority vote).
Feature Importance in Random Forest
measures how much each feature contributes to the model's decision-making. It helps in feature selection by identifying the most influential features.
Example: In a fruit classification model, color might be the most important feature, followed by texture and weight.
Formula:
where:
FI_j = Feature importance of feature j
N = Number of trees
Isplit,j(i) = Importance of feature j in tree i
Code to get Feature Importance:
Lecture 11: (15/02/2025)(Evening class)
)(Evening class)Boosting
Boosting is an ensemble technique that combines weak learners sequentially, where each model corrects the errors of the previous one, improving overall accuracy. It reduces bias and variance.
Example: In spam detection, boosting refines misclassified emails by focusing more on difficult examples in each iteration.
Example: Financial Fraud Detection (Using Bagging + Boosting Together)
How?
Bagging inside Boosting: Use Random Forest (bagging) as the base estimator in AdaBoost/XGBoost to make boosting more robust.
Boosting inside Bagging: Train multiple boosted models (e.g., Gradient Boosted Trees) and aggregate their predictions like bagging.
Step 1: Bagging (Random Forest) for Robust Feature Selection
A Random Forest model is trained using multiple decision trees on different subsets of transaction data.
Each tree gives independent predictions, and majority voting ensures stable, less overfitting-prone results.
Example:
Tree 1: Says "Fraud" based on transaction amount.
Tree 2: Says "Not Fraud" based on merchant type.
Tree 3: Says "Fraud" based on location difference.
Final Bagging Prediction: "Fraud" (majority vote).
Step 2: Boosting (XGBoost) for Enhanced Accuracy
The output from Random Forest is then fed into an XGBoost model, which corrects misclassifications.
The model assigns higher weights to misclassified transactions and improves fraud detection.
Example:
If Bagging misclassified a fraud case due to a rare merchant, Boosting will refine it using new weighted trees.
Final Outcome
By combining Bagging (for robustness) and Boosting (for accuracy improvement), the system detects fraud more reliably, reducing false positives and catching hard-to-detect fraudulent transactions.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal hyperplane to separate classes with maximum margin. It works well for both linear and non-linear classification using kernels.
Example: In spam detection, SVM separates spam and non-spam emails based on word frequency patterns.
Hard Margin vs. Soft Margin in SVM
Hard Margin SVM:
Used when data is linearly separable with no misclassification.
Example: Perfectly separating red and blue balls in a 2D plane without overlap.
Soft Margin SVM:
Allows some misclassification for better generalization in non-linearly separable data.
Example 1: Classifying emails as spam or non-spam, where some emails might be misclassified due to ambiguous words.
Example 2: Separating dog and cat images where some breeds (e.g., Pomeranian vs. Persian cat) have similar features.
Regularization Hyperparameter (CC)
The Regularization Hyperparameter (CC) in SVM controls the trade-off between maximizing margin and minimizing misclassification.
High CC (low regularization) → Focuses more on classifying all points correctly, leading to overfitting.
Low CC (high regularization) → Allows some misclassification, leading to better generalization.
Example 1: In spam detection, a high CC might overfit to specific spam words, while a low CC generalizes better. Example 2: In image classification, a low CC prevents overfitting to noise in training images.
Non-Linear SVM
When data is not linearly separable, SVM uses kernel tricks to map it into a higher-dimensional space where a hyperplane can separate the classes.
Example:
Classifying red and blue points that form concentric circles. A linear SVM fails, but using a Radial Basis Function (RBF) kernel, we transform data into a higher dimension where a clear separation is possible.
Graph (Visualization of Non-Linear SVM)

The plot shows how SVM with an RBF kernel separates non-linearly distributed data (moons dataset). The decision boundary curves around the data, demonstrating how kernel tricks enable SVM to handle complex patterns.
Lecture 12: (22/02/2025)
)Kernel Function in SVM
A kernel function transforms non-linearly separable data into a higher-dimensional space, making it linearly separable.
Common Kernel Types:
Linear Kernel – Used when data is already linearly separable.
Example: Separating spam vs. non-spam emails based on word frequency.
Polynomial Kernel – Maps data into polynomial space for curved decision boundaries.
Example: Classifying different species of flowers with overlapping petal lengths.
RBF (Gaussian) Kernel – Maps data to an infinite-dimensional space, capturing complex patterns.
Example: Detecting fraudulent transactions with non-linear relationships.
Sigmoid Kernel – Similar to a neural network activation function.
Example: Handwriting recognition where patterns need non-linear separation.
Lecture 13: (23/02/2025)
)Analyzing Covariance Matrix in ML
What is a Covariance Matrix?
A covariance matrix is a square matrix that captures the relationships between multiple variables in a dataset. Each element C(i, j) represents the covariance between variable X_i and X_j:
Positive covariance → Variables increase together.
Negative covariance → One variable increases while the other decreases.
Zero covariance → No linear relationship.
Why is it Important in ML?
Feature Relationship: Helps understand how features interact.
Dimensionality Reduction: Used in PCA (Principal Component Analysis) to find uncorrelated axes.
Multicollinearity Detection: Identifies redundant features in regression models.
Example with Visualization
Consider a dataset with two features, Height (cm) and Weight (kg).

Interpretation: If the covariance matrix has a high positive value, Height and Weight are strongly correlated.
Graphical Representation
Heatmap of the Covariance Matrix
This helps visualize how different features are related in high-dimensional datasets.

Relevance to PCA (Dimensionality Reduction)
PCA relies on the eigenvectors and eigenvalues of the covariance matrix to transform correlated variables into uncorrelated principal components.
This transforms the dataset into new axes where features are uncorrelated, making ML models more efficient.
Conclusion
The covariance matrix is a fundamental tool in ML for understanding feature relationships, reducing dimensions, and improving model efficiency.
Lecture 14: (01/03/2025)
)Class Cancelled
Lecture 15: (02/03/2025)
)Class Recording
Unsupervised Learning
Unsupervised learning finds hidden patterns or structures in data without labeled outputs. It is widely used in clustering, anomaly detection, and dimensionality reduction.
K-means Clustering: A Simple Yet Powerful Algorithm
K-Means is an unsupervised learning algorithm that groups data into K clusters by minimizing intra-cluster distance. It follows an iterative process of centroid initialization, point assignment, centroid update, and convergence.
How K-Means Works:
Select the number of clusters (K).
Randomly initialize K centroids.
Assign data points to the nearest centroid.
Recalculate centroids based on cluster means.
Repeat until convergence.
Finding Optimal K (Elbow Method):
Run K-Means for different K values.
Calculate total variation within clusters.
Plot results and find the "elbow point" where adding clusters no longer reduces variation significantly.
Applications & Considerations:
Works for 1D, 2D, and multi-dimensional data.
Used in customer segmentation, image compression, and heatmaps.
Running multiple times helps counter randomness in centroid initialization.
Limitations of K-Means Clustering
Sensitivity to Initialization – The algorithm's final clustering results can vary due to different initial centroid placements, leading to inconsistent outcomes.
Fixed Number of Clusters (K) – K-means requires specifying the number of clusters in advance, which can be challenging without prior knowledge of the data structure.
Struggles with Non-Spherical Clusters – It assumes clusters are spherical and evenly sized, making it ineffective for complex, irregularly shaped clusters.
Sensitivity to Outliers – Outliers can distort centroid positions, leading to inaccurate cluster assignments and affecting overall performance.
Cost function
The cost function for K-Means Clustering is the Sum of Squared Errors (SSE), also known as Inertia. It measures the compactness of clusters by calculating the squared distance between each data point and its assigned centroid.
The objective of K-Means is to minimize this cost function to achieve the best clustering.
Pseudocode

WORKING......
Mini-batch
Elbow method
Lecture 16: (08/03/2025)
)Class Recording
Silhouette Coefficient
The Silhouette Coefficient (or Silhouette Score) is a metric used to evaluate the quality of clustering in unsupervised learning. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, where:
1 → The data point is well clustered.
0 → The data point is on the border between clusters.
-1 → The data point is likely misclassified.

Bayesian Decision Theory Risk function
Bayesian Decision Theory provides a probabilistic approach to decision-making under uncertainty. The Risk Function quantifies the expected loss when making decisions based on uncertain information.
Lecture 17: (09/03/2025)
)Class Recording
Principal Component Analysis (PCA)
Local Linear Embedding
Eigenvalue
Lecture 17: (16/03/2025)
)Class Recording
Lecture : (22/03/2025)
)Class Recording
Lecture : (23/03/2025)
)Class Recording
Neural Networks
Lecture : (29/03/2025)
)Class Recording

Neural Networks
Input Encoding: The original photo is processed into a latent feature space using a CNN.
Style Conditioning: A text encoder converts the prompt "Ghibli style" into a style embedding.
Latent Fusion: Cross-attention fuses the photo’s content with the Ghibli style.
Diffusion Refinement: An iterative diffusion model denoises the fused latent space to align it with the desired style.
Decoding: A decoder converts the refined latent representation back into the final stylized image.

Lecture : (30/03/2025)
)Class Recording

Lecture : (05/04/2025)
)Class Recording
CNN
Convolutional Neural Networks are a specialized kind of neural network designed for processing structured grid data like images. They are particularly effective in visual recognition tasks.

Lecture : (06/04/2025)
)Class Recording
CNN
Lecture : (12/04/2025)
)Class Recording
Autoencoders
Autoencoders are neural networks designed to learn efficient representations (encodings) of data, typically for dimensionality reduction, denoising, or generative tasks. They work by trying to reconstruct their inputs.

RNN (Recurrent Neural Network)
RNNs are neural networks designed for sequential data, where the current output depends not only on the current input but also on previous inputs. They are widely used in tasks involving time series, language, and sequences.

Lecture : (13/04/2025)
)QUIZ
Last updated
