# Machine Learning ML comes where Human expertise does not exits

Course Content

**Introduction**: Definitions, Datasets for Machine Learning, Different Paradigms of Machine Learning, Data Normalization, Hypothesis Evaluation, VC-Dimensions and Distribution, Bias-Variance Tradeoff, Linear Regression, Classification (5-6 Lectures) • Bayes Decision Theory: Bayes decision rule, Minimum error rate classification, Normal density and discriminant functions Parameter Estimation: Maximum Likelihood and Bayesian Parameter Estimation (3-4 Lectures) • Discriminative Methods: SVM, Distance-based methods, Linear Discriminant Functions, Decision Tree, Random Decision Forest and Boosting (4 Lectures) • Dimensionality Reduction: PCA, LDA, ICA, SFFS, SBFS (2-3 Lectures) • Clustering: k-means clustering, Gaussian Mixture Modeling, EM-algorithm (3 Lectures) • Kernels and Neural Networks, Kernel Tricks, SVMs (primal and dual forms), K-SVR, K-PCA (2 Lectures) • Artificial Neural Networks: MLP, Backprop, and RBF-Net (3 Lectures) • Foundations of Deep Learning: CNN, Autoencoders (2-3 lectures) • Time series analysis

Exams

* 50% internal * 22.5% (7.5% each) for 3 -Quizzes * 12.5% for Assignments 1 ( 2 group) * 15% for Assignments 1 ( 3 group) * 50% Main

Material

* [**Class Recordings**](https://general-smile-94b.notion.site/ML-Class-Recording-1990dfee4e4380fd8ce0cf27e0531a74) * [**Class Material**](https://github.com/manvendrapratapsinghdev/IITJMaterial/tree/main/T1/ML) * Python Library * * Videos: * [Cost function](https://www.youtube.com/watch?v=7uwa9aPbBRU\&list=PLTDARY42LDV7WGmlzZtY-w9pemyPrKNUZ\&index=1)

{% file src="" %} {% file src="" %} ## [*Gist of complete course*](https://app.napkin.ai/page/CgoiCHByb2Qtb25lEiwKBFBhZ2UaJDExNjg2YTlkLTQwYTYtNDdmMy1hNDBlLTg4YzFlZTIyYWQ4Mg?s=1) ## Lecture 1: *(11/01/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/ZK98Z22v_ogK2QceGzu7tGf7v4yJHuVMpP1bgfdbROVE4cukCMnySDoO0b0ed6xOUF3fEEDnx7a-ht2F.wTjDIhS1ev3j25M7) {% file src="" %} Category of Data Set: \ ### What is ML * Learning is any process by which a system improves performance from experience – Herbert Simon * A computer program is said to learn from experience E with respect to some task T and some performance measure P, if **its performance on T, as measured by P, improves with experience E**. —Tom Mitchell, 1997 ### E, T, P examples * **Tic-tac-toe** * **T**: Playing checkers * **P**: Percentage of games won against an arbitrary opponent * **E**: Playing practice games against itself * **Hand writing check** * **T**: Recognizing hand-written words * **P**: Percentage of words correctly classified * **E**: Database of humanlabeled images of handwritten words * **Auto driven car** * **T**: Driving on four-lane highways using vision sensors * **P**: Average distance traveled before a humanjudged error * **E**: A sequence of images and steering commands recorded while observing a human driver. * **Email Spam check** * **T**: Categorize email messages as spam or legitimate. * **P**: Percentage of email messages correctly classified. * **E**: Database of emails, some with human-given labels

### Task T * Classifications of data * Ranking * Recommendation * Clustering * Density estimation

### When do we use Machine Learning ? • Human expertise does not exist (navigating on Mars) • Humans can’t explain their expertise (speech recognition) • Models must be customized (personalized medicine) • Models are based on huge amounts of data (genomics) • Learning isn't always useful

### Sample Applications of ML * Web search * Computational biology * Finance * E-commerce * Space exploration * Robotics * Information extraction * Social networks * Debugging software * Medical imaging ## Lecture 2: *(12/01/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/TKTf0vJ54QxTT-b3DnDoPNni6nQTQGrtENY1CTs6LouRCLvaLOcZbRY5N_DN-EIuowLzZ6L9NqXMasH0.NOXIG_SsVPYX_sO_) ### **Supervise learning and Unsupervised leaning** ### **Supervise learning**: When we have training data and desire output Like: email spanning or finding dog/cat from animals images #### **Binary vs Multi-class classification** Binary => true/ false Multi-class => Multiple options ### Unsupervised leaning When we have training data only There is no labelled data Finding the patter(clusters) from the given data Like: Astronomical data, market segmentation ### Cluster: Finding patters\ no of cluster is given by user is Unsupervised learning ### Reinforcement Leaning Learning with rewards,\ get reward for right work and penalty for wrong work Example: ChatGPT and self car Driving ### ML System Classification Batch vs Online Learning **Batch** => offline data learning **Online Learning** => chat GPT Instance Base vs Model Base ### Challenges of Machine learning * Insufficient data * NonRepresentative training data * Poor data * Irrelevant data ### Performance Measure It also called Cost Root Mean Square error RMSE = sqrt \[(Σ(Pi – Oi)²) / n] Mean Square error MSE = (Σ(Pi – Oi)²) / n ### Explore Data 1. **Understand the data like** * missing data * data range, count 2. **Visually watch the data using histograms** * watch patter and outliner * Library => import matplotlib.pyplot as plt 3. **Duplicate Data** => Remove it 4. **Segregate Data** 5. **Get Unique Identifier** 6. **Find Correlations** => Standard Correlations Coefficient( Pearson’s r) Segregate 20% (for less depend on data set) for Testing purpose

## Lecture 3: (*18/01/2025*) [**Class Recording**](https://futurense.zoom.us/rec/play/ojgH9HzpeGnwfdn1GdKUsZ-0BUivPtLq2B4ddL-fEj2zB1ryFeWaQPENxvxafDvXeg2NQrx0EKe3tZX_.OReCu7j0uGAxI8pj) {% file src="" %} ### Data Preparation: 80% of data analysis is spent on the process of cleaning and preparation data. Imputation: replacing null or blank vale with zero or mean or medium value so that instead of removing vale complete value. by mapping with some other value Good Imputation: ?? Homework ### Data Cleaning: Capping: remove the outline Encoding

Converting Text to Number… or we can said mapping text into numerical value. Like ordinal\_encoder

### Features Scaling and transformation ML algo don’t perform well when input numerical attribute have very different scales.\ Gaussian Algo is best for ML, where min = 0 1. Feature Scaling: Adjusting the range of features (e.g., normalization or standardization) to ensure all features contribute equally to the model, preventing dominance by features with larger magnitudes. 2. Feature Transformation: Modifying features (e.g., log, square root, or polynomial transformations) to make data more suitable for modeling, often improving linearity or addressing skewness. ### Multimodal Distribution Hyperparameter reading: Grid search method Feature Importance : Drop 0 like value Evaluate on Test Linear Regression: {% hint style="info" %} for reference ML ppt CS229 {% endhint %} ### Mean square error problem ([Cost function](https://m-tech-in-artificial-intelligenc.gitbook.io/manvendrapratapsinghdev/trimester-1/broken-reference)) Iteration to find theta **Gradient Descent**: Mountain Example

## Lecture 4: *(19/01/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/A4bn_Ki2KcmMhjFBi5-HAfteRI0xOwVHqG1Ft6PSRB-Psmlum_-ERDujYOlX92-6xCn0ytXkTNqxR78v.YypNZ7tV2gWM0lwj) {% file src="" %} ### **Linear Regression** models the relationship between a dependent variable Y and one or more independent variables X using a linear equation: $$ Y= \beta\_0 + \beta\_1 X + \epsilon $$ where β\_0 is the intercept, β\_1 is the slope, and epsilon(ϵ) is the error term. #### **Example****:** Predicting house prices based on square footage. If Y= 50000 + 200X, then a 1000 sq. ft. house costs 50000+200(1000) = 250000. ### **Hypothesis function** In Linear Regression, the represents the predicted output as a linear combination of input features: $$ h(X) = \theta\_0 + \theta\_1 X\_1 $$ where θ\_0(intercept) and θ\_1 (slope) are learned parameters. #### why we calculate the hypothesis? "To estimate the relationship between input 𝑋 and output 𝑌, allowing us to make predictions for new data." **Example****:** If h(X)= 50 + 10X, for X=5, the predicted value is h(5)=50+10(5)= 50 + 10(5) = 100. #### **Hypothesis function for Multiple Linear Regression** where predictions depend on multiple input features: $$ h(X)= \theta\_0 + \theta\_1 X\_1 + \theta\_2 X\_2 + ... + \theta\_n X\_n $$ Each X\_i represents an independent variable, and θ\_i are the learned coefficients. **Example****:** Predicting house price based on size (X\_1) and number of rooms (X\_2): h(X)=50000+200X\_1+10000X\_2 For X\_1= 1000 sq. ft, X\_2 = 3 rooms, the price is **₹2,80,000**. ### Calculation Of θ ⇒ Cost Function The values of θ\_*0,θ\_*1,… are found using **Gradient Descent** or the **Normal Equation**. #### **1. Gradient Descent Algorithm** Minimizes the cost function: $$ J(θ)= \frac{1}{2m} \sum\_{i=1}^{m} (h(X\_i) - Y\_i)^2 $$ * **m** = Total number of training examples. * **Xi** = Input features of the ith training example. * **Yi** = Actual output (target value) of the ith training example. * **h(Xi)** = Predicted output using the hypothesis function. **Update Rule:** $$ θj:=\theta\_j - \alpha \frac{\partial J}{\partial \theta\_j} $$ where α is the learning rate. **Example****:** For data points (1,2), (2,2.8), (3,3.6), running gradient descent iteratively updates θ\_0 and θ\_1 to best fit h(X). #### **2. Normal Equation (Direct Method)** Used when data is small, as it’s computationally expensive for large datasets. Solves for θ without iteration: $$ θ=(X^TX)^{-1}X^TY $$ ### Role of the hypothesis function The hypothesis function serves as a **mathematical model** that maps inputs X to outputs Y, whether continuous (regression) or discrete (classification) **Regression**, h(X) outputs continuous values, meaning the predictions can take any real number. **Example****:** Predicting house prices—h(X) 50000 + 200X can output any value like ₹2,50,000 or ₹2,50,500.\ Predict quantities **Classification**, h(X) outputs discrete values, meaning predictions belong to predefined categories. **Example:** Spam detection—h(X) predicts either **Spam (1)** or **Not Spam (0)** based on email features. Predict label ### Least Squares Optimization Problem The **Least Squares Optimization Problem** finds the best-fit line by minimizes the sum of squared errors between predicted and actual values: $$ J(θ)= \sum\_{i=1}^{m} (Y\_i - h(X\_i))^2 $$ where $$ h(X)= \theta\_0 + \theta\_1 X. $$ **Methods:** 1. **Gradient Descent** iteratively updates θ to minimize J(θ). 2. **Normal Equation** directly computes. **Example****:** Fitting a line to points (1,2),(2,2.8),(3,3.6)(1,2), (2,2.8), (3,3.6) by minimizing the squared differences between actual Y and predicted h(X). ### **Pitfalls of Least Squares Optimization:** 1. **Sensitive to Outliers:** Large errors get squared, making the model biased toward extreme values. 2. **Overfitting in High Dimensions:** Too many features (X) can lead to poor generalization. 3. **Multicollinearity:** Highly correlated features cause unstable parameter estimates. 4. **Non-Linearity:** Least squares assumes a linear relationship, failing for complex patterns. 5. **Heteroscedasticity:** Unequal variance in errors violates model assumptions. **Example****:** If one house in a dataset has an extreme price (₹1 crore while others are ₹10-20 lakhs), the least squares model will be skewed. ### **Learning Rate**: Learning rate hyperparameter. The **learning rate** (α) controls how much Gradient Descent updates model parameters in each step:

$$ θj:= \theta\_j - \alpha \frac{\partial J}{\partial \theta\_j} $$ #### **Effects:** * **Too high (α≫1)** → Divergence (jumps over the minimum). * **Too low (α≪1)** → Slow convergence. **Example****:** If α= 0.01, the model learns steadily, but if α= 10, it may overshoot and fail to minimize the cost function. * When will be stop * fixed no for iteration * Stop at threshold `Numerical on MSE` ### **Feature Scaling** improves Gradient Descent convergence by normalizing feature values. Two common methods: **`Min-Max Scaling:`** $$ X′= \frac{X - X\_{\min}}{X\_{\max} - X\_{\min}} $$ Scales values between 0 and 1. **`Standardization (Z-score):`** $$ X′= \frac{X - \mu}{\sigma} $$ Centers mean at 0 with unit variance. **Example:** If house sizes range from 500 to 5000 sq. ft, without scaling, Gradient Descent takes longer to converge. Normalizing makes updates uniform, speeding up learning.

### Batch Gradient Decent vs Stochastic GD * **Batch Gradient Descent** computes gradients using the entire dataset, making it slow for large datasets but stable. **Batch**: is like linear search b/c we gave all data to Machine so Time increases * **Stochastic Gradient Descent (SGD)** updates parameters using one random instance at a time, making it faster but noisy. **Stochastic**: like random search b/c it pick random point and check GD but it will never stop at minimum but it will come near to that point b/c it changing testing data every time. **SGD does not converge exactly** but oscillates near the minimum, helping escape local minima. **Example****:** In house price prediction, Batch GD updates after processing all houses, while SGD updates after each house, making it faster but less stable.

### Mini Batch Gradient Decent It is mix of both, instead of random data it will pick it will pick random sets and perform batch of that, it will go very close to global minima ### Normal Equation Derivation * Not useful for large data set, b/c we have to take inverse of matrix and that’s costly. * Gradient calculation is faster by using Normal equations * It will best work for minimum 70k data.. * If the inverse is not exist then clean data or use another approach ### Polynomial Regression **Polynomial Regression** extends Linear Regression by adding polynomial terms to capture non-linear relationships: $$ h(X) = \theta\_0 + \theta\_1 X + \theta\_2 X^2 + ... + \theta\_n X^n $$ **Example:** Predicting salary based on experience, where a simple linear model fails. If h(X)= = 5000 + 2000X + 300X^2 for X = 5 years, the predicted salary is **₹32,500**. ### **Learning Curves** A plot of training and validation errors vs. training size, showing model performance. #### **Underfitting** Occurs when the model is too simple (high bias), leading to high training and validation errors.\ **Example****:** Linear regression on a curved dataset results in poor predictions. #### **Overfitting** Occurs when the model is too complex (high variance), fitting noise instead of patterns.\ **Example****:** A high-degree polynomial perfectly fits training data but performs poorly on new data. ## Lecture 5: *(25/01/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/wClklIgXDOZnTiJHfeRfvt-3ksHIskHZb9lK4zItUpPHl9E8PWEYjtLGZhSEDGSbb0cqQFmY3FsJss0Q.IQz19p--Vw6m3CTW) ### Practical Session only ## Lecture 6: *(01/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/j5WSMXnsi7dDCuWCqmxB9o2JIJPmfF8wMmw-wDEKxFnjHwrzqMrpAgavUxUWEIJLgUhJ-SbNfFIauXTg.SLcZTN3I487thmxw) {% file src="" %} ### Regularised Linear Models- tackles overfitting Lasso Regression Elastic Error Bias & Variance Tradeoff & Irreducible **Bias** => does not fit the data well, i.e. underfoot **Variance** => small change in data result change lot i.e overfit **Irreducible** => Noisy data, if you can’t fit the data in model. So we need to clean up the data. Remove outliners ## Lecture 7: *(02/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/6SA3as3yDl6U5dU_YYqJjZOfolIcXAus-vwXJkRfkwovGxcyJMqaR5JWk5wvfDXEsU1wyx2PHhDCkryX.idOUmn4znoWW8nj5) {% file src="" %}

## Lecture 8: *(08/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/6ojdaYMA9jcG2augzQbFX-0byogaiFCdkCQVoNA14JQkEFM3TLV-WWShZKt4RO5Sapov4el91WpzQF-C.LwnO69GmhpCjsPDc) {% file src="" %} [**https://chatgpt.com/share/67ac361b-1e38-800c-8d24-9e3991a11f25**](https://chatgpt.com/share/67ac361b-1e38-800c-8d24-9e3991a11f25) ## Lecture 9: *(09/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/C59FJ9vdLJ5HWwoBqMbG1sBeCtblObDVSJx1viu2a8iVNOkaTVDKsrW6r6CSXrKJUHgJMIhYNsNF4_z4.bSh7D9e9tunM16I0) [**Doubt session Recording**](https://futurense.zoom.us/rec/play/YNrPUbqtYKkTEkHUTp99wYBFgrUwHqpFITE61ZKNmxuAo59XGLKHXrWzC9u6Je_Mci2RMAmZ56DNGnYM.YggLz4fIBwO-Ov0q) \ [**https://chatgpt.com/share/67af6047-4a34-8006-a25b-168265542c77**](https://chatgpt.com/share/67af6047-4a34-8006-a25b-168265542c77) ## Lecture 10: *(15/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/pHUB84cOxmdk4_vmRvOmpKtnNA-BMun-FTF_A34RsnEtYmt8AnscZOhMJqyQQWPrHjzn6OryMxVP1IaN.Qe4elChmrF_GNd30) {% file src="" %} ### Random Forest **Random Forest** is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. It uses **bagging** and **feature randomness** for robustness. ### **Bagging (Bootstrap Aggregating)** in Random Forest improves stability and accuracy by training each decision tree on a different **random subset** of the dataset with replacement. This reduces variance and prevents overfitting. **Example****:** In a customer churn prediction model, each tree is trained on a different bootstrapped sample, and the final decision is made by averaging (regression) or voting (classification). **Example 2**** ⇒** **Bagging in Random Forest** can be understood using an example of classifying apples and oranges. Suppose we have a dataset of fruits with features like **color, weight, and texture**. Each decision tree in the Random Forest is trained on a **random subset** of this dataset (with replacement). Some trees may focus more on **color**, while others on **weight**. When classifying a new fruit, the final decision is made by majority voting. **Like****:** * **Tree 1:** Says "Apple" based on red color * **Tree 2:** Says "Orange" based on texture * **Tree 3:** Says "Apple" based on weight Final prediction: **"Apple" (majority vote).** ### **Feature Importance in Random Forest** measures how much each feature contributes to the model's decision-making. It helps in feature selection by identifying the most influential features. **Example****:** In a fruit classification model, **color** might be the most important feature, followed by **texture** and **weight**. **Formula:** $$ FIj= \frac{1}{N} \sum\_{i=1}^{N} \left( I\_{split, j}^{(i)} \right) $$ where: * FI\_j = Feature importance of feature j * N = Number of trees * Isplit,j(i) = Importance of feature j in tree i **Code to get Feature Importance:** ## Lecture 11: *(15/02/2025`)(Evening class)`* [**Class Recording**](https://futurense.zoom.us/rec/play/jpi0HLOQad6EtlbhH98nh5gXAttQPSZkZjt2L8E8OXGgS5cupI-wcANYhw2slC3t44gWj96y9kuB8zA.BE8sibTfcLs92g-g) {% file src="" %} ### Boosting **Boosting** is an ensemble technique that combines weak learners sequentially, where each model corrects the errors of the previous one, improving overall accuracy. It reduces bias and variance. **Example****:** In spam detection, boosting refines misclassified emails by focusing more on difficult examples in each iteration. ### **Example: Financial Fraud Detection (Using Bagging + Boosting Together)** #### **How?** 1. **Bagging inside Boosting:** Use **Random Forest** (bagging) as the base estimator in **AdaBoost/XGBoost** to make boosting more robust. 2. **Boosting inside Bagging:** Train multiple boosted models (e.g., **Gradient Boosted Trees**) and aggregate their predictions like bagging. **Step 1: Bagging (Random Forest) for Robust Feature Selection** * A **Random Forest** model is trained using multiple decision trees on different subsets of transaction data. * Each tree gives independent predictions, and majority voting ensures stable, less overfitting-prone results. * **Example****:** * Tree 1: Says "Fraud" based on transaction amount. * Tree 2: Says "Not Fraud" based on merchant type. * Tree 3: Says "Fraud" based on location difference. * **Final Bagging Prediction:** "Fraud" (majority vote). **Step 2: Boosting (XGBoost) for Enhanced Accuracy** * The output from Random Forest is then fed into an **XGBoost model**, which corrects misclassifications. * The model assigns **higher weights** to misclassified transactions and improves fraud detection. * **Example:** * If Bagging misclassified a fraud case due to a rare merchant, Boosting will refine it using new weighted trees. **Final Outcome** By combining **Bagging (for robustness)** and **Boosting (for accuracy improvement)**, the system detects fraud more reliably, reducing false positives and catching hard-to-detect fraudulent transactions. ### **Support Vector Machine (SVM)** Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal **hyperplane** to separate classes with maximum margin. It works well for both linear and non-linear classification using **kernels**. **Example****:** In spam detection, SVM separates spam and non-spam emails based on word frequency patterns. ### **Hard Margin vs. Soft Margin in SVM** 1. **Hard Margin SVM:** * Used when data is **linearly separable** with no misclassification. * **Example****:** Perfectly separating red and blue balls in a 2D plane without overlap. 2. **Soft Margin SVM:** * Allows some misclassification for better generalization in **non-linearly separable** data. * **Example 1****:** Classifying emails as spam or non-spam, where some emails might be misclassified due to ambiguous words. * **Example 2****:** Separating dog and cat images where some breeds (e.g., Pomeranian vs. Persian cat) have similar features. ### **Regularization Hyperparameter (CC)** The **Regularization Hyperparameter (CC)** in SVM controls the trade-off between **maximizing margin** and **minimizing misclassification**. * **High CC (low regularization)** → Focuses more on classifying all points correctly, leading to **overfitting**. * **Low CC (high regularization)** → Allows some misclassification, leading to **better generalization**. **Example 1****:** In spam detection, a high CC might overfit to specific spam words, while a low CC generalizes better.\ **Example 2****:** In image classification, a low CC prevents overfitting to noise in training images. ### **Non-Linear SVM** When data is **not linearly separable**, SVM uses **kernel tricks** to map it into a higher-dimensional space where a hyperplane can separate the classes. **Example****:** Classifying red and blue points that form concentric circles. A linear SVM fails, but using a **Radial Basis Function (RBF) kernel**, we transform data into a higher dimension where a clear separation is possible. **Graph (Visualization of Non-Linear SVM)**

The plot shows how **SVM with an RBF kernel** separates non-linearly distributed data (moons dataset). The **decision boundary** curves around the data, demonstrating how kernel tricks enable SVM to handle complex patterns. ## Lecture 12: *(22/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/6XyB9M5pTZe61hYGYHrv0r7G-9U8c8DToUMSJwV7ysDCn5sp-waQtR2XYtoVrKV9TjrEC_-MAKkrmQZr.s8MOBFD_jFVruf_u) ### **Kernel Function in SVM** A **kernel function** transforms non-linearly separable data into a **higher-dimensional space**, making it linearly separable. #### **Common Kernel Types****:** 1. **Linear Kernel** – Used when data is already linearly separable. * **Example:** Separating spam vs. non-spam emails based on word frequency. 2. **Polynomial Kernel** – Maps data into polynomial space for curved decision boundaries. * **Example:** Classifying different species of flowers with overlapping petal lengths. 3. **RBF (Gaussian) Kernel** – Maps data to an infinite-dimensional space, capturing complex patterns. * **Example:** Detecting fraudulent transactions with non-linear relationships. 4. **Sigmoid Kernel** – Similar to a neural network activation function. * **Example:** Handwriting recognition where patterns need non-linear separation. [**SVM CODE**](https://github.com/manvendrapratapsinghdev/IITJMaterial/blob/main/T1/ML/Code/SVM.ipynb) ## Lecture 13: *(23/02/2025`)`* [**Class Recording**](https://futurense.zoom.us/rec/play/SS27MBKAcWTrbLhhyLWhVuyKlSPh85SqPOtzpnVnMKiOjaeVRNWjMeb2ezOw7eMZ2kLaPPoJoOZ5egjJ.EoZ5fGxoSTFmk2Yx) #### **Analyzing Covariance Matrix in ML** **What is a Covariance Matrix?** A **covariance matrix** is a square matrix that captures the relationships between multiple variables in a dataset. Each element C(i, j) represents the covariance between variable X\_i and X\_j: $$ C(i,j)= \frac{1}{n} \sum\_{k=1}^{n} (X\_{ki} - \bar{X\_i})(X\_{kj} - \bar{X\_j}) $$ * **Positive covariance** → Variables increase together. * **Negative covariance** → One variable increases while the other decreases. * **Zero covariance** → No linear relationship. *** **Why is it Important in ML?** 1. **Feature Relationship**: Helps understand how features interact. 2. **Dimensionality Reduction**: Used in **PCA (Principal Component Analysis)** to find uncorrelated axes. 3. **Multicollinearity Detection**: Identifies redundant features in regression models. *** **Example with Visualization** Consider a dataset with two features, **Height (cm)** and **Weight (kg)**.

**Interpretation**: If the covariance matrix has a **high positive value**, Height and Weight are strongly correlated. *** **Graphical Representation** * **Heatmap of the Covariance Matrix** ```python import seaborn as sns sns.heatmap(cov_matrix, annot=True, cmap="coolwarm") plt.title("Covariance Matrix Heatmap") plt.show() ``` This helps visualize how different features are related in high-dimensional datasets.

*** #### **Relevance to PCA (Dimensionality Reduction)** PCA relies on the **eigenvectors** and **eigenvalues** of the covariance matrix to transform correlated variables into **uncorrelated principal components**. ```python from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(data.T) print("Principal Components:\n", pca.components_) ``` This transforms the dataset into new axes where features are **uncorrelated**, making ML models more efficient. ``` Principal Components: [[ 0.77334214 0.63398891] [-0.63398891 0.77334214]] ``` *** **Conclusion** The covariance matrix is a fundamental tool in ML for **understanding feature relationships**, **reducing dimensions**, and **improving model efficiency.** ## Lecture 14: *(01/03/2025`)`* **Class Cancelled** ## Lecture 15: *(02/03/2025`)`* **Class Recording** {% file src="" %} ## Unsupervised Learning Unsupervised learning finds hidden patterns or structures in data without labeled outputs. It is widely used in clustering, anomaly detection, and dimensionality reduction. ### K-means Clustering: A Simple Yet Powerful Algorithm K-Means is an unsupervised learning algorithm that groups data into K clusters by minimizing intra-cluster distance. It follows an iterative process of centroid initialization, point assignment, centroid update, and convergence. * How K-Means Works: * Select the number of clusters (K). * Randomly initialize K centroids. * Assign data points to the nearest centroid. * Recalculate centroids based on cluster means. * Repeat until convergence. * Finding Optimal K (Elbow Method): * Run K-Means for different K values. * Calculate total variation within clusters. * Plot results and find the "elbow point" where adding clusters no longer reduces variation significantly. * Applications & Considerations: * Works for **1D, 2D, and multi-dimensional** data. * Used in **customer segmentation, image compression, and heatmaps**. * Running multiple times helps counter randomness in centroid initialization. #### Limitations of K-Means Clustering * **Sensitivity to Initialization** – The algorithm's final clustering results can vary due to different initial centroid placements, leading to inconsistent outcomes. * **Fixed Number of Clusters (K)** – K-means requires specifying the number of clusters in advance, which can be challenging without prior knowledge of the data structure. * **Struggles with Non-Spherical Clusters** – It assumes clusters are spherical and evenly sized, making it ineffective for complex, irregularly shaped clusters. * **Sensitivity to Outliers** – Outliers can distort centroid positions, leading to inaccurate cluster assignments and affecting overall performance. #### Cost function The cost function for **K-Means Clustering** is the **Sum of Squared Errors (SSE)**, also known as **Inertia**. It measures the compactness of clusters by calculating the squared distance between each data point and its assigned centroid. $$ J= \sum\_{i=1}^{K} \sum\_{x \in C\_i} || x - \mu\_i ||^2 $$ ``` Where: K = Number of clusters x = Data point μ_i Centroid of cluster C_i || x - u_i ||^2 = Squared Euclidean distance between the point and the centroid ``` The objective of **K-Means** is to **minimize** this cost function to achieve the best clustering. #### Pseudocode

``` Initialize K centroids randomly Repeat until convergence: Assign each data point to the nearest centroid Update centroids by computing the mean of assigned points Check for convergence (centroids no longer change) ``` ### WORKING...... ### Mini-batch Elbow method ## Lecture 16: *(08/03/2025`)`* **Class Recording** {% file src="" %} ### Silhouette Coefficient The **Silhouette Coefficient (or Silhouette Score)** is a metric used to evaluate the quality of clustering in unsupervised learning. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from **-1 to 1**, where: * **1** → The data point is well clustered. * **0** → The data point is on the border between clusters. * **-1** → The data point is likely misclassified. $$ S(i)= (b(i)−a(i))/ max(a(i),b(i)) $$ ``` where: a(i) = Average intra-cluster distance (distance from i to all other points in the same cluster). b(i) = Average nearest-cluster distance (distance from i to all points in the closest neighboring cluster). ```

### Bayesian Decision Theory Risk function Bayesian Decision Theory provides a probabilistic approach to decision-making under uncertainty. The **Risk Function** quantifies the expected loss when making decisions based on uncertain information. ## Lecture 17: *(09/03/2025`)`* **Class Recording** {% file src="" %} ### Principal Component Analysis (PCA) Local Linear Embedding #### Eigenvalue ## Lecture 17: *(16/03/2025`)`* **Class Recording** {% file src="" %} ## Lecture : *(22/03/2025`)`* **Class Recording** {% file src="" %} ## Lecture : *(23/03/2025`)`* **Class Recording** {% file src="" %} ### Neural Networks ## Lecture : *(29/03/2025`)`* **Class Recording**

### Neural Networks **Input Encoding**: The original photo is processed into a latent feature space using a CNN. **Style Conditioning**: A text encoder converts the prompt "Ghibli style" into a style embedding. **Latent Fusion**: Cross-attention fuses the photo’s content with the Ghibli style. **Diffusion Refinement**: An iterative diffusion model denoises the fused latent space to align it with the desired style. **Decoding**: A decoder converts the refined latent representation back into the final stylized image.

{% file src="" %} ## Lecture : *(30/03/2025`)`* **Class Recording**

## Lecture : *(05/04/2025`)`* **Class Recording** {% file src="" %} **CNN** Convolutional Neural Networks are a specialized kind of neural network designed for processing structured grid data like images. They are particularly effective in visual recognition tasks.

## Lecture : *(06/04/2025`)`* **Class Recording** **CNN** {% embed url="" %} Detail Notes by Akash {% endembed %} {% embed url="" %} Good note of Ashish {% endembed %} {% file src="" %} Hand note by Ashish {% endfile %} ## Lecture : *(12/04/2025`)`* **Class Recording** ### **Autoencoders** Autoencoders are neural networks designed to learn efficient representations (encodings) of data, typically for dimensionality reduction, denoising, or generative tasks. They work by trying to reconstruct their inputs.

### **RNN (Recurrent Neural Network)** RNNs are neural networks designed for sequential data, where the current output depends not only on the current input but also on previous inputs. They are widely used in tasks involving time series, language, and sequences.

## Lecture : *(13/04/2025`)`* QUIZ