Day66 Deep Learning Lecture Review - Lecture 1

September 3, 2024 6 minute read

Basic Mathematics, Supervised ML, and Review of Multi-Layered Perceptron

We will explore the intricate intricacies of “The End-to-End Deep Learning lecture of the 2024 Fall at the University of Rochester, diving into its potential and envisioning how it can revolutionize our understanding of the world. I expect to unravel the complexities and breakthroughs of this captivating field; we will navigate through the profound implications and transformative applications across various industries. Join me on this journey to review all concepts from the lectures.

1. Background Math

Vector Norm $\Vert \mathbf{v} \Vert $
- A vector norm is a function that takes a vector as input and outputs a single non-negative real number to show how significant that number is.
- Used for many ML activities
  - Training systems(loss functions) / Regulation / Error Measurement
- $L^p$ Vector Norm: $\Vert \mathbf{x} \Vert_p = \big({\sum_{i=1}^n x_i^2} \big)^{\frac{1}{p}}$
  - $L^2$ Norm (Euclidean Norm): $\Vert \mathbf{x} \Vert_2 = \sqrt{\sum_{i=1}^n x_i^2}$ (most common)
  - $L^1$ Norm (Manhattan Norm): $\Vert \mathbf{x} \Vert_1 = \sum_{i=1}^n {\vert x_i \vert} $

Matrices as Linear Transformation
- Matrix multiplication can be used to transform vectors in useful ways, e.g., $\text{y=Ax}$
- This is crucial to understand. Much of the class involves finding $\text{A}$ matrices to do a particular linear transformation.

Tensors
- Multi-dimensional arrays
- Color images are typically stored as $n \times m \times 3$ tensors (arrays) - Red, Green, and Blue brightness values.

Exponential Function exp(x)
- $e^x = \exp(x)$
- The exponential function $\exp(x)$ is defined as the power to which the number $e$ (approximately 2.71828, and known as Euler’s number) must be raised to yield $x$. Thus, $e^x = \exp(x)$.
- Differentiability and Continuity:
  - Continuous and differentiable across the entire real line.
    - The derivative of $\exp(x)$ is $\exp(x)$ itself, which is a unique property and makes it extremely useful in differential equations and calculus.
    - $K(x_i, x_j) = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$

Multivariate Gaussian / Normal Distribution
- $m$ \indicates the mean of the $d$ -dimensional Gaussian}
- $\Sigma$ indicates the covariance matrix
  - $f(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^d \vert \Sigma \vert}} \exp\left(-\frac{1}{2} (\mathbf{x} - \mathbf{m})^T \Sigma^{-1} (\mathbf{x} - \mathbf{m})\right)$

Vectorization for Making Code Fast
- Avoid for-loops if possible
- Try to directly implement algorithms using matrix and vector operations
- Parallelization is always important

2. Supervised Machine Learning

After seeing a bunch of examples (Input space- $\mathbf x$, Output Space-y), pick a mapping $F:\mathbf x \rightarrow y$ that accurately replicates the input-output pattern of examples.
- With continous values $\rightarrow$ Regression
- Predicting class labels $\rightarrow$ classifier

How well does the model $F$ work after training?
- Evaluate $F$ on the test data to get the generalized results

Error for Regression (predicting continuous variables)
- Error measures to use when trying to predict real values instead of discrete values used in classification
- Mean Squared Error is the most common one
  - $MSE = \frac{1}{N} \sum^N_{i=1} \Vert F(\mathbf X_i) - \mathbf y_i \Vert ^2 _2 $

Discrete Output Space
- With a discrete output space $\rightarrow$ classifiers
  - class $\rightarrow $ labels or categories
  - Nominal variables

Classification Error and Accuracy (Predicting discrete variables )
- Accuracy = Total Correct / Total
- Error Rate = 1-accuracy
- For class imbalanced datasets, often deed to compute per-class accuracy statistics

Overfitting and Underfitting
- Overfit: if the training perforance is much greater than the testing performance
- Underfit: if the model isn’t powerful enough to derive the input-output relationship
- Regularization: mtethods for reducing overfitting

Quantity of Data
- More labeled data, better model we can settle

Data Augmentation
- Free way to get “more” data
- Random crops, rotations, flips, noise injection, color modifications and many others

3. Reveiw of Multi-Layer Perceptron

The Artificial Neuron
- $y=f \big( \mathbf{w}^T \mathbf{x} + b \big)$
- Can use it for classification or regression tasks
- Activation function needs to be chosen for the task
  - It constrains the output range

Some common Activation Functions

Logistic Sigmoid Activation Function (most common)
- Forces output to be between 0 and 1
- Used for classification
Hyperbolic Tangent Activation Function (most common)
- Forces output to be between -1 and 1
Rectified Linear Activation Function
- Output between 0 and postiive infinity

MLP
- Each layer of units $\rightarrow$ matrix multiplication
- Unit’s weights $\rightarrow$ a row of a weight matrix

Why Activation Functions?
- Without activation functions, will not get anny additional non-linearity in a MLP
  - Two linear layers can just be combined into one.

Model Capacity for Parametric Models
- Model capacity ( $\approx$ number of parameters)
  - Greater model capacity $\approx$ more parameters in the model
  - Bigger capacity $\rightarrow$ need more training data to train the model to avoid overfitting
  - Low capacity $\rightarrow$ under-fit the training data significantly
- In a Neural Netrowors: More neurons = more capacity
  - Linear classifier $\rightarrow$ low capacity
  - Deep neural network $\rightarrow$ lot of capacity

How to find Parameters $\mathbf w$ (weights) and $b$ (bias)
- Training : to identify good values for $\mathbf w$ and $b$
- Fitting: for parametric models like neural networks
- Adjust the result with a loss function

Loss Function (Cost Function, Error Function)
- Aim to minimize the loss function on the training data (lower training loss)
  - Depends on what is the type of the model - regressions vs. classification

Optimization Loss Landscape
- Find the global minima

Weight Initialization
- Random initialization: All units all initialized from some random distribution
  - Optimal distribution varies depending on activation function
  - Poor initialization will impact trining greatly
- Transfer initialization: Take weights from another neural network trained on similar inputs from a large dataset.

Bias Initialization
- Hidden layer bias is typcally initialized to Zero
- Output layer bais can be initialized from data for regression set it to the mean of the output vectors in training data $\rightarrow$ reduces “hockey Stick” loss curve

Activation Functions & Loss Functions
- Activation functions define the outputs of the unit
- Output layer activation functions & loss functions : closely related
  - Linear activation function for the out put is typically paired most often with:
    - L2 Loss
    - Huber Loss
  - Logistic Sigmoid function for output is typically paired with:
    - Binary cross-entropy loss (BCE Loss)
  - Softmax for mutually exclusive classifcation
    - Cross-entropy loss

Gradient Descent
- First-order iterative optimization algorithm for finding a local minimum of a fucntion
- For finding the local maximum of a function $\rightarrow$ gradient ascent
- A procedure for minimizing a loss function to find the network’s parameters
- Compute partial derivates of the loss function with repsect to network parameters to give us the slope that telss us what direction to change each parameter get closer to the minima
- $w \leftarrow w - \eta \frac{\partial \text{Loss}}{\partial w}$
  - $\eta$ is the learning rate

Backpropagation: Gradient descent for Multi Layer Networks
- Use gradient descent to change the parameters $\mathbf w$ to reduce the loss

Epochs & Loss Curves
- Gradient descent makes small changes to parameters to improve loss
  - Many small changes are needed
- Epoch: An entire gradient descent pass through the whole training dataset
- Train loss should go down as a function of epochs

When should we update Parameters?
- SGD: Compute update individually for each the N training instances (N updates in an epoch)
- Full Gradient Descent: Update after seeing all the instances in the training set (1 update of in an epoch)
- Mini Batch Gradient Descent: for deep learning
  - Update N / M times, where M is the batch size
Adam
- Adapts the step sizes for different parameters
- Automatically adjusts the learning rate for each parameter based on the history of gradient
- Adam keeps track of two things:
  - The average gradient (the “first moment”) – similar a running average of the direction
  - The average of the squared gradients (the “second moment”) – how steep or flat the terrain is overall
  - If the terrain is very steep in one direction (meaning large gradients), Adam will take smaller steps in that direction. if the terrain is flatter in another direction (smaller gradients), Adam will take larger steps.

Relationship between Minibatch Size and Learning Rate
- When increasing the batch size, the optimal learning rate will change, and this depends on the optimization algorithm
- For SGD: When batch size increases by $k$, multiply the learning rate by $k$.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day66 Deep Learning Lecture Review - Lecture 1

Basic Mathematics, Supervised ML, and Review of Multi-Layered Perceptron

1. Background Math

2. Supervised Machine Learning

3. Reveiw of Multi-Layer Perceptron

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)