Coursera - Deep Learning Specialization
Neural Networks and Deep Learning from ML Specialization
Week 1 - Neural Networks
NN are good for unstrucutred data but also strucuted
deep learning has risen because of the amount of data and computational power. Graph of performance over time with model size and performance. also new algorithms like backpropagation and better activation functions.
Inference is the process of predicting the output of the model given the input. Inference is also called prediction or testing.
Trying to mimic the human brain. The brain is made up of neurons. Neurons are connected to each other. Not really true tho?
during 1980 and 90s trendy but then died out. Now it is back. minst dataset.
Computer vision 2012 Imagenet challenge. AlexNet. 2014 VGGNet. 2015 ResNet. 2016 Inception. 2017 Xception. 2018 NASNet. 2019 EfficientNet.
Recently NLP Chatbots, Google Duplex, GPT-3, etc.
Neurons take input and then send outputs/activations.
Demand prediction for a store. Input is the item and output is the demand. Depends on awaness, price, etc. combine neurons to make a model. Form Layers input layers, hidden and output
Fully connected because can learn to ignore some features. NN automatically does feature engineering. Can learn to ignore some features. Can learn to combine features. Can learn to create new features. Can learn to do feature engineering.
Can turn a picture into a vector of numbers and then use that as input to the NN then classify, but not great. This works for MNST.
Matrix notation is nice. Can do matrix multiplication instead of for loops. Can do matrix multiplication in parallel. Can do matrix multiplication on a GPU.
So called forward propagation if we download weights. If we train the model, then it is called back propagation.
Tensorflow and numpy have incosisency, rows first then columns. if only one dimension i.e [200,17] this is 1d vector i.e 2 rows but technically just an array not [[200,17]] which is 1 row 2 columns.
Sequential string Dense layers togehter.
must implement the forward propagation and the back propagation by hand one time. Then can use a library.
can first do for loop an dot product then can do matrix multiplication to go faster. and make code cleaner.
Why use a non-linear activation very important! otherwise just linear regression. Can use sigmoid function. Can use tanh function. Can use ReLU function. Can use leaky ReLU function. Can use softmax function.
show the speed difference of using numpy with vectors rather than for loops. Probably already noticable with 1000 samples and a simple operation like sigmoid.
where is broadcasting used in a normal linear layer or activation function?
log loss is actually just maximum likelihood estimation, which makes sense.
Week 2 - Neural Network Training
BinaryCrossentropy is the same as log loss. CategoricalCrossentropy is the same as log loss but for multiple classes. epochs is the iterations of the training, i.e GD. batch size is the number of training samples to use to calculate the gradient.
keras got merged into tf.
loss/cost function is the same thing.
compute the gradient for the descent using backpropagation so we don't anallytically compute the gradient by the hand anymore.
linear activation function is just the identity function or like no activation function.
proof of relu being differentiable???? is max(0,x) differentiable at x=0? no. but it is differentiable everywhere else. so it is differentiable almost???
Depending on what output u want the last layer has different activation functions. For example, if you want a binary output, then use sigmoid. If you want a multi-class output, then use softmax. If you want a regression output, then use linear. if you only want positive values, then use ReLU.
Relu is faster than sigmoid and tanh. ReLU is also more stable than sigmoid and tanh, stable meaning that GD is less likely to saturate. ReLU is also more biologically plausible than sigmoid and tanh.
for non final layers the tanh activation is almost always better then the sigmoid. tanh is just a shifted sigmoid. tanh is also zero centered which has some benefits, same as standardizing the input.
what are pros and cons of leaky relu and the derivative of these?
how should we initialize the weights? random normal distribution with mean 0 and std 1. or random uniform distribution between -1 and 1. or xavier initialization. can have an impact on the training. can also have an impact on the final performance of the model.
first build a shallow neural network to learn and calucalte.
Multiclass
MNIST dataset. 10 classes. multiline decision boundary? softmax is generalization of logistic regression. softmax is the multiclass version of the sigmoid function.
softmax regression. 4 output neurons z1...z4. for 4 classes.
a4 = softmax(z4) = e^z4 / sum(e^z1 + e^z2 + e^z3 + e^z4) etc. for all 4 activations, tjese all sum to 1. so they can be interpreted as probabilities. the highest probability is the prediction.
the loss then also becomes categorical cross entropy or just cross entropy a generalization of log loss. etc.
SparseCategoricalCrossentropy??? Sprase means only one class is true and the rest are false??? but apparently better version in tf???
Because of the small numbers in cross entropy can lead to floating point errors so we use more numerically accurate version of cross entropy. So for logistic regression we live it up to tenserflow of whether it wants to first compute the indermieate value after the linear layer, the so called logits or if it wants to use the logits directly in the cross entropy function. so instead use from_logits=True in the cross entropy function and the linear activation function in the last layer.
These errors are worse for softmax because the numbers are even smaller. So we use the same from_logits=True for softmax.
Does pytorch have the same problem?
Multilabel
Multilabel classification is when there are multiple correct labels. For example, a picture of a dog and a cat. The correct labels are dog and cat. Another example is a picture of a dog and a person. The correct labels are dog and person.
could build a network for each label but this is not good. So we use a single network for all the labels. The output layer has multiple neurons, one for each label. The activation function is sigmoid because we want the output to be between 0 and 1 for each label.
which loss is used????
GD and Backpropagation
better optimizer algorithms, like adam, rmsprop, etc. with momentum, explain all of these.
Another layer could be convultiotional etc. but we will get to that later. 1d conv layer does anything??? Cant the network find that out itself?
backprop uses a computation graph, where for each operation we define the forward and the backwards prop, i.e how the gradient is calculated. Then we can just use the chain rule to calculate the gradient of the cost function with respect to the parameters.
auto diff is cool. can just define the forward prop and then it will automatically calculate the backward prop. can also use this to calculate the gradient of the cost function with respect to the parameters.
Week 3 - Advice for Applying Machine Learning
Diagnostics are tests that you can run to gain insight into what is/isn't working with your model.
training and test set. This is useful to see if the model is overfitting or underfitting. We can also count the number of correct and wrong classifications and calculate metrics like recall etc.
Cross validation is when you split the training set into multiple training and validation sets. This is used to tune the hyperparameters of the model such as degrees of polynomial, regularization parameter, etc.
cross valuidation or just validation/dev and then test set. use the validation results to pick the best model and then use the test set to get the final performance of the model.
Why do we add the validation and not just use the test to select, this was explained but didn't really get it. K-fold cross validation is not mentioned. plotting the training loss and the validation loss of multiple models is a good way to see if the model is overfitting or underfitting. A sweet spot of hyperparameters is when the training loss is low and the validation loss is low.
If the training loss is low and the validation loss is high, then the model is overfitting. If the training loss is high and the validation loss is high, then the model is underfitting. If the training loss is low and the validation loss is low, then the model is good.
establishing baselines is important. For example, if we have a binary classification problem, then we can use a baseline of 50% accuracy, i.e. random guessing. We can also use other already existing models as baselines or human performance as a baseline.
Having more data doesn't always help, especialy if the algorithm suffers from high bias.Can see this by looking at the impact of the training set size on the training and validation loss.
When improving models, look at bad examples, maybe you can see a pattern, like when classifying cats and dogs, maybe the model is bad at classifying black cats, etc.
When then adding more data can just add general data or can add more data of the bad examples. Maybe you can also find new feature from the bad examples.
Data augmentation is when you generate new data from the existing data. For example, if you have a picture of a cat, then you can flip the picture horizontally to get a new picture of a cat. This is useful when you don't have enough data. or make it black and white, etc. This is very useful for computer vision in helping the model generalize better. For speech recognition, you can add noise to the audio clip. For text classification, you can add spelling mistakes to the text etc.
Transfer learning is when you use a model that was trained on a different task as a starting point for your model. For example, you can use a model that was trained on ImageNet as a starting point for your model. This is useful when you don't have enough data. This is also useful when you don't have enough computational resources to train a model from scratch. The idea is that the model has already learned some features that are useful for your task some general things. Just swap the last layer and then train the model on your task. This is called fine-tuning. This is also called transfer learning.
TransferLearning of Imagenet onto cat vs dog?
Trading of accuracy, recall and precision. Some data is skewd for example if you have 99% of one class and 1% of the other class. Then you can just predict the majority class and get 99% accuracy. But this is not a good model. So you can use recall and precision to evaluate the model. Recall is the percentage of the positive examples that are correctly classified. Precision is the percentage of the positive predictions that are correct. You can also use the F1 score rather then taking the normal average of recall and precision. The F1 score is the harmonic mean of recall and precision.
other parts
show how to calculate the number of parameters in a NN. This is useful to see how many parameters you have to train. This is also useful to see how much memory the model will take up. why deep networks? Can think of layers as feature engineering. Can also think of layers as a hierarchy of concepts. For example, in computer vision, the first layer can learn edges, the second layer can learn shapes, the third layer can learn objects, etc.
for the backwards prop, intermediate values are cached. So we can use these values to calculate the gradients using the chain rule and the computation graph.
parameters vs hyperparameters. parameters are the weights and biases of the model. hyperparameters are the learning rate, number of layers, number of neurons, etc.
sometimes can be a very empirical process to tune the hyperparameters. can use grid search or random search. can also use bayesian optimization. can also use gradient descent to tune the hyperparameters???
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
make sure training and dev and test set are from the same distribution. so not training from camara and testing on phone, this is for obvious reasons bad.
If overfitting, then can use regularization. If underfitting, then can use a bigger network or more training data or different architecture etc.
in linear regression we use ridge and lasso regression or elastic net, a combination of both. The idea is to keep weights and biases small.
For neural networks, we can use L2 regularization, which is the same as ridge regression. We can also use L1 regularization, which is the same as lasso regression. We can also use dropout regularization, which is a different type of regularization.
for simplisity we can ignore regularization for the biases. lambda is the regularization parameter.
When we are using the matrix notation for the weights and samples, isntead we use the frobenius norm, which is the same as the L2 norm but for matrices. The frobenius norm is the square root of the sum of the squares of the elements of the matrix. Is this rly the same as the l2 norm? what About the l1 norm for matrices?
This regularization is also called weight decay. This is because the weights are decayed over time.
Why does regularaization work? forces entire network to be used and not just a few neurons. forces the weights to be small. forces the weights to be similar to each other. doesnt go into extremes of the activation functions.
A layer of dropout is a layer that randomly sets some of the activations to zero. This is done during training. During testing, all the activations are used. The idea is that the model can't rely on any one neuron. So it has to learn multiple ways of doing the same thing which can help with generalization.
data augmention is also a form of regularization by adding more data or by adding noise to the data. Can help with generalization.
early stopping is also a form of regularization. The idea is that we stop training when the validation loss starts to increase. This is because the model is starting to overfit. Can be done manually and was a big thing for GANs.
Vanishing and Exploding Gradients
Is especially the case with deep networks. The gradients can get very small or very big. This can cause the model to not train. This is because the gradients are used to update the weights and biases. If the gradients are very small, then the weights and biases won't change. If the gradients are very big, then the weights and biases will change too much.
The gradients can get very small or very big because of the activation functions. For example, the sigmoid activation function can cause the gradients to get very small. The ReLU activation function can cause the gradients to get very big in the backpropagation.
the weight initialization can also cause the gradients to get very small or very big. For example, if the weights are initialized to zero, then the gradients will be zero. If the weights are initialized to a very big number, then the gradients will be very big.
We can use gradient checking to check if the gradients are correct, seems a bit weird??? mainly for debugging??? doesn't work for complex NN such as with dropout, etc.