# Lecture 'Deep Learning'

All dates for winter term 17/18

Note! Changes in the course plan:

Thursday, 11.01.18 --> 14:00 lecture

Thursday, 11.01.18 --> 15:45 no exercise!

Thursday, 18.01.18 --> 14:00 lecture

Thursday, 18.01.18 --> 15:45 lecture

Thursday, 25.01.18 --> 14:00 exercise

Thursday, 25.01.18 --> 15:45 exercise

Script Last updated: 18.01.18

(password-protected, password will be provided only to students in first lecture)

## Exercises

Please do exercise 10 (CNN in TensorFlow) + exercise 11 (speech recognition with a CNN in Keras) in the script for the exercises at 25.01.18. All students that have not yet presented an exercise should present one at this day.

For some exercises, code will be provided as a starting point. This code will also be published in the Deep Learning Book GitHub repository.

## Data

- Convolutions: video with test pattern
- 10x10 audio dataset contains:
- training data: 10 audio streams of a single word spoken 10 times
- test data: 10 audio streams of a single word spoken once

## Videos

- Video of original experiments of Hubel and Wiesel from 1959 showing the existence of "simple" and "complex" cells.
- Video of a Convolutional Neural Network demo from 1993. Yes, the CNN model is not really new!
- Mini-batch gradient descent by Andrew Ng. Explains what mini-batches are compared to batches (11m28s)
- Understanding Mini-Batch Gradient Descent by Andrew Ng. Explains how the path to a local minimum looks like for Batch GD, SGD and Mini-batch GD (11m18s)
- Gradient descent with momentum by Andrew Ng. Gives a good intuition in a short time (9m20s)
- RMSProp by Andrew Ng. Again gives a good intuition in a short time (7m41s)
- Adam Optimization Algorithm by Andrew Ng. Mainly presents the formulas and shows that Adam (="Adaptive Moment Estimation") is a combination of the momentum optimizer with the RMSProp optimizer (7m07s)
- Exponentially Weighted Averages by Andrew Ng. RMSProp+Adam use exponentially weighted (moving) averages (EWMA) of squared gradients. For this, Andrew Ng introduces EWMA as well (5m58s)
- Bias Correction of Exponentially Weighted Averages by Andrew Ng. EWMA are actually quite bad estimates at the initial phase. They tend to underestimate the average. Here Andrew Ng explains how to correct for this artefact by multiplying the EWMA with (1/(1-beta^t)) (4m11s)
- Learning Rate Decay by Andrew Ng. Shows the motivation why to decrease the learning rate and shows some formulas that are typically used to decrease the learning rate as a function of the number of training epochs (6m44s)
- Tuning Process by Andrew Ng. Tells us not to use a grid sampling strategy, but random sampling and to follow a coarse-to-fine search (7m10s)
- Normalizing Activations in a Network by Andrew Ng. Explains the key idea of activation normalization: Introducing a normalization step for the activation or output values of neurons in a layer, such that they have a certain mean value and a certain variance. Thus two new hyper parameters are introduced for each layer, which allows to learn a good mean value and variance for the activations / output values of neurons in the layer (8m54s)
- Fitting Batch Norm Into Neural Networks by Andrew Ng. Explains that activations are normally computed on basis of the resulting per-layer activations given by a batch of samples and shows that the usual bias vectors are not needed any longer, since they are already incorporated in the batch normalization step, where we learn the best mean for the activation of each neuron in the current layer being considered (12m55s)
- Why Does Batch Norm Work? by Andrew Ng. Explains why batch normalization helps: 1. it gives later layers a more stable input, 2. it acts a regularizer similary to dropout, since it adds noise to the activation values of the neurons, since the normalization parameters are computed iteratively using all mini-batches, but are then applied to only the current mini-batch (11m39s)
- Batch Norm At Test Time by Andrew Ng. Explains an important difference between training and inference step. During training the normalization of the neuron's activation values is computed on basis of a mini-batch. And what do we do at test time? Here Andrew Ng explains that usually we keep track of an (exponenetially weigted) moving average of the means and variances computed during training and use these estimates for the normalization step at test (inference) time (5m46s)

## Links

- A nice visualization of a CNN. by Adam Harley
- Deep Learning Glossary - A very compact intro/overview into Deep Learning related terms by Denny Britz
- AI and Deep Learning in 2017 – A Year in Review by Denny Britz
- An overview of gradient descent optimization algorithms by Sebastian Ruder

**This is old material from my Deep Learning lecture hold in winter term 16/17:**

Slides: Slides of all lectures till January, 19 2017

Exercises:

- Exercise 01: building the OpenCV library and experimenting with convolutions
- Exercise 02: Reading in the MNIST dataset and filtering sample images with a filter bank
- Exercise 03: Perceptron classifier
- Exercise 04: Multi Layer Perceptron feedforward step and performance tests
- Exercise 05: Backpropagation and implementation testing
- Exercise 06: MLP network topologies and transfer functions
- Exercise 07: MLP with TensorFlow
- Exercise 08: CNN with TensorFlow
- Exercise 09: AlexNet CNN with TensorFlow
- Exercise 10: Using a pre-trained CNN model
- Exercise 11: Unsupervised learning of features
- Exercise 12: Long Short Term Memory (LSTM)
- Exercise 13: Hierarchical unsupervised learning of features

Sample solutions for all exercises can be found at github.