1. Activation funcs


1.1 sigmoid


  • big values neurons kill the gradients(大的x值会导致梯度为0)
  • Not Zero-centered
  • exp() is a bit compute expensive.

1.2 tanh


  • Zero-centered


  • big values neurons kill the gradients(大的x值会导致梯度为0)

1.3 RELU


  • big values neurons don’t kill the gradients
  • not compute expensive.
  • Converges much faster than Sigmoid and Tanh
  • More biologically plausible than sigmoid.


  • Not Zero-centered
  • If weights aren’t initialized good, maybe 75% of the neurons will be dead and thats a waste computation(为了解决这个问题,可以加个bias=0.01)

1.4 leaky RELU


  • Doesn’t kill the gradients from both sides.
  • Computationally efficient.
  • Converges much faster than Sigmoid and Tanh
  • Will not die. 其中0.01也可以作为一个学习超参数

1.4 maxout


  • Generalizes RELU and Leaky RELU
  • Doesn’t die!


  • double the number of parameters per neuron

1.5 ELU


  • It has all the benefits of RELU
  • Closer to zero mean outputs and adds some robustness to noise.


  • exp() is a bit compute expensive.

1.6 In practice

  • Use RELU. Be careful for your learning rates.
  • Try out Leaky RELU/Maxout/ELU
  • Try out tanh but don’t expect much.
  • Don’t use sigmoid!

2. Data preprocessing

# Zero centered data. (Calculate the mean for every input).
# One of the reasons we do this is because we need data to be between positive and negative and not all the be negative or positive. 
X -= np.mean(X, axis = 1)

# Then apply the standard deviation. Hint: in images we don't do this.
X /= np.std(X, axis = 1)


3. Weight initialization

3.1 初始化为0


3.2 随机化



W = 0.01 * np.random.rand(D, H)
# Works OK for small networks but it makes problems with deeper networks!


W = 1 * np.random.rand(D, H) 
# Works OK for small networks but it makes problems with deeper networks!

3.3 Xavier initialization


W = np.random.rand(in, out) / np.sqrt(in)


3.4 He initialization (Solution for the RELU issue)


W = np.random.rand(in, out) / np.sqrt(in/2)

4. Batch normalization

该技术是为了让每一层的输入都是mean = 0, var = 1



  1. 计算平均值以及方差
  2. 减去平均值,除以根号下方差+epsilon,epsilon的目的就是为了让分母不为0
  3. 接下来shift + scale:Result = gamma * normalizedX + beta。其中gamma和beta都是可学习的超参数。目的是让NN可以学习各种概率分布,因为不一定都高斯分布是最好的。这样每一层就会更加灵活。

batch normalization的好处:

  • Networks train faster.
  • Allows higher learning rates.
  • helps reduce the sensitivity to the initial starting weights.
  • Makes more activation functions viable.
  • Provides some regularization. Because we are calculating mean and variance for each batch that gives a slight regularization effect.

卷积层后面,都是每个activation map就算一个mean和var,不是几层一块算的!

常规的CONV和NN中batch normalization是best,但是循环的NN和rf中这还是热门的研究。

5. Baby sitting the learning process

  • Preprocessing of data.
  • Choose the architecture.
  • Make a forward pass and check the loss (Disable regularization). Check if the loss is reasonable.
  • Add regularization, the loss should go up!
  • Disable the regularization again and take a small number of data and try to train the loss and reach zero loss. (You should overfit perfectly for small datasets.)
  • Take your full training data, and small regularization then try some value of learning rate.
    • If loss is barely changing, then the learning rate is small.
    • If you got NAN then your NN exploded and your learning rate is high.
    • Get your learning rate range by trying the min value (That can change) and the max value that doesn’t explode the network.
  • Do Hyperparameters optimization to get the best hyperparameters values.

6. Hyperparameter Optimization

  • Cross validation strategy
  • It‘s best to optimize in log space
  • Adjust your ranges and try again.
  • Its better to try random search instead of grid searches (In log space)


  1. 怎么判断训练是否是更快了?