CoCalc -- content.tex

📚 The CoCalc Library - books, templates and other resources
cocalc-examples / martinthoma-latex-examples / publications / activation-functions / content.tex
¹³²⁹³¹ views
License: OTHER
1
%!TEX root = main.tex
2
\section{Introduction}
3
Artificial neural networks have dozends of hyperparameters which influence
4
their behaviour during training and evaluation time. One parameter is the
5
choice of activation functions. While in principle every neuron could have a
6
different activation function, in practice networks only use two activation
7
functions: The softmax function for the output layer in order to obtain a
8
probability distribution over the possible classes and one activation function
9
for all other neurons.
10

11
Activation functions should have the following properties:
12
\begin{itemize}
13
    \item \textbf{Non-linearity}: A linear activation function in a simple feed
14
          forward network leads to a linear function. This means no matter how
15
          many layers the network uses, there is an equivalent network with
16
          only the input and the output layer. Please note that \glspl{CNN} are
17
          different. Padding and pooling are also non-linear operations.
18
    \item \textbf{Differentiability}: Activation functions need to be
19
          differentiable in order to be able to apply gradient descent. It is
20
          not necessary that they are differentiable at any point. In practice,
21
          the gradient at non-differentiable points can simply be set to zero
22
          in order to prevent weight updates at this point.
23
    \item \textbf{Non-zero gradient}: The sign function is not suitable for
24
          gradient descent based optimizers as its gradient is zero at all
25
          differentiable points. An activation function should have infinitely
26
          many points with non-zero gradient.
27
\end{itemize}
28

29
One of the simplest and most widely used activation functions for \glspl{CNN}
30
is \gls{ReLU}~\cite{AlexNet-2012}, but others such as
31
\gls{ELU}~\cite{clevert2015fast}, \gls{PReLU}~\cite{he2015delving}, softplus~\cite{7280459}
32
and softsign~\cite{bergstra2009quadratic} have been proposed.
33

34
Activation functions differ in the range of values and the derivative. The
35
definitions and other comparisons of eleven activation functions are given
36
in~\cref{table:activation-functions-overview}.
37

38

39
\section{Important Differences of Proposed Activation Functions}
40
Theoretical explanations why one activation function is preferable to another
41
in some scenarios are the following:
42
\begin{itemize}
43
    \item \textbf{Vanishing Gradient}: Activation functions like tanh and the
44
          logistic function saturate outside of the interval $[-5, 5]$. This
45
          means weight updates are very small for preceding neurons, which is
46
          especially a problem for very deep or recurrent networks as described
47
          in~\cite{bengio1994learning}. Even if the neurons learn eventually,
48
          learning is slower~\cite{AlexNet-2012}.
49
    \item \textbf{Dying ReLU}: The dying \gls{ReLU} problem is similar to the
50
          vanishing gradient problem. The gradient of the \gls{ReLU} function
51
          is~0 for all non-positive values. This means if all elements of the
52
          training set lead to a negative input for one neuron at any point in
53
          the training process, this neuron does not get any update and hence
54
          does not participate in the training process. This problem is
55
          addressed in~\cite{maas2013rectifier}.
56
    \item \textbf{Mean unit activation}: Some publications
57
          like~\cite{clevert2015fast,BatchNormalization-2015} claim that mean
58
          unit activations close to 0 are desirable. They claim that this
59
          speeds up learning by reducing the bias shift effect. The speedup
60
          of learning is supported by many experiments. Hence the possibility
61
          of negative activations is desirable.
62
\end{itemize}
63

64
Those considerations are listed
65
in~\cref{table:properties-of-activation-functions} for 11~activation functions.
66
Besides the theoretical properties, empiric results are provided
67
in~\cref{table:CIFAR-100-accuracies-activation-functions,table:CIFAR-100-timing-activation-functions}.
68
The baseline network was adjusted so that every activation function except the
69
one of the output layer was replaced by one of the 11~activation functions.
70

71
As expected, \gls{PReLU} and \gls{ELU} performed best. Unexpected was that the
72
logistic function, tanh and softplus performed worse than the identity and it
73
is unclear why the pure-softmax network performed so much better than the
74
logistic function.
75
One hypothesis why the logistic function performs so bad is that it cannot
76
produce negative outputs. Hence the logistic$^-$ function was developed:
77
\[\text{logistic}^{-}(x) = \frac{1}{1+ e^{-x}} - 0.5\]
78
The logistic$^-$ function has the same derivative as the logistic function and
79
hence still suffers from the vanishing gradient problem.
80
The network with the logistic$^-$ function achieves an accuracy which is
81
\SI{11.30}{\percent} better than the network with the logistic function, but is
82
still \SI{5.54}{\percent} worse than the \gls{ELU}.
83

84
Similarly, \gls{ReLU} was adjusted to have a negative output:
85
\[\text{ReLU}^{-}(x) = \max(-1, x) = \text{ReLU}(x+1) - 1\]
86
The results of \gls{ReLU}$^-$ are much worse on the training set, but perform
87
similar on the test set. The result indicates that the possibility of hard zero
88
and thus a sparse representation is either not important or similar important as
89
the possibility to produce negative outputs. This
90
contradicts~\cite{glorot2011deep,srivastava2014understanding}.
91

92
A key difference between the logistic$^-$ function and \gls{ELU} is that
93
\gls{ELU} does neither suffers from the vanishing gradient problem nor is its
94
range of values bound. For this reason, the S2ReLU activation function, defined
95
as
96
\begin{align*}
97
  \StwoReLU(x) &= \ReLU \left (\frac{x}{2} + 1 \right ) - \ReLU \left (-\frac{x}{2} + 1 \right)\\
98
  &=
99
  \begin{cases}-\frac{x}{2} + 1 &\text{if } x \le -2\\
100
               x &\text{if } -2\le x \le 2\\
101
               \frac{x}{2} + 1&\text{if } x > -2\end{cases}
102
\end{align*}
103
This function is similar to SReLUs as introduced in~\cite{jin2016deep}. The
104
difference is that S2ReLU does not introduce learnable parameters. The S2ReLU
105
was designed to be symmetric, be the identity close to zero and have a smaller
106
absolute value than the identity farther away. It is easy to compute and easy to
107
implement.
108

109
Those results --- not only the absolute values, but also the relative
110
comparison --- might depend on the network architecture, the training
111
algorithm, the initialization and the dataset. Results for MNIST can be found
112
in~\cref{table:MNIST-accuracies-activation-functions} and for HASYv2
113
in~\cref{table:HASYv2-accuracies-activation-functions}. For both datasets, the
114
logistic function has a much shorter training time and a noticeably lower test
115
accuracy.
116

117
\glsunset{LReLU}
118
\begin{table}[H]
119
    \centering
120
    \begin{tabular}{lccc}
121
    \toprule
122
    \multirow{2}{*}{Function} & Vanishing  & Negative Activation & Bound \\
123
                  & Gradient       & possible & activation \\\midrule
124
    Identity      & \cellcolor{green!25}No    & \cellcolor{green!25}  Yes    & \cellcolor{green!25}No  \\
125
    Logistic      & \cellcolor{red!25} Yes    & \cellcolor{red!25}   No      & \cellcolor{red!25}  Yes \\
126
    Logistic$^-$  & \cellcolor{red!25} Yes    & \cellcolor{green!25}  Yes    & \cellcolor{red!25}  Yes \\
127
    Softmax        & \cellcolor{red!25} Yes    & \cellcolor{green!25}  Yes    & \cellcolor{red!25}  Yes \\
128
    tanh          & \cellcolor{red!25} Yes    & \cellcolor{green!25}  Yes    & \cellcolor{red!25}  Yes \\
129
    Softsign      & \cellcolor{red!25} Yes    & \cellcolor{green!25}Yes      & \cellcolor{red!25}   Yes \\
130
    ReLU          & \cellcolor{yellow!25}Yes\footnotemark & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\
131
    Softplus      & \cellcolor{green!25}No    & \cellcolor{red!25}   No      & \cellcolor{yellow!25}Half-sided \\
132
    S2ReLU        & \cellcolor{green!25}No    & \cellcolor{green!25}Yes      & \cellcolor{green!25} No \\
133
    \gls{LReLU}/PReLU   & \cellcolor{green!25}No    & \cellcolor{green!25}Yes      & \cellcolor{green!25} No \\
134
    ELU           & \cellcolor{green!25}No    & \cellcolor{green!25}Yes      & \cellcolor{green!25} No \\
135
    \bottomrule
136
    \end{tabular}
137
    \caption[Activation function properties]{Properties of activation functions.}
138
    \label{table:properties-of-activation-functions}
139
\end{table}
140
\footnotetext{The dying ReLU problem is similar to the vanishing gradient problem.}
141

142
\begin{table}[H]
143
    \centering
144
    \begin{tabular}{lccclllll}
145
    \toprule
146
    \multirow{2}{*}{Function} & \multicolumn{2}{c}{Inference per}                                & Training                            & \multirow{2}{*}{Epochs} & Mean total        \\\cline{2-3}
147
                              & 1 Image                        & 128                             & time                                &                         & training time     \\\midrule
148
    Identity                  & \SI{8}{\milli\second}          & \SI{42}{\milli\second}          & \SI{31}{\second\per\epoch}          & 108 -- \textbf{148}     &\SI{3629}{\second} \\
149
    Logistic                  & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch}          & \textbf{101} -- 167     &\textbf{\SI{2234}{\second}} \\
150
    Logistic$^-$              & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \textbf{\SI{22}{\second\per\epoch}} & 133 -- 255              &\SI{3421}{\second} \\
151
    Softmax                   & \SI{7}{\milli\second}          & \SI{37}{\milli\second}          & \SI{33}{\second\per\epoch}          & 127 -- 248              &\SI{5250}{\second} \\
152
    Tanh                      & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch}          & 125 -- 211              &\SI{3141}{\second} \\
153
    Softsign                  & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch}          & 122 -- 205              &\SI{3505}{\second} \\
154
    \gls{ReLU}                & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch}          & 118 -- 192              &\SI{3449}{\second} \\
155
    Softplus                  & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch}          & \textbf{101} -- 165     &\SI{2718}{\second} \\
156
    S2ReLU                    & \textbf{\SI{5}{\milli\second}} & \SI{32}{\milli\second}          & \SI{26}{\second\per\epoch}          & 108 -- 209              &\SI{3231}{\second} \\
157
    \gls{LReLU}               & \SI{7}{\milli\second}          & \SI{34}{\milli\second}          & \SI{25}{\second\per\epoch}          & 109 -- 198              &\SI{3388}{\second} \\
158
    \gls{PReLU}               & \SI{7}{\milli\second}          & \SI{34}{\milli\second}          & \SI{28}{\second\per\epoch}          & 131 -- 215              &\SI{3970}{\second} \\
159
    \gls{ELU}                 & \SI{6}{\milli\second}          & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch}          & 146 -- 232              &\SI{3692}{\second} \\
160
    \bottomrule
161
    \end{tabular}
162
    \caption[Activation function timing results on CIFAR-100]{Training time and
163
             inference time of adjusted baseline models trained with different
164
             activation functions on GTX~970 \glspl{GPU} on CIFAR-100. It was
165
             expected that the identity is the fastest function. This result is
166
             likely an implementation specific problem of Keras~2.0.4 or
167
             Tensorflow~1.1.0.}
168
    \label{table:CIFAR-100-timing-activation-functions}
169
\end{table}
170

171
\begin{table}[H]
172
    \centering
173
    \begin{tabular}{lccccc}
174
    \toprule
175
    \multirow{2}{*}{Function} & \multicolumn{2}{c}{Single model}              & Ensemble & \multicolumn{2}{c}{Epochs}\\\cline{2-3}\cline{5-6}
176
                              & Accuracy             & std                    & Accuracy & Range & Mean \\\midrule
177
    Identity                  & \SI{99.45}{\percent} & $\sigma=0.09$          & \SI{99.63}{\percent} & 55 -- \hphantom{0}77  & 62.2\\%TODO: Really?
178
    Logistic                  & \SI{97.27}{\percent} & $\sigma=2.10$          & \SI{99.48}{\percent} & \textbf{37} -- \hphantom{0}76  & \textbf{54.5}\\
179
    Softmax                   & \SI{99.60}{\percent} & $\boldsymbol{\sigma=0.03}$& \SI{99.63}{\percent} & 44 -- \hphantom{0}73  & 55.6\\
180
    Tanh                      & \SI{99.40}{\percent} & $\sigma=0.09$          & \SI{99.57}{\percent} & 56 -- \hphantom{0}80  & 67.6\\
181
    Softsign                  & \SI{99.40}{\percent} & $\sigma=0.08$          & \SI{99.57}{\percent} & 72 -- 101             & 84.0\\
182
    \gls{ReLU}                & \textbf{\SI{99.62}{\percent}} & \textbf{$\sigma=0.04$} & \textbf{\SI{99.73}{\percent}} & 51 -- \hphantom{0}94 & 71.7\\
183
    Softplus                  & \SI{99.52}{\percent} & $\sigma=0.05$          & \SI{99.62}{\percent} & 62 -- \hphantom{0}\textbf{70}  & 68.9\\
184
    \gls{PReLU}               & \SI{99.57}{\percent} & $\sigma=0.07$          & \textbf{\SI{99.73}{\percent}} & 44 -- \hphantom{0}89 & 71.2\\
185
    \gls{ELU}                 & \SI{99.53}{\percent} & $\sigma=0.06$          & \SI{99.58}{\percent} & 45 -- 111 & 72.5\\
186
    \bottomrule
187
    \end{tabular}
188
    \caption[Activation function evaluation results on MNIST]{Test accuracy of
189
             adjusted baseline models trained with different activation
190
             functions on MNIST.}
191
    \label{table:MNIST-accuracies-activation-functions}
192
\end{table}
193
\glsreset{LReLU}
194

195

196
Product

Resources

Company