📚 The CoCalc Library - books, templates and other resources
License: OTHER
%!TEX root = main.tex1\section{Introduction}2Artificial neural networks have dozends of hyperparameters which influence3their behaviour during training and evaluation time. One parameter is the4choice of activation functions. While in principle every neuron could have a5different activation function, in practice networks only use two activation6functions: The softmax function for the output layer in order to obtain a7probability distribution over the possible classes and one activation function8for all other neurons.910Activation functions should have the following properties:11\begin{itemize}12\item \textbf{Non-linearity}: A linear activation function in a simple feed13forward network leads to a linear function. This means no matter how14many layers the network uses, there is an equivalent network with15only the input and the output layer. Please note that \glspl{CNN} are16different. Padding and pooling are also non-linear operations.17\item \textbf{Differentiability}: Activation functions need to be18differentiable in order to be able to apply gradient descent. It is19not necessary that they are differentiable at any point. In practice,20the gradient at non-differentiable points can simply be set to zero21in order to prevent weight updates at this point.22\item \textbf{Non-zero gradient}: The sign function is not suitable for23gradient descent based optimizers as its gradient is zero at all24differentiable points. An activation function should have infinitely25many points with non-zero gradient.26\end{itemize}2728One of the simplest and most widely used activation functions for \glspl{CNN}29is \gls{ReLU}~\cite{AlexNet-2012}, but others such as30\gls{ELU}~\cite{clevert2015fast}, \gls{PReLU}~\cite{he2015delving}, softplus~\cite{7280459}31and softsign~\cite{bergstra2009quadratic} have been proposed.3233Activation functions differ in the range of values and the derivative. The34definitions and other comparisons of eleven activation functions are given35in~\cref{table:activation-functions-overview}.363738\section{Important Differences of Proposed Activation Functions}39Theoretical explanations why one activation function is preferable to another40in some scenarios are the following:41\begin{itemize}42\item \textbf{Vanishing Gradient}: Activation functions like tanh and the43logistic function saturate outside of the interval $[-5, 5]$. This44means weight updates are very small for preceding neurons, which is45especially a problem for very deep or recurrent networks as described46in~\cite{bengio1994learning}. Even if the neurons learn eventually,47learning is slower~\cite{AlexNet-2012}.48\item \textbf{Dying ReLU}: The dying \gls{ReLU} problem is similar to the49vanishing gradient problem. The gradient of the \gls{ReLU} function50is~0 for all non-positive values. This means if all elements of the51training set lead to a negative input for one neuron at any point in52the training process, this neuron does not get any update and hence53does not participate in the training process. This problem is54addressed in~\cite{maas2013rectifier}.55\item \textbf{Mean unit activation}: Some publications56like~\cite{clevert2015fast,BatchNormalization-2015} claim that mean57unit activations close to 0 are desirable. They claim that this58speeds up learning by reducing the bias shift effect. The speedup59of learning is supported by many experiments. Hence the possibility60of negative activations is desirable.61\end{itemize}6263Those considerations are listed64in~\cref{table:properties-of-activation-functions} for 11~activation functions.65Besides the theoretical properties, empiric results are provided66in~\cref{table:CIFAR-100-accuracies-activation-functions,table:CIFAR-100-timing-activation-functions}.67The baseline network was adjusted so that every activation function except the68one of the output layer was replaced by one of the 11~activation functions.6970As expected, \gls{PReLU} and \gls{ELU} performed best. Unexpected was that the71logistic function, tanh and softplus performed worse than the identity and it72is unclear why the pure-softmax network performed so much better than the73logistic function.74One hypothesis why the logistic function performs so bad is that it cannot75produce negative outputs. Hence the logistic$^-$ function was developed:76\[\text{logistic}^{-}(x) = \frac{1}{1+ e^{-x}} - 0.5\]77The logistic$^-$ function has the same derivative as the logistic function and78hence still suffers from the vanishing gradient problem.79The network with the logistic$^-$ function achieves an accuracy which is80\SI{11.30}{\percent} better than the network with the logistic function, but is81still \SI{5.54}{\percent} worse than the \gls{ELU}.8283Similarly, \gls{ReLU} was adjusted to have a negative output:84\[\text{ReLU}^{-}(x) = \max(-1, x) = \text{ReLU}(x+1) - 1\]85The results of \gls{ReLU}$^-$ are much worse on the training set, but perform86similar on the test set. The result indicates that the possibility of hard zero87and thus a sparse representation is either not important or similar important as88the possibility to produce negative outputs. This89contradicts~\cite{glorot2011deep,srivastava2014understanding}.9091A key difference between the logistic$^-$ function and \gls{ELU} is that92\gls{ELU} does neither suffers from the vanishing gradient problem nor is its93range of values bound. For this reason, the S2ReLU activation function, defined94as95\begin{align*}96\StwoReLU(x) &= \ReLU \left (\frac{x}{2} + 1 \right ) - \ReLU \left (-\frac{x}{2} + 1 \right)\\97&=98\begin{cases}-\frac{x}{2} + 1 &\text{if } x \le -2\\99x &\text{if } -2\le x \le 2\\100\frac{x}{2} + 1&\text{if } x > -2\end{cases}101\end{align*}102This function is similar to SReLUs as introduced in~\cite{jin2016deep}. The103difference is that S2ReLU does not introduce learnable parameters. The S2ReLU104was designed to be symmetric, be the identity close to zero and have a smaller105absolute value than the identity farther away. It is easy to compute and easy to106implement.107108Those results --- not only the absolute values, but also the relative109comparison --- might depend on the network architecture, the training110algorithm, the initialization and the dataset. Results for MNIST can be found111in~\cref{table:MNIST-accuracies-activation-functions} and for HASYv2112in~\cref{table:HASYv2-accuracies-activation-functions}. For both datasets, the113logistic function has a much shorter training time and a noticeably lower test114accuracy.115116\glsunset{LReLU}117\begin{table}[H]118\centering119\begin{tabular}{lccc}120\toprule121\multirow{2}{*}{Function} & Vanishing & Negative Activation & Bound \\122& Gradient & possible & activation \\\midrule123Identity & \cellcolor{green!25}No & \cellcolor{green!25} Yes & \cellcolor{green!25}No \\124Logistic & \cellcolor{red!25} Yes & \cellcolor{red!25} No & \cellcolor{red!25} Yes \\125Logistic$^-$ & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\126Softmax & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\127tanh & \cellcolor{red!25} Yes & \cellcolor{green!25} Yes & \cellcolor{red!25} Yes \\128Softsign & \cellcolor{red!25} Yes & \cellcolor{green!25}Yes & \cellcolor{red!25} Yes \\129ReLU & \cellcolor{yellow!25}Yes\footnotemark & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\130Softplus & \cellcolor{green!25}No & \cellcolor{red!25} No & \cellcolor{yellow!25}Half-sided \\131S2ReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\132\gls{LReLU}/PReLU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\133ELU & \cellcolor{green!25}No & \cellcolor{green!25}Yes & \cellcolor{green!25} No \\134\bottomrule135\end{tabular}136\caption[Activation function properties]{Properties of activation functions.}137\label{table:properties-of-activation-functions}138\end{table}139\footnotetext{The dying ReLU problem is similar to the vanishing gradient problem.}140141\begin{table}[H]142\centering143\begin{tabular}{lccclllll}144\toprule145\multirow{2}{*}{Function} & \multicolumn{2}{c}{Inference per} & Training & \multirow{2}{*}{Epochs} & Mean total \\\cline{2-3}146& 1 Image & 128 & time & & training time \\\midrule147Identity & \SI{8}{\milli\second} & \SI{42}{\milli\second} & \SI{31}{\second\per\epoch} & 108 -- \textbf{148} &\SI{3629}{\second} \\148Logistic & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 167 &\textbf{\SI{2234}{\second}} \\149Logistic$^-$ & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \textbf{\SI{22}{\second\per\epoch}} & 133 -- 255 &\SI{3421}{\second} \\150Softmax & \SI{7}{\milli\second} & \SI{37}{\milli\second} & \SI{33}{\second\per\epoch} & 127 -- 248 &\SI{5250}{\second} \\151Tanh & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 125 -- 211 &\SI{3141}{\second} \\152Softsign & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 122 -- 205 &\SI{3505}{\second} \\153\gls{ReLU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 118 -- 192 &\SI{3449}{\second} \\154Softplus & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{24}{\second\per\epoch} & \textbf{101} -- 165 &\SI{2718}{\second} \\155S2ReLU & \textbf{\SI{5}{\milli\second}} & \SI{32}{\milli\second} & \SI{26}{\second\per\epoch} & 108 -- 209 &\SI{3231}{\second} \\156\gls{LReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{25}{\second\per\epoch} & 109 -- 198 &\SI{3388}{\second} \\157\gls{PReLU} & \SI{7}{\milli\second} & \SI{34}{\milli\second} & \SI{28}{\second\per\epoch} & 131 -- 215 &\SI{3970}{\second} \\158\gls{ELU} & \SI{6}{\milli\second} & \textbf{\SI{31}{\milli\second}} & \SI{23}{\second\per\epoch} & 146 -- 232 &\SI{3692}{\second} \\159\bottomrule160\end{tabular}161\caption[Activation function timing results on CIFAR-100]{Training time and162inference time of adjusted baseline models trained with different163activation functions on GTX~970 \glspl{GPU} on CIFAR-100. It was164expected that the identity is the fastest function. This result is165likely an implementation specific problem of Keras~2.0.4 or166Tensorflow~1.1.0.}167\label{table:CIFAR-100-timing-activation-functions}168\end{table}169170\begin{table}[H]171\centering172\begin{tabular}{lccccc}173\toprule174\multirow{2}{*}{Function} & \multicolumn{2}{c}{Single model} & Ensemble & \multicolumn{2}{c}{Epochs}\\\cline{2-3}\cline{5-6}175& Accuracy & std & Accuracy & Range & Mean \\\midrule176Identity & \SI{99.45}{\percent} & $\sigma=0.09$ & \SI{99.63}{\percent} & 55 -- \hphantom{0}77 & 62.2\\%TODO: Really?177Logistic & \SI{97.27}{\percent} & $\sigma=2.10$ & \SI{99.48}{\percent} & \textbf{37} -- \hphantom{0}76 & \textbf{54.5}\\178Softmax & \SI{99.60}{\percent} & $\boldsymbol{\sigma=0.03}$& \SI{99.63}{\percent} & 44 -- \hphantom{0}73 & 55.6\\179Tanh & \SI{99.40}{\percent} & $\sigma=0.09$ & \SI{99.57}{\percent} & 56 -- \hphantom{0}80 & 67.6\\180Softsign & \SI{99.40}{\percent} & $\sigma=0.08$ & \SI{99.57}{\percent} & 72 -- 101 & 84.0\\181\gls{ReLU} & \textbf{\SI{99.62}{\percent}} & \textbf{$\sigma=0.04$} & \textbf{\SI{99.73}{\percent}} & 51 -- \hphantom{0}94 & 71.7\\182Softplus & \SI{99.52}{\percent} & $\sigma=0.05$ & \SI{99.62}{\percent} & 62 -- \hphantom{0}\textbf{70} & 68.9\\183\gls{PReLU} & \SI{99.57}{\percent} & $\sigma=0.07$ & \textbf{\SI{99.73}{\percent}} & 44 -- \hphantom{0}89 & 71.2\\184\gls{ELU} & \SI{99.53}{\percent} & $\sigma=0.06$ & \SI{99.58}{\percent} & 45 -- 111 & 72.5\\185\bottomrule186\end{tabular}187\caption[Activation function evaluation results on MNIST]{Test accuracy of188adjusted baseline models trained with different activation189functions on MNIST.}190\label{table:MNIST-accuracies-activation-functions}191\end{table}192\glsreset{LReLU}193194195196