📚 The CoCalc Library - books, templates and other resources
License: OTHER
%!TEX root = main.tex1\section{Introduction}2Publicly available datasets have helped the computer vision community to3compare new algorithms and develop applications. Especially4MNIST~\cite{LeNet-5} was used thousands of times to train and evaluate models5for classification. However, even rather simple models consistently get about6$\SI{99.2}{\percent}$ accuracy on MNIST~\cite{TF-MNIST-2016}. The best models7classify everything except for about 20~instances correct. This makes8meaningful statements about improvements in classifiers hard. Possible reason9why current models are so good on MNIST are10\begin{enumerate*}11\item MNIST has only 10~classes12\item there are very few (probably none) labeling errors in MNIST13\item every class has \num{6000}~training samples14\item the feature dimensionality is comparatively low.15\end{enumerate*}16Also, applications which need to recognize only Arabic numerals are rare.1718Similar to MNIST, \dbName{} is of very low resolution. In contrast to MNIST,19the \dbNameVersion~dataset contains \dbTotalClasses~classes, including Arabic20numerals and Latin characters. Furthermore, \dbNameVersion{} has much less21recordings per class than MNIST and is only in black and white whereas22MNIST is in grayscale.2324\dbName{} could be used to train models for semantic segmentation of25non-cursive handwritten documents like mathematical notes or forms.2627\section{Terminology}28A \textit{symbol} is an atomic semantic entity which has exactly one visual29appearance when it is handwritten. Examples of symbols are:30$\alpha, \propto, \cdot, x, \int, \sigma, \dots$31%\footnote{The first symbol is an \verb+\alpha+, the second one is a \verb+\propto+.}3233While a symbol is a single semantic entity with a given visual appearance, a34glyph is a single typesetting entity. Symbols, glyphs and \LaTeX{} commands do35not relate:3637\begin{itemize}38\item Two different symbols can have the same glyph. For example, the symbols39\verb+\sum+ and \verb+\Sigma+ both render to $\Sigma$, but they have different40semantics and hence they are different symbols.41\item Two different glyphs can correspond to the same semantic entity. An example is42\verb+\varphi+ ($\varphi$) and \verb+\phi+ ($\phi$): Both represent the small43Greek letter \enquote{phi}, but they exist in two different variants. Hence44\verb+\varphi+ and \verb+\phi+ are two different symbols.45\item Examples for different \LaTeX{} commands that represent the same symbol are46\verb+\alpha+ ($\alpha$) and \verb+\upalpha+ ($\upalpha$): Both have the same47semantics and are hand-drawn the same way. This is the case for all \verb+\up+48variants of Greek letters.49\end{itemize}5051All elements of the data set are called \textit{recordings} in the following.525354\section{How HASY was created}55\dbName{} is derived from the HWRT dataset which was first used and described56in~\cite{Thoma:2014}. HWRT is an on-line recognition dataset, meaning it does57not contain the handwritten symbols as images, but as point-sequences. Hence58HWRT contains strictly more information than \dbName. The larger dimension59of each recordings bounding box was scaled to be \SI{32}{\pixel}. The image60was then centered within the $\SI{32}{\pixel} \times \SI{32}{\pixel}$ bounding61box.6263\begin{figure}[h]64\centering65\includegraphics*[width=\linewidth, keepaspectratio]{figures/sample-images.png}66\caption{100 recordings of the \dbNameVersion{} data set.}67\label{fig:100-data-items}68\end{figure}6970HWRT contains exactly the same recordings and classes as \dbName, but \dbName{}71is rendered in order to make it easy to use.7273HWRT and hence \dbName{} is a merged dataset. $\SI{91.93}{\percent}$ of HWRT74were collected by Detexify~\cite{Kirsch,Kirsch2014}. The remaining recordings75were collected by \href{http://write-math.com}{http://write-math.com}. Both76projects aim at helping users to find \LaTeX{} commands in cases where the77users know how to write the symbol, but not the symbols name. The user writes78the symbol on a blank canvas in the browser (either via touch devices or with a79mouse). Then the websites give the Top-$k$ results which the user could have80thought of. The user then clicks on the correct symbol to accept it as the81correct symbol. On \href{http://write-math.com}{write-math.com}, other users82can also suggest which symbol could be the correct one.8384After collecting the data, Martin Thoma manually inspected each recording. This85manual inspection is a necessary step as anonymous web users could submit any86drawing they wanted to any symbol. This includes many creative recordings as87shown in~\cite{Kirsch,Thoma:2014} as well as loose associations. In some cases,88the correct label was unambiguous and could be changed. In other cases, the89recordings were removed from the data set.9091It is not possible to determine the exact number of people who contributed92handwritten symbols to the Detexify part of the dataset. The part which was93created with \href{http://write-math.com}{write-math.com} was created by94477~user~IDs. Although user IDs are given in the dataset, they are not95reliable. On the one hand, the Detexify data has the user ID 16925,96although many users contributed to it. Also, some users lend their phone to97others while being logged in to show how write-math.com works. This leads to98multiple users per user ID. On the other hand, some users don't register and99use write-math.com multiple times. This can lead to multiple user IDs for one100person.101102\section{Classes}103The \dbNameVersion~dataset contains \dbTotalClasses~classes. Those classes include the104Latin uppercase and lowercase characters (\verb+A-Z+, \verb+a-z+), the Arabic105numerals (\verb+0-9+), 32~different types of arrows, fractal and calligraphic106Latin characters, brackets and more. See \cref{table:symbols-of-db-0,table:symbols-of-db-1,table:symbols-of-db-2,table:symbols-of-db-3,table:symbols-of-db-4,table:symbols-of-db-5,table:symbols-of-db-6,table:symbols-of-db-7,table:symbols-of-db-8} for more information.107108\section{Data}109The \dbNameVersion~dataset contains \dbTotalInstances{} black and white images110of the size $\SI{32}{\pixel} \times \SI{32}{\pixel}$. Each image is labeled111with one of \dbTotalClasses~labels. An example of 100~elements of the112\dbNameVersion{} data set is shown in~\cref{fig:100-data-items}.113114The average amount of black pixels is \SI{16}{\percent}, but this is highly115class-dependent ranging from \SI{3.7}{\percent} of \enquote{$\dotsc$} to \SI{59.2}{\percent} of \enquote{$\blacksquare$} average116black pixel by class.117118The ten classes with most recordings are:119\[\int, \sum, \infty, \alpha, \xi, \equiv, \partial, \mathds{R}, \in, \square\]120Those symbols have \num{26780} recordings and thus account for121\SI{16}{\percent} of the data set. 47~classes have more than \num{1000}122recordings. The number of recordings of the remaining classes are distributed123as visualized in~\cref{fig:class-data-distribution}.124125\begin{figure}[h]126\centering127\includegraphics*[width=\linewidth, keepaspectratio]{figures/data-dist}128\caption{Distribution of the data among classes. 47~classes with129more than \num{1000} recordings are not shown.}130\label{fig:class-data-distribution}131\end{figure}132133A weakness of \dbNameVersion{} is the amount of available data per class. For134some classes, there are only 51~elements in the test set.135136The data has $32\cdot 32 = 1024$ features in $\Set{0, 255}$.137As~\cref{table:pca-explained-variance} shows, \SI{32}{\percent} of the features138can explain~\SI{90}{\percent} of the variance, \SI{54}{\percent} of the139features explain \SI{99}{\percent} of the variance and \SI{86}{\percent} of the140features explain \SI{99}{\percent} of the variance.141142\begin{table}[h]143\centering144\begin{tabular}{lccc}145\toprule146Principal Components & 331 & 551 & 882 \\147Explained Variance & \SI{90}{\percent} & \SI{95}{\percent} & \SI{99}{\percent} \\148\bottomrule149\end{tabular}150\caption{The number of principal components necessary to explain,151\SI{90}{\percent}, \SI{95}{\percent}, \SI{99}{\percent}152of the data.}153\label{table:pca-explained-variance}154\end{table}155156The Pearson correlation coefficient was calculated for all features. The157features are more correlated the closer the pixels are together as one can see158in~\cref{fig:feature-correlation}. The block-like structure of every 32th159feature comes from the fact the features were flattened for this visualization.160The second diagonal to the right shows features which are one pixel down in the161image. Those correlations are expected as symbols are written by continuous162lines. Less easy to explain are the correlations between high-index163features with low-index features in the upper right corner of the image.164Those correlations correspond to features in the upper left corner with165features in the lower right corner. One explanation is that symbols which have166a line in the upper left corner are likely $\blacksquare$.167168\begin{figure}[h]169\centering170\includegraphics*[width=\linewidth, keepaspectratio]{figures/feature-correlation.pdf}171\caption{Correlation of all $32 \cdot 32 = 1024$ features. The diagonal172shows the correlation of a feature with itself.}173\label{fig:feature-correlation}174\end{figure}175176177\section{Classification Challenge}178\subsection{Evaluation}179\dbName{} defines 10 folds which should be used for calculating the accuracy180of any classifier being evaluated on \dbName{} as follows:181182\begin{algorithm}[H]183\begin{algorithmic}184\Function{CrossValidation}{Folds $F$}185\State $D \gets \cup_{i=1}^{10} F_i$\Comment{Complete Dataset}186\For{($i=0$; $\;i < 10$; $\;i$++)}187\State $A \gets D \setminus F_i$\Comment{Train set}188\State $B \gets F_i$\Comment{Test set}189\State Train Classifier $C_i$ on $A$190\State Calculate accuracy $a_i$ of $C_i$ on $B$191\EndFor192\State \Return ($\frac{1}{10}\sum_{i=1}^{10} a_i$, $\min(a_i)$, $\max(a_i)$)193\EndFunction194\end{algorithmic}195\caption{Calculate the mean accuracy, the minimum accuracy, and the maximum196accuracy with 10-fold cross-validation}197\label{alg:seq1}198\end{algorithm}199200\subsection{Model Baselines}201Eight standard algorithms were evaluated by their accuracy on the raw image202data. The neural networks were implemented with203Tensorflow~0.12.1~\cite{tensorflow2015-whitepaper}. All other algorithms are204implemented in sklearn~0.18.1~\cite{scikit-learn}. \Cref{table:classifier-results}205shows the results of the models being trained and tested on MNIST and also for206\dbNameVersion{}:207\begin{table}[h]208\centering209\begin{tabular}{lrrr}210\toprule211\multirow{2}{*}{Classifier} & \multicolumn{3}{c}{Test Accuracy} \\%& \multirow{2}{*}{\parbox{1.2cm}{\centering HASY\\Test time}}\\212& MNIST & HASY & min -- max\hphantom{00 } \\\midrule% &213TF-CNN & \SI{99.20}{\percent} & \SI{81.0}{\percent} & \SI{80.6}{\percent} -- \SI{81.5}{\percent}\\% & \SI{3.1}{\second}\\214Random Forest & \SI{96.41}{\percent} & \SI{62.4}{\percent} & \SI{62.1}{\percent} -- \SI{62.8}{\percent}\\% & \SI{19.0}{\second}\\215MLP (1 Layer) & \SI{89.09}{\percent} & \SI{62.2}{\percent} & \SI{61.7}{\percent} -- \SI{62.9}{\percent}\\% & \SI{7.8}{\second}\\216LDA & \SI{86.42}{\percent} & \SI{46.8}{\percent} & \SI{46.3}{\percent} -- \SI{47.7}{\percent}\\% & \SI{0.2}{\second}\\217$k$-NN ($k=3$)& \SI{92.84}{\percent} & \SI{28.4}{\percent} & \SI{27.4}{\percent} -- \SI{29.1}{\percent}\\% & \SI{196.2}{\second}\\218$k$-NN ($k=5$)& \SI{92.88}{\percent} & \SI{27.4}{\percent} & \SI{26.9}{\percent} -- \SI{28.3}{\percent}\\% & \SI{196.2}{\second}\\219QDA & \SI{55.61}{\percent} & \SI{25.4}{\percent} & \SI{24.9}{\percent} -- \SI{26.2}{\percent}\\% & \SI{94.7}{\second}\\220Decision Tree & \SI{65.40}{\percent} & \SI{11.0}{\percent} & \SI{10.4}{\percent} -- \SI{11.6}{\percent}\\% & \SI{0.0}{\second}\\221Naive Bayes & \SI{56.15}{\percent} & \SI{8.3}{\percent} & \SI{7.9}{\percent} -- \hphantom{0}\SI{8.7}{\percent}\\% & \SI{24.7}{\second}\\222AdaBoost & \SI{73.67}{\percent} & \SI{3.3}{\percent} & \SI{2.1}{\percent} -- \hphantom{0}\SI{3.9}{\percent}\\% & \SI{9.8}{\second}\\223\bottomrule224\end{tabular}225\caption{Classification results for eight classifiers.226% The test time is the time needed for all test samples in average.227The number of228test samples differs between the folds, but is $\num{16827} \pm229166$. The decision tree was trained with a maximum depth of~5. The230exact structure of the CNNs is explained231in~\cref{subsec:CNNs-Classification}. For $k$ nearest neighbor,232the amount of samples per class had to be reduced to 50 for HASY233due to the extraordinary amount of testing time this algorithm234needs.}235\label{table:classifier-results}236\end{table}237238The following observations are noteworthy:239\begin{itemize}240\item All algorithms achieve much higher accuracy on MNIST than on241\dbNameVersion{}.242\item While a single Decision Tree performs much better on MNIST than243QDA, it is exactly the other way around for~\dbName{}. One possible244explanation is that MNIST has grayscale images while \dbName{} has245black and white images.246\end{itemize}247248249\subsection{Convolutional Neural Networks}\label{subsec:CNNs-Classification}250Convolutional Neural Networks (CNNs) are state of the art on several computer251vision benchmarks like MNIST~\cite{wan2013regularization}, CIFAR-10, CIFAR-100252and SVHN~\cite{huang2016densely},253ImageNet~2012~\cite{deep-residual-networks-2015} and more. Experiments on254\dbNameVersion{} without preprocessing also showed that even the255simplest CNNs achieve much higher accuracy on \dbNameVersion{} than all other256classifiers (see~\cref{table:classifier-results}).257258\Cref{table:cnn-results} shows the 10-fold cross-validation results for four259architectures.260\begin{table}[H]261\centering262\begin{tabular}{lrrrr}263\toprule264\multirow{2}{*}{Network} & \multirow{2}{*}{Parameters} & \multicolumn{2}{c}{Test Accuracy} & \multirow{2}{*}{Time} \\265& & mean & min -- max\hphantom{00 } & \\\midrule2662-layer & \num{3023537} & \SI{73.8}{\percent} & \SI{72.9}{\percent} -- \SI{74.3}{\percent} & \SI{1.5}{\second}\\2673-layer & \num{1530609} & \SI{78.4}{\percent} & \SI{77.6}{\percent} -- \SI{79.0}{\percent} & \SI{2.4}{\second}\\2684-layer & \num{848753} & \SI{80.5}{\percent} & \SI{79.2}{\percent} -- \SI{80.7}{\percent} & \SI{2.8}{\second}\\269TF-CNN & \num{4592369} & \SI{81.0}{\percent} & \SI{80.6}{\percent} -- \SI{81.5}{\percent} & \SI{2.9}{\second}\\270\bottomrule271\end{tabular}272\caption{Classification results for CNN architectures. The test time is,273as before, the mean test time for all examples on the ten folds.}274\label{table:cnn-results}275\end{table}276The following architectures were evaluated:277\begin{itemize}278\item 2-layer: A convolutional layer with 32~filters of size $3 \times 3 \times 1$279is followed by a $2 \times 2$ max pooling layer with stride~2. The output280layer is --- as in all explored CNN architectures --- a fully281connected softmax layer with 369~neurons.282\item 3-layer: Like the 2-layer CNN, but before the output layer is another283convolutional layer with 64~filters of size $3 \times 3 \times 32$284followed by a $2 \times 2$ max pooling layer with stride~2.285\item 4-layer: Like the 3-layer CNN, but before the output layer is another286convolutional layer with 128~filters of size $3 \times 3 \times 64$287followed by a $2 \times 2$ max pooling layer with stride~2.288\item TF-CNN: A convolutional layer with 32~filters of size $3 \times 3 \times 1$289is followed by a $2 \times 2$ max pooling layer with stride~2.290Another convolutional layer with 64~filters of size $3 \times 3 \times 32$291and a $2 \times 2$ max pooling layer with stride~2 follow. A fully292connected layer with 1024~units and tanh activation function, a293dropout layer with dropout probability 0.5 and the output softmax294layer are last. This network is described in~\cite{tf-mnist}.295\end{itemize}296297For all architectures, ADAM~\cite{kingma2014adam} was used for training. The298combined training and testing time was always less than 6~hours for the 10~fold299cross-validation on a Nvidia GeForce GTX Titan Black with CUDA~8 and CuDNN~5.1.300\clearpage301\subsection{Class Difficulties}302The class-wise accuracy303\[\text{class-accuracy}(c) = \frac{\text{correctly predicted samples of class } c}{\text{total number of training samples of class } c}\]304is used to estimate how difficult a class is.30530632~classes were not a single time classified correctly by TF-CNN and hence have307a class-accuracy of~0. They are shown in~\cref{table:hard-classes}. Some, like308\verb+\mathsection+ and \verb+\S+ are not distinguishable at all. Others, like309\verb+\Longrightarrow+ and310\verb+\Rightarrow+ are only distinguishable in some peoples handwriting.311Those classes account for \SI{2.8}{\percent} of the data.312313\begin{table}[h]314\centering315\begin{tabular}{lcrlc}316\toprule317\LaTeX & Rendered & Total & Confused with & \\\midrule318\verb+\mid+ & $\mid$ & 34 & \verb+|+ & $|$ \\319\verb+\triangle+ & $\triangle$ & 32 & \verb+\Delta+ & $\Delta$ \\320\verb+\mathds{1}+ & $\mathds{1}$ & 32 & \verb+\mathbb{1}+ & \includegraphics{symbols/mathbb1.pdf} \\321\verb+\checked+ & {\mbox {\wasyfamily \char 8}} & 28 & \verb+\checkmark+ & $\checkmark$ \\322\verb+\shortrightarrow+ & $\shortrightarrow$ & 28 & \verb+\rightarrow+ & $\rightarrow$ \\323\verb+\Longrightarrow+ & $\Longrightarrow$ & 27 & \verb+\Rightarrow+ & $\Rightarrow$ \\324\verb+\backslash+ & $\backslash$ & 26 & \verb+\setminus+ & $\setminus$ \\325\verb+\O+ & \O & 24 & \verb+\emptyset+ & $\emptyset$ \\326\verb+\with+ & $\with$ & 21 & \verb+\&+ & $\&$ \\327\verb+\diameter+ & {\mbox {\wasyfamily \char 31}} & 20 & \verb+\emptyset+ & $\emptyset$ \\328\verb+\triangledown+ & $\triangledown$ & 20 & \verb+\nabla+ & $\nabla$ \\329\verb+\longmapsto+ & $\longmapsto$ & 19 & \verb+\mapsto+ & $\mapsto$ \\330\verb+\dotsc+ & $\dotsc$ & 15 & \verb+\dots+ & $\dots$ \\331\verb+\fullmoon+ & {\mbox {\wasyfamily \char 35}} & 15 & \verb+\circ+ & $\circ$ \\332\verb+\varpropto+ & $\varpropto$ & 14 & \verb+\propto+ & $\propto$ \\333\verb+\mathsection+ & $\mathsection$ & 13 & \verb+\S+ & $\S$ \\334\verb+\vartriangle+ & $\vartriangle$ & 12 & \verb+\Delta+ & $\Delta$ \\335\verb+O+ & $O$ & 9 & \verb+\circ+ & $\circ$ \\336\verb+o+ & $o$ & 7 & \verb+\circ+ & $\circ$ \\337\verb+c+ & $c$ & 7 & \verb+\subset+ & $\subset$ \\338\verb+v+ & $v$ & 7 & \verb+\vee+ & $\vee$ \\339\verb+x+ & $x$ & 7 & \verb+\times+ & $\times$ \\340\verb+\mathbb{Z}+ & $\mathbb{Z}$ & 7 & \verb+\mathds{Z}+ & $\mathds{Z}$ \\341\verb+T+ & $T$ & 6 & \verb+\top+ & $\top$ \\342\verb+V+ & $V$ & 6 & \verb+\vee+ & $\vee$ \\343\verb+g+ & $g$ & 6 & \verb+9+ & $9$ \\344\verb+l+ & $l$ & 6 & \verb+|+ & $|$ \\345\verb+s+ & $s$ & 6 & \verb+\mathcal{S}+ & $\mathcal{S}$ \\346\verb+z+ & $z$ & 6 & \verb+\mathcal{Z}+ & $\mathcal{Z}$ \\347\verb+\mathbb{R}+ & $\mathbb{R}$ & 6 & \verb+\mathds{R}+ & $\mathds{R}$ \\348\verb+\mathbb{Q}+ & $\mathbb{Q}$ & 6 & \verb+\mathds{Q}+ & $\mathds{Q}$ \\349\verb+\mathbb{N}+ & $\mathbb{N}$ & 6 & \verb+\mathds{N}+ & $\mathds{N}$ \\350\bottomrule351\end{tabular}352\caption{32~classes which were not a single time classified correctly by353the best CNN.}354\label{table:hard-classes}355\end{table}356357In contrast, 21~classes have an accuracy of more than \SI{99}{\percent} with358TF-CNN (see~\cref{table:easy-classes}).359360\begin{table}[h]361\centering362\begin{tabular}{lcr}363\toprule364\LaTeX & Rendered & Total\\\midrule365\verb+\forall + & $\forall $ & 214 \\366\verb+\sim + & $\sim $ & 201 \\367\verb+\nabla + & $\nabla $ & 122 \\368\verb+\cup + & $\cup $ & 93 \\369\verb+\neg + & $\neg $ & 85 \\370\verb+\setminus + & $\setminus $ & 52 \\371\verb+\supset + & $\supset $ & 42 \\372\verb+\vdots + & $\vdots $ & 41 \\373\verb+\boxtimes + & $\boxtimes $ & 22 \\374\verb+\nearrow + & $\nearrow $ & 21 \\375\verb+\uplus + & $\uplus $ & 19 \\376\verb+\nvDash + & $\nvDash $ & 15 \\377\verb+\AE + & \AE & 15 \\378\verb+\Vdash + & $\Vdash $ & 14 \\379\verb+\Leftarrow + & $\Leftarrow $ & 14 \\380\verb+\upharpoonright+ & $\upharpoonright$ & 14 \\381\verb+- + & $- $ & 12 \\382\verb+\guillemotleft + & $\guillemotleft $ & 11 \\383\verb+R + & $R $ & 9 \\384\verb+7 + & $7 $ & 8 \\385\verb+\blacktriangleright+ & $\blacktriangleright$ & 6 \\386\bottomrule387\end{tabular}388\caption{21~classes with a class-wise accuracy of more than \SI{99}{\percent}389with TF-CNN.}390\label{table:easy-classes}391\end{table}392393394\section{Verification Challenge}395In the setting of an online symbol recognizer like396\href{http://write-math.com}{write-math.com} it is important to recognize when397the user enters a symbol which is not known to the classifier. One way to achieve398this is by training a binary classifier to recognize when two recordings belong to399the same symbol. This kind of task is similar to face verification.400Face verification is the task where two images with faces are given and it has401to be decided if they belong to the same person.402403For the verification challenge, a training-test split is given. The training404data contains images with their class labels. The test set405contains 32~symbols which were not seen by the classifier before. The elements406of the test set are pairs of recorded handwritten symbols $(r_1, r_2)$. There407are three groups of tests:408\begin{enumerate}[label=V\arabic*]409\item $r_1$ and $r_2$ both belong to symbols which are in the training set,410\item $r_1$ belongs to a symbol in the training set, but $r_2$411might not412\item $r_1$ and $r_2$ don't belong symbols in the training set413\end{enumerate}414415When evaluating models, the models may not take advantage of the fact that it416is known if a recording $r_1$ / $r_2$ is an instance of the training symbols.417For all test sets, the following numbers should be reported: True Positive (TP),418True Negative (TN), False Positive (FP), False Negative (FN),419Accuracy $= \frac{TP+ TN}{TP+TN+FP+FN}$.420421422% \section{Open Questions}423424% There are still a couple of open questions about \dbNameVersion:425426% \begin{enumerate}427% \item What is the accuracy of human expert labelers?428% \item What is the variance between human experts labeling the samples?429% \end{enumerate}430431432\section{Acknowledgment}433434I want to thank \enquote{Begabtenstiftung Informatik Karls\-ruhe}, the Foundation435for Gifted Informatics Students in Karlsruhe. Their support helped me to write436this work.437438439