Testing latest pari + WASM + node.js... and it works?! Wow.
License: GPL3
ubuntu2004
\def\TITLE{Introduction to parallel GP programming}1\input parimacro.tex23% START TYPESET4\begintitle5\vskip 2.5truecm6\centerline{\mine Introduction}7\vskip 1.truecm8\centerline{\mine to}9\vskip 1.truecm10\centerline{\mine parallel GP}11\vskip 1.truecm12\centerline{\sectiontitlebf (version \vers)}13\vskip 1.truecm14\authors15\endtitle1617\copyrightpage18\tableofcontents19\openin\std=parallel.aux20\ifeof\std21\else22\input parallel.aux23\fi24\chapno=02526\chapter{Parallel GP interface}2728\section{Configuration}2930This booklet documents the parallel GP interface. The first chapter describes31configuration and basic concepts, the second one explains how to write GP32codes suitable for parallel execution. Two multithread interfaces33are supported:3435\item POSIX threads3637\item Message passing interface (MPI)3839POSIX threads are well-suited for single systems, for instance a personal40computer equiped with a multi-core processor, while MPI is used by most41clusters. However the parallel GP interface does not depend on the42multithread interface: a properly written GP program will work identically43with both. The interfaces are mutually exclusive and one must be chosen at44configuration time when installing the PARI/GP suite.4546\subsec{POSIX threads}4748POSIX threads are selected by passing the flag \kbd{--mt=pthread} to49\kbd{Configure}. The required library (\kbd{libpthread}) and header files50(\kbd{pthread.h}) are installed by default on most Linux system. This option51implies \kbd{--enable-tls} which builds a thread-safe version of the PARI52library. This unfortunately makes the dynamically linked \kbd{gp-dyn} binary53about $25\%$ slower; since \kbd{gp-sta} is only $5\%$ slower, you definitely54want to use the latter binary.5556You may want to also pass the flag \kbd{--time=ftime} to \kbd{Configure}57so that \kbd{gettime} and the GP timer report real time instead of cumulated58CPU time. An alternative is to use \kbd{getwalltime} in your GP scripts59instead of \kbd{gettime}.6061You can test whether parallel GP support is working with6263\bprog64make test-parallel65@eprog6667\subsec{Message Passing Interface}6869Configuring MPI is somewhat more difficult, but your MPI installation should70include a script \kbd{mpicc} that takes care of the necessary configuration.71If you have a choice between several MPI implementations, choose OpenMPI.7273To configure PARI/GP for MPI, use74\bprog75env CC=mpicc ./Configure --mt=mpi76@eprog7778To run a program, you must use the launcher provided by your implementation,79usually \kbd{mpiexec} or \kbd{mpirun}. For instance, to run80the program \kbd{prog.gp} on $10$ nodes, you would use81\bprog82mpirun -np 10 gp prog.gp83@eprog\noindent (or \kbd{mpiexec} instead of \kbd{mpirun}). PARI84requires at least $3$ MPI nodes to work properly. \emph{Caveats:}85\kbd{mpirun} is not suited for interactive use because it does not provide86tty emulation. Moreover, it is not currently possible to interrupt parallel87tasks.8889You can test parallel GP support (here using 3 nodes) with9091\bprog92make test-parallel RUNTEST="mpirun -np 3"93@eprog9495\section{Concept}9697In a GP program, the \emph{main thread} executes instructions sequentially98(one after the other) and GP provides functions, that execute subtasks in99\emph{secondary threads} in parallel (at the same time). Those functions all100share the prefix \kbd{par}, e.g., \kbd{parfor}, a parallel version of the101sequential \kbd{for}-loop construct.102103The subtasks are subject to a stringent limitation, the parallel code104must be free of side effects: it cannot set or modify a global variable105for instance. In fact, it may not even \emph{access} global variables or106local variables declared with \kbd{local()}.107108Due to the overhead of parallelism, we recommend to split the computation so109that each parallel computation requires at least a few seconds. On the other110hand, it is generally more efficient to split the computation in small chunks111rather than large chunks.112113\subsec{Resources}114115The number of secondary threads to use is controlled by116\kbd{default(nbthreads)}. The default value of nbthreads depends on the117mutithread interface.118119\item POSIX threads: the number of CPU threads, i.e., the number of CPU cores120multiplied by the hyperthreading factor. The default can be freely modified.121122\item MPI: the number of available process slots minus $1$ (one slot is used by123the master thread), as configured with \kbd{mpirun} (or \kbd{mpiexec}). E.g.,124\kbd{nbthreads} is $9$ after \kbd{mpirun -np 10 gp}.125It is possible to change the default to a lower value, but increasing it will126not work: MPI does not allow to spawn new threads at run time. PARI requires127at least $3$ MPI nodes to work properly.128129The PARI stack size in secondary threads is controlled by130\kbd{default(threadsize)}, so the total memory allocated is equal to131$\kbd{parisize}+\kbd{nbthreads}\times\kbd{threadsize}$. By default,132$\kbd{threadsize}=\kbd{parisize}$. Setting the \kbd{threadsizemax} default133allows \kbd{threadsize} to grow as needed up to that limit, analogously to134the behaviour of \kbd{parisize} / \kbd{parisizemax}. We strongly recommend135to set this parameter since it is very hard to control in advance the amount136of memory threads will require: a too small stack size will result in a137stack overflow, aborting the computation, and a too large value is138very wasteful (since the extra reserved but unneeded memory is multiplied by139the number of threads).140141\subsec{GP functions}142143GP provides the following functions for parallel operations, please144see the documentation of each function for details:145146\item \tet{parvector}: parallel version of \kbd{vector};147148\item \tet{parapply}: parallel version of \kbd{apply};149150\item \tet{parsum}: parallel version of \kbd{sum};151152\item \tet{parselect}: parallel version of \kbd{select};153154\item \tet{pareval}: evaluate a vector of closures in parallel;155156\item \tet{parfor}: parallel version of \kbd{for};157158\item \tet{parforprime}: parallel version of \kbd{forprime};159160\item \tet{parforvec}: parallel version of \kbd{forvec};161162\item \tet{parploth}: parallel version of \kbd{ploth}.163164\subsec{PARI functions}165The low-level \kbd{libpari} interface for parallelism is documented166in the \emph{Developer's guide to the PARI library}.167\newpage168169\chapter{Writing code suitable for parallel execution}170171\section{Exporting global variables}172173When parallel execution encounters a global variable, say $V$,174an error such as the following is reported:175\bprog176*** parapply: mt: please use export(V)177@eprog\noindent A global variable is not visible in the parallel execution178unless it is explicitly exported. This may occur in a number of contexts.179180\subsec{Example 1: data}181182\bprog183? V = [2^256 + 1, 2^193 - 1];184? parvector(#V, i, factor(V[i]))185*** parvector: mt: please use export(V).186@eprog\noindent The problem is fixed as follows:187\bprog188? V = [2^256 + 1, 2^193 - 1];189? export(V)190? parvector(#V, i, factor(V[i]))191@eprog\noindent The following short form is also available, with a different192semantic:193\bprog194? export(V = [2^256 + 1, 2^193 - 1]);195? parvector(#V, i, factor(V[i]))196@eprog\noindent In the latter case the variable $V$ does not exist in the197main thread, only in parallel threads.198199\subsec{Example 2: polynomial variable}200201\bprog202? f(n) = bnfinit(x^n-2).no;203? parapply(f, [1..50])204*** parapply: mt: please use export(x).205@eprog\noindent You may fix this as in the first example using \kbd{export}206but here there is a more natural solution: use the polynomial indeterminate207\kbd{'x} instead the global variable \kbd{x} (whose value is \kbd{'x} on208startup, but may or may no longer be \kbd{'x} at this point):209210\bprog211? f(n) = bnfinit('x^n-2).no;212@eprog\noindent or alternatively213\bprog214? f(n) = my(x='x); bnfinit(x^n-2).no;215@eprog216which is more readable if the same polynomial variable is used several times217in the function.218219\subsec{Example 3: function}220221\bprog222? f(a) = bnfinit('x^8-a).no;223? g(a,b) = parsum(i = a, b, f(i));224? g(37,48)225*** parsum: mt: please use export(f).226? export(f)227? g(37,48)228%4 = 81229@eprog\noindent Note that \kbd{export}$(v)$ freezes the value of $v$230for parallel execution at the time of the export: you may certainly modify231its value later in the main thread but you need to re-export $v$ if you want232the new value to be used in parallel threads. You may export more than one233variable at once, e.g., \kbd{export}$(a,b,c)$ is accepted. You may also234export \emph{all} variables with dynamic scope (all global variables235and all variables declared with \kbd{local}) using \kbd{exportall()}.236Although convenient, this may be wasteful if most variables are not meant to237be used from parallel threads. We recommend to238239\item use \kbd{exportall} in the \kbd{gp} interpreter interactively, while240developping code;241242\item \kbd{export} a function meant to be called from parallel243threads, just after its definition;244245\item use $v = \var{value}$; \kbd{export}$(v)$ when the value246is needed both in the main thread and in secondary threads;247248\item use \kbd{export}($v = \var{value}$) when the value is249not needed in the main thread.250251\noindent In the two latter forms, $v$ should be considered read-only. It is252actually read-only in secondary threads, trying to change it will raise an253exception:254\bprog255*** mt: attempt to change exported variable 'v'.256@eprog\noindent You \emph{can} modify it in the main thread, but it must be257exported again so that the new value is accessible to secondary threads:258barring a new \kbd{export}, secondary threads continue to access the old value.259260\section{Input and output} If your parallel code needs to write data to261files, split the output in as many files as the number of parallel262computations, to avoid concurrent writes to the same file, with a high risk263of data corruption. For example a parallel version of264\bprog265? f(a) = write("bnf",bnfinit('x^8-a));266? for (a = 37, 48, f(a))267@eprog\noindent could be268\bprog269? f(a) = write(Str("bnf-",a), bnfinit('x^8-a).no);270? export(f);271? parfor(i = 37, 48, f(i))272@eprog\noindent which creates the files \kbd{bnf-37} to \kbd{bnf-48}. Of273course you may want to group these file in a subdirectory, which must be274created first.275276\section{Using \kbd{parfor} and \kbd{parforprime}}277\kbd{parfor} and \kbd{parforprime} are the most powerful of all parallel GP278functions but, since they have a different interface than \kbd{for} or279\kbd{forprime}, sequential code needs to be adapted. Consider the example280\bprog281for(i = a, b,282my(c = f(i));283g(i,c));284@eprog\noindent285286where \kbd{f} is a function without side effects. This can be run in parallel287as follows:288\bprog289parfor(i = a, b,290f(i),291c, /*@Ccom the value of \kbd{f(i)} is assigned to \kbd{c} */292g(i,c));293@eprog\noindent For each $i$, $a \leq i \leq b$, in random order,294this construction assigns \kbd{f(i)} to (lexically scoped, as per~\kbd{my})295variable $c$, then calls \kbd{g}$(i,c)$. Only the function \kbd{f} is296evaluated in parallel (in secondary threads), the function \kbd{g} is297evaluated sequentially (in the main thread). Writing \kbd{c = f(i)} in the298parallel section of the code would not work since the main thread would then299know nothing about $c$ or its content. The only data sent from the main300thread to secondary threads are $f$ and the index $i$, and only $c$ (which301is equal to $f(i)$) is returned from the secondary thread to the main thread.302303The following function finds the index of the first component of a vector304$V$ satisfying a predicate, and $0$ if none satisfies it:305\bprog306parfirst(pred, V) =307{308parfor(i = 1, #V,309pred(V[i]),310cond,311if (cond, return(i)));312return(0);313}314@eprog\noindent This works because, if the second expression315in \kbd{parfor} exits the loop via \kbd{break} / \kbd{return} at index~$i$,316it is guaranteed that all indexes $< i$ are also evaluated and the one with317smallest index is the one that triggers the exit. See \kbd{??parfor} for318details.319320The following function is similar to \kbd{parsum}:321\bprog322myparsum(a, b, expr) =323{ my(s = 0);324325parfor(i = a, b,326expr(i),327val,328s += val);329return(s);330}331@eprog332333\section{Sizing parallel tasks}334335Dispatching tasks to parallel threads takes time. To limit overhead, split336the computation so that each parallel task requires at least a few337seconds. Consider the following sequential example:338\bprog339? thuemorse(n) = (-1)^n * hammingweight(n);340? s = 0; for(n = 1, 2*10^6, s += thuemorse(n) / n * 1.)341cpu time = 1,064 ms, real time = 1,064 ms.342@eprog\noindent343It is natural to try344\bprog345? export(thuemorse);346? s = 0; parfor(n = 1, 2*10^6, thuemorse(n) / n * 1., S, s += S)347cpu time = 5,637 ms, real time = 3,216 ms.348@eprog\noindent349However, due to the overhead, this is not faster than the sequential350version; in fact it is much \emph{slower}. To limit overhead, we group351the summation by blocks:352\bprog353? s = 0; parfor(N = 1, 20, sum(n = 1+(N-1)*10^5, N*10^5,354thuemorse(n) / n*1.), S, s += S)355? cpu time = 1,997 ms, real time = 186 ms.356@eprog\noindent357Try to create at least as many groups as the number of available threads,358to take full advantage of parallelism. Since some of the floating point359additions are done in random order (the ones in a given block occur360successively, in deterministic order), it is possible that some of the361results will differ slightly from one run to the next.362363Of course, in this example, \kbd{parsum} is much easier to use since it will364group the summations for you behind the scenes:365\bprog366? parsum(n = 1, 2*10^6, thuemorse(n) / n * 1.)367cpu time = 1,905 ms, real time = 177 ms.368@eprog369370\section{Load balancing}371372If the parallel tasks require varying time to complete, it is preferable to373perform the slower ones first, when there are more tasks than available374parallel threads. Instead of375\bprog376parvector(36, i, bnfinit('x^i - 2).no)377@eprog\noindent doing378\bprog379parvector(36, i, bnfinit('x^(37-i) - 2).no)380@eprog\noindent will be faster if you have fewer than $36$ threads.381Indeed, \kbd{parvector} schedules tasks by increasing $i$ values, and the382computation time increases steeply with $i$. With $18$ threads, say:383384\item in the first form, thread~1 handles both $i = 1$ and $i = 19$,385while thread~18 will likely handle $i = 18$ and $i = 36$. In fact, it is386likely that the first batch of tasks $i\leq 18$ runs relatively quickly,387but that none of the threads handling a value $i > 18$ (second task) will388have time to complete before $i = 18$. When that thread finishes389$i = 18$, it will pick the remaining task $i = 36$.390391\item in the second form, thread~1 will likely handle392only $i = 36$: tasks $i = 36, 35, \dots 19$ go to the available39318 threads, and $i = 36$ is likely to finish last, when394$i = 18,\dots, 2$ are already assigned to the other 17 threads.395Since the small values of $i$ will finish almost instantly, $i = 1$ will have396been allocated before the initial thread handling $i = 36$ becomes ready again.397398Load distribution is clearly more favourable in the second form.399400\vfill\eject401\input index\end402403404