Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Testing latest pari + WASM + node.js... and it works?! Wow.

28495 views
License: GPL3
ubuntu2004
1
\def\TITLE{Introduction to parallel GP programming}
2
\input parimacro.tex
3
4
% START TYPESET
5
\begintitle
6
\vskip 2.5truecm
7
\centerline{\mine Introduction}
8
\vskip 1.truecm
9
\centerline{\mine to}
10
\vskip 1.truecm
11
\centerline{\mine parallel GP}
12
\vskip 1.truecm
13
\centerline{\sectiontitlebf (version \vers)}
14
\vskip 1.truecm
15
\authors
16
\endtitle
17
18
\copyrightpage
19
\tableofcontents
20
\openin\std=parallel.aux
21
\ifeof\std
22
\else
23
\input parallel.aux
24
\fi
25
\chapno=0
26
27
\chapter{Parallel GP interface}
28
29
\section{Configuration}
30
31
This booklet documents the parallel GP interface. The first chapter describes
32
configuration and basic concepts, the second one explains how to write GP
33
codes suitable for parallel execution. Two multithread interfaces
34
are supported:
35
36
\item POSIX threads
37
38
\item Message passing interface (MPI)
39
40
POSIX threads are well-suited for single systems, for instance a personal
41
computer equiped with a multi-core processor, while MPI is used by most
42
clusters. However the parallel GP interface does not depend on the
43
multithread interface: a properly written GP program will work identically
44
with both. The interfaces are mutually exclusive and one must be chosen at
45
configuration time when installing the PARI/GP suite.
46
47
\subsec{POSIX threads}
48
49
POSIX threads are selected by passing the flag \kbd{--mt=pthread} to
50
\kbd{Configure}. The required library (\kbd{libpthread}) and header files
51
(\kbd{pthread.h}) are installed by default on most Linux system. This option
52
implies \kbd{--enable-tls} which builds a thread-safe version of the PARI
53
library. This unfortunately makes the dynamically linked \kbd{gp-dyn} binary
54
about $25\%$ slower; since \kbd{gp-sta} is only $5\%$ slower, you definitely
55
want to use the latter binary.
56
57
You may want to also pass the flag \kbd{--time=ftime} to \kbd{Configure}
58
so that \kbd{gettime} and the GP timer report real time instead of cumulated
59
CPU time. An alternative is to use \kbd{getwalltime} in your GP scripts
60
instead of \kbd{gettime}.
61
62
You can test whether parallel GP support is working with
63
64
\bprog
65
make test-parallel
66
@eprog
67
68
\subsec{Message Passing Interface}
69
70
Configuring MPI is somewhat more difficult, but your MPI installation should
71
include a script \kbd{mpicc} that takes care of the necessary configuration.
72
If you have a choice between several MPI implementations, choose OpenMPI.
73
74
To configure PARI/GP for MPI, use
75
\bprog
76
env CC=mpicc ./Configure --mt=mpi
77
@eprog
78
79
To run a program, you must use the launcher provided by your implementation,
80
usually \kbd{mpiexec} or \kbd{mpirun}. For instance, to run
81
the program \kbd{prog.gp} on $10$ nodes, you would use
82
\bprog
83
mpirun -np 10 gp prog.gp
84
@eprog\noindent (or \kbd{mpiexec} instead of \kbd{mpirun}). PARI
85
requires at least $3$ MPI nodes to work properly. \emph{Caveats:}
86
\kbd{mpirun} is not suited for interactive use because it does not provide
87
tty emulation. Moreover, it is not currently possible to interrupt parallel
88
tasks.
89
90
You can test parallel GP support (here using 3 nodes) with
91
92
\bprog
93
make test-parallel RUNTEST="mpirun -np 3"
94
@eprog
95
96
\section{Concept}
97
98
In a GP program, the \emph{main thread} executes instructions sequentially
99
(one after the other) and GP provides functions, that execute subtasks in
100
\emph{secondary threads} in parallel (at the same time). Those functions all
101
share the prefix \kbd{par}, e.g., \kbd{parfor}, a parallel version of the
102
sequential \kbd{for}-loop construct.
103
104
The subtasks are subject to a stringent limitation, the parallel code
105
must be free of side effects: it cannot set or modify a global variable
106
for instance. In fact, it may not even \emph{access} global variables or
107
local variables declared with \kbd{local()}.
108
109
Due to the overhead of parallelism, we recommend to split the computation so
110
that each parallel computation requires at least a few seconds. On the other
111
hand, it is generally more efficient to split the computation in small chunks
112
rather than large chunks.
113
114
\subsec{Resources}
115
116
The number of secondary threads to use is controlled by
117
\kbd{default(nbthreads)}. The default value of nbthreads depends on the
118
mutithread interface.
119
120
\item POSIX threads: the number of CPU threads, i.e., the number of CPU cores
121
multiplied by the hyperthreading factor. The default can be freely modified.
122
123
\item MPI: the number of available process slots minus $1$ (one slot is used by
124
the master thread), as configured with \kbd{mpirun} (or \kbd{mpiexec}). E.g.,
125
\kbd{nbthreads} is $9$ after \kbd{mpirun -np 10 gp}.
126
It is possible to change the default to a lower value, but increasing it will
127
not work: MPI does not allow to spawn new threads at run time. PARI requires
128
at least $3$ MPI nodes to work properly.
129
130
The PARI stack size in secondary threads is controlled by
131
\kbd{default(threadsize)}, so the total memory allocated is equal to
132
$\kbd{parisize}+\kbd{nbthreads}\times\kbd{threadsize}$. By default,
133
$\kbd{threadsize}=\kbd{parisize}$. Setting the \kbd{threadsizemax} default
134
allows \kbd{threadsize} to grow as needed up to that limit, analogously to
135
the behaviour of \kbd{parisize} / \kbd{parisizemax}. We strongly recommend
136
to set this parameter since it is very hard to control in advance the amount
137
of memory threads will require: a too small stack size will result in a
138
stack overflow, aborting the computation, and a too large value is
139
very wasteful (since the extra reserved but unneeded memory is multiplied by
140
the number of threads).
141
142
\subsec{GP functions}
143
144
GP provides the following functions for parallel operations, please
145
see the documentation of each function for details:
146
147
\item \tet{parvector}: parallel version of \kbd{vector};
148
149
\item \tet{parapply}: parallel version of \kbd{apply};
150
151
\item \tet{parsum}: parallel version of \kbd{sum};
152
153
\item \tet{parselect}: parallel version of \kbd{select};
154
155
\item \tet{pareval}: evaluate a vector of closures in parallel;
156
157
\item \tet{parfor}: parallel version of \kbd{for};
158
159
\item \tet{parforprime}: parallel version of \kbd{forprime};
160
161
\item \tet{parforvec}: parallel version of \kbd{forvec};
162
163
\item \tet{parploth}: parallel version of \kbd{ploth}.
164
165
\subsec{PARI functions}
166
The low-level \kbd{libpari} interface for parallelism is documented
167
in the \emph{Developer's guide to the PARI library}.
168
\newpage
169
170
\chapter{Writing code suitable for parallel execution}
171
172
\section{Exporting global variables}
173
174
When parallel execution encounters a global variable, say $V$,
175
an error such as the following is reported:
176
\bprog
177
*** parapply: mt: please use export(V)
178
@eprog\noindent A global variable is not visible in the parallel execution
179
unless it is explicitly exported. This may occur in a number of contexts.
180
181
\subsec{Example 1: data}
182
183
\bprog
184
? V = [2^256 + 1, 2^193 - 1];
185
? parvector(#V, i, factor(V[i]))
186
*** parvector: mt: please use export(V).
187
@eprog\noindent The problem is fixed as follows:
188
\bprog
189
? V = [2^256 + 1, 2^193 - 1];
190
? export(V)
191
? parvector(#V, i, factor(V[i]))
192
@eprog\noindent The following short form is also available, with a different
193
semantic:
194
\bprog
195
? export(V = [2^256 + 1, 2^193 - 1]);
196
? parvector(#V, i, factor(V[i]))
197
@eprog\noindent In the latter case the variable $V$ does not exist in the
198
main thread, only in parallel threads.
199
200
\subsec{Example 2: polynomial variable}
201
202
\bprog
203
? f(n) = bnfinit(x^n-2).no;
204
? parapply(f, [1..50])
205
*** parapply: mt: please use export(x).
206
@eprog\noindent You may fix this as in the first example using \kbd{export}
207
but here there is a more natural solution: use the polynomial indeterminate
208
\kbd{'x} instead the global variable \kbd{x} (whose value is \kbd{'x} on
209
startup, but may or may no longer be \kbd{'x} at this point):
210
211
\bprog
212
? f(n) = bnfinit('x^n-2).no;
213
@eprog\noindent or alternatively
214
\bprog
215
? f(n) = my(x='x); bnfinit(x^n-2).no;
216
@eprog
217
which is more readable if the same polynomial variable is used several times
218
in the function.
219
220
\subsec{Example 3: function}
221
222
\bprog
223
? f(a) = bnfinit('x^8-a).no;
224
? g(a,b) = parsum(i = a, b, f(i));
225
? g(37,48)
226
*** parsum: mt: please use export(f).
227
? export(f)
228
? g(37,48)
229
%4 = 81
230
@eprog\noindent Note that \kbd{export}$(v)$ freezes the value of $v$
231
for parallel execution at the time of the export: you may certainly modify
232
its value later in the main thread but you need to re-export $v$ if you want
233
the new value to be used in parallel threads. You may export more than one
234
variable at once, e.g., \kbd{export}$(a,b,c)$ is accepted. You may also
235
export \emph{all} variables with dynamic scope (all global variables
236
and all variables declared with \kbd{local}) using \kbd{exportall()}.
237
Although convenient, this may be wasteful if most variables are not meant to
238
be used from parallel threads. We recommend to
239
240
\item use \kbd{exportall} in the \kbd{gp} interpreter interactively, while
241
developping code;
242
243
\item \kbd{export} a function meant to be called from parallel
244
threads, just after its definition;
245
246
\item use $v = \var{value}$; \kbd{export}$(v)$ when the value
247
is needed both in the main thread and in secondary threads;
248
249
\item use \kbd{export}($v = \var{value}$) when the value is
250
not needed in the main thread.
251
252
\noindent In the two latter forms, $v$ should be considered read-only. It is
253
actually read-only in secondary threads, trying to change it will raise an
254
exception:
255
\bprog
256
*** mt: attempt to change exported variable 'v'.
257
@eprog\noindent You \emph{can} modify it in the main thread, but it must be
258
exported again so that the new value is accessible to secondary threads:
259
barring a new \kbd{export}, secondary threads continue to access the old value.
260
261
\section{Input and output} If your parallel code needs to write data to
262
files, split the output in as many files as the number of parallel
263
computations, to avoid concurrent writes to the same file, with a high risk
264
of data corruption. For example a parallel version of
265
\bprog
266
? f(a) = write("bnf",bnfinit('x^8-a));
267
? for (a = 37, 48, f(a))
268
@eprog\noindent could be
269
\bprog
270
? f(a) = write(Str("bnf-",a), bnfinit('x^8-a).no);
271
? export(f);
272
? parfor(i = 37, 48, f(i))
273
@eprog\noindent which creates the files \kbd{bnf-37} to \kbd{bnf-48}. Of
274
course you may want to group these file in a subdirectory, which must be
275
created first.
276
277
\section{Using \kbd{parfor} and \kbd{parforprime}}
278
\kbd{parfor} and \kbd{parforprime} are the most powerful of all parallel GP
279
functions but, since they have a different interface than \kbd{for} or
280
\kbd{forprime}, sequential code needs to be adapted. Consider the example
281
\bprog
282
for(i = a, b,
283
my(c = f(i));
284
g(i,c));
285
@eprog\noindent
286
287
where \kbd{f} is a function without side effects. This can be run in parallel
288
as follows:
289
\bprog
290
parfor(i = a, b,
291
f(i),
292
c, /*@Ccom the value of \kbd{f(i)} is assigned to \kbd{c} */
293
g(i,c));
294
@eprog\noindent For each $i$, $a \leq i \leq b$, in random order,
295
this construction assigns \kbd{f(i)} to (lexically scoped, as per~\kbd{my})
296
variable $c$, then calls \kbd{g}$(i,c)$. Only the function \kbd{f} is
297
evaluated in parallel (in secondary threads), the function \kbd{g} is
298
evaluated sequentially (in the main thread). Writing \kbd{c = f(i)} in the
299
parallel section of the code would not work since the main thread would then
300
know nothing about $c$ or its content. The only data sent from the main
301
thread to secondary threads are $f$ and the index $i$, and only $c$ (which
302
is equal to $f(i)$) is returned from the secondary thread to the main thread.
303
304
The following function finds the index of the first component of a vector
305
$V$ satisfying a predicate, and $0$ if none satisfies it:
306
\bprog
307
parfirst(pred, V) =
308
{
309
parfor(i = 1, #V,
310
pred(V[i]),
311
cond,
312
if (cond, return(i)));
313
return(0);
314
}
315
@eprog\noindent This works because, if the second expression
316
in \kbd{parfor} exits the loop via \kbd{break} / \kbd{return} at index~$i$,
317
it is guaranteed that all indexes $< i$ are also evaluated and the one with
318
smallest index is the one that triggers the exit. See \kbd{??parfor} for
319
details.
320
321
The following function is similar to \kbd{parsum}:
322
\bprog
323
myparsum(a, b, expr) =
324
{ my(s = 0);
325
326
parfor(i = a, b,
327
expr(i),
328
val,
329
s += val);
330
return(s);
331
}
332
@eprog
333
334
\section{Sizing parallel tasks}
335
336
Dispatching tasks to parallel threads takes time. To limit overhead, split
337
the computation so that each parallel task requires at least a few
338
seconds. Consider the following sequential example:
339
\bprog
340
? thuemorse(n) = (-1)^n * hammingweight(n);
341
? s = 0; for(n = 1, 2*10^6, s += thuemorse(n) / n * 1.)
342
cpu time = 1,064 ms, real time = 1,064 ms.
343
@eprog\noindent
344
It is natural to try
345
\bprog
346
? export(thuemorse);
347
? s = 0; parfor(n = 1, 2*10^6, thuemorse(n) / n * 1., S, s += S)
348
cpu time = 5,637 ms, real time = 3,216 ms.
349
@eprog\noindent
350
However, due to the overhead, this is not faster than the sequential
351
version; in fact it is much \emph{slower}. To limit overhead, we group
352
the summation by blocks:
353
\bprog
354
? s = 0; parfor(N = 1, 20, sum(n = 1+(N-1)*10^5, N*10^5,
355
thuemorse(n) / n*1.), S, s += S)
356
? cpu time = 1,997 ms, real time = 186 ms.
357
@eprog\noindent
358
Try to create at least as many groups as the number of available threads,
359
to take full advantage of parallelism. Since some of the floating point
360
additions are done in random order (the ones in a given block occur
361
successively, in deterministic order), it is possible that some of the
362
results will differ slightly from one run to the next.
363
364
Of course, in this example, \kbd{parsum} is much easier to use since it will
365
group the summations for you behind the scenes:
366
\bprog
367
? parsum(n = 1, 2*10^6, thuemorse(n) / n * 1.)
368
cpu time = 1,905 ms, real time = 177 ms.
369
@eprog
370
371
\section{Load balancing}
372
373
If the parallel tasks require varying time to complete, it is preferable to
374
perform the slower ones first, when there are more tasks than available
375
parallel threads. Instead of
376
\bprog
377
parvector(36, i, bnfinit('x^i - 2).no)
378
@eprog\noindent doing
379
\bprog
380
parvector(36, i, bnfinit('x^(37-i) - 2).no)
381
@eprog\noindent will be faster if you have fewer than $36$ threads.
382
Indeed, \kbd{parvector} schedules tasks by increasing $i$ values, and the
383
computation time increases steeply with $i$. With $18$ threads, say:
384
385
\item in the first form, thread~1 handles both $i = 1$ and $i = 19$,
386
while thread~18 will likely handle $i = 18$ and $i = 36$. In fact, it is
387
likely that the first batch of tasks $i\leq 18$ runs relatively quickly,
388
but that none of the threads handling a value $i > 18$ (second task) will
389
have time to complete before $i = 18$. When that thread finishes
390
$i = 18$, it will pick the remaining task $i = 36$.
391
392
\item in the second form, thread~1 will likely handle
393
only $i = 36$: tasks $i = 36, 35, \dots 19$ go to the available
394
18 threads, and $i = 36$ is likely to finish last, when
395
$i = 18,\dots, 2$ are already assigned to the other 17 threads.
396
Since the small values of $i$ will finish almost instantly, $i = 1$ will have
397
been allocated before the initial thread handling $i = 36$ becomes ready again.
398
399
Load distribution is clearly more favourable in the second form.
400
401
\vfill\eject
402
\input index\end
403
404