CoCalc -- parallel.tex

Testing latest pari + WASM + node.js... and it works?! Wow.
pari / pari-2.14.0.alpha / doc / parallel.tex
²⁸⁴⁹⁵ views
License: GPL3
ubuntu2004
1
\def\TITLE{Introduction to parallel GP programming}
2
\input parimacro.tex
3

4
% START TYPESET
5
\begintitle
6
\vskip 2.5truecm
7
\centerline{\mine Introduction}
8
\vskip 1.truecm
9
\centerline{\mine to}
10
\vskip 1.truecm
11
\centerline{\mine parallel GP}
12
\vskip 1.truecm
13
\centerline{\sectiontitlebf (version \vers)}
14
\vskip 1.truecm
15
\authors
16
\endtitle
17

18
\copyrightpage
19
\tableofcontents
20
\openin\std=parallel.aux
21
\ifeof\std
22
\else
23
  \input parallel.aux
24
\fi
25
\chapno=0
26

27
\chapter{Parallel GP interface}
28

29
\section{Configuration}
30

31
This booklet documents the parallel GP interface. The first chapter describes
32
configuration and basic concepts, the second one explains how to write GP
33
codes suitable for parallel execution. Two multithread interfaces
34
are supported:
35

36
\item POSIX threads
37

38
\item Message passing interface (MPI)
39

40
POSIX threads are well-suited for single systems, for instance a personal
41
computer equiped with a multi-core processor, while MPI is used by most
42
clusters. However the parallel GP interface does not depend on the
43
multithread interface: a properly written GP program will work identically
44
with both. The interfaces are mutually exclusive and one must be chosen at
45
configuration time when installing the PARI/GP suite.
46

47
\subsec{POSIX threads}
48

49
POSIX threads are selected by passing the flag \kbd{--mt=pthread} to
50
\kbd{Configure}. The required library (\kbd{libpthread}) and header files
51
(\kbd{pthread.h}) are installed by default on most Linux system. This option
52
implies \kbd{--enable-tls} which builds a thread-safe version of the PARI
53
library. This unfortunately makes the dynamically linked \kbd{gp-dyn} binary
54
about $25\%$ slower; since \kbd{gp-sta} is only $5\%$ slower, you definitely
55
want to use the latter binary.
56

57
You may want to also pass the flag \kbd{--time=ftime} to \kbd{Configure}
58
so that \kbd{gettime} and the GP timer report real time instead of cumulated
59
CPU time. An alternative is to use \kbd{getwalltime} in your GP scripts
60
instead of \kbd{gettime}.
61

62
You can test whether parallel GP support is working with
63

64
\bprog
65
make test-parallel
66
@eprog
67

68
\subsec{Message Passing Interface}
69

70
Configuring MPI is somewhat more difficult, but your MPI installation should
71
include a script \kbd{mpicc} that takes care of the necessary configuration.
72
If you have a choice between several MPI implementations, choose OpenMPI.
73

74
To configure PARI/GP for MPI, use
75
\bprog
76
env CC=mpicc ./Configure --mt=mpi
77
@eprog
78

79
To run a program, you must use the launcher provided by your implementation,
80
usually \kbd{mpiexec} or \kbd{mpirun}. For instance, to run
81
the program \kbd{prog.gp} on $10$ nodes, you would use
82
\bprog
83
  mpirun -np 10 gp prog.gp
84
@eprog\noindent (or \kbd{mpiexec} instead of \kbd{mpirun}). PARI
85
requires at least $3$ MPI nodes to work properly. \emph{Caveats:}
86
\kbd{mpirun} is not suited for interactive use because it does not provide
87
tty emulation. Moreover, it is not currently possible to interrupt parallel
88
tasks.
89

90
You can test parallel GP support (here using 3 nodes) with
91

92
\bprog
93
make test-parallel RUNTEST="mpirun -np 3"
94
@eprog
95

96
\section{Concept}
97

98
In a GP program, the \emph{main thread} executes instructions sequentially
99
(one after the other) and GP provides functions, that execute subtasks in
100
\emph{secondary threads} in parallel (at the same time). Those functions all
101
share the prefix \kbd{par}, e.g., \kbd{parfor}, a parallel version of the
102
sequential \kbd{for}-loop construct.
103

104
The subtasks are subject to a stringent limitation, the parallel code
105
must be free of side effects: it cannot set or modify a global variable
106
for instance. In fact, it may not even \emph{access} global variables or
107
local variables declared with \kbd{local()}.
108

109
Due to the overhead of parallelism, we recommend to split the computation so
110
that each parallel computation requires at least a few seconds. On the other
111
hand, it is generally more efficient to split the computation in small chunks
112
rather than large chunks.
113

114
\subsec{Resources}
115

116
The number of secondary threads to use is controlled by
117
\kbd{default(nbthreads)}. The default value of nbthreads depends on the
118
mutithread interface.
119

120
\item POSIX threads: the number of CPU threads, i.e., the number of CPU cores
121
multiplied by the hyperthreading factor. The default can be freely modified.
122

123
\item MPI: the number of available process slots minus $1$ (one slot is used by
124
the master thread), as configured with \kbd{mpirun} (or \kbd{mpiexec}). E.g.,
125
\kbd{nbthreads} is $9$ after \kbd{mpirun -np 10 gp}.
126
It is possible to change the default to a lower value, but increasing it will
127
not work: MPI does not allow to spawn new threads at run time. PARI requires
128
at least $3$ MPI nodes to work properly.
129

130
The PARI stack size in secondary threads is controlled by
131
\kbd{default(threadsize)}, so the total memory allocated is equal to
132
$\kbd{parisize}+\kbd{nbthreads}\times\kbd{threadsize}$.  By default,
133
$\kbd{threadsize}=\kbd{parisize}$. Setting the \kbd{threadsizemax} default
134
allows \kbd{threadsize} to grow as needed up to that limit, analogously to
135
the behaviour of \kbd{parisize} / \kbd{parisizemax}. We strongly recommend
136
to set this parameter since it is very hard to control in advance the amount
137
of memory threads will require: a too small stack size will result in a
138
stack overflow, aborting the computation, and a too large value is
139
very wasteful (since the extra reserved but unneeded memory is multiplied by
140
the number of threads).
141

142
\subsec{GP functions}
143

144
GP provides the following functions for parallel operations, please
145
see the documentation of each function for details:
146

147
\item \tet{parvector}: parallel version of \kbd{vector};
148

149
\item \tet{parapply}:  parallel version of \kbd{apply};
150

151
\item \tet{parsum}:    parallel version of \kbd{sum};
152

153
\item \tet{parselect}:    parallel version of \kbd{select};
154

155
\item \tet{pareval}:   evaluate a vector of closures in parallel;
156

157
\item \tet{parfor}:    parallel version of \kbd{for};
158

159
\item \tet{parforprime}:   parallel version of \kbd{forprime};
160

161
\item \tet{parforvec}: parallel version of \kbd{forvec};
162

163
\item \tet{parploth}: parallel version of \kbd{ploth}.
164

165
\subsec{PARI functions}
166
The low-level \kbd{libpari} interface for parallelism is documented
167
in the \emph{Developer's guide to the PARI library}.
168
\newpage
169

170
\chapter{Writing code suitable for parallel execution}
171

172
\section{Exporting global variables}
173

174
When parallel execution encounters a global variable, say $V$,
175
an error such as the following is reported:
176
\bprog
177
  *** parapply: mt: please use export(V)
178
@eprog\noindent A global variable is not visible in the parallel execution
179
unless it is explicitly exported. This may occur in a number of contexts.
180

181
\subsec{Example 1: data}
182

183
\bprog
184
? V = [2^256 + 1, 2^193 - 1];
185
? parvector(#V, i, factor(V[i]))
186
  *** parvector: mt: please use export(V).
187
@eprog\noindent The problem is fixed as follows:
188
\bprog
189
? V = [2^256 + 1, 2^193 - 1];
190
? export(V)
191
? parvector(#V, i, factor(V[i]))
192
@eprog\noindent The following short form is also available, with a different
193
semantic:
194
\bprog
195
? export(V = [2^256 + 1, 2^193 - 1]);
196
? parvector(#V, i, factor(V[i]))
197
@eprog\noindent In the latter case the variable $V$ does not exist in the
198
main thread, only in parallel threads.
199

200
\subsec{Example 2: polynomial variable}
201

202
\bprog
203
? f(n) = bnfinit(x^n-2).no;
204
? parapply(f, [1..50])
205
  *** parapply: mt: please use export(x).
206
@eprog\noindent You may fix this as in the first example using \kbd{export}
207
but here there is a more natural solution: use the polynomial indeterminate
208
\kbd{'x} instead the global variable \kbd{x} (whose value is \kbd{'x} on
209
startup, but may or may no longer be \kbd{'x} at this point):
210

211
\bprog
212
? f(n) = bnfinit('x^n-2).no;
213
@eprog\noindent or alternatively
214
\bprog
215
? f(n) = my(x='x); bnfinit(x^n-2).no;
216
@eprog
217
which is more readable if the same polynomial variable is used several times
218
in the function.
219

220
\subsec{Example 3: function}
221

222
\bprog
223
? f(a) = bnfinit('x^8-a).no;
224
? g(a,b) = parsum(i = a, b, f(i));
225
? g(37,48)
226
  *** parsum: mt: please use export(f).
227
? export(f)
228
? g(37,48)
229
%4 = 81
230
@eprog\noindent Note that \kbd{export}$(v)$ freezes the value of $v$
231
for parallel execution at the time of the export: you may certainly modify
232
its value later in the main thread but you need to re-export $v$ if you want
233
the new value to be used in parallel threads. You may export more than one
234
variable at once, e.g., \kbd{export}$(a,b,c)$ is accepted. You may also
235
export \emph{all} variables with dynamic scope (all global variables
236
and all variables declared with \kbd{local}) using \kbd{exportall()}.
237
Although convenient, this may be wasteful if most variables are not meant to
238
be used from parallel threads. We recommend to
239

240
\item use \kbd{exportall} in the \kbd{gp} interpreter interactively, while
241
  developping code;
242

243
\item \kbd{export} a function meant to be called from parallel
244
  threads, just after its definition;
245

246
\item use $v = \var{value}$; \kbd{export}$(v)$ when the value
247
  is needed both in the main thread and in secondary threads;
248

249
\item use \kbd{export}($v = \var{value}$) when the value is
250
  not needed in the main thread.
251

252
\noindent In the two latter forms, $v$ should be considered read-only. It is
253
actually read-only in secondary threads, trying to change it will raise an
254
exception:
255
\bprog
256
  ***   mt: attempt to change exported variable 'v'.
257
@eprog\noindent You \emph{can} modify it in the main thread, but it must be
258
exported again so that the new value is accessible to secondary threads:
259
barring a new \kbd{export}, secondary threads continue to access the old value.
260

261
\section{Input and output} If your parallel code needs to write data to
262
files, split the output in as many files as the number of parallel
263
computations, to avoid concurrent writes to the same file, with a high risk
264
of data corruption. For example a parallel version of
265
\bprog
266
? f(a) = write("bnf",bnfinit('x^8-a));
267
? for (a = 37, 48, f(a))
268
@eprog\noindent could be
269
\bprog
270
? f(a) = write(Str("bnf-",a), bnfinit('x^8-a).no);
271
? export(f);
272
? parfor(i = 37, 48, f(i))
273
@eprog\noindent which creates the files \kbd{bnf-37} to \kbd{bnf-48}. Of
274
course you may want to group these file in a subdirectory, which must be
275
created first.
276

277
\section{Using \kbd{parfor} and \kbd{parforprime}}
278
\kbd{parfor} and \kbd{parforprime} are the most powerful of all parallel GP
279
functions but, since they have a different interface than \kbd{for} or
280
\kbd{forprime}, sequential code needs to be adapted. Consider the example
281
\bprog
282
for(i = a, b,
283
  my(c = f(i));
284
  g(i,c));
285
@eprog\noindent
286

287
where \kbd{f} is a function without side effects.  This can be run in parallel
288
as follows:
289
\bprog
290
parfor(i = a, b,
291
  f(i),
292
  c,     /*@Ccom the value of \kbd{f(i)} is assigned to \kbd{c} */
293
  g(i,c));
294
@eprog\noindent For each $i$, $a \leq i \leq b$, in random order,
295
this construction assigns \kbd{f(i)} to (lexically scoped, as per~\kbd{my})
296
variable $c$, then calls \kbd{g}$(i,c)$. Only the function \kbd{f} is
297
evaluated in parallel (in secondary threads), the function \kbd{g} is
298
evaluated sequentially (in the main thread). Writing \kbd{c = f(i)} in the
299
parallel section of the code would not work since the main thread would then
300
know nothing about $c$ or its content. The only data sent from the main
301
thread to secondary threads are $f$ and the index $i$, and only $c$ (which
302
is equal to $f(i)$) is returned from the secondary thread to the main thread.
303

304
The following function finds the index of the first component of a vector
305
$V$ satisfying a predicate, and $0$ if none satisfies it:
306
\bprog
307
parfirst(pred, V) =
308
{
309
  parfor(i = 1, #V,
310
    pred(V[i]),
311
    cond,
312
    if (cond, return(i)));
313
  return(0);
314
}
315
@eprog\noindent This works because, if the second expression
316
in \kbd{parfor} exits the loop via \kbd{break} / \kbd{return} at index~$i$,
317
it is guaranteed that all indexes $< i$ are also evaluated and the one with
318
smallest index is the one that triggers the exit. See \kbd{??parfor} for
319
details.
320

321
The following function is similar to \kbd{parsum}:
322
\bprog
323
myparsum(a, b, expr) =
324
{ my(s = 0);
325

326
  parfor(i = a, b,
327
    expr(i),
328
    val,
329
    s += val);
330
  return(s);
331
}
332
@eprog
333

334
\section{Sizing parallel tasks}
335

336
Dispatching tasks to parallel threads takes time. To limit overhead, split
337
the computation so that each parallel task requires at least a few
338
seconds. Consider the following sequential example:
339
\bprog
340
  ? thuemorse(n) = (-1)^n * hammingweight(n);
341
  ? s = 0; for(n = 1, 2*10^6, s += thuemorse(n) / n * 1.)
342
  cpu time = 1,064 ms, real time = 1,064 ms.
343
@eprog\noindent
344
It is natural to try
345
\bprog
346
  ? export(thuemorse);
347
  ? s = 0; parfor(n = 1, 2*10^6, thuemorse(n) / n * 1., S, s += S)
348
  cpu time = 5,637 ms, real time = 3,216 ms.
349
@eprog\noindent
350
However, due to the overhead, this is not faster than the sequential
351
version; in fact it is much \emph{slower}. To limit overhead, we group
352
the summation by blocks:
353
\bprog
354
  ? s = 0; parfor(N = 1, 20, sum(n = 1+(N-1)*10^5, N*10^5,
355
                                 thuemorse(n) / n*1.), S, s += S)
356
  ? cpu time = 1,997 ms, real time = 186 ms.
357
@eprog\noindent
358
Try to create at least as many groups as the number of available threads,
359
to take full advantage of parallelism. Since some of the floating point
360
additions are done in random order (the ones in a given block occur
361
successively, in deterministic order), it is possible that some of the
362
results will differ slightly from one run to the next.
363

364
Of course, in this example, \kbd{parsum} is much easier to use since it will
365
group the summations for you behind the scenes:
366
\bprog
367
  ? parsum(n = 1, 2*10^6, thuemorse(n) / n * 1.)
368
  cpu time = 1,905 ms, real time = 177 ms.
369
@eprog
370

371
\section{Load balancing}
372

373
If the parallel tasks require varying time to complete, it is preferable to
374
perform the slower ones first, when there are more tasks than available
375
parallel threads. Instead of
376
\bprog
377
  parvector(36, i, bnfinit('x^i - 2).no)
378
@eprog\noindent doing
379
\bprog
380
  parvector(36, i, bnfinit('x^(37-i) - 2).no)
381
@eprog\noindent will be faster if you have fewer than $36$ threads.
382
Indeed, \kbd{parvector} schedules tasks by increasing $i$ values, and the
383
computation time increases steeply with $i$. With $18$ threads, say:
384

385
\item in the first form, thread~1 handles both $i = 1$ and $i = 19$,
386
  while thread~18 will likely handle $i = 18$ and $i = 36$. In fact, it is
387
  likely that the first batch of tasks $i\leq 18$ runs relatively quickly,
388
  but that none of the threads  handling a value $i > 18$ (second task) will
389
  have time to complete before $i = 18$. When that thread finishes
390
  $i = 18$, it will pick the remaining task $i = 36$.
391

392
\item in the second form, thread~1 will likely handle
393
only $i = 36$: tasks $i = 36, 35, \dots 19$ go to the available
394
18 threads, and $i = 36$ is likely to finish last, when
395
$i = 18,\dots, 2$ are already assigned to the other 17 threads.
396
Since the small values of $i$ will finish almost instantly, $i = 1$ will have
397
been allocated before the initial thread handling $i = 36$ becomes ready again.
398

399
Load distribution is clearly more favourable in the second form.
400

401
\vfill\eject
402
\input index\end
403

404
Product

Resources

Company