Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

📚 The CoCalc Library - books, templates and other resources

132955 views
License: OTHER
1
2
# Advanced NumPy
3
4
5
6
```python
7
%matplotlib inline
8
```
9
10
## Quick example: gene expression, without numpy
11
12
FIRST...
13
14
## Even quicker: what is gene expression?
15
16
How genes work:
17
18
![central dogma](images/centraldogma.png)
19
20
What this means:
21
22
![gene expression](images/gene-expression.png)
23
24
How we measure it:
25
26
![rna sequence](images/rnaseq.png)
27
28
## Some fake data
29
30
| | Cell type A | Cell type B | Cell type C | Cell type D |
31
|--------|-------------|-------------|-------------|-------------|
32
| Gene 0 | 100 | 200 | 50 | 400 |
33
| Gene 1 | 50 | 0 | 0 | 100 |
34
| Gene 2 | 350 | 100 | 50 | 200 |
35
36
37
38
```python
39
gene0 = [100, 200, 50, 400]
40
gene1 = [50, 0, 0, 100]
41
gene2 = [350, 100, 50, 200]
42
expression_data = [gene0, gene1, gene2]
43
```
44
45
Why is this a bad idea?
46
47
## Now with NumPy
48
49
50
51
```python
52
import numpy as np
53
a = np.array(expression_data)
54
print(a)
55
```
56
57
```output
58
[[100 200 50 400]
59
[ 50 0 0 100]
60
[350 100 50 200]]
61
```
62
63
We are going to:
64
65
* Obtain an *RPKM* expression matrix
66
* Quantile normalize the data
67
68
using the awesome power of NumPy
69
70
## Inside a numpy ndarray
71
72
73
74
```python
75
def print_info(a):
76
print('number of elements:', a.size)
77
print('number of dimensions:', a.ndim)
78
print('shape:', a.shape)
79
print('data type:', a.dtype)
80
print('strides:', a.strides)
81
print('flags:', a.flags)
82
83
print_info(a)
84
```
85
86
```output
87
number of elements: 12
88
number of dimensions: 2
89
shape: (3, 4)
90
data type: int64
91
strides: (32, 8)
92
flags: C_CONTIGUOUS : True
93
F_CONTIGUOUS : False
94
OWNDATA : True
95
WRITEABLE : True
96
ALIGNED : True
97
UPDATEIFCOPY : False
98
```
99
100
101
102
```python
103
print(a.data)
104
```
105
106
```output
107
<memory at 0x7fc487c7f7e0>
108
```
109
110
111
112
```python
113
abytes = a.ravel().view(dtype=np.uint8)
114
```
115
116
117
118
```python
119
print_info(abytes)
120
```
121
122
```output
123
number of elements: 96
124
number of dimensions: 1
125
shape: (96,)
126
data type: uint8
127
strides: (1,)
128
flags: C_CONTIGUOUS : True
129
F_CONTIGUOUS : True
130
OWNDATA : False
131
WRITEABLE : True
132
ALIGNED : True
133
UPDATEIFCOPY : False
134
```
135
136
137
138
```python
139
print(abytes[:24])
140
```
141
142
```output
143
[100 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0 50 0
144
0 0 0 0 0 0]
145
```
146
147
### Example: take the transpose of `a`
148
149
150
151
```python
152
print_info(a)
153
```
154
155
```output
156
number of elements: 12
157
number of dimensions: 2
158
shape: (3, 4)
159
data type: int64
160
strides: (32, 8)
161
flags: C_CONTIGUOUS : True
162
F_CONTIGUOUS : False
163
OWNDATA : True
164
WRITEABLE : True
165
ALIGNED : True
166
UPDATEIFCOPY : False
167
```
168
169
170
171
```python
172
print_info(a.T)
173
```
174
175
```output
176
number of elements: 12
177
number of dimensions: 2
178
shape: (4, 3)
179
data type: int64
180
strides: (8, 32)
181
flags: C_CONTIGUOUS : False
182
F_CONTIGUOUS : True
183
OWNDATA : False
184
WRITEABLE : True
185
ALIGNED : True
186
UPDATEIFCOPY : False
187
```
188
189
### Example: skipping rows and columns with *slicing*
190
191
192
193
```python
194
print_info(a.T)
195
```
196
197
```output
198
number of elements: 12
199
number of dimensions: 2
200
shape: (4, 3)
201
data type: int64
202
strides: (8, 32)
203
flags: C_CONTIGUOUS : False
204
F_CONTIGUOUS : True
205
OWNDATA : False
206
WRITEABLE : True
207
ALIGNED : True
208
UPDATEIFCOPY : False
209
```
210
211
212
213
```python
214
print_info(a.T[::2])
215
```
216
217
```output
218
number of elements: 6
219
number of dimensions: 2
220
shape: (2, 3)
221
data type: int64
222
strides: (16, 32)
223
flags: C_CONTIGUOUS : False
224
F_CONTIGUOUS : False
225
OWNDATA : False
226
WRITEABLE : True
227
ALIGNED : True
228
UPDATEIFCOPY : False
229
```
230
231
232
233
```python
234
print_info(a.T[::2, ::2])
235
```
236
237
```output
238
number of elements: 4
239
number of dimensions: 2
240
shape: (2, 2)
241
data type: int64
242
strides: (16, 64)
243
flags: C_CONTIGUOUS : False
244
F_CONTIGUOUS : False
245
OWNDATA : False
246
WRITEABLE : True
247
ALIGNED : True
248
UPDATEIFCOPY : False
249
```
250
251
### Getting a copy
252
253
254
255
```python
256
b = a
257
```
258
259
260
261
```python
262
print(b)
263
```
264
265
```output
266
[[100 200 50 400]
267
[ 50 0 0 100]
268
[350 100 50 200]]
269
```
270
271
272
273
```python
274
a[0, 0] = 5
275
print(b)
276
a[0, 0] = 100
277
```
278
279
```output
280
[[ 5 200 50 400]
281
[ 50 0 0 100]
282
[350 100 50 200]]
283
```
284
285
## Advanced operations: axis-wise evaluation
286
287
288
289
```python
290
expr = np.load('../../data/expr.npy')
291
```
292
293
294
295
```python
296
print_info(expr)
297
```
298
299
```output
300
number of elements: 7687500
301
number of dimensions: 2
302
shape: (20500, 375)
303
data type: uint32
304
strides: (4, 82000)
305
flags: C_CONTIGUOUS : False
306
F_CONTIGUOUS : True
307
OWNDATA : False
308
WRITEABLE : True
309
ALIGNED : True
310
UPDATEIFCOPY : False
311
```
312
313
This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the *library size*, which is the total number of reads across a column.
314
315
The `np.sum` function returns the sum of all the elements of an array. With the `axis` argument, you can take the sum *along the given axis*.
316
317
318
319
```python
320
lib_size = np.sum(expr, axis=0)
321
print(lib_size)
322
```
323
324
```output
325
[ 41539237 50824374 35747651 33532890 48162893 83478927 61652756
326
56043182 50066035 44291909 62103960 52177691 39935196 44712296
327
49499950 54994779 64874312 41901128 61124396 74947233 55805778
328
44751480 57419892 45460279 58590618 75900141 79022684 51253368
329
63148322 64114541 53696942 50946576 55709927 60184139 73887491
330
51905092 39721522 42746126 43019952 40435776 57789966 30526018
331
50488171 59600438 49802751 103219262 80445758 76730579 59996715
332
70611975 53788477 57967905 50450027 61725636 52939093 48355220
333
60814140 61419740 52865931 42718020 47981997 50989994 55265609
334
57268101 37174525 65469319 69459796 64633280 52550614 52800398
335
67603908 42568697 54328614 58801154 41408022 55721943 56051351
336
48906539 30074621 56514800 61685565 45155573 55190918 58239673
337
51484868 56456500 58897501 41200475 60510647 45186472 48863100
338
44723293 66857959 43679229 63454731 40459554 44210528 71649897
339
51052494 55893243 23880954 37930574 50358271 67084571 55954779
340
60533989 37534918 58268372 69034028 62187943 31471960 60211327
341
69060349 52800469 54351395 48929065 55783836 57805383 45280443
342
64365686 51738764 45503686 20805588 6231205 58912016 58382737
343
55586370 47844506 88199645 45202180 57012734 35536183 66413476
344
58255198 67400957 63388779 66920588 89665770 54467447 53674778
345
60489113 64271298 54482571 57945567 58481866 30468523 46520205
346
43720270 20860933 62291312 76469475 58273889 62758553 69907767
347
46323880 58981816 57005722 63065832 55354924 53218346 48332445
348
50847760 45896941 50855098 66765635 53976120 43076562 64591516
349
70321285 43164571 19972560 60273622 48626910 62192138 56940941
350
54212832 47915415 60829977 64420702 47877190 21861554 42351337
351
56078062 58080584 65786195 69711708 35936611 61273004 40872795
352
42664688 51610919 42895450 65840053 46914753 52775804 48278133
353
63447113 46209355 46190947 55448740 40136288 61288618 55819950
354
61874813 51645078 52880150 56670662 41643222 54351160 38218219
355
20136141 48512044 54430430 40095409 37135634 47338323 47628500
356
70593750 61983198 33989258 62762404 32312231 54883550 58804271
357
47618042 43385504 53262000 51775467 77796207 60667212 58663579
358
48561056 37776295 41013665 61718896 49617862 46246248 53495544
359
59868475 65570441 57271936 33651719 42949256 56326986 46188444
360
65338751 63280380 72026286 45558125 65005023 37707710 36690847
361
36791707 47339679 40636949 47695871 59847402 65956305 59465169
362
70009556 58111425 60158442 54800853 54455853 68528086 44581634
363
45475847 66490471 42291085 35965990 41479598 58860157 36363286
364
43083690 59348176 58854511 41230056 49617871 43539028 65823859
365
44602672 50578213 57054596 45886960 66240091 66165612 32463206
366
47864659 52971575 52220505 25097393 70434480 32039712 32819373
367
58584329 45039848 59657923 49474656 63730127 65560607 47372745
368
58633031 63905299 54165552 47486106 52398513 45153699 58159323
369
19581961 66938365 70099565 62639887 34726837 63115934 69049347
370
66656207 44194958 43770898 63897405 32572188 21661735 40549383
371
37659411 45886025 64801505 21061525 59949050 31486861 62661761
372
49585558 61242121 69353546 60024113 58447218 61229433 44275809
373
64901114 37277569 36078563 44834451 62844105 53060365 66203234
374
63269672 52694324 61885104 51421981 59063240 43349098 48320365
375
89431418 40367656 61502892 64771848 36282255 45373426 66499413
376
56130605 40281506 45002236 55357976 61842699 30509789 43349038
377
59621075 69058739 67386247 55079081 28732898 56053773 60685111
378
39380186 45394727 59700010 54182166]
379
```
380
381
### Exercise
382
383
Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.
384
385
## Advanced operations: broadcasting
386
387
In order to normalize every column by its corresponding library size, we have to *align* the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use `np.newaxis` to match the dimensions.
388
389
390
391
```python
392
print(expr.shape)
393
print(lib_size.shape)
394
print(lib_size[np.newaxis, :].shape)
395
```
396
397
```output
398
(20500, 375)
399
(375,)
400
(1, 375)
401
```
402
403
However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:
404
405
406
407
```python
408
np.all(expr / lib_size ==
409
expr / lib_size[np.newaxis, :])
410
```
411
412
413
414
415
```output
416
True
417
```
418
419
420
421
422
423
```python
424
expr_lib = expr / lib_size
425
```
426
427
We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).
428
429
430
431
```python
432
expr_lib *= 1e6
433
```
434
435
Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).
436
437
438
439
```python
440
gene_len = np.load('../../data/gene-lens.npy')
441
print(gene_len.shape)
442
```
443
444
```output
445
(20500,)
446
```
447
448
### Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM
449
450
451
452
```python
453
rpkm = expr_lib # FIX THIS
454
```
455
456
457
458
```python
459
import matplotlib.pyplot as plt
460
from scipy import stats
461
462
def plot_col_density(data, xlim=None, *args, **kwargs):
463
# Use gaussian smoothing to estimate the density
464
density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
465
if xlim is not None:
466
m, M = xlim
467
else:
468
m, M = np.min(data), np.max(data)
469
x = np.linspace(m, M, 100)
470
471
fig, ax = plt.subplots()
472
for density in density_per_col:
473
ax.plot(x, density(x), *args, **kwargs)
474
ax.set_xlabel('log-counts')
475
ax.set_ylabel('frequency')
476
if xlim is not None:
477
ax.set_xlim(xlim)
478
plt.show()
479
480
```
481
482
483
484
```python
485
%matplotlib inline
486
```
487
488
489
490
```python
491
plt.style.use('ggplot')
492
```
493
494
495
496
```python
497
plot_col_density(np.log(expr+1)[:, :20])
498
```
499
500
501
![png](1_advanced_numpy_files/1_advanced_numpy_43_0.png)
502
503
504
505
506
```python
507
plot_col_density(np.log(rpkm + 1)[:, :20], xlim=(0, 250))
508
```
509
510
511
![png](1_advanced_numpy_files/1_advanced_numpy_44_0.png)
512
513
514
### Exercise: 3D broadcasting
515
516
Below, produce the array containing the sum of every element in `x` with every element in `y`
517
518
519
520
```python
521
x = np.random.rand(3, 5)
522
y = np.random.randint(10, size=8)
523
z = x # FIX THIS
524
```
525
526
## Fancy indexing
527
528
You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.
529
530
531
532
```python
533
values = np.array([0, 5, 99])
534
selector = np.random.randint(0, 3, size=(3, 4))
535
print(selector)
536
print(values[selector])
537
```
538
539
```output
540
[[2 1 1 2]
541
[0 1 0 1]
542
[2 2 1 2]]
543
[[99 5 5 99]
544
[ 0 5 0 5]
545
[99 99 5 99]]
546
```
547
548
### Exercise: quantile normalization
549
550
Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.
551
552
*Hint: look for documentation for `scipy.mstats.rankdata`, `np.sort`, and `np.argsort`.*
553
554
555
556
```python
557
def qnorm(x):
558
"""Quantile normalize an input matrix.
559
560
Parameters
561
----------
562
x : 2D array of float, shape (M, N)
563
The input data, with each column being a
564
distribution to normalize.
565
566
Returns
567
-------
568
xn : 2D array of float, shape (M, N)
569
The normalized data.
570
"""
571
xn = np.copy(x) # replace this by normalizing code
572
return xn
573
```
574
575
576
577
```python
578
logexpr = np.log(expr + 1)
579
logrpkm = np.log(rpkm + 1)
580
```
581
582
583
584
```python
585
logexprn = qnorm(logexpr)
586
logrpkmn = qnorm(logrpkm)
587
```
588
589
590
591
```python
592
plot_col_density(logexprn[:, :20])
593
```
594
595
596
![png](1_advanced_numpy_files/1_advanced_numpy_53_0.png)
597
598
599
600
601
```python
602
plot_col_density(logrpkmn[:, :20], xlim=(0, 0.25))
603
```
604
605
606
![png](1_advanced_numpy_files/1_advanced_numpy_54_0.png)
607
608
609