CoCalc -- 1_advanced

📚 The CoCalc Library - books, templates and other resources
cocalc-examples / scikit-image-tutorials / _sources / lessons / 1_advanced_numpy.txt
¹³²⁹⁶⁴ views
License: OTHER
1

2
# Advanced NumPy
3

4

5

6
```python
7
%matplotlib inline
8
```
9

10
## Quick example: gene expression, without numpy
11

12
FIRST...
13

14
## Even quicker: what is gene expression?
15

16
How genes work:
17

18
![central dogma](images/centraldogma.png)
19

20
What this means:
21

22
![gene expression](images/gene-expression.png)
23

24
How we measure it:
25

26
![rna sequence](images/rnaseq.png)
27

28
## Some fake data
29

30
|        | Cell type A | Cell type B | Cell type C | Cell type D |
31
|--------|-------------|-------------|-------------|-------------|
32
| Gene 0 | 100         | 200         | 50          | 400         |
33
| Gene 1 | 50          | 0           | 0           | 100         |
34
| Gene 2 | 350         | 100         | 50          | 200         |
35

36

37

38
```python
39
gene0 = [100, 200, 50, 400]
40
gene1 = [50, 0, 0, 100]
41
gene2 = [350, 100, 50, 200]
42
expression_data = [gene0, gene1, gene2]
43
```
44

45
Why is this a bad idea?
46

47
## Now with NumPy
48

49

50

51
```python
52
import numpy as np
53
a = np.array(expression_data)
54
print(a)
55
```
56

57
```output
58
[[100 200  50 400]
59
 [ 50   0   0 100]
60
 [350 100  50 200]]
61
```
62

63
We are going to:
64

65
* Obtain an *RPKM* expression matrix
66
* Quantile normalize the data
67

68
using the awesome power of NumPy
69

70
## Inside a numpy ndarray
71

72

73

74
```python
75
def print_info(a):
76
    print('number of elements:', a.size)
77
    print('number of dimensions:', a.ndim)
78
    print('shape:', a.shape)
79
    print('data type:', a.dtype)
80
    print('strides:', a.strides)
81
    print('flags:', a.flags)
82
    
83
print_info(a)
84
```
85

86
```output
87
number of elements: 12
88
number of dimensions: 2
89
shape: (3, 4)
90
data type: int64
91
strides: (32, 8)
92
flags:   C_CONTIGUOUS : True
93
  F_CONTIGUOUS : False
94
  OWNDATA : True
95
  WRITEABLE : True
96
  ALIGNED : True
97
  UPDATEIFCOPY : False
98
```
99

100

101

102
```python
103
print(a.data)
104
```
105

106
```output
107
<memory at 0x7fc487c7f7e0>
108
```
109

110

111

112
```python
113
abytes = a.ravel().view(dtype=np.uint8)
114
```
115

116

117

118
```python
119
print_info(abytes)
120
```
121

122
```output
123
number of elements: 96
124
number of dimensions: 1
125
shape: (96,)
126
data type: uint8
127
strides: (1,)
128
flags:   C_CONTIGUOUS : True
129
  F_CONTIGUOUS : True
130
  OWNDATA : False
131
  WRITEABLE : True
132
  ALIGNED : True
133
  UPDATEIFCOPY : False
134
```
135

136

137

138
```python
139
print(abytes[:24])
140
```
141

142
```output
143
[100   0   0   0   0   0   0   0 200   0   0   0   0   0   0   0  50   0
144
   0   0   0   0   0   0]
145
```
146

147
### Example: take the transpose of `a`
148

149

150

151
```python
152
print_info(a)
153
```
154

155
```output
156
number of elements: 12
157
number of dimensions: 2
158
shape: (3, 4)
159
data type: int64
160
strides: (32, 8)
161
flags:   C_CONTIGUOUS : True
162
  F_CONTIGUOUS : False
163
  OWNDATA : True
164
  WRITEABLE : True
165
  ALIGNED : True
166
  UPDATEIFCOPY : False
167
```
168

169

170

171
```python
172
print_info(a.T)
173
```
174

175
```output
176
number of elements: 12
177
number of dimensions: 2
178
shape: (4, 3)
179
data type: int64
180
strides: (8, 32)
181
flags:   C_CONTIGUOUS : False
182
  F_CONTIGUOUS : True
183
  OWNDATA : False
184
  WRITEABLE : True
185
  ALIGNED : True
186
  UPDATEIFCOPY : False
187
```
188

189
### Example: skipping rows and columns with *slicing*
190

191

192

193
```python
194
print_info(a.T)
195
```
196

197
```output
198
number of elements: 12
199
number of dimensions: 2
200
shape: (4, 3)
201
data type: int64
202
strides: (8, 32)
203
flags:   C_CONTIGUOUS : False
204
  F_CONTIGUOUS : True
205
  OWNDATA : False
206
  WRITEABLE : True
207
  ALIGNED : True
208
  UPDATEIFCOPY : False
209
```
210

211

212

213
```python
214
print_info(a.T[::2])
215
```
216

217
```output
218
number of elements: 6
219
number of dimensions: 2
220
shape: (2, 3)
221
data type: int64
222
strides: (16, 32)
223
flags:   C_CONTIGUOUS : False
224
  F_CONTIGUOUS : False
225
  OWNDATA : False
226
  WRITEABLE : True
227
  ALIGNED : True
228
  UPDATEIFCOPY : False
229
```
230

231

232

233
```python
234
print_info(a.T[::2, ::2])
235
```
236

237
```output
238
number of elements: 4
239
number of dimensions: 2
240
shape: (2, 2)
241
data type: int64
242
strides: (16, 64)
243
flags:   C_CONTIGUOUS : False
244
  F_CONTIGUOUS : False
245
  OWNDATA : False
246
  WRITEABLE : True
247
  ALIGNED : True
248
  UPDATEIFCOPY : False
249
```
250

251
### Getting a copy
252

253

254

255
```python
256
b = a
257
```
258

259

260

261
```python
262
print(b)
263
```
264

265
```output
266
[[100 200  50 400]
267
 [ 50   0   0 100]
268
 [350 100  50 200]]
269
```
270

271

272

273
```python
274
a[0, 0] = 5
275
print(b)
276
a[0, 0] = 100
277
```
278

279
```output
280
[[  5 200  50 400]
281
 [ 50   0   0 100]
282
 [350 100  50 200]]
283
```
284

285
## Advanced operations: axis-wise evaluation
286

287

288

289
```python
290
expr = np.load('../../data/expr.npy')
291
```
292

293

294

295
```python
296
print_info(expr)
297
```
298

299
```output
300
number of elements: 7687500
301
number of dimensions: 2
302
shape: (20500, 375)
303
data type: uint32
304
strides: (4, 82000)
305
flags:   C_CONTIGUOUS : False
306
  F_CONTIGUOUS : True
307
  OWNDATA : False
308
  WRITEABLE : True
309
  ALIGNED : True
310
  UPDATEIFCOPY : False
311
```
312

313
This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the *library size*, which is the total number of reads across a column.
314

315
The `np.sum` function returns the sum of all the elements of an array. With the `axis` argument, you can take the sum *along the given axis*.
316

317

318

319
```python
320
lib_size = np.sum(expr, axis=0)
321
print(lib_size)
322
```
323

324
```output
325
[ 41539237  50824374  35747651  33532890  48162893  83478927  61652756
326
  56043182  50066035  44291909  62103960  52177691  39935196  44712296
327
  49499950  54994779  64874312  41901128  61124396  74947233  55805778
328
  44751480  57419892  45460279  58590618  75900141  79022684  51253368
329
  63148322  64114541  53696942  50946576  55709927  60184139  73887491
330
  51905092  39721522  42746126  43019952  40435776  57789966  30526018
331
  50488171  59600438  49802751 103219262  80445758  76730579  59996715
332
  70611975  53788477  57967905  50450027  61725636  52939093  48355220
333
  60814140  61419740  52865931  42718020  47981997  50989994  55265609
334
  57268101  37174525  65469319  69459796  64633280  52550614  52800398
335
  67603908  42568697  54328614  58801154  41408022  55721943  56051351
336
  48906539  30074621  56514800  61685565  45155573  55190918  58239673
337
  51484868  56456500  58897501  41200475  60510647  45186472  48863100
338
  44723293  66857959  43679229  63454731  40459554  44210528  71649897
339
  51052494  55893243  23880954  37930574  50358271  67084571  55954779
340
  60533989  37534918  58268372  69034028  62187943  31471960  60211327
341
  69060349  52800469  54351395  48929065  55783836  57805383  45280443
342
  64365686  51738764  45503686  20805588   6231205  58912016  58382737
343
  55586370  47844506  88199645  45202180  57012734  35536183  66413476
344
  58255198  67400957  63388779  66920588  89665770  54467447  53674778
345
  60489113  64271298  54482571  57945567  58481866  30468523  46520205
346
  43720270  20860933  62291312  76469475  58273889  62758553  69907767
347
  46323880  58981816  57005722  63065832  55354924  53218346  48332445
348
  50847760  45896941  50855098  66765635  53976120  43076562  64591516
349
  70321285  43164571  19972560  60273622  48626910  62192138  56940941
350
  54212832  47915415  60829977  64420702  47877190  21861554  42351337
351
  56078062  58080584  65786195  69711708  35936611  61273004  40872795
352
  42664688  51610919  42895450  65840053  46914753  52775804  48278133
353
  63447113  46209355  46190947  55448740  40136288  61288618  55819950
354
  61874813  51645078  52880150  56670662  41643222  54351160  38218219
355
  20136141  48512044  54430430  40095409  37135634  47338323  47628500
356
  70593750  61983198  33989258  62762404  32312231  54883550  58804271
357
  47618042  43385504  53262000  51775467  77796207  60667212  58663579
358
  48561056  37776295  41013665  61718896  49617862  46246248  53495544
359
  59868475  65570441  57271936  33651719  42949256  56326986  46188444
360
  65338751  63280380  72026286  45558125  65005023  37707710  36690847
361
  36791707  47339679  40636949  47695871  59847402  65956305  59465169
362
  70009556  58111425  60158442  54800853  54455853  68528086  44581634
363
  45475847  66490471  42291085  35965990  41479598  58860157  36363286
364
  43083690  59348176  58854511  41230056  49617871  43539028  65823859
365
  44602672  50578213  57054596  45886960  66240091  66165612  32463206
366
  47864659  52971575  52220505  25097393  70434480  32039712  32819373
367
  58584329  45039848  59657923  49474656  63730127  65560607  47372745
368
  58633031  63905299  54165552  47486106  52398513  45153699  58159323
369
  19581961  66938365  70099565  62639887  34726837  63115934  69049347
370
  66656207  44194958  43770898  63897405  32572188  21661735  40549383
371
  37659411  45886025  64801505  21061525  59949050  31486861  62661761
372
  49585558  61242121  69353546  60024113  58447218  61229433  44275809
373
  64901114  37277569  36078563  44834451  62844105  53060365  66203234
374
  63269672  52694324  61885104  51421981  59063240  43349098  48320365
375
  89431418  40367656  61502892  64771848  36282255  45373426  66499413
376
  56130605  40281506  45002236  55357976  61842699  30509789  43349038
377
  59621075  69058739  67386247  55079081  28732898  56053773  60685111
378
  39380186  45394727  59700010  54182166]
379
```
380

381
### Exercise
382

383
Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.
384

385
## Advanced operations: broadcasting
386

387
In order to normalize every column by its corresponding library size, we have to *align* the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use `np.newaxis` to match the dimensions.
388

389

390

391
```python
392
print(expr.shape)
393
print(lib_size.shape)
394
print(lib_size[np.newaxis, :].shape)
395
```
396

397
```output
398
(20500, 375)
399
(375,)
400
(1, 375)
401
```
402

403
However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:
404

405

406

407
```python
408
np.all(expr / lib_size ==
409
       expr / lib_size[np.newaxis, :])
410
```
411

412

413

414

415
```output
416
True
417
```
418

419

420

421

422

423
```python
424
expr_lib = expr / lib_size
425
```
426

427
We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).
428

429

430

431
```python
432
expr_lib *= 1e6
433
```
434

435
Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).
436

437

438

439
```python
440
gene_len = np.load('../../data/gene-lens.npy')
441
print(gene_len.shape)
442
```
443

444
```output
445
(20500,)
446
```
447

448
### Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM
449

450

451

452
```python
453
rpkm = expr_lib  # FIX THIS
454
```
455

456

457

458
```python
459
import matplotlib.pyplot as plt
460
from scipy import stats
461

462
def plot_col_density(data, xlim=None, *args, **kwargs):
463
    # Use gaussian smoothing to estimate the density
464
    density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
465
    if xlim is not None:
466
        m, M = xlim
467
    else:
468
        m, M = np.min(data), np.max(data)
469
    x = np.linspace(m, M, 100)
470

471
    fig, ax = plt.subplots()
472
    for density in density_per_col:
473
        ax.plot(x, density(x), *args, **kwargs)
474
    ax.set_xlabel('log-counts')
475
    ax.set_ylabel('frequency')
476
    if xlim is not None:
477
        ax.set_xlim(xlim)
478
    plt.show()
479

480
```
481

482

483

484
```python
485
%matplotlib inline
486
```
487

488

489

490
```python
491
plt.style.use('ggplot')
492
```
493

494

495

496
```python
497
plot_col_density(np.log(expr+1)[:, :20])
498
```
499

500

501
![png](1_advanced_numpy_files/1_advanced_numpy_43_0.png)
502

503

504

505

506
```python
507
plot_col_density(np.log(rpkm + 1)[:, :20], xlim=(0, 250))
508
```
509

510

511
![png](1_advanced_numpy_files/1_advanced_numpy_44_0.png)
512

513

514
### Exercise: 3D broadcasting
515

516
Below, produce the array containing the sum of every element in `x` with every element in `y`
517

518

519

520
```python
521
x = np.random.rand(3, 5)
522
y = np.random.randint(10, size=8)
523
z = x # FIX THIS
524
```
525

526
## Fancy indexing
527

528
You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.
529

530

531

532
```python
533
values = np.array([0, 5, 99])
534
selector = np.random.randint(0, 3, size=(3, 4))
535
print(selector)
536
print(values[selector])
537
```
538

539
```output
540
[[2 1 1 2]
541
 [0 1 0 1]
542
 [2 2 1 2]]
543
[[99  5  5 99]
544
 [ 0  5  0  5]
545
 [99 99  5 99]]
546
```
547

548
### Exercise: quantile normalization
549

550
Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.
551

552
*Hint: look for documentation for `scipy.mstats.rankdata`, `np.sort`, and `np.argsort`.*
553

554

555

556
```python
557
def qnorm(x):
558
    """Quantile normalize an input matrix.
559
    
560
    Parameters
561
    ----------
562
    x : 2D array of float, shape (M, N)
563
        The input data, with each column being a
564
        distribution to normalize.
565
        
566
    Returns
567
    -------
568
    xn : 2D array of float, shape (M, N)
569
        The normalized data.
570
    """
571
    xn = np.copy(x) # replace this by normalizing code
572
    return xn
573
```
574

575

576

577
```python
578
logexpr = np.log(expr + 1)
579
logrpkm = np.log(rpkm + 1)
580
```
581

582

583

584
```python
585
logexprn = qnorm(logexpr)
586
logrpkmn = qnorm(logrpkm)
587
```
588

589

590

591
```python
592
plot_col_density(logexprn[:, :20])
593
```
594

595

596
![png](1_advanced_numpy_files/1_advanced_numpy_53_0.png)
597

598

599

600

601
```python
602
plot_col_density(logrpkmn[:, :20], xlim=(0, 0.25))
603
```
604

605

606
![png](1_advanced_numpy_files/1_advanced_numpy_54_0.png)
607

608

609
Product

Resources

Company