📚 The CoCalc Library - books, templates and other resources
License: OTHER
Advanced NumPy
Quick example: gene expression, without numpy
FIRST...
Even quicker: what is gene expression?
How genes work:
What this means:
How we measure it:
Some fake data
Cell type A | Cell type B | Cell type C | Cell type D | |
---|---|---|---|---|
Gene 0 | 100 | 200 | 50 | 400 |
Gene 1 | 50 | 0 | 0 | 100 |
Gene 2 | 350 | 100 | 50 | 200 |
Why is this a bad idea?
Now with NumPy
We are going to:
Obtain an RPKM expression matrix
Quantile normalize the data
using the awesome power of NumPy
Inside a numpy ndarray
Example: take the transpose of a
Example: skipping rows and columns with slicing
Getting a copy
Advanced operations: axis-wise evaluation
This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column.
The np.sum
function returns the sum of all the elements of an array. With the axis
argument, you can take the sum along the given axis.
Exercise
Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.
Advanced operations: broadcasting
In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis
to match the dimensions.
However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:
We also multiply by in order to keep the numbers on a readable scale (reads per million reads).
Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).
Exercise: broadcast expr_lib
and gene_len
together to produce RPKM
Exercise: 3D broadcasting
Below, produce the array containing the sum of every element in x
with every element in y
Fancy indexing
You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.
Exercise: quantile normalization
Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.
Hint: look for documentation for scipy.mstats.rankdata
, np.sort
, and np.argsort
.