📚 The CoCalc Library - books, templates and other resources
License: OTHER
1# Advanced NumPy2345```python6%matplotlib inline7```89## Quick example: gene expression, without numpy1011FIRST...1213## Even quicker: what is gene expression?1415How genes work:16171819What this means:20212223How we measure it:24252627## Some fake data2829| | Cell type A | Cell type B | Cell type C | Cell type D |30|--------|-------------|-------------|-------------|-------------|31| Gene 0 | 100 | 200 | 50 | 400 |32| Gene 1 | 50 | 0 | 0 | 100 |33| Gene 2 | 350 | 100 | 50 | 200 |34353637```python38gene0 = [100, 200, 50, 400]39gene1 = [50, 0, 0, 100]40gene2 = [350, 100, 50, 200]41expression_data = [gene0, gene1, gene2]42```4344Why is this a bad idea?4546## Now with NumPy47484950```python51import numpy as np52a = np.array(expression_data)53print(a)54```5556```output57[[100 200 50 400]58[ 50 0 0 100]59[350 100 50 200]]60```6162We are going to:6364* Obtain an *RPKM* expression matrix65* Quantile normalize the data6667using the awesome power of NumPy6869## Inside a numpy ndarray70717273```python74def print_info(a):75print('number of elements:', a.size)76print('number of dimensions:', a.ndim)77print('shape:', a.shape)78print('data type:', a.dtype)79print('strides:', a.strides)80print('flags:', a.flags)8182print_info(a)83```8485```output86number of elements: 1287number of dimensions: 288shape: (3, 4)89data type: int6490strides: (32, 8)91flags: C_CONTIGUOUS : True92F_CONTIGUOUS : False93OWNDATA : True94WRITEABLE : True95ALIGNED : True96UPDATEIFCOPY : False97```9899100101```python102print(a.data)103```104105```output106<memory at 0x7fc487c7f7e0>107```108109110111```python112abytes = a.ravel().view(dtype=np.uint8)113```114115116117```python118print_info(abytes)119```120121```output122number of elements: 96123number of dimensions: 1124shape: (96,)125data type: uint8126strides: (1,)127flags: C_CONTIGUOUS : True128F_CONTIGUOUS : True129OWNDATA : False130WRITEABLE : True131ALIGNED : True132UPDATEIFCOPY : False133```134135136137```python138print(abytes[:24])139```140141```output142[100 0 0 0 0 0 0 0 200 0 0 0 0 0 0 0 50 01430 0 0 0 0 0]144```145146### Example: take the transpose of `a`147148149150```python151print_info(a)152```153154```output155number of elements: 12156number of dimensions: 2157shape: (3, 4)158data type: int64159strides: (32, 8)160flags: C_CONTIGUOUS : True161F_CONTIGUOUS : False162OWNDATA : True163WRITEABLE : True164ALIGNED : True165UPDATEIFCOPY : False166```167168169170```python171print_info(a.T)172```173174```output175number of elements: 12176number of dimensions: 2177shape: (4, 3)178data type: int64179strides: (8, 32)180flags: C_CONTIGUOUS : False181F_CONTIGUOUS : True182OWNDATA : False183WRITEABLE : True184ALIGNED : True185UPDATEIFCOPY : False186```187188### Example: skipping rows and columns with *slicing*189190191192```python193print_info(a.T)194```195196```output197number of elements: 12198number of dimensions: 2199shape: (4, 3)200data type: int64201strides: (8, 32)202flags: C_CONTIGUOUS : False203F_CONTIGUOUS : True204OWNDATA : False205WRITEABLE : True206ALIGNED : True207UPDATEIFCOPY : False208```209210211212```python213print_info(a.T[::2])214```215216```output217number of elements: 6218number of dimensions: 2219shape: (2, 3)220data type: int64221strides: (16, 32)222flags: C_CONTIGUOUS : False223F_CONTIGUOUS : False224OWNDATA : False225WRITEABLE : True226ALIGNED : True227UPDATEIFCOPY : False228```229230231232```python233print_info(a.T[::2, ::2])234```235236```output237number of elements: 4238number of dimensions: 2239shape: (2, 2)240data type: int64241strides: (16, 64)242flags: C_CONTIGUOUS : False243F_CONTIGUOUS : False244OWNDATA : False245WRITEABLE : True246ALIGNED : True247UPDATEIFCOPY : False248```249250### Getting a copy251252253254```python255b = a256```257258259260```python261print(b)262```263264```output265[[100 200 50 400]266[ 50 0 0 100]267[350 100 50 200]]268```269270271272```python273a[0, 0] = 5274print(b)275a[0, 0] = 100276```277278```output279[[ 5 200 50 400]280[ 50 0 0 100]281[350 100 50 200]]282```283284## Advanced operations: axis-wise evaluation285286287288```python289expr = np.load('../../data/expr.npy')290```291292293294```python295print_info(expr)296```297298```output299number of elements: 7687500300number of dimensions: 2301shape: (20500, 375)302data type: uint32303strides: (4, 82000)304flags: C_CONTIGUOUS : False305F_CONTIGUOUS : True306OWNDATA : False307WRITEABLE : True308ALIGNED : True309UPDATEIFCOPY : False310```311312This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the *library size*, which is the total number of reads across a column.313314The `np.sum` function returns the sum of all the elements of an array. With the `axis` argument, you can take the sum *along the given axis*.315316317318```python319lib_size = np.sum(expr, axis=0)320print(lib_size)321```322323```output324[ 41539237 50824374 35747651 33532890 48162893 83478927 6165275632556043182 50066035 44291909 62103960 52177691 39935196 4471229632649499950 54994779 64874312 41901128 61124396 74947233 5580577832744751480 57419892 45460279 58590618 75900141 79022684 5125336832863148322 64114541 53696942 50946576 55709927 60184139 7388749132951905092 39721522 42746126 43019952 40435776 57789966 3052601833050488171 59600438 49802751 103219262 80445758 76730579 5999671533170611975 53788477 57967905 50450027 61725636 52939093 4835522033260814140 61419740 52865931 42718020 47981997 50989994 5526560933357268101 37174525 65469319 69459796 64633280 52550614 5280039833467603908 42568697 54328614 58801154 41408022 55721943 5605135133548906539 30074621 56514800 61685565 45155573 55190918 5823967333651484868 56456500 58897501 41200475 60510647 45186472 4886310033744723293 66857959 43679229 63454731 40459554 44210528 7164989733851052494 55893243 23880954 37930574 50358271 67084571 5595477933960533989 37534918 58268372 69034028 62187943 31471960 6021132734069060349 52800469 54351395 48929065 55783836 57805383 4528044334164365686 51738764 45503686 20805588 6231205 58912016 5838273734255586370 47844506 88199645 45202180 57012734 35536183 6641347634358255198 67400957 63388779 66920588 89665770 54467447 5367477834460489113 64271298 54482571 57945567 58481866 30468523 4652020534543720270 20860933 62291312 76469475 58273889 62758553 6990776734646323880 58981816 57005722 63065832 55354924 53218346 4833244534750847760 45896941 50855098 66765635 53976120 43076562 6459151634870321285 43164571 19972560 60273622 48626910 62192138 5694094134954212832 47915415 60829977 64420702 47877190 21861554 4235133735056078062 58080584 65786195 69711708 35936611 61273004 4087279535142664688 51610919 42895450 65840053 46914753 52775804 4827813335263447113 46209355 46190947 55448740 40136288 61288618 5581995035361874813 51645078 52880150 56670662 41643222 54351160 3821821935420136141 48512044 54430430 40095409 37135634 47338323 4762850035570593750 61983198 33989258 62762404 32312231 54883550 5880427135647618042 43385504 53262000 51775467 77796207 60667212 5866357935748561056 37776295 41013665 61718896 49617862 46246248 5349554435859868475 65570441 57271936 33651719 42949256 56326986 4618844435965338751 63280380 72026286 45558125 65005023 37707710 3669084736036791707 47339679 40636949 47695871 59847402 65956305 5946516936170009556 58111425 60158442 54800853 54455853 68528086 4458163436245475847 66490471 42291085 35965990 41479598 58860157 3636328636343083690 59348176 58854511 41230056 49617871 43539028 6582385936444602672 50578213 57054596 45886960 66240091 66165612 3246320636547864659 52971575 52220505 25097393 70434480 32039712 3281937336658584329 45039848 59657923 49474656 63730127 65560607 4737274536758633031 63905299 54165552 47486106 52398513 45153699 5815932336819581961 66938365 70099565 62639887 34726837 63115934 6904934736966656207 44194958 43770898 63897405 32572188 21661735 4054938337037659411 45886025 64801505 21061525 59949050 31486861 6266176137149585558 61242121 69353546 60024113 58447218 61229433 4427580937264901114 37277569 36078563 44834451 62844105 53060365 6620323437363269672 52694324 61885104 51421981 59063240 43349098 4832036537489431418 40367656 61502892 64771848 36282255 45373426 6649941337556130605 40281506 45002236 55357976 61842699 30509789 4334903837659621075 69058739 67386247 55079081 28732898 56053773 6068511137739380186 45394727 59700010 54182166]378```379380### Exercise381382Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.383384## Advanced operations: broadcasting385386In order to normalize every column by its corresponding library size, we have to *align* the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use `np.newaxis` to match the dimensions.387388389390```python391print(expr.shape)392print(lib_size.shape)393print(lib_size[np.newaxis, :].shape)394```395396```output397(20500, 375)398(375,)399(1, 375)400```401402However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:403404405406```python407np.all(expr / lib_size ==408expr / lib_size[np.newaxis, :])409```410411412413414```output415True416```417418419420421422```python423expr_lib = expr / lib_size424```425426We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).427428429430```python431expr_lib *= 1e6432```433434Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).435436437438```python439gene_len = np.load('../../data/gene-lens.npy')440print(gene_len.shape)441```442443```output444(20500,)445```446447### Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM448449450451```python452rpkm = expr_lib # FIX THIS453```454455456457```python458import matplotlib.pyplot as plt459from scipy import stats460461def plot_col_density(data, xlim=None, *args, **kwargs):462# Use gaussian smoothing to estimate the density463density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]464if xlim is not None:465m, M = xlim466else:467m, M = np.min(data), np.max(data)468x = np.linspace(m, M, 100)469470fig, ax = plt.subplots()471for density in density_per_col:472ax.plot(x, density(x), *args, **kwargs)473ax.set_xlabel('log-counts')474ax.set_ylabel('frequency')475if xlim is not None:476ax.set_xlim(xlim)477plt.show()478479```480481482483```python484%matplotlib inline485```486487488489```python490plt.style.use('ggplot')491```492493494495```python496plot_col_density(np.log(expr+1)[:, :20])497```498499500501502503504505```python506plot_col_density(np.log(rpkm + 1)[:, :20], xlim=(0, 250))507```508509510511512513### Exercise: 3D broadcasting514515Below, produce the array containing the sum of every element in `x` with every element in `y`516517518519```python520x = np.random.rand(3, 5)521y = np.random.randint(10, size=8)522z = x # FIX THIS523```524525## Fancy indexing526527You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.528529530531```python532values = np.array([0, 5, 99])533selector = np.random.randint(0, 3, size=(3, 4))534print(selector)535print(values[selector])536```537538```output539[[2 1 1 2]540[0 1 0 1]541[2 2 1 2]]542[[99 5 5 99]543[ 0 5 0 5]544[99 99 5 99]]545```546547### Exercise: quantile normalization548549Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.550551*Hint: look for documentation for `scipy.mstats.rankdata`, `np.sort`, and `np.argsort`.*552553554555```python556def qnorm(x):557"""Quantile normalize an input matrix.558559Parameters560----------561x : 2D array of float, shape (M, N)562The input data, with each column being a563distribution to normalize.564565Returns566-------567xn : 2D array of float, shape (M, N)568The normalized data.569"""570xn = np.copy(x) # replace this by normalizing code571return xn572```573574575576```python577logexpr = np.log(expr + 1)578logrpkm = np.log(rpkm + 1)579```580581582583```python584logexprn = qnorm(logexpr)585logrpkmn = qnorm(logrpkm)586```587588589590```python591plot_col_density(logexprn[:, :20])592```593594595596597598599600```python601plot_col_density(logrpkmn[:, :20], xlim=(0, 0.25))602```603604605606607608609