CoCalc -- 2018-02-09.ipynb

Download

📚 The CoCalc Library - books, templates and other resources

📚 The Library - Shared Public Version

cocalc-examples / math157 / lectures / 2018-02-09 / 2018-02-09.ipynb

¹³²⁹³⁰ views
License: OTHER

Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

February 9, 2018: Pandas

Administrivia:

As usual, virtual office hours run 3-7pm, and the homework is due at 8pm.
For homework problem 4c, please take $k=1,2,3,4$ instead of $k=1,2,3,4,5$ to avoid crashing your Jupyter kernel. (If you are having trouble even for $k=4$ , try restarting your project. If that doesn't help, you might need to make your code more memory-efficient.)
Advance warning: next week, my office hours will take place Thursday 3-4 instead of 4-5 due to a schedule conflict.

What is pandas?

"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

It is used all over the world (including at the UCSD Guardian, I am told) as a primary tool for transferring data sets into Python for the purposes of manipulation, visualization, etc.

pandas is a member of the SciPy ecosystem.

Warning: big data = big headaches (sometimes)

When using pandas, it is easy to create very large files in your course project. The disk quota for your project is currently about 300MB; overrunning this may cause unexpected problems (e.g., getting your project stuck in a "locked" state). Contact course staff if this happens, but more importantly do not leave large files in your project any longer than necessary!

10 minutes to pandas

The following is adapted from this tutorial, with a few changes to handle differences between Sage and pure Python. The "10 minutes" is misleading; we will spend the whole hour on this and not get through it all. It would take a few hours to go through everything carefully, which I would recommend if you plan to use this extensively.

In [1]:

import pandas as pd
import numpy as np

One fundamental data structure in pandas is a Series, which is similar to a list.

In [2]:

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Out[2]:

    1
    3
    5
  NaN
    6
    8
dtype: object

It is possible to specify alternate labels instead of the default 0,1,...; more on this later.

Another fundamental data structure is a DataFrame, which is basically a list of Series. A good metaphor for how a DataFrame behaves is an Excel spreadsheet; in fact, it is not hard to import and export Excel spreadsheets using this data structure.

In [3]:

dates = pd.date_range('20130101', periods=6) # oops, this gives an error
dates

Out[3]:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-1d8021c635fd> in <module>()
----> 1 dates = pd.date_range('20130101', periods=Integer(6)) # oops, this gives an error
      2 dates
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in date_range(start, end, periods, freq, tz, normalize, name, closed, **kwargs)
   2060     return DatetimeIndex(start=start, end=end, periods=periods,
   2061                          freq=freq, tz=tz, normalize=normalize, name=name,
-> 2062                          closed=closed, **kwargs)
   2063 
   2064 
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/util/_decorators.pyc in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
    302             elif not is_integer(periods):
    303                 msg = 'periods must be a number, got {periods}'
--> 304                 raise TypeError(msg.format(periods=periods))
    305 
    306         if data is None and freq is None:
TypeError: periods must be a number, got 6

Let me take a quick aside to illustrate why this happened. Remember that Sage feeds commands through a preparser before delivering them to Python; that's how you are able to type 3^4 in Sage to mean exponentiation instead of 3**4. But you can also call the preparser yourself to see how it behaves.

In [5]:

preparse("3^4")

Out[5]:

'Integer(3)**Integer(4)'

In [4]:

preparse("dates = pd.date_range('20130101', periods=6)") # Let's see why this happened.

Out[4]:

"dates = pd.date_range('20130101', periods=Integer(6))"

So Sage is making sure that integer literals get created as Sage integers by default, rather than Python integers. pandas doesn't know how to fix this, because it doesn't know anything about Sage; but we can fix this manually.

In [6]:

dates = pd.date_range('20130101', periods=int(6))
dates

Out[6]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Better yet, we can turn off the preparser so that we don't have to keep doing this. Don't forget to turn it back on when you want to switch back to Sage syntax!

In [7]:

preparser(False)

In [8]:

dates = pd.date_range('20130101', periods=6)
dates

Out[8]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

End of aside.

Here we convert a numpy array into a DataFrame.

In [9]:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Out[9]:

Notice that Jupyter knows to how to pretty-print this!

In [10]:

type(df)

Out[10]:

<class 'pandas.core.frame.DataFrame'>

In [0]:

df. # Try tab-completion to see what methods are available

Here we construct a DataFrame in a different way, from a dictionary of list-like objects.

In [11]:

df2 = pd.DataFrame({ 'A' : 1.,
                     "G" : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([1,2,3,4],dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : ['foo', 'bar', 'foo', 'bar'] })

df2

Out[11]:

The data in each column has to be a particular type. However, as a fallback, pandas allows the type object for arbitrary Python objects.

In [12]:

df2.dtypes # Like in numpy, this is an attribute rather than a method.

Out[12]:

A           float64
C           float32
D             int32
E          category
F            object
G    datetime64[ns]
dtype: object

There are various ways to view data in a DataFrame...

In [13]:

df.head(2)

Out[13]:

In [14]:

df.tail()

Out[14]:

In [15]:

df.index

Out[15]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:

df.columns

Out[16]:

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [17]:

df.values

Out[17]:

array([[-0.81038868,  0.82407517,  1.38278369, -0.1436792 ],
       [-0.50901509, -0.43543885, -1.22481693, -0.24919439],
       [ 0.95378815, -0.29844621,  1.34592039,  0.17929153],
       [ 0.66858785,  0.40165846,  0.00276616,  1.01860646],
       [-1.51008067,  0.9570845 , -1.57934228, -0.96390948],
       [-1.3192444 , -0.35134697, -1.12665158, -0.4731958 ]])

It is also easy to get some quick statistics on the data.

In [18]:

df.describe()

Out[18]:

In [19]:

df.T # Transpose, like a matrix

Out[19]:

In [20]:

df.sort_index(axis=1, ascending=False) # Sort by index

Out[20]:

In [21]:

df.sort_values(by='B') # Sort by values in a column

Out[21]:

Now let's look at ways to extract (or "select", to continue the Excel metaphor) sections of data.

In [22]:

df['A'] # Select a single column

Out[22]:

2013-01-01   -0.810389
2013-01-02   -0.509015
2013-01-03    0.953788
2013-01-04    0.668588
2013-01-05   -1.510081
2013-01-06   -1.319244
Freq: D, Name: A, dtype: float64

In [23]:

df[0:3] # Use slice notation to select a range of rows

Out[23]:

In [24]:

df['20130102':'20130104'] # You can also slice by index rather than position.

Out[24]:

In [25]:

df.loc[dates[0]] # Select a single row

Out[25]:

A   -0.810389
B    0.824075
C    1.382784
D   -0.143679
Name: 2013-01-01 00:00:00, dtype: float64

In [26]:

df.loc[:,['A','B']] # Select multiple columns

Out[26]:

In [27]:

df.loc['20130102':'20130104',['A','B']] # Select rows and columns

Out[27]:

In [28]:

df.loc['20130102',['A','B']] # Note the reduction of dimension here...

Out[28]:

A   -0.509015
B   -0.435439
Name: 2013-01-02 00:00:00, dtype: float64

In [29]:

df.loc[dates[0],'A'] # .. and here.

Out[29]:

-0.81038867560804528

In [30]:

df.at[dates[0],'A'] # Another way to select a single entry, may be more efficient in practice.

Out[30]:

-0.81038867560804528

In [31]:

df.iloc[3] # Select by position instead of index

Out[31]:

A    0.668588
B    0.401658
C    0.002766
D    1.018606
Name: 2013-01-04 00:00:00, dtype: float64

In [32]:

df.iloc[3:5,0:2] # Slice by position, a la python

Out[32]:

In [33]:

df.iloc[[1,2,4],[0,2]] # Again, you can use lists instead of ranges

Out[33]:

In [34]:

df.iloc[1:3,:]

Out[34]:

In [35]:

df.iloc[:,1:3]

Out[35]:

In [36]:

df.iloc[1,1]

Out[36]:

-0.43543884541049577

In [37]:

df.iat[1,1] # Equivalent to the previous one, but more efficient

Out[37]:

-0.43543884541049577

You can also select rows or columns based on boolean conditions, as in a list comprehension; but the notation here is more compact.

In [38]:

df[df.A > 0]

Out[38]:

In [39]:

df[df > 0] # Pick out all entries in the DataFrame satisfying this condition

Out[39]:

The missing entries here have been set to np.nan (NaN stands for "Not a Number"). This is the pandas analogue of having an empty cell in an Excel spreadsheet. Operations generally skip over missing data.

In [40]:

dft = df[df > 0]
dft.describe() # Pay attention to the counts

Out[40]:

In [41]:

df2 = df.copy() # Make a copy so we don't mess up the original
df2['E'] = ['one', 'one','two','three','four','three'] # Add a new column
df2

Out[41]:

In [43]:

df2[df2['E'].isin(['two','three'])] # Pick out rows based on whether the entry in a given column belongs to a given list

Out[43]:

You can write values into a single entry, or a range; this is analogous to pasting into an Excel spreadsheet (or in the case of a single value, simply typing in a new value).

In [44]:

df.at[dates[0],'A'] = 0 # Reference by index...

In [45]:

df.iat[0,1] = 0 # ... or position

In [46]:

df.loc[:,'D'] = np.array([5] * len(df)) # Set a whole column

In [47]:

df # See what happened

Out[47]:

In [48]:

df2 = df.copy()
df2[df2 > 0] = -df2 # Select entries based on a criterion, then set those
df2 # No positive entries should appear!

Out[48]:

It is possible to change indices in an existing DataFrame.

In [49]:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Out[49]:

Returning to missing data, one can handle it in various ways.

In [50]:

df1.dropna(how='any') # Drop rows with missing data

Out[50]:

In [51]:

df1.fillna(value=5) # Fill missing data with a default value

Out[51]:

In [52]:

pd.isna(df1) # Create a Boolean array identifying the locations of missing data

Out[52]:

Let me skip a few sections here.

Binary operations
Statistics
Applying functions to the data (like the Python map function on a list)
Histogramming
String processing

One can combine data in various ways, such as concatenation...

In [53]:

df = pd.DataFrame(np.random.randn(10, 4))
df

Out[53]:

In [54]:

pieces = [df[7:], df[3:7], df[:3]]
pd.concat(pieces) # Reassemble

Out[54]:

... or merging, as in a SQL database. (As a mathematician, I like to think of this as a "Cartesian product".)

In [55]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [56]:

left

Out[56]:

In [57]:

right

Out[57]:

In [58]:

pd.merge(left, right, on='key')

Out[58]:

In [59]:

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [60]:

left

Out[60]:

In [61]:

right

Out[61]:

In [62]:

pd.merge(left, right, on='key')

Out[62]:

In [0]:

Let's skip some more.

Appending rows
Grouping (this means splitting the data on some criteria, applying a different function to each group, then recombining)
Reshaping
Time series (more on these when we focus on statistics)
Categoricals (values limited to a small number of options, e.g., letter grades)

Plotting uses matplotlib.

In [63]:

import matplotlib.pyplot as plt

In [64]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum() # Cumulative sum
ts.plot(); # Don't print this, just do the plot

Out[64]:

Plotting a DataFrame gives you superimposed plots.

In [65]:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                        columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best');

Out[65]:

<matplotlib.figure.Figure object at 0x7f8dd411d550>

Finally, one can move data in and out of CSV files...

In [66]:

df.to_csv('foo.csv')
pd.read_csv('foo.csv')

Out[66]:

... or Excel spreadsheets.

In [67]:

df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Out[67]:

After running the previous example, try switching to the file view. You can try downloading the new files and trying to open them in a spreadsheet program (Excel, OpenOffice, etc.).

In [0]: