Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

📚 The CoCalc Library - books, templates and other resources

132928 views
License: OTHER
Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

February 9, 2018: Pandas

Administrivia:

  • As usual, virtual office hours run 3-7pm, and the homework is due at 8pm.

  • For homework problem 4c, please take k=1,2,3,4k=1,2,3,4 instead of k=1,2,3,4,5k=1,2,3,4,5 to avoid crashing your Jupyter kernel. (If you are having trouble even for k=4k=4, try restarting your project. If that doesn't help, you might need to make your code more memory-efficient.)

  • Advance warning: next week, my office hours will take place Thursday 3-4 instead of 4-5 due to a schedule conflict.

What is pandas?

"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

It is used all over the world (including at the UCSD Guardian, I am told) as a primary tool for transferring data sets into Python for the purposes of manipulation, visualization, etc.

pandas is a member of the SciPy ecosystem.

Warning: big data = big headaches (sometimes)

When using pandas, it is easy to create very large files in your course project. The disk quota for your project is currently about 300MB; overrunning this may cause unexpected problems (e.g., getting your project stuck in a "locked" state). Contact course staff if this happens, but more importantly do not leave large files in your project any longer than necessary!

10 minutes to pandas

The following is adapted from this tutorial, with a few changes to handle differences between Sage and pure Python. The "10 minutes" is misleading; we will spend the whole hour on this and not get through it all. It would take a few hours to go through everything carefully, which I would recommend if you plan to use this extensively.

import pandas as pd import numpy as np

One fundamental data structure in pandas is a Series, which is similar to a list.

s = pd.Series([1, 3, 5, np.nan, 6, 8]) s
0 1 1 3 2 5 3 NaN 4 6 5 8 dtype: object

It is possible to specify alternate labels instead of the default 0,1,...; more on this later.

Another fundamental data structure is a DataFrame, which is basically a list of Series. A good metaphor for how a DataFrame behaves is an Excel spreadsheet; in fact, it is not hard to import and export Excel spreadsheets using this data structure.

dates = pd.date_range('20130101', periods=6) # oops, this gives an error dates
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-3-1d8021c635fd> in <module>() ----> 1 dates = pd.date_range('20130101', periods=Integer(6)) # oops, this gives an error 2 dates /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in date_range(start, end, periods, freq, tz, normalize, name, closed, **kwargs) 2060 return DatetimeIndex(start=start, end=end, periods=periods, 2061 freq=freq, tz=tz, normalize=normalize, name=name, -> 2062 closed=closed, **kwargs) 2063 2064 /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/util/_decorators.pyc in wrapper(*args, **kwargs) 116 else: 117 kwargs[new_arg_name] = new_arg_value --> 118 return func(*args, **kwargs) 119 return wrapper 120 return _deprecate_kwarg /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs) 302 elif not is_integer(periods): 303 msg = 'periods must be a number, got {periods}' --> 304 raise TypeError(msg.format(periods=periods)) 305 306 if data is None and freq is None: TypeError: periods must be a number, got 6

Let me take a quick aside to illustrate why this happened. Remember that Sage feeds commands through a preparser before delivering them to Python; that's how you are able to type 3^4 in Sage to mean exponentiation instead of 3**4. But you can also call the preparser yourself to see how it behaves.

preparse("3^4")
'Integer(3)**Integer(4)'
preparse("dates = pd.date_range('20130101', periods=6)") # Let's see why this happened.
"dates = pd.date_range('20130101', periods=Integer(6))"

So Sage is making sure that integer literals get created as Sage integers by default, rather than Python integers. pandas doesn't know how to fix this, because it doesn't know anything about Sage; but we can fix this manually.

dates = pd.date_range('20130101', periods=int(6)) dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')

Better yet, we can turn off the preparser so that we don't have to keep doing this. Don't forget to turn it back on when you want to switch back to Sage syntax!

preparser(False)
dates = pd.date_range('20130101', periods=6) dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')

End of aside.

Here we convert a numpy array into a DataFrame.

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) df

Notice that Jupyter knows to how to pretty-print this!

type(df)
<class 'pandas.core.frame.DataFrame'>
df. # Try tab-completion to see what methods are available

Here we construct a DataFrame in a different way, from a dictionary of list-like objects.

df2 = pd.DataFrame({ 'A' : 1., "G" : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([1,2,3,4],dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : ['foo', 'bar', 'foo', 'bar'] }) df2

The data in each column has to be a particular type. However, as a fallback, pandas allows the type object for arbitrary Python objects.

df2.dtypes # Like in numpy, this is an attribute rather than a method.
A float64 C float32 D int32 E category F object G datetime64[ns] dtype: object

There are various ways to view data in a DataFrame...

df.head(2)
df.tail()
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
df.values
array([[-0.81038868, 0.82407517, 1.38278369, -0.1436792 ], [-0.50901509, -0.43543885, -1.22481693, -0.24919439], [ 0.95378815, -0.29844621, 1.34592039, 0.17929153], [ 0.66858785, 0.40165846, 0.00276616, 1.01860646], [-1.51008067, 0.9570845 , -1.57934228, -0.96390948], [-1.3192444 , -0.35134697, -1.12665158, -0.4731958 ]])

It is also easy to get some quick statistics on the data.

df.describe()
df.T # Transpose, like a matrix
df.sort_index(axis=1, ascending=False) # Sort by index
df.sort_values(by='B') # Sort by values in a column

Now let's look at ways to extract (or "select", to continue the Excel metaphor) sections of data.

df['A'] # Select a single column
2013-01-01 -0.810389 2013-01-02 -0.509015 2013-01-03 0.953788 2013-01-04 0.668588 2013-01-05 -1.510081 2013-01-06 -1.319244 Freq: D, Name: A, dtype: float64
df[0:3] # Use slice notation to select a range of rows
df['20130102':'20130104'] # You can also slice by index rather than position.
df.loc[dates[0]] # Select a single row
A -0.810389 B 0.824075 C 1.382784 D -0.143679 Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:,['A','B']] # Select multiple columns
df.loc['20130102':'20130104',['A','B']] # Select rows and columns
df.loc['20130102',['A','B']] # Note the reduction of dimension here...
A -0.509015 B -0.435439 Name: 2013-01-02 00:00:00, dtype: float64
df.loc[dates[0],'A'] # .. and here.
-0.81038867560804528
df.at[dates[0],'A'] # Another way to select a single entry, may be more efficient in practice.
-0.81038867560804528
df.iloc[3] # Select by position instead of index
A 0.668588 B 0.401658 C 0.002766 D 1.018606 Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5,0:2] # Slice by position, a la python
df.iloc[[1,2,4],[0,2]] # Again, you can use lists instead of ranges
df.iloc[1:3,:]
df.iloc[:,1:3]
df.iloc[1,1]
-0.43543884541049577
df.iat[1,1] # Equivalent to the previous one, but more efficient
-0.43543884541049577

You can also select rows or columns based on boolean conditions, as in a list comprehension; but the notation here is more compact.

df[df.A > 0]
df[df > 0] # Pick out all entries in the DataFrame satisfying this condition

The missing entries here have been set to np.nan (NaN stands for "Not a Number"). This is the pandas analogue of having an empty cell in an Excel spreadsheet. Operations generally skip over missing data.

dft = df[df > 0] dft.describe() # Pay attention to the counts
df2 = df.copy() # Make a copy so we don't mess up the original df2['E'] = ['one', 'one','two','three','four','three'] # Add a new column df2
df2[df2['E'].isin(['two','three'])] # Pick out rows based on whether the entry in a given column belongs to a given list

You can write values into a single entry, or a range; this is analogous to pasting into an Excel spreadsheet (or in the case of a single value, simply typing in a new value).

df.at[dates[0],'A'] = 0 # Reference by index...
df.iat[0,1] = 0 # ... or position
df.loc[:,'D'] = np.array([5] * len(df)) # Set a whole column
df # See what happened
df2 = df.copy() df2[df2 > 0] = -df2 # Select entries based on a criterion, then set those df2 # No positive entries should appear!

It is possible to change indices in an existing DataFrame.

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) df1.loc[dates[0]:dates[1],'E'] = 1 df1

Returning to missing data, one can handle it in various ways.

df1.dropna(how='any') # Drop rows with missing data
df1.fillna(value=5) # Fill missing data with a default value
pd.isna(df1) # Create a Boolean array identifying the locations of missing data

Let me skip a few sections here.

  • Binary operations

  • Statistics

  • Applying functions to the data (like the Python map function on a list)

  • Histogramming

  • String processing

One can combine data in various ways, such as concatenation...

df = pd.DataFrame(np.random.randn(10, 4)) df
pieces = [df[7:], df[3:7], df[:3]] pd.concat(pieces) # Reassemble

... or merging, as in a SQL database. (As a mathematician, I like to think of this as a "Cartesian product".)

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left
right
pd.merge(left, right, on='key')
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left
right
pd.merge(left, right, on='key')

Let's skip some more.

  • Appending rows

  • Grouping (this means splitting the data on some criteria, applying a different function to each group, then recombining)

  • Reshaping

  • Time series (more on these when we focus on statistics)

  • Categoricals (values limited to a small number of options, e.g., letter grades)

Plotting uses matplotlib.

import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) ts = ts.cumsum() # Cumulative sum ts.plot(); # Don't print this, just do the plot
Image in a Jupyter notebook

Plotting a DataFrame gives you superimposed plots.

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D']) df = df.cumsum() plt.figure(); df.plot(); plt.legend(loc='best');
<matplotlib.figure.Figure object at 0x7f8dd411d550>
Image in a Jupyter notebook

Finally, one can move data in and out of CSV files...

df.to_csv('foo.csv') pd.read_csv('foo.csv')

... or Excel spreadsheets.

df.to_excel('foo.xlsx', sheet_name='Sheet1') pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

After running the previous example, try switching to the file view. You can try downloading the new files and trying to open them in a spreadsheet program (Excel, OpenOffice, etc.).