📚 The CoCalc Library - books, templates and other resources
License: OTHER

Math 157: Intro to Mathematical Software
UC San Diego, winter 2018
February 9, 2018: Pandas
Administrivia:
As usual, virtual office hours run 3-7pm, and the homework is due at 8pm.
For homework problem 4c, please take instead of to avoid crashing your Jupyter kernel. (If you are having trouble even for , try restarting your project. If that doesn't help, you might need to make your code more memory-efficient.)
Advance warning: next week, my office hours will take place Thursday 3-4 instead of 4-5 due to a schedule conflict.
What is pandas?
"pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."
It is used all over the world (including at the UCSD Guardian, I am told) as a primary tool for transferring data sets into Python for the purposes of manipulation, visualization, etc.
pandas is a member of the SciPy ecosystem.
Warning: big data = big headaches (sometimes)
When using pandas, it is easy to create very large files in your course project. The disk quota for your project is currently about 300MB; overrunning this may cause unexpected problems (e.g., getting your project stuck in a "locked" state). Contact course staff if this happens, but more importantly do not leave large files in your project any longer than necessary!
10 minutes to pandas
The following is adapted from this tutorial, with a few changes to handle differences between Sage and pure Python. The "10 minutes" is misleading; we will spend the whole hour on this and not get through it all. It would take a few hours to go through everything carefully, which I would recommend if you plan to use this extensively.
One fundamental data structure in pandas is a Series
, which is similar to a list.
It is possible to specify alternate labels instead of the default 0,1,...; more on this later.
Another fundamental data structure is a DataFrame
, which is basically a list of Series
. A good metaphor for how a DataFrame
behaves is an Excel spreadsheet; in fact, it is not hard to import and export Excel spreadsheets using this data structure.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-1d8021c635fd> in <module>()
----> 1 dates = pd.date_range('20130101', periods=Integer(6)) # oops, this gives an error
2 dates
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in date_range(start, end, periods, freq, tz, normalize, name, closed, **kwargs)
2060 return DatetimeIndex(start=start, end=end, periods=periods,
2061 freq=freq, tz=tz, normalize=normalize, name=name,
-> 2062 closed=closed, **kwargs)
2063
2064
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/util/_decorators.pyc in wrapper(*args, **kwargs)
116 else:
117 kwargs[new_arg_name] = new_arg_value
--> 118 return func(*args, **kwargs)
119 return wrapper
120 return _deprecate_kwarg
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.pyc in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
302 elif not is_integer(periods):
303 msg = 'periods must be a number, got {periods}'
--> 304 raise TypeError(msg.format(periods=periods))
305
306 if data is None and freq is None:
TypeError: periods must be a number, got 6
Let me take a quick aside to illustrate why this happened. Remember that Sage feeds commands through a preparser before delivering them to Python; that's how you are able to type 3^4
in Sage to mean exponentiation instead of 3**4
. But you can also call the preparser yourself to see how it behaves.
So Sage is making sure that integer literals get created as Sage integers by default, rather than Python integers. pandas doesn't know how to fix this, because it doesn't know anything about Sage; but we can fix this manually.
Better yet, we can turn off the preparser so that we don't have to keep doing this. Don't forget to turn it back on when you want to switch back to Sage syntax!
End of aside.
Here we convert a numpy array into a DataFrame
.
Notice that Jupyter knows to how to pretty-print this!
Here we construct a DataFrame
in a different way, from a dictionary of list-like objects.
The data in each column has to be a particular type. However, as a fallback, pandas allows the type object
for arbitrary Python objects.
There are various ways to view data in a DataFrame
...
It is also easy to get some quick statistics on the data.
Now let's look at ways to extract (or "select", to continue the Excel metaphor) sections of data.
You can also select rows or columns based on boolean conditions, as in a list comprehension; but the notation here is more compact.
The missing entries here have been set to np.nan
(NaN stands for "Not a Number"). This is the pandas analogue of having an empty cell in an Excel spreadsheet. Operations generally skip over missing data.
You can write values into a single entry, or a range; this is analogous to pasting into an Excel spreadsheet (or in the case of a single value, simply typing in a new value).
It is possible to change indices in an existing DataFrame
.
Returning to missing data, one can handle it in various ways.
Let me skip a few sections here.
Binary operations
Statistics
Applying functions to the data (like the Python
map
function on a list)Histogramming
String processing
One can combine data in various ways, such as concatenation...
... or merging, as in a SQL database. (As a mathematician, I like to think of this as a "Cartesian product".)
Let's skip some more.
Appending rows
Grouping (this means splitting the data on some criteria, applying a different function to each group, then recombining)
Reshaping
Time series (more on these when we focus on statistics)
Categoricals (values limited to a small number of options, e.g., letter grades)
Plotting uses matplotlib.
Plotting a DataFrame
gives you superimposed plots.
Finally, one can move data in and out of CSV files...
... or Excel spreadsheets.
After running the previous example, try switching to the file view. You can try downloading the new files and trying to open them in a spreadsheet program (Excel, OpenOffice, etc.).