📚 The CoCalc Library - books, templates and other resources
License: OTHER
This notebook was prepared by Donne Martin. Source and license info is on GitHub.
Pandas
Credits: The following are notes taken while working through Python for Data Analysis by Wes McKinney
Series
DataFrame
Reindexing
Dropping Entries
Indexing, Selecting, Filtering
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indices with Duplicate Values
Summarizing and Computing Descriptive Statistics
Cleaning Data (Under Construction)
Input and Output (Under Construction)
Series
A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index.
Create a Series:
Get the array representation of a Series:
Index objects are immutable and hold the axis labels and metadata such as names and axis names.
Get the index of the Series:
Create a Series with a custom index:
Get a value from a Series:
Get a set of values from a Series by passing in a list:
Get values great than 0:
Scalar multiply:
Apply a numpy math function:
A Series is like a fixed-length, ordered dict.
Create a series by passing in a dict:
Re-order a Series by passing in an index (indices not found are NaN):
Check for NaN with the pandas method:
Check for NaN with the Series method:
Series automatically aligns differently indexed data in arithmetic operations:
Name a Series:
Name a Series index:
Rename a Series' index in place:
DataFrame
A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.
Create a DataFrame:
Create a DataFrame specifying a sequence of columns:
Like Series, columns that are not present in the data are NaN:
Retrieve a column by key, returning a Series:
Retrive a column by attribute, returning a Series:
Retrieve a row by position:
Update a column by assignment:
Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):
Assign a new column that doesn't exist to create a new column:
Delete a column:
Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):
Transpose the DataFrame:
Create a DataFrame from a dict of Series:
Set the DataFrame index name:
Set the DataFrame columns name:
Return the data contained in a DataFrame as a 2D ndarray:
If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:
Reindexing
Create a new object with the data conformed to a new index. Any missing values are set to NaN.
Reindexing rows returns a new frame with the specified index:
Missing values can be set to something other than NaN:
Interpolate ordered data like a time series:
Reindex columns:
Reindex rows and columns while filling rows:
Reindex using ix:
Dropping Entries
Drop rows from a Series or DataFrame:
Drop columns from a DataFrame:
Indexing, Selecting, Filtering
Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.
Select a value from a Series:
Select a slice from a Series:
Select specific values from a Series:
Select from a Series based on a filter:
Select a slice from a Series with labels (note the end point is inclusive):
Assign to a Series slice (note the end point is inclusive):
Pandas supports indexing into a DataFrame.
Select specified columns from a DataFrame:
Select a slice from a DataFrame:
Select from a DataFrame based on a filter:
Perform a scalar comparison on a DataFrame:
Perform a scalar comparison on a DataFrame, retain the values that pass the filter:
Select a slice of rows from a DataFrame (note the end point is inclusive):
Select a slice of rows from a specific column of a DataFrame:
Select rows based on an arithmetic operation on a specific row:
Arithmetic and Data Alignment
Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:
Set a fill value instead of NaN for indices that do not overlap:
Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:
Set a fill value instead of NaN for indices that do not overlap:
Like NumPy, pandas supports arithmetic operations between DataFrames and Series.
Match the index of the Series on the DataFrame's columns, broadcasting down the rows:
Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:
Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:
Function Application and Mapping
NumPy ufuncs (element-wise array methods) operate on pandas objects:
Apply a function on 1D arrays to each column:
Apply a function on 1D arrays to each row:
Apply a function and return a DataFrame:
Apply an element-wise Python function to a DataFrame:
Apply an element-wise Python function to a Series:
Sorting and Ranking
Sort a Series by its index:
Sort a Series by its values:
Sort a DataFrame by its index:
Sort a DataFrame by columns in descending order:
Sort a DataFrame's values by column:
Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:
Rank a Series according to when they appear in the data:
Rank a Series in descending order, using the maximum rank for the group:
DataFrames can rank over rows or columns.
Rank a DataFrame over rows:
Rank a DataFrame over columns:
Axis Indexes with Duplicate Values
Labels do not have to be unique in Pandas:
Select Series elements:
Select DataFrame elements:
Summarizing and Computing Descriptive Statistics
Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.
Sum over the rows:
Account for NaNs:
Cleaning Data (Under Construction)
Replace
Drop
Concatenate
Setup a DataFrame:
Replace
Replace all occurrences of a string with another string, in place (no copy):
In a specified column, replace all occurrences of a string with another string, in place (no copy):
Drop
Drop the 'population' column and return a copy of the DataFrame:
Concatenate
Concatenate two DataFrames:
Input and Output (Under Construction)
Reading
Writing
Reading
Read data from a CSV file into a DataFrame (use sep='\t' for TSV):
Get a summary of the DataFrame:
List the first five rows of the DataFrame:
Writing
Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:
View the data directory:
total 1016
-rw-r--r-- 1 donnemartin staff 437903 Jul 7 2015 churn.csv
-rwxr-xr-x 1 donnemartin staff 72050 Jul 7 2015 confusion_matrix.png
-rw-r--r-- 1 donnemartin staff 2902 Jul 7 2015 ozone.csv
-rw-r--r-- 1 donnemartin staff 3324 Apr 1 07:18 ozone_copy.csv
drwxr-xr-x 10 donnemartin staff 340 Jul 7 2015 titanic