Using groupby to summarize information
Import some data from the Seaborn sample data sets, and do some exploration.
Let's work with only cars made in the USA.
Look at what possible values are in the
origin
column.Create a new
DataFrame
namedusa
with only those rows inmpg
for which the value of theorigin
column is usa.Verify there is only 1 value in the
origin
column in `usa.
Now let's specify that we want to create some groupings based on the model_year
attribute and see what is contained in the groups
property of the DataFrameGroupBy
generated by calling groupby
.
Let's see what the values associated with the key 70 in grouped.groups
are.
Remember loc
uses index values, not list positions. Let's prove that a minute. First we'll change the index values.
Now re-compute the groups based on the model_year
attribute.
This hopefully proves that the values in the groups
dictionary represent index values.
Now let's look at grouping by multiple attributes; in particular, we'll group by model_year
followed by cylinders
.
Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this
And we can get the average mpg for that specific set of vehicles by finding them using loc
, and then using the Series
method mean
to compute the average
We can get the average mpg
for every (model_year, cylinders)
combination using the DataFrameGroupBy
object's mean
method.
The index of the DateFrame
that comes back from DataFrameGroupBy.mean
is a multi-level index.
Here's an example from our book that illustrates an important idea when passing a Series
object as the field to group by. Consider the following DataFrame
named df
.
The book uses the following code to compute the average of the data1
column, grouped by the values in the key1
column.
Another way to write this is as follows:
Look carefully at the first example again, particularly focusing on what's being grouped:
Does this raise any questions in your head?
Consider the code in the following 2 cells.
What gives? The reason the original code works is because the index values are the same.
In the version above:
there's no match in key1
from the index values in df['data1']
, so we don't get any groups.
Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg
dataset by using the name
column values as the index.
Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.
We see some problems here:
chevroelt should be chevrolet
chevy is an abbreviation for chevrolet.
A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.
Let's pass the get_make
function as the argument to groupby
and look at the resulting groups.
Now find the average mpg
for each make based on the keys returned by the get_make
function.
Get the average miles per gallon for the 3 makes that appear the most often in our data set.
To get started, sort the result of calling the size
aggregate function on grouped_by_make
in descending order, storing the result in a variable named most_frequently_occurring
.
Now use the index values from most_frequently_occurring
to select elements from mean_mpg_by_make
.
We can also create a new DataFrame
from the Series
objects mean_mpg_by_make
and size_by_make
, sort it using sort_values
, and get the first 3 elements using iloc
.
Finally, look at summarizing by Make
and Year
. Get started by creating a multi-level index on those 2 columns.
Now we can group by a combination of the car's make and year
And find the average miles per gallon for all vehicles manufactured by Ford in 1971.