Class Samples / datascience / group_by_lecture.ipynb

⁷⁵⁵⁰ views
ubuntu2004

Kernel: Python 3 (system-wide)

Using groupby to summarize information

Import some data from the Seaborn sample data sets, and do some exploration.

In [1]:

import pandas as pd
import numpy as np

import seaborn as sns
mpg = sns.load_dataset('mpg')
mpg.head()

Out[1]:

Let's work with only cars made in the USA.

Look at what possible values are in the origin column.
Create a new DataFrame named usa with only those rows in mpg for which the value of the origin column is usa.
Verify there is only 1 value in the origin column in `usa.

In [2]:

mpg['origin'].unique()
usa = mpg[mpg.loc[:, 'origin'] == 'usa']
print(usa['origin'].unique())
print(usa.head())

Out[2]:

['usa']
    mpg  cylinders  displacement  horsepower  weight  acceleration  \
18.0          8         307.0       130.0    3504          12.0   
15.0          8         350.0       165.0    3693          11.5   
18.0          8         318.0       150.0    3436          11.0   
16.0          8         304.0       150.0    3433          12.0   
17.0          8         302.0       140.0    3449          10.5   

   model_year origin                       name  
        70    usa  chevrolet chevelle malibu  
        70    usa          buick skylark 320  
        70    usa         plymouth satellite  
        70    usa              amc rebel sst  
        70    usa                ford torino  

Now let's specify that we want to create some groupings based on the model_year attribute and see what is contained in the groups property of the DataFrameGroupBy generated by calling groupby.

In [3]:

grouped = usa.groupby('model_year')
grouped.groups

Out[3]:

{70: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 24, 25, 26, 27, 28], 71: [30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56], 72: [58, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 80, 83], 73: [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 109, 112, 113, 115, 116, 121, 124], 74: [125, 126, 127, 128, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 146], 75: [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 168, 169, 170, 174, 176], 76: [184, 185, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 201, 202, 206, 208, 212, 213, 214, 215], 77: [217, 219, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 234, 236, 237, 238], 78: [245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 269, 271, 272], 79: [280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 295, 296, 298, 300, 301, 302, 305, 306, 307, 308], 80: [311, 313, 314, 315, 316, 323, 336], 81: [338, 339, 340, 341, 342, 344, 350, 351, 352, 363, 364, 365, 366], 82: [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 386, 387, 388, 389, 391, 392, 393, 395, 396, 397]}

Let's see what the values associated with the key 70 in grouped.groups are.

In [4]:

usa.loc[grouped.groups[70]].head(20)

Out[4]:

Remember loc uses index values, not list positions. Let's prove that a minute. First we'll change the index values.

In [5]:

new_index = pd.RangeIndex(100, len(usa)+100)
usa_with_new_index = usa.set_index(new_index)
usa_with_new_index.head()
#print(len(usa))

Out[5]:

Now re-compute the groups based on the model_year attribute.

In [6]:

new_index_grouped = usa_with_new_index.groupby('model_year')
new_index_grouped.groups

Out[6]:

{70: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121], 71: [122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 72: [142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159], 73: [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188], 74: [189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203], 75: [204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 76: [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 77: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263], 78: [264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285], 79: [286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 80: [309, 310, 311, 312, 313, 314, 315], 81: [316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328], 82: [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348]}

This hopefully proves that the values in the groups dictionary represent index values.

Now let's look at grouping by multiple attributes; in particular, we'll group by model_year followed by cylinders.

In [7]:

grouped = usa.groupby(['model_year', 'cylinders'])
grouped.groups

Out[7]:

{(70, 6): [15, 16, 17, 24], (70, 8): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 25, 26, 27, 28], (71, 4): [30, 32, 46, 49, 56], (71, 6): [33, 34, 35, 36, 37, 45, 47, 48], (71, 8): [38, 39, 40, 41, 42, 43, 44], (72, 4): [58, 60, 61, 80, 83], (72, 8): [62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75], (73, 4): [109, 112], (73, 6): [97, 98, 99, 100, 101, 107, 113], (73, 8): [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 103, 104, 105, 106, 115, 116, 121, 124], (74, 4): [130, 132, 146], (74, 6): [125, 126, 127, 128, 133, 134, 135], (74, 8): [136, 137, 138, 139, 140], (75, 4): [168, 170], (75, 6): [152, 153, 154, 155, 160, 161, 162, 163, 164, 169, 174, 176], (75, 8): [156, 157, 158, 159, 165, 166], (76, 4): [184, 185, 195, 196, 206], (76, 6): [191, 192, 193, 194, 199, 200, 201, 202], (76, 8): [187, 188, 189, 190, 208, 212, 213, 214, 215], (77, 4): [217, 219, 234, 236, 237, 238], (77, 6): [225, 226, 227, 228], (77, 8): [221, 222, 223, 224, 229, 230, 231, 232], (78, 4): [245, 255, 266, 269, 271, 272], (78, 6): [252, 253, 254, 256, 257, 258, 259, 260, 261, 263], (78, 8): [249, 250, 251, 262, 264, 265], (79, 4): [282, 295, 296, 301, 302, 305, 308], (79, 6): [280, 281, 283, 284, 306, 307], (79, 8): [285, 286, 287, 288, 289, 290, 291, 292, 298, 300], (80, 4): [311, 313, 314, 315, 323, 336], (80, 6): [316], (81, 4): [338, 339, 340, 342, 344, 350, 351, 352], (81, 6): [341, 363, 365, 366], (81, 8): [364], (82, 4): [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 388, 391, 392, 393, 395, 396, 397], (82, 6): [386, 387, 389]}

Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this

In [8]:

usa.loc[grouped.groups[(70,6)]]

Out[8]:

And we can get the average mpg for that specific set of vehicles by finding them using loc, and then using the Series method mean to compute the average

In [9]:

usa.loc[grouped.groups[(70,6)],["mpg", "horsepower"]].mean()

Out[9]:

mpg           20.50
horsepower    91.75
dtype: float64

We can get the average mpg for every (model_year, cylinders) combination using the DataFrameGroupBy object's mean method.

In [10]:

means = grouped.mean()
means.head()

Out[10]:

The index of the DateFrame that comes back from DataFrameGroupBy.mean is a multi-level index.

In [11]:

means.index

Out[11]:

MultiIndex([(70, 6),
            (70, 8),
            (71, 4),
            (71, 6),
            (71, 8),
            (72, 4),
            (72, 8),
            (73, 4),
            (73, 6),
            (73, 8),
            (74, 4),
            (74, 6),
            (74, 8),
            (75, 4),
            (75, 6),
            (75, 8),
            (76, 4),
            (76, 6),
            (76, 8),
            (77, 4),
            (77, 6),
            (77, 8),
            (78, 4),
            (78, 6),
            (78, 8),
            (79, 4),
            (79, 6),
            (79, 8),
            (80, 4),
            (80, 6),
            (81, 4),
            (81, 6),
            (81, 8),
            (82, 4),
            (82, 6)],
           names=['model_year', 'cylinders'])

Here's an example from our book that illustrates an important idea when passing a Series object as the field to group by. Consider the following DataFrame named df.

In [12]:

df = pd.DataFrame({
        'key1': ['a','a', 'b', 'b', 'a'],
        'key2': ['one', 'two', 'one', 'two', 'one'],
        'data1': np.random.randn(5),
        'data2': np.random.randn(5)
});

df

Out[12]:

The book uses the following code to compute the average of the data1 column, grouped by the values in the key1 column.

In [13]:

df['data1'].groupby(df['key1']).mean()

Out[13]:

key1
a    0.533188
b   -0.535952
Name: data1, dtype: float64

In [14]:

print((df['data1']))
print(df['key1'])

Out[14]:

  1.628402
 -1.107394
  0.475426
 -1.547331
  1.078556
Name: data1, dtype: float64
  a
  a
  b
  b
  a
Name: key1, dtype: object

Another way to write this is as follows:

In [15]:

df.groupby(df['key1']).mean()['data1']

Out[15]:

key1
a    0.533188
b   -0.535952
Name: data1, dtype: float64

Look carefully at the first example again, particularly focusing on what's being grouped:

df['data1'].groupby(df['key1']).mean()

Does this raise any questions in your head?

Consider the code in the following 2 cells.

In [16]:

key1 = pd.Series(data=['a', 'b', 'b', 'a', 'a'], index=[5, 0, 7, 8, 9])
print(df.head())
print()
print(key1)

Out[16]:

  key1 key2     data1     data2
  a  one  1.628402  0.630337
  a  two -1.107394  0.022734
  b  one  0.475426 -0.792190
  b  two -1.547331 -1.789632
  a  one  1.078556 -1.490519

  a
  b
  b
  a
  a
dtype: object

In [17]:

grouped_by_key1 = df['data1'].groupby(key1)
grouped_by_key1.groups

Out[17]:

{'b': [0]}

What gives? The reason the original code works is because the index values are the same.

df['data1'].groupby(df['key1']).mean()

In the version above:

key1 = pd.Series(data=df['data1'].values, index=[5, 6, 7, 8, 9])
grouped_by_key1 = df['data1'].groupby(key1)

there's no match in key1 from the index values in df['data1'], so we don't get any groups.

Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg dataset by using the name column values as the index.

In [18]:

usa_model_index = usa.set_index('name')
usa_model_index.head()

Out[18]:

Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.

In [19]:

def get_make(name):
    stripper = name.strip()
    parts = stripper.split()
    return parts[0]

grouped_by_make = usa_model_index.groupby(lambda name: name.strip().split()[0])
grouped_by_make.groups.keys()

Out[19]:

dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevroelt', 'chevrolet', 'chevy', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

We see some problems here:

chevroelt should be chevrolet
chevy is an abbreviation for chevrolet.

A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.

In [20]:

def get_make(name):
    stripper = name.strip()
    parts = stripper.split()
    if parts[0] in ["chevy", "chevrolet", "chevroelt"]:
        return "chevrolet"
    else:
        return parts[0]

Let's pass the get_make function as the argument to groupby and look at the resulting groups.

In [21]:

grouped_by_make = usa_model_index.groupby(get_make)
grouped_by_make.groups.keys()

Out[21]:

dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevrolet', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

Now find the average mpg for each make based on the keys returned by the get_make function.

In [22]:

print(grouped_by_make.mean()['mpg'])
print(grouped_by_make['mpg'].mean())

Out[22]:

name
amc           18.246429
buick         19.182353
cadillac      19.750000
capri         25.000000
chevrolet     20.219149
chrysler      17.266667
dodge         22.060714
ford          19.694118
hi             9.000000
mercury       19.118182
oldsmobile    21.100000
plymouth      21.703226
pontiac       20.012500
Name: mpg, dtype: float64
name
amc           18.246429
buick         19.182353
cadillac      19.750000
capri         25.000000
chevrolet     20.219149
chrysler      17.266667
dodge         22.060714
ford          19.694118
hi             9.000000
mercury       19.118182
oldsmobile    21.100000
plymouth      21.703226
pontiac       20.012500
Name: mpg, dtype: float64

Get the average miles per gallon for the 3 makes that appear the most often in our data set.

To get started, sort the result of calling the size aggregate function on grouped_by_make in descending order, storing the result in a variable named most_frequently_occurring.

In [31]:

size_by_make = grouped_by_make.size()
most_frequently_occurring = size_by_make.sort_values(ascending=False)
makes_of_most_frequently_occurring = most_frequently_occurring.index[0:3]
makes_of_most_frequently_occurring

Out[31]:

Index(['ford', 'chevrolet', 'plymouth'], dtype='object', name='name')

Now use the index values from most_frequently_occurring to select elements from mean_mpg_by_make.

In [32]:

mean_mpg_by_make = grouped_by_make.mean()['mpg']
mean_mpg_by_make.loc[makes_of_most_frequently_occurring]

Out[32]:

name
ford         19.694118
chevrolet    20.219149
plymouth     21.703226
Name: mpg, dtype: float64

We can also create a new DataFrame from the Series objects mean_mpg_by_make and size_by_make, sort it using sort_values, and get the first 3 elements using iloc.

In [0]:

summary = pd.DataFrame(
  data = {
        "Avg. MPG": mean_mpg_by_make,
        "Num. Vehicles": size_by_make
    }
)
summary.sort_values('Num. Vehicles', ascending=False).iloc[0:3]

Finally, look at summarizing by Make and Year. Get started by creating a multi-level index on those 2 columns.

In [0]:

usa_model_year_idx = usa.set_index(['name', 'model_year'])
usa_model_year_idx.head()

Now we can group by a combination of the car's make and year

In [0]:

And find the average miles per gallon for all vehicles manufactured by Ford in 1971.

In [0]:

Using groupby to summarize information

Product

Resources

Company