Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
7550 views
ubuntu2004
Kernel: Python 3 (system-wide)

Using groupby to summarize information

Import some data from the Seaborn sample data sets, and do some exploration.

import pandas as pd import numpy as np import seaborn as sns mpg = sns.load_dataset('mpg') mpg.head()

Let's work with only cars made in the USA.

  • Look at what possible values are in the origin column.

  • Create a new DataFrame named usa with only those rows in mpg for which the value of the origin column is usa.

  • Verify there is only 1 value in the origin column in `usa.

mpg['origin'].unique() usa = mpg[mpg.loc[:, 'origin'] == 'usa'] print(usa['origin'].unique()) print(usa.head())
['usa'] mpg cylinders displacement horsepower weight acceleration \ 0 18.0 8 307.0 130.0 3504 12.0 1 15.0 8 350.0 165.0 3693 11.5 2 18.0 8 318.0 150.0 3436 11.0 3 16.0 8 304.0 150.0 3433 12.0 4 17.0 8 302.0 140.0 3449 10.5 model_year origin name 0 70 usa chevrolet chevelle malibu 1 70 usa buick skylark 320 2 70 usa plymouth satellite 3 70 usa amc rebel sst 4 70 usa ford torino

Now let's specify that we want to create some groupings based on the model_year attribute and see what is contained in the groups property of the DataFrameGroupBy generated by calling groupby.

grouped = usa.groupby('model_year') grouped.groups
{70: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 24, 25, 26, 27, 28], 71: [30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56], 72: [58, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 80, 83], 73: [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 109, 112, 113, 115, 116, 121, 124], 74: [125, 126, 127, 128, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 146], 75: [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 168, 169, 170, 174, 176], 76: [184, 185, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 201, 202, 206, 208, 212, 213, 214, 215], 77: [217, 219, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 234, 236, 237, 238], 78: [245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 269, 271, 272], 79: [280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 295, 296, 298, 300, 301, 302, 305, 306, 307, 308], 80: [311, 313, 314, 315, 316, 323, 336], 81: [338, 339, 340, 341, 342, 344, 350, 351, 352, 363, 364, 365, 366], 82: [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 386, 387, 388, 389, 391, 392, 393, 395, 396, 397]}

Let's see what the values associated with the key 70 in grouped.groups are.

usa.loc[grouped.groups[70]].head(20)

Remember loc uses index values, not list positions. Let's prove that a minute. First we'll change the index values.

new_index = pd.RangeIndex(100, len(usa)+100) usa_with_new_index = usa.set_index(new_index) usa_with_new_index.head() #print(len(usa))

Now re-compute the groups based on the model_year attribute.

new_index_grouped = usa_with_new_index.groupby('model_year') new_index_grouped.groups
{70: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121], 71: [122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 72: [142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159], 73: [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188], 74: [189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203], 75: [204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 76: [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 77: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263], 78: [264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285], 79: [286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 80: [309, 310, 311, 312, 313, 314, 315], 81: [316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328], 82: [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348]}

This hopefully proves that the values in the groups dictionary represent index values.

Now let's look at grouping by multiple attributes; in particular, we'll group by model_year followed by cylinders.

grouped = usa.groupby(['model_year', 'cylinders']) grouped.groups
{(70, 6): [15, 16, 17, 24], (70, 8): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 25, 26, 27, 28], (71, 4): [30, 32, 46, 49, 56], (71, 6): [33, 34, 35, 36, 37, 45, 47, 48], (71, 8): [38, 39, 40, 41, 42, 43, 44], (72, 4): [58, 60, 61, 80, 83], (72, 8): [62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75], (73, 4): [109, 112], (73, 6): [97, 98, 99, 100, 101, 107, 113], (73, 8): [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 103, 104, 105, 106, 115, 116, 121, 124], (74, 4): [130, 132, 146], (74, 6): [125, 126, 127, 128, 133, 134, 135], (74, 8): [136, 137, 138, 139, 140], (75, 4): [168, 170], (75, 6): [152, 153, 154, 155, 160, 161, 162, 163, 164, 169, 174, 176], (75, 8): [156, 157, 158, 159, 165, 166], (76, 4): [184, 185, 195, 196, 206], (76, 6): [191, 192, 193, 194, 199, 200, 201, 202], (76, 8): [187, 188, 189, 190, 208, 212, 213, 214, 215], (77, 4): [217, 219, 234, 236, 237, 238], (77, 6): [225, 226, 227, 228], (77, 8): [221, 222, 223, 224, 229, 230, 231, 232], (78, 4): [245, 255, 266, 269, 271, 272], (78, 6): [252, 253, 254, 256, 257, 258, 259, 260, 261, 263], (78, 8): [249, 250, 251, 262, 264, 265], (79, 4): [282, 295, 296, 301, 302, 305, 308], (79, 6): [280, 281, 283, 284, 306, 307], (79, 8): [285, 286, 287, 288, 289, 290, 291, 292, 298, 300], (80, 4): [311, 313, 314, 315, 323, 336], (80, 6): [316], (81, 4): [338, 339, 340, 342, 344, 350, 351, 352], (81, 6): [341, 363, 365, 366], (81, 8): [364], (82, 4): [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 388, 391, 392, 393, 395, 396, 397], (82, 6): [386, 387, 389]}

Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this

usa.loc[grouped.groups[(70,6)]]

And we can get the average mpg for that specific set of vehicles by finding them using loc, and then using the Series method mean to compute the average

usa.loc[grouped.groups[(70,6)],["mpg", "horsepower"]].mean()
mpg 20.50 horsepower 91.75 dtype: float64

We can get the average mpg for every (model_year, cylinders) combination using the DataFrameGroupBy object's mean method.

means = grouped.mean() means.head()

The index of the DateFrame that comes back from DataFrameGroupBy.mean is a multi-level index.

means.index
MultiIndex([(70, 6), (70, 8), (71, 4), (71, 6), (71, 8), (72, 4), (72, 8), (73, 4), (73, 6), (73, 8), (74, 4), (74, 6), (74, 8), (75, 4), (75, 6), (75, 8), (76, 4), (76, 6), (76, 8), (77, 4), (77, 6), (77, 8), (78, 4), (78, 6), (78, 8), (79, 4), (79, 6), (79, 8), (80, 4), (80, 6), (81, 4), (81, 6), (81, 8), (82, 4), (82, 6)], names=['model_year', 'cylinders'])

Here's an example from our book that illustrates an important idea when passing a Series object as the field to group by. Consider the following DataFrame named df.

df = pd.DataFrame({ 'key1': ['a','a', 'b', 'b', 'a'], 'key2': ['one', 'two', 'one', 'two', 'one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5) }); df

The book uses the following code to compute the average of the data1 column, grouped by the values in the key1 column.

df['data1'].groupby(df['key1']).mean()
key1 a 0.533188 b -0.535952 Name: data1, dtype: float64
print((df['data1'])) print(df['key1'])
0 1.628402 1 -1.107394 2 0.475426 3 -1.547331 4 1.078556 Name: data1, dtype: float64 0 a 1 a 2 b 3 b 4 a Name: key1, dtype: object

Another way to write this is as follows:

df.groupby(df['key1']).mean()['data1']
key1 a 0.533188 b -0.535952 Name: data1, dtype: float64

Look carefully at the first example again, particularly focusing on what's being grouped:

df['data1'].groupby(df['key1']).mean()

Does this raise any questions in your head?

Consider the code in the following 2 cells.

key1 = pd.Series(data=['a', 'b', 'b', 'a', 'a'], index=[5, 0, 7, 8, 9]) print(df.head()) print() print(key1)
key1 key2 data1 data2 0 a one 1.628402 0.630337 1 a two -1.107394 0.022734 2 b one 0.475426 -0.792190 3 b two -1.547331 -1.789632 4 a one 1.078556 -1.490519 5 a 0 b 7 b 8 a 9 a dtype: object
grouped_by_key1 = df['data1'].groupby(key1) grouped_by_key1.groups
{'b': [0]}

What gives? The reason the original code works is because the index values are the same.

df['data1'].groupby(df['key1']).mean()

In the version above:

key1 = pd.Series(data=df['data1'].values, index=[5, 6, 7, 8, 9]) grouped_by_key1 = df['data1'].groupby(key1)

there's no match in key1 from the index values in df['data1'], so we don't get any groups.

Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg dataset by using the name column values as the index.

usa_model_index = usa.set_index('name') usa_model_index.head()

Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.

def get_make(name): stripper = name.strip() parts = stripper.split() return parts[0] grouped_by_make = usa_model_index.groupby(lambda name: name.strip().split()[0]) grouped_by_make.groups.keys()
dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevroelt', 'chevrolet', 'chevy', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

We see some problems here:

  • chevroelt should be chevrolet

  • chevy is an abbreviation for chevrolet.

A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.

def get_make(name): stripper = name.strip() parts = stripper.split() if parts[0] in ["chevy", "chevrolet", "chevroelt"]: return "chevrolet" else: return parts[0]

Let's pass the get_make function as the argument to groupby and look at the resulting groups.

grouped_by_make = usa_model_index.groupby(get_make) grouped_by_make.groups.keys()
dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevrolet', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

Now find the average mpg for each make based on the keys returned by the get_make function.

print(grouped_by_make.mean()['mpg']) print(grouped_by_make['mpg'].mean())
name amc 18.246429 buick 19.182353 cadillac 19.750000 capri 25.000000 chevrolet 20.219149 chrysler 17.266667 dodge 22.060714 ford 19.694118 hi 9.000000 mercury 19.118182 oldsmobile 21.100000 plymouth 21.703226 pontiac 20.012500 Name: mpg, dtype: float64 name amc 18.246429 buick 19.182353 cadillac 19.750000 capri 25.000000 chevrolet 20.219149 chrysler 17.266667 dodge 22.060714 ford 19.694118 hi 9.000000 mercury 19.118182 oldsmobile 21.100000 plymouth 21.703226 pontiac 20.012500 Name: mpg, dtype: float64

Get the average miles per gallon for the 3 makes that appear the most often in our data set.

To get started, sort the result of calling the size aggregate function on grouped_by_make in descending order, storing the result in a variable named most_frequently_occurring.

size_by_make = grouped_by_make.size() most_frequently_occurring = size_by_make.sort_values(ascending=False) makes_of_most_frequently_occurring = most_frequently_occurring.index[0:3] makes_of_most_frequently_occurring
Index(['ford', 'chevrolet', 'plymouth'], dtype='object', name='name')

Now use the index values from most_frequently_occurring to select elements from mean_mpg_by_make.

mean_mpg_by_make = grouped_by_make.mean()['mpg'] mean_mpg_by_make.loc[makes_of_most_frequently_occurring]
name ford 19.694118 chevrolet 20.219149 plymouth 21.703226 Name: mpg, dtype: float64

We can also create a new DataFrame from the Series objects mean_mpg_by_make and size_by_make, sort it using sort_values, and get the first 3 elements using iloc.

summary = pd.DataFrame( data = { "Avg. MPG": mean_mpg_by_make, "Num. Vehicles": size_by_make } ) summary.sort_values('Num. Vehicles', ascending=False).iloc[0:3]

Finally, look at summarizing by Make and Year. Get started by creating a multi-level index on those 2 columns.

usa_model_year_idx = usa.set_index(['name', 'model_year']) usa_model_year_idx.head()

Now we can group by a combination of the car's make and year

And find the average miles per gallon for all vehicles manufactured by Ford in 1971.