9. Pivot Tables

2025. 6. 19. 00:16Python/Pandas

  • The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
  • Pivot tables as essentially a multidimensional version of groupby aggregation.
  • That is, you split-apply-combine, but both the split and the combine happen accross not a one-dimensional index, but across a two-dimensional grid.

Motivating Pivot Tables

  • We'll use the database of passengers on the Titanic, available through the Seaborn library.
# In[1]
titanic=sns.load_dataset('titanic')
titanic.head()
# Out[1]
  survived    pclass       sex     age    sibsp    parch       fare    embarked    class      who    adult_male    deck    embark_town    alive    alone
0         0         3      male    22.0        1        0     7.2500           S    Third      man          True     NaN    Southampton       no    False
1         1         1    female    38.0        1        0    71.2833           C    First    woman         False       C      Cherbourg      yes    False
2         1         3    female    26.0        0        0     7.9250           S    Third    woman         False     NaN    Southampton      yes     True
3         1         1    female    35.0        1        0    53.1000           S    First    woman         False       C    Southampton      yes    False
4         0         3      male    35.0        0        0     8.0500           S    Third      man          True     NaN    Southampton       no     True

Pivot Tables by Hand

  • We might begin by grouping according to sex, survival status, or combination thereof.
# In[2]
titanic.groupby('sex')[['survived']].mean() # make a DataFrame
"""titanic.groupby('sex')['survived'].mean() return Series"""
# Out[2]
        survived
sex    
female    0.742038
male    0.188908

# In[3]
titanic.groupby(['sex','class'])['survived'].aggregate('mean').unstack()
# Out[3]
class    First        Second        Third
sex            
female    0.968085    0.921053    0.500000
male    0.368852    0.157407    0.135447
  • The above code gives us a better idea of how both sex and class affected survival, but the code is starting to look a bit garbled(=confused).
  • The long string of code is not particularly easy to read or use.
  • This two-dimensional groupby is common enough that Pandas includes a convenience routine, pivot_table, which succinctly(간단명료한) handles this type of multidimensional aggregation.

Pivot Table Syntax

  • Here is the equivalent to the preceding operation using the DataFrame.pivot_table method
# In[4]
titanic.pivot_table('survived',index='sex',columns='class',aggfunc='mean')
# Out[4]
class    First        Second        Third
sex            
female    0.968085    0.921053    0.500000
male    0.368852    0.157407    0.135447

Multilevel Pivot Tables

  • Just as in a groupby, the grouping in pivot tables can be specified with multiple levels and via a number of options.
  • We'll bin the age using the pd.cut function.
# In[5]
age=pd.cut(titanic['age'],[0,18,80])
titanic.pivot_table('survived',['sex',age],'class')
# Out[5]
class            First        Second        Third
sex       age            
female    (0, 18]    0.909091    1.000000    0.511628
       (18, 80]    0.972973    0.900000    0.423729
male    (0, 18]    0.800000    0.600000    0.215686
       (18, 80]    0.375000    0.071429    0.133663
  • We can apply the same strategy when working with the columns as well; let's add info on the fare paid, using pd.qcut to automatically compute quantiles.
# In[6]
fare=pd.qcut(titanic['fare'],2)
titanic.pivot_table('survived',['sex',age],[fare,'class'])
# Out[6]
fare           (-0.001, 14.454]                      (14.454, 512.329]
class           First      Second       Third       First      Second       Third
sex        age                        
female    (0, 18]     NaN    1.000000    0.714286    0.909091    1.000000    0.318182
       (18, 80]     NaN    0.880000    0.444444    0.972973    0.914286    0.391304
male    (0, 18]     NaN    0.000000    0.260870    0.800000    0.818182    0.178571
       (18, 80]     0.0    0.098039    0.125000    0.391304    0.030303    0.192308
  • The result is a four-dimensional aggregation with hierarchical indices, shown in a grid demonstrating the relationship between the values.

If you don't know about the pd.cut and pd.qcut, please reference this url :
Difference of pd.cut() and pd.qcut()

Additional Pivot Table Options

  • The full call signature of the DataFrame.pivot_table method is as follows.
DataFrame.pivot_table(data, values=None, index=None, columns=None, 
aggfunc='mean', fill_value=None, margins=False, dropna=True, 
margins_name='All', observed=False, sort=True)
  • fill_value and dropna have to do with missing data and are fairly straightforward.

  • The aggfunc keyword controls what type of aggregation is applied, which is mean by default.
  • As with groupby, the aggregation specification can be a string representing one of several common choices like sum , mean, count etc.
  • Or a function that implements an aggregation like np.sum(), min(), sum() etc.
  • Additionally, it can be specified as a dictionary mapping a column to any of the desired options.
# In[7]
titanic.pivot_table(index='sex',columns='class',
aggfunc={'survived':sum,'fare':'mean'})
# Out[7]
        fare                                survived
class    First        Second        Third        First    Second    Third
sex                        
female    106.125798    21.970121    16.118810    91        70        72
male     67.226127    19.741782    12.661633    45        17        47
  • We've omitted the values keyword here; when specifying a mapping for aggfunc, this is determined automatically.
  • At times it's useful to compute totals along each grouping. This can be done via the margins keyword.
# In[8]
titanic.pivot_table('survived',index='sex',columns='class',margins=True)
# Out[8]
class    First        Second        Third        All
sex                
female    0.968085    0.921053    0.500000    0.742038
male    0.368852    0.157407    0.135447    0.188908
All        0.629630    0.472826    0.242363    0.383838
  • The margin label can be specified with the margins_name keyword; it defaults to "All"

'Python > Pandas' 카테고리의 다른 글

11. Working with Time Series  (0) 2025.06.19
10. Vectorized String Operations  (0) 2025.06.19
8. Aggregation and Grouping  (0) 2025.06.19
7. Combining Datasets: merge and join  (0) 2025.06.19
6. Combining Datasets: concat and append  (0) 2025.06.19