5. Hierarchical Indexing

2025. 6. 19. 00:06ㆍPython/Pandas

A Multiply Indexed Series

Bad Way

# In[1]
index=[('California',2010),('California',2020),('New York',2010),('New York',2020),('Texas',2010),('Texas',2020)]
populations=[37253956,39538223,19378102,20201249,25145561,29145505]
pop=pd.Series(populations,index=index)
pop

# Out[1]
(California, 2010)    37253956
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
(Texas, 2020)         29145505
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this tuple index

# In[2]
pop[('California',2020):('Texas',2010)]

# Out[2]
(California, 2020)    39538223
(New York, 2010)      19378102
(New York, 2020)      20201249
(Texas, 2010)         25145561
dtype: int64

But the convenience ends there. If you do something, you'll need to do some messy munging to make it happen.
It will be not as clean as the slicing syntax we've learned in Pandas.

Better Way: Pandas MultiIndex

We can create a multi-index from the tuples.

# In[3]
index=pd.MultiIndex.from_tuples(index)

The MultiIndex represents multiple levels of indexing as well as multiple labels for each data point which encode these levels.
If we reindex our series with MultiIndex, we see the hierarchical representation of the data.

# In[4]
pop=pop.reindex(index)
pop

# Out[4]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
We can also use the Pandas slicing

# In[5]
pop[:,2020]

# Out[5]
California    39538223
New York      20201249
Texas         29145505
dtype: int64

MultiIndex as Extra Dimension

unstack method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.

# In[6]
pop_df=pop.unstack()
pop_df

# Out[6]
                2010        2020
California    37253956    39538223
New York    19378102    20201249
Texas        25145561    29145505

The stack method provides the opposite opperation

# In[7]
pop_df.stack()

# Out[7]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

We were able to use multi-indexing to manipulate two-dimensional data within a one-dimensional Series, we can also use it to manipulate data of three or more dimensions in a Series or DataFrame.
Each extra level in a multi-index represents an extra dimension of data.
We might want to add another column; with a MultiIndex this is as easy as adding another column to the DataFrame.

# In[8]
pop_df=pd.DataFrame({'total':pop,'under18':[9284094,8898092,4318033,4181528,6879014,7432474]})
pop_df

# Out[8]
                       total    under18
California    2010    37253956    9284094
            2020    39538223    8898092
New York    2010    19378102    4318033
            2020    20201249    4181528
Texas        2010    25145561    6879014
            2020    29145505    7432474

In addition, all the ufuncs and other functionality work with hierarchical indices as well.

# In[9]
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()

# Out[9]
                2010        2020
California    0.249211    0.225050
New York    0.222831    0.206994
Texas        0.273568    0.255013

If you want some more information about stack and unstack methods, reference these urls :

Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series and DataFrame is to simply pass a list of two or more index arrays to the constructor.

# In[10]
df=pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
df

# Out[10]
           data1       data2
a    1    0.627660    0.158404
    2    0.181580    0.043981
b    1    0.297599    0.338398
    2    0.592384    0.886842

If you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default.

# In[11]
data={('California',2010):37253956,('California',2020):39538223,('New York',2010):19378102,('New York',2020):20201249,('Texas',2010):25145561,('Texas',2020):29145505}
pd.Series(data)

# Out[11]
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Explicit MultiIndex Constructors

For more flexibility in how the index is constructed, you can instead use the constructor methods available in the pd.MultiIndex class.
You can construct a MultiIndex from a simple list of arrays giving the index values within each level.

# In[12]
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])

# Out[12]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point.

# In[13]
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])

# Out[13]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can even construct it from a Cartesian product (데카르트 곱) of single indices.

# In[14]
pd.MultiIndex.from_product([['a','b'],[1,2]])

# Out[14]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Similarly, you can construct a MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and codes (a list of lists that reference these lables)

# In[15]
pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])

# Out[15]
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

MultiIndex Level Names

Sometimes it is convenient to name the levels of the MultiIndex
This can be accomplished by passing the names argument to any of the previously discussed MultiIndex constructors, or by setting the names attribute of the index.

# In[16]
pop.index.names=['state','year']
pop

# Out[16]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

MultiIndex for Columns

# In[17]
# hierarchical indices and columns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],names=['subject','type'])

# mock some data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+=37

# create the DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data

# Out[17]
subject             Bob           Guido             Sue
type              HR    Temp      HR    Temp      HR    Temp
year    visit                        
2013    1        41.0    37.2    36.0    36.0    43.0    37.9
        2        50.0    35.8    29.0    35.8    41.0    37.8
2014    1        34.0    36.5    34.0    37.2    58.0    38.1
        2        40.0    37.0    43.0    36.0    21.0    38.8

This is fundamentally four-dimensional data.
We can index the top-level column by the person's name and get a full DataFrame containing just that person's information.

# In[18]
health_data['Guido']

# Out[18]
type              HR    Temp
year    visit        
2013    1        36.0    36.0
        2        29.0    35.8
2014    1        34.0    37.2
        2        43.0    36.0

Indexing and Slicing a MultiIndex

Multiply Indexed Series

# In[19]
pop

# Out[19]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

We can access single elements by indexing with multiple terms.

# In[20]
pop['California',2010]

# Out[20]
37253956

The MultiIndex also supports partial indexing, or indexing just one of the levels in the index.

# In[21]
pop['California']

# Out[21]
year
2010    37253956
2020    39538223
dtype: int64

Partial slicing is available as well, as long as the MultiIndex is sorted.

# In[22]
pop.loc['California':'New York']

# Out[22]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index

# In[23]
pop[:,2010]

# Out[23]
state
California    37253956
New York      19378102
Texas         25145561
dtype: int64

Other types of indexing and selection work as well

# In[24]
pop[pop>22000000]

# Out[24]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

# In[25]
pop[['California','Texas']]

# Out[25]
state       year
California  2010    37253956
            2020    39538223
Texas       2010    25145561
            2020    29145505
dtype: int64

Multiply Indexed DataFrames

# In[26]
health_data

# Out[26]
subject             Bob           Guido             Sue
type              HR    Temp      HR    Temp      HR    Temp
year    visit                        
2013    1        41.0    37.2    36.0    36.0    43.0    37.9
        2        50.0    35.8    29.0    35.8    41.0    37.8
2014    1        34.0    36.5    34.0    37.2    58.0    38.1
        2        40.0    37.0    43.0    36.0    21.0    38.8

The syntax used for multiply indexed Series applies to the columns.

# In[27]
health_data['Guido','HR']

# Out[27]
year  visit
2013  1        36.0
      2        29.0
2014  1        34.0
      2        43.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers.

# In[28]
health_data.iloc[:2,:2]

# Out[28]
subject             Bob
type              HR    Temp
year    visit        
2013    1        41.0    37.2
        2        50.0    35.8

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc and iloc can be passed a tuple of multiple indices.

# In[29]
health_data.loc[:,('Bob','HR')]

# Out[29]
year  visit
2013  1        41.0
      2        50.0
2014  1        34.0
      2        40.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not convenient.
If you trying to create a slice within a tuple, it will lead to a syntax error.
You could get around this by building the desired slice explicitly using Python's built-in slice function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation.

# In[30]
idx=pd.IndexSlice
health_data.loc[idx[:,1],idx[:,'HR']]

# Out[30]
subject             Bob    Guido     Sue
type              HR      HR      HR
year    visit            
2013    1        41.0    36.0    43.0
2014    1        34.0    34.0    58.0

Rearranging Multi-Indexes

Sorted and Unsorted Indices

Many of the MultiIndex slicing operations will fail if the index is not sorted.

# In[31]
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)
data.index.names=['char','int']
data

# Out[31]
char  int
a     1      0.601307
      2      0.623240
c     1      0.194030
      2      0.969886
b     1      0.931100
      2      0.700467
dtype: float64

You can't take a partial silce of this index.
For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted order.
Pandas provides sort_index and sortlevel methods of the DataFrame.

# In[32]
data=data.sort_index()
data

# Out[32]
char  int
a     1      0.601307
      2      0.623240
b     1      0.931100
      2      0.700467
c     1      0.194030
      2      0.969886
dtype: float64

With the index sorted in this way, partial slicing will work as expected.

# In[33]
data['a':'b']

# Out[33]
char  int
a     1      0.689178
      2      0.016826
b     1      0.230445
      2      0.842501
dtype: float64

Stacking and Unstacking Indices

It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use.
If you use level option, the specified index in the level parameter, comes up as a column.

# In[34]
pop.unstack()

# Out[34]
year            2010        2020
state        
California    37253956    39538223
New York    19378102    20201249
Texas        25145561    29145505

# In[35]
pop.unstack(level=0) # pop.unstack(level='state')

# Out[35]
state    California    New York    Texas
year            
2010    37253956    19378102    25145561
2020    39538223    20201249    29145505

# In[36]
pop.unstack(level=1) # pop.unstack(level='year')

# Out[36]
year            2010        2020
state        
California    37253956    39538223
New York    19378102    20201249
Texas        25145561    29145505

The opposite of unstack is stack, which can be used to recover the original series

# In[37]
pop.unstack().stack()

# Out[37]
state       year
California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

Index Setting and Resetting

Using reset_index method

# In[38]
pop_flat=pop.reset_index(name='population')
pop_flat

# Out[38]
         state    year    population
0    California    2010    37253956
1    California    2020    39538223
2    New York    2010    19378102
3    New York    2020    20201249
4    Texas        2010    25145561
5    Texas        2020    29145505

A common pattern is to build a MultiIndex from the column values.
This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame

# In[39]
pop_flat.set_index(['state','year'])

# Out[39]
                  population
state        year    
California    2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas        2010    25145561
            2020    29145505

저작자표시 (새창열림)

'Python > Pandas' 카테고리의 다른 글

7. Combining Datasets: merge and join (0)	2025.06.19
6. Combining Datasets: concat and append (0)	2025.06.19
4. Handling Missing Data (0)	2025.06.19
3. Operating on Data in Pandas (0)	2025.06.19
2. Data Indexing and Selection (0)	2025.06.19

노정훈

노정훈

태그

최근글

댓글

공지사항

아카이브

A Multiply Indexed Series

Bad Way

Better Way: Pandas MultiIndex

MultiIndex as Extra Dimension

Methods of MultiIndex Creation

Explicit MultiIndex Constructors

MultiIndex Level Names

MultiIndex for Columns

Indexing and Slicing a MultiIndex

Multiply Indexed Series

Multiply Indexed DataFrames

Rearranging Multi-Indexes

Sorted and Unsorted Indices

Stacking and Unstacking Indices

Index Setting and Resetting

'Python > Pandas' 카테고리의 다른 글

관련글

티스토리툴바