5. Hierarchical Indexing
2025. 6. 19. 00:06ㆍPython/Pandas
A Multiply Indexed Series
Bad Way
# In[1]
index=[('California',2010),('California',2020),('New York',2010),('New York',2020),('Texas',2010),('Texas',2020)]
populations=[37253956,39538223,19378102,20201249,25145561,29145505]
pop=pd.Series(populations,index=index)
pop
# Out[1]
(California, 2010) 37253956
(California, 2020) 39538223
(New York, 2010) 19378102
(New York, 2020) 20201249
(Texas, 2010) 25145561
(Texas, 2020) 29145505
dtype: int64
- With this indexing scheme, you can straightforwardly index or slice the series based on this tuple index
# In[2]
pop[('California',2020):('Texas',2010)]
# Out[2]
(California, 2020) 39538223
(New York, 2010) 19378102
(New York, 2020) 20201249
(Texas, 2010) 25145561
dtype: int64
- But the convenience ends there. If you do something, you'll need to do some messy munging to make it happen.
- It will be not as clean as the slicing syntax we've learned in Pandas.
Better Way: Pandas MultiIndex
- We can create a multi-index from the tuples.
# In[3]
index=pd.MultiIndex.from_tuples(index)
The MultiIndex represents multiple levels of indexing as well as multiple labels for each data point which encode these levels.
If we reindex our series with MultiIndex, we see the hierarchical representation of the data.
# In[4]
pop=pop.reindex(index)
pop
# Out[4]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
We can also use the Pandas slicing
# In[5]
pop[:,2020]
# Out[5]
California 39538223
New York 20201249
Texas 29145505
dtype: int64
MultiIndex as Extra Dimension
unstack
method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame.
# In[6]
pop_df=pop.unstack()
pop_df
# Out[6]
2010 2020
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
- The
stack
method provides the opposite opperation
# In[7]
pop_df.stack()
# Out[7]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
- We were able to use multi-indexing to manipulate two-dimensional data within a one-dimensional Series, we can also use it to manipulate data of three or more dimensions in a Series or DataFrame.
- Each extra level in a multi-index represents an extra dimension of data.
- We might want to add another column; with a MultiIndex this is as easy as adding another column to the DataFrame.
# In[8]
pop_df=pd.DataFrame({'total':pop,'under18':[9284094,8898092,4318033,4181528,6879014,7432474]})
pop_df
# Out[8]
total under18
California 2010 37253956 9284094
2020 39538223 8898092
New York 2010 19378102 4318033
2020 20201249 4181528
Texas 2010 25145561 6879014
2020 29145505 7432474
- In addition, all the ufuncs and other functionality work with hierarchical indices as well.
# In[9]
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()
# Out[9]
2010 2020
California 0.249211 0.225050
New York 0.222831 0.206994
Texas 0.273568 0.255013
If you want some more information about stack
and unstack
methods, reference these urls :
Methods of MultiIndex Creation
- The most straightforward way to construct a multiply indexed Series and DataFrame is to simply pass a list of two or more index arrays to the constructor.
# In[10]
df=pd.DataFrame(np.random.rand(4,2),index=[['a','a','b','b'],[1,2,1,2]],columns=['data1','data2'])
df
# Out[10]
data1 data2
a 1 0.627660 0.158404
2 0.181580 0.043981
b 1 0.297599 0.338398
2 0.592384 0.886842
- If you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default.
# In[11]
data={('California',2010):37253956,('California',2020):39538223,('New York',2010):19378102,('New York',2020):20201249,('Texas',2010):25145561,('Texas',2020):29145505}
pd.Series(data)
# Out[11]
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Explicit MultiIndex Constructors
- For more flexibility in how the index is constructed, you can instead use the constructor methods available in the
pd.MultiIndex
class. - You can construct a MultiIndex from a simple list of arrays giving the index values within each level.
# In[12]
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
# Out[12]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
- You can construct it from a list of tuples giving the multiple index values of each point.
# In[13]
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
# Out[13]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
- You can even construct it from a Cartesian product (데카르트 곱) of single indices.
# In[14]
pd.MultiIndex.from_product([['a','b'],[1,2]])
# Out[14]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
- Similarly, you can construct a MultiIndex directly using its internal encoding by passing
levels
(a list of lists containing available index values for each level) andcodes
(a list of lists that reference these lables)
# In[15]
pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
# Out[15]
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
MultiIndex Level Names
- Sometimes it is convenient to name the levels of the MultiIndex
- This can be accomplished by passing the
names
argument to any of the previously discussed MultiIndex constructors, or by setting thenames
attribute of the index.
# In[16]
pop.index.names=['state','year']
pop
# Out[16]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
MultiIndex for Columns
# In[17]
# hierarchical indices and columns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],names=['subject','type'])
# mock some data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+=37
# create the DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data
# Out[17]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 41.0 37.2 36.0 36.0 43.0 37.9
2 50.0 35.8 29.0 35.8 41.0 37.8
2014 1 34.0 36.5 34.0 37.2 58.0 38.1
2 40.0 37.0 43.0 36.0 21.0 38.8
- This is fundamentally four-dimensional data.
- We can index the top-level column by the person's name and get a full DataFrame containing just that person's information.
# In[18]
health_data['Guido']
# Out[18]
type HR Temp
year visit
2013 1 36.0 36.0
2 29.0 35.8
2014 1 34.0 37.2
2 43.0 36.0
Indexing and Slicing a MultiIndex
Multiply Indexed Series
# In[19]
pop
# Out[19]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
- We can access single elements by indexing with multiple terms.
# In[20]
pop['California',2010]
# Out[20]
37253956
- The MultiIndex also supports partial indexing, or indexing just one of the levels in the index.
# In[21]
pop['California']
# Out[21]
year
2010 37253956
2020 39538223
dtype: int64
- Partial slicing is available as well, as long as the MultiIndex is sorted.
# In[22]
pop.loc['California':'New York']
# Out[22]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
dtype: int64
- With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index
# In[23]
pop[:,2010]
# Out[23]
state
California 37253956
New York 19378102
Texas 25145561
dtype: int64
- Other types of indexing and selection work as well
# In[24]
pop[pop>22000000]
# Out[24]
state year
California 2010 37253956
2020 39538223
Texas 2010 25145561
2020 29145505
dtype: int64
# In[25]
pop[['California','Texas']]
# Out[25]
state year
California 2010 37253956
2020 39538223
Texas 2010 25145561
2020 29145505
dtype: int64
Multiply Indexed DataFrames
# In[26]
health_data
# Out[26]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 41.0 37.2 36.0 36.0 43.0 37.9
2 50.0 35.8 29.0 35.8 41.0 37.8
2014 1 34.0 36.5 34.0 37.2 58.0 38.1
2 40.0 37.0 43.0 36.0 21.0 38.8
- The syntax used for multiply indexed Series applies to the columns.
# In[27]
health_data['Guido','HR']
# Out[27]
year visit
2013 1 36.0
2 29.0
2014 1 34.0
2 43.0
Name: (Guido, HR), dtype: float64
- Also, as with the single-index case, we can use the
loc
,iloc
, andix
indexers.
# In[28]
health_data.iloc[:2,:2]
# Out[28]
subject Bob
type HR Temp
year visit
2013 1 41.0 37.2
2 50.0 35.8
- These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in
loc
andiloc
can be passed a tuple of multiple indices.
# In[29]
health_data.loc[:,('Bob','HR')]
# Out[29]
year visit
2013 1 41.0
2 50.0
2014 1 34.0
2 40.0
Name: (Bob, HR), dtype: float64
- Working with slices within these index tuples is not convenient.
- If you trying to create a slice within a tuple, it will lead to a syntax error.
- You could get around this by building the desired slice explicitly using Python's built-in
slice
function, but a better way in this context is to use anIndexSlice
object, which Pandas provides for precisely this situation.
# In[30]
idx=pd.IndexSlice
health_data.loc[idx[:,1],idx[:,'HR']]
# Out[30]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 41.0 36.0 43.0
2014 1 34.0 34.0 58.0
Rearranging Multi-Indexes
Sorted and Unsorted Indices
- Many of the MultiIndex slicing operations will fail if the index is not sorted.
# In[31]
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]])
data=pd.Series(np.random.rand(6),index=index)
data.index.names=['char','int']
data
# Out[31]
char int
a 1 0.601307
2 0.623240
c 1 0.194030
2 0.969886
b 1 0.931100
2 0.700467
dtype: float64
- You can't take a partial silce of this index.
- For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted order.
- Pandas provides
sort_index
andsortlevel
methods of the DataFrame.
# In[32]
data=data.sort_index()
data
# Out[32]
char int
a 1 0.601307
2 0.623240
b 1 0.931100
2 0.700467
c 1 0.194030
2 0.969886
dtype: float64
- With the index sorted in this way, partial slicing will work as expected.
# In[33]
data['a':'b']
# Out[33]
char int
a 1 0.689178
2 0.016826
b 1 0.230445
2 0.842501
dtype: float64
Stacking and Unstacking Indices
- It is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use.
- If you use
level
option, the specified index in thelevel
parameter, comes up as a column.
# In[34]
pop.unstack()
# Out[34]
year 2010 2020
state
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
# In[35]
pop.unstack(level=0) # pop.unstack(level='state')
# Out[35]
state California New York Texas
year
2010 37253956 19378102 25145561
2020 39538223 20201249 29145505
# In[36]
pop.unstack(level=1) # pop.unstack(level='year')
# Out[36]
year 2010 2020
state
California 37253956 39538223
New York 19378102 20201249
Texas 25145561 29145505
- The opposite of
unstack
isstack
, which can be used to recover the original series
# In[37]
pop.unstack().stack()
# Out[37]
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
dtype: int64
Index Setting and Resetting
- Using
reset_index
method
# In[38]
pop_flat=pop.reset_index(name='population')
pop_flat
# Out[38]
state year population
0 California 2010 37253956
1 California 2020 39538223
2 New York 2010 19378102
3 New York 2020 20201249
4 Texas 2010 25145561
5 Texas 2020 29145505
- A common pattern is to build a MultiIndex from the column values.
- This can be done with the
set_index
method of the DataFrame, which returns a multiply indexed DataFrame
# In[39]
pop_flat.set_index(['state','year'])
# Out[39]
population
state year
California 2010 37253956
2020 39538223
New York 2010 19378102
2020 20201249
Texas 2010 25145561
2020 29145505
'Python > Pandas' 카테고리의 다른 글
7. Combining Datasets: merge and join (0) | 2025.06.19 |
---|---|
6. Combining Datasets: concat and append (0) | 2025.06.19 |
4. Handling Missing Data (0) | 2025.06.19 |
3. Operating on Data in Pandas (0) | 2025.06.19 |
2. Data Indexing and Selection (0) | 2025.06.19 |