- There are three fundamental Pandas structures :
Series
,DataFrame
, andIndex
# In[1]
import numpy as np
import pandas as pd
Pandas Series Object
- A Pandas Series is a one-dimensional array of indexed data.
# In[2]
data=pd.Series([0.25,0.5,0.75,1.0])
data
# Out[2]
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
- Series combines a sequence of values with an explicit sequence of indices, which we can access with the
values
andindex
attributes
# In[3]
print(data.values)
print(data.index)
# Out[3]
[0.25 0.5 0.75 1. ]
RangeIndex(start=0, stop=4, step=1)
- Like with a Numpy array, data can be accessed by the associated index via the familiar Python square-bracket.
# In[4]
print(data[1])
print(data[1:3])
# Out[4]
0.5
1 0.50
2 0.75
dtype: float64
- Pandas Series is much more general and flexible than the one-dimensional Numpy array that is emulates.
Series as Generalized Numpy array
- Numpy array has an implicitly defined integer index used to access the values
- Pandas Series has an explicitly defined index associated with the values.
- This explicit index definition gives the Series object additional capabilities.
# In[5]
data=pd.Series([0.25,0.5,0.75,1.0],index=['b','a','d','c'])
data
# Out[5]
b 0.25
a 0.50
d 0.75
c 1.00
dtype: float64
# In[6]
data['b']
# Out[6]
0.25
Series as Specialized Dictionary
- A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
- Series is a structure that maps types keys to set of types values.
- The type-specific compiled code behind a Numpy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it more efficient than Python dictionaries for certain operations.
# In[7]
population_dict={'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700}
population=pd.Series(population_dict)
population
# Out[7]
California 39538223
Texas 29145505
Florida 21538187
New York 20201249
Pennsylvania 13002700
dtype: int64
# In[8]
population['California']
# Out[8]
39538223
- Unlike a dictionary, though, the Series also supports array-style operations such as slicing.
# In[9]
population['California':'Florida']
# Out[9]
California 39538223
Texas 29145505
Florida 21538187
dtype: int64
Constructing Series Objects
- Pandas Series following
pd.Series(data,index=index)
index
is an optional argument, anddata
can be one of may entitiesdata
can be a list or Numpy array like this
# In[10]
pd.Series([2,4,6])
# Out[10]
0 2
1 4
2 6
dtype: int64
-data
can be a scalar, which is repeated to fill the specified index
# In[11]
pd.Series(5,index=[100,200,300])
# Out[11]
100 5
200 5
300 5
dtype: int64
- Or it can be a dictionary, in which case index defaults to the dictionary keys
# In[12]
pd.Series({2:'a',1:'b',3:'c'})
# Out[12]
2 a
1 b
3 c
dtype: object
- The index can be explicitly set to control the order or the subset of keys used.
# In[13]
pd.Series({2:'a',1:'b',3:'c'},index=[1,2])
# Out[13]
1 b
2 a
dtype: object
Pandas DataFrame Object
DataFrame as Generalized Numpy Array
- If a Series is an analog of a one-dimensional array with explicit indices, a DataFrame is an analog of a two-dimensional array with explicit row and column indices.
# In[14]
area_dict={'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280}
area=pd.Series(area_dict)
area
# Out[14]
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
dtype: int64
# In[15]
states=pd.DataFrame({'population':population,'area':area})
states
# Out[15]
population area
California 39538223 423967
Texas 29145505 695662
Florida 21538187 170312
New York 20201249 141297
Pennsylvania 13002700 119280
- Like Series object, the DataFrame has an index attribute that gives access to the index labels.
# In[16]
states.index
# Out[16]
Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')
- Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels.
# In[17]
states.columns
# Out[17]
Index(['population', 'area'], dtype='object')
DataFrame as Specialized Dictionary
- Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.
# In[18]
states['area']
# Out[18]
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
Name: area, dtype: int64
Constructing DataFrame Object
From a single Series object
- A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.
# In[19]
pd.DataFrame(population,columns=['population'])
# Out[19]
population
California 39538223
Texas 29145505
Florida 21538187
New York 20201249
Pennsylvania 13002700
From a list of dicts
# In[20]
data=[{'a':i,'b':2*i} for i in range(3)]
pd.DataFrame(data)
# Out[20]
a b
0 0 0
1 1 2
2 2 4
- If some keys in the dictionary are missing, Pandas will fill them in with
NaN(Not a Number)
values.
# In[21]
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])
# Out[21]
a b c
0 1.0 2 NaN
1 NaN 3 4.0
From a dictionary of Series objects
- A DataFrame can be constructed from a dictionary of Series object
- We saw this before. Please refer
# In[15]
From a two-dimensional Numpy array
- Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.
- If omitted, an integer index will be used for each.
# In[22]
pd.DataFrame(np.random.rand(3,2),columns=['foo','bar'],index=['a','b','c'])
# Out[22]
foo bar
a 0.466496 0.888614
b 0.228347 0.613272
c 0.912784 0.961023
From a Numpy structured array
- A Pandas DataFrame operates much like a structured array, and can be created directly from one.
# In[23]
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A
# Out[23]
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
# In[24]
pd.DataFrame(A)
# Out[24]
A B
0 0 0.0
1 0 0.0
2 0 0.0
Pandas Index Object
- The Series and DataFrame objects both contain an explicit index that let you reference and modify data.
- Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.
# In[25]
ind=pd.Index([2,3,5,7,11])
ind
# Out[25]
Int64Index([2, 3, 5, 7, 11], dtype='int64')
Index as Immutable array
- The Index in many ways operates like an array.
# In[26]
print(ind[1])
print(ind[::2])
print(ind.size, ind.shape, ind.ndim, ind.dtype)
# Out[26]
3
Int64Index([2, 5, 11], dtype='int64')
5 (5,) 1 int64
- One difference between Index objects and Numpy arrays is that the indices are immutable.
- That is, they cannot be modified via the normal means.
Index as Ordered Set
- The Index object follows many of the conventions used by Python's built-in
set
data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.
# In[27]
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])
# In[28]
print(indA.intersection(indB))
print(indA.union(indB))
print(indA.symmetric_difference(indB))
# Out[28]
Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')
'Python > Pandas' 카테고리의 다른 글
6. Combining Datasets: concat and append (2) | 2025.06.19 |
---|---|
5. Hierarchical Indexing (0) | 2025.06.19 |
4. Handling Missing Data (1) | 2025.06.19 |
3. Operating on Data in Pandas (0) | 2025.06.19 |
2. Data Indexing and Selection (0) | 2025.06.19 |