Python/Pandas

1. Introducing Pandas Object

njh1008 2025. 6. 18. 23:58
  • There are three fundamental Pandas structures : Series, DataFrame, and Index
# In[1]
import numpy as np 
import pandas as pd

Pandas Series Object

  • A Pandas Series is a one-dimensional array of indexed data.
# In[2]
data=pd.Series([0.25,0.5,0.75,1.0])
data
# Out[2]
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
  • Series combines a sequence of values with an explicit sequence of indices, which we can access with the values and index attributes
# In[3]
print(data.values)
print(data.index)
# Out[3]
[0.25 0.5  0.75 1.  ]
RangeIndex(start=0, stop=4, step=1)
  • Like with a Numpy array, data can be accessed by the associated index via the familiar Python square-bracket.
# In[4]
print(data[1])
print(data[1:3])
# Out[4]
0.5
1    0.50
2    0.75
dtype: float64
  • Pandas Series is much more general and flexible than the one-dimensional Numpy array that is emulates.

Series as Generalized Numpy array

  • Numpy array has an implicitly defined integer index used to access the values
  • Pandas Series has an explicitly defined index associated with the values.
  • This explicit index definition gives the Series object additional capabilities.
# In[5]
data=pd.Series([0.25,0.5,0.75,1.0],index=['b','a','d','c'])
data
# Out[5]
b    0.25
a    0.50
d    0.75
c    1.00
dtype: float64

# In[6]
data['b']
# Out[6]
0.25

Series as Specialized Dictionary

  • A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
  • Series is a structure that maps types keys to set of types values.
  • The type-specific compiled code behind a Numpy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it more efficient than Python dictionaries for certain operations.
# In[7]
population_dict={'California':39538223,'Texas':29145505,'Florida':21538187,'New York':20201249,'Pennsylvania':13002700}
population=pd.Series(population_dict)
population
# Out[7]
California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

# In[8]
population['California']
# Out[8]
39538223
  • Unlike a dictionary, though, the Series also supports array-style operations such as slicing.
# In[9]
population['California':'Florida']
# Out[9]
California    39538223
Texas         29145505
Florida       21538187
dtype: int64

Constructing Series Objects

  • Pandas Series following pd.Series(data,index=index)
  • index is an optional argument, and data can be one of may entities
  • data can be a list or Numpy array like this
# In[10]
pd.Series([2,4,6])
# Out[10]
0    2
1    4
2    6
dtype: int64

-data can be a scalar, which is repeated to fill the specified index

# In[11]
pd.Series(5,index=[100,200,300])
# Out[11]
100    5
200    5
300    5
dtype: int64
  • Or it can be a dictionary, in which case index defaults to the dictionary keys
# In[12]
pd.Series({2:'a',1:'b',3:'c'})
# Out[12]
2    a
1    b
3    c
dtype: object
  • The index can be explicitly set to control the order or the subset of keys used.
# In[13]
pd.Series({2:'a',1:'b',3:'c'},index=[1,2])
# Out[13]
1    b
2    a
dtype: object

Pandas DataFrame Object

DataFrame as Generalized Numpy Array

  • If a Series is an analog of a one-dimensional array with explicit indices, a DataFrame is an analog of a two-dimensional array with explicit row and column indices.
# In[14]
area_dict={'California':423967,'Texas':695662,'Florida':170312,'New York':141297,'Pennsylvania':119280}
area=pd.Series(area_dict)
area
# Out[14]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

# In[15]
states=pd.DataFrame({'population':population,'area':area})
states
# Out[15]
              population      area
California        39538223    423967
Texas            29145505    695662
Florida            21538187    170312
New York        20201249    141297
Pennsylvania    13002700    119280
  • Like Series object, the DataFrame has an index attribute that gives access to the index labels.
# In[16]
states.index
# Out[16]
Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')
  • Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels.
# In[17]
states.columns
# Out[17]
Index(['population', 'area'], dtype='object')

DataFrame as Specialized Dictionary

  • Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.
# In[18]
states['area']
# Out[18]
California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

Constructing DataFrame Object

From a single Series object

  • A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.
# In[19]
pd.DataFrame(population,columns=['population'])
# Out[19]
              population
California        39538223
Texas            29145505
Florida            21538187
New York        20201249
Pennsylvania    13002700

From a list of dicts

# In[20]
data=[{'a':i,'b':2*i} for i in range(3)]
pd.DataFrame(data)
# Out[20]
    a    b
0    0    0
1    1    2
2    2    4
  • If some keys in the dictionary are missing, Pandas will fill them in with NaN(Not a Number) values.
# In[21]
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])
# Out[21]
      a    b      c
0    1.0    2    NaN
1    NaN    3    4.0

From a dictionary of Series objects

  • A DataFrame can be constructed from a dictionary of Series object
  • We saw this before. Please refer # In[15]

From a two-dimensional Numpy array

  • Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.
  • If omitted, an integer index will be used for each.
# In[22]
pd.DataFrame(np.random.rand(3,2),columns=['foo','bar'],index=['a','b','c'])
# Out[22]
         foo         bar
a    0.466496    0.888614
b    0.228347    0.613272
c    0.912784    0.961023

From a Numpy structured array

  • A Pandas DataFrame operates much like a structured array, and can be created directly from one.
# In[23]
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A
# Out[23]
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

# In[24]
pd.DataFrame(A)
# Out[24]
    A      B
0    0    0.0
1    0    0.0
2    0    0.0

Pandas Index Object

  • The Series and DataFrame objects both contain an explicit index that let you reference and modify data.
  • Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.
# In[25]
ind=pd.Index([2,3,5,7,11])
ind
# Out[25]
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as Immutable array

  • The Index in many ways operates like an array.
# In[26]
print(ind[1])
print(ind[::2])
print(ind.size, ind.shape, ind.ndim, ind.dtype)
# Out[26]
3
Int64Index([2, 5, 11], dtype='int64')
5 (5,) 1 int64
  • One difference between Index objects and Numpy arrays is that the indices are immutable.
  • That is, they cannot be modified via the normal means.

Index as Ordered Set

  • The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way.
# In[27]
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])

# In[28]
print(indA.intersection(indB))
print(indA.union(indB))
print(indA.symmetric_difference(indB))
# Out[28]
Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')

'Python > Pandas' 카테고리의 다른 글

6. Combining Datasets: concat and append  (2) 2025.06.19
5. Hierarchical Indexing  (0) 2025.06.19
4. Handling Missing Data  (1) 2025.06.19
3. Operating on Data in Pandas  (0) 2025.06.19
2. Data Indexing and Selection  (0) 2025.06.19