4. Handling Missing Data
2025. 6. 19. 00:04ㆍPython/Pandas
Trade-offs in Missing Data Conventions
- A number of approaches have been developed to track the presence of missing data in a table or DataFrame.
- They revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.
- In the masking approach, the mask might be an entirely separate Boolean array, or it might involve appropriation(도용) of one bit in the data representation to locally indicate the null status of a value.
- In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point values with
NaN
, a special values that is part of the IEEE floating-point specification. - Use of a separate mask array requires allocation(할당) of an additional Boolean array, which adds overhead in both storage and computation.
- A sentinel value reduces the range of valid values that can be represented, and may require extra logic in CPU and GPU arithmetic, because common special values like
NaN
are not available for all data types.
Missing Data in Pandas
- The way in which Pandas handles missing values is constrained by its reliance on the Numpy package, which does not have a built-in notion of NA values for nonfloating-point data types.
- For these reasons, Pandas has two modes of storing and manipulating null values.
- The default mode is to use a sentinel-based missing data scheme, with sentinel values
NaN
orNone
depending on the type of the data. - You can opt in(참여하기로 하다) to using the nullable data types Pandas provides, which results in the creation an accompanying mask array to track missing entries. These missing entries are then presented to the user as the special
pd.NA
value.
- In either case, the data operations and manipulations provided by the Pandas API will handle and propagate those missing entries in a predictable manner.
None as a Sentinel Value
None
is a Python object, which means that any array containingNone
must havedtype=object
.
# In[1]
vals1=np.array([1,None,2,3])
vals1
# Out[1]
array([1, None, 2, 3], dtype=object)
- The
dtype=object
means that the best common type representation Numpy could infer for the contents of the array is that they are Python objects. - Python does not support arithmetic operations with
None
, aggregations likesum
ormin
will generally lead to an error. - For this reason, Pandas does not use
None
as a sentinel in its numerical arrays.
NaN: Missing Numerical Data
NaN
is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.
# In[2]
vals2=np.array([1,np.nan,3,4])
vals2
# Out[2]
array([ 1., nan, 3., 4.])
- Numpy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.
NaN
is a bit like data virus; it infects any other object it touches.- Regardless of the operations, the result of arithmetic with
NaN
will be anotherNaN
# In[3]
print(1+np.nan)
print(0*np.nan)
# Out[3]
nan
nan
- This means that aggregates over the values are well defined but not always useful.
# In[4]
vals2.sum(),vals2.min(),vals2.max()
# Out[4]
(nan,nan,nan)
- Numpy does provide
NaN
-aware versions of aggregations that will ignore these missing values.
# In[5]
np.nansum(vals2),np.nanmin(vals2),np.nanmax(vals2)
# Out[5]
(8.0, 1.0, 4.0)
NaN and None in Pandas
# In[6]
pd.Series([1,np.nan,2,None])
# Out[6]
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
- For types that don't have an available sentinel value, Pandas automatically typecasts when NA values are present.
- If we set a value in an integer array to
np.nan
, it will automatically be upcast to a floating-point type to accommodate the NA
# In[7]
x=pd.Series(range(2),dtype=int)
x
# Out[7]
0 0
1 1
dtype: int64
# In[8]
x[0]=None
x
# Out[8]
0 NaN
1 1.0
dtype: float64
Pandas handling of NAs by type
Typeclass | Conversion when storing NAs | NA sentinel value |
---|---|---|
floating | No change | np.nan |
object | No change | None or np.nan |
integer | Cast to float64 | np.nan |
boolean | Cast to object | None or np.nan |
Pandas Nullable Dtypes
- Pandas later added nullable dtypes, which are distinguished from regular dtypes by capitalization of their names.
- For backward compatibility, these nullable dtypes are only used if specifically requested.
# In[9]
pd.Series([1,np.nan,2,None,pd.NA],dtype='Int32')
# Out[9]
0 1
1 <NA>
2 2
3 <NA>
4 <NA>
dtype: Int32
- This representation can be used interchangeably with the others in all the operations explored through the rest of this chapter.
Operating on Null Values
- Pandas provides several methods for detecting, removing, and replacing null values in Pandas data structures.
isnull
: Generates a Boolean mask indicating missing values.notnull
: Opposite ofisnull
dropna
: Returns a filtered version of the datafillna
: Returns a copy of the data with missing values filled or imputed(귀속시키다).
Detecting Null Values
# In[10]
data=pd.Series([1,np.nan,'hello',None])
# In[11]
data.isnull()
# Out[11]
0 False
1 True
2 False
3 True
dtype: bool
# In[12]
data[data.notnull()]
# Out[12]
0 1
2 hello
dtype: object
- The
isnull
andnotnull
methods produce similar Boolean results for DataFrame objects.
Dropping Null Values
# In[13]
data.dropna()
# Out[13]
0 1
2 hello
dtype: object
- We cannot drop single values from a DataFrame; we can only drop entire rows or columns.
# In[14]
df=pd.DataFrame([[1 , np.nan, 2],
[2 , 3, 5],
[np.nan, 4, 6]])
df
# Out[14]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
- By default,
dropna
will drop all rows in which any null value is present.
# In[15]
df.dropna()
# Out[15]
0 1 2
1 2.0 3.0 5
- Alternatively, you can drop NA values along a differnet axis. By using
axis=1
oraxis=column
# In[16]
df.dropna(axis=1)
# Out[16]
2
0 2
1 5
2 6
This drop can be specified through the
how
orthresh
parameters.The default is
how='any'
, such that any row or column containing a null value will be dropped.You can also specify
how='all'
, which will only drop rows/columns that contain all null values.
# In[17]
df[3]=np.nan
df
# Out[17]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[18]
df.dropna(axis=1,how='all')
# Out[18]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
- For final-grained control, the
thresh
parameter lets you specify a minimum number of non-null values for the row/columns to be kept.
# In[19]
df.dropna(axis=0,thresh=3)
# Out[19]
0 1 2 3
1 2.0 3.0 5 NaN
Filling Null Values
# In[20]
data=pd.Series([1,np.nan,2,None,3],index=list('abcde'),dtype='Int32')
data
# Out[20]
a 1
b <NA>
c 2
d <NA>
e 3
dtype: Int32
- We can fill NA entries with a single value, such as zero
# In[21]
data.fillna(0)
# Out[21]
a 1
b 0
c 2
d 0
e 3
dtype: Int32
- We can specify a forward fill to propagate the previous value forward.
# In[22]
data.fillna(method='ffill') # forward fill
# Out[22]
a 1
b 1
c 2
d 2
e 3
dtype: Int32
- Or we can specify a backward fill to propagate the next values backward.
# In[23]
data.fillna(method='bfill') # backward fill
# Out[23]
a 1
b 2
c 2
d 3
e 3
dtype: Int32
- In case of a DataFrame, the options are similar, but we can also specify an
axis
along which the fills should take place.
# In[24]
df
# Out[24]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[25]
df.fillna(method='ffill',axis=1)
# Out[25]
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
'Python > Pandas' 카테고리의 다른 글
6. Combining Datasets: concat and append (0) | 2025.06.19 |
---|---|
5. Hierarchical Indexing (0) | 2025.06.19 |
3. Operating on Data in Pandas (0) | 2025.06.19 |
2. Data Indexing and Selection (0) | 2025.06.19 |
1. Introducing Pandas Object (0) | 2025.06.18 |