12. High-Performance Pandas: eval and query
2025. 6. 19. 00:21ㆍPython/Pandas
- The power of PyData stack is built upon the ability of Numpy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax.
- While these abstractions are effcient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue(=excessive) overhead in computational time and memory use.
- To address this, Pandas includes some methods that allow you to directly access C-speed operations without costly allocation of intermediate arrays:
eval
andquery
Motivating query and eval: Compound Expressions
# In[1]
rng=np.random.default_rng(42)
x=rng.random(1000000)
y=rng.random(1000000)
%timeit x+y
# Out[1]
4.26 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- This is much faster than doing the addition via a Python loop or comprehension.
- But this abstraction can become less efficient when computing compound expressions.
# In[2]
mask=(x>0.5)&(y<0.5)
- Because Numpy evaluates each subexpression, this is roughly equivalent to the following code.
# In[3]
tmp1=(x>0.5)
tmp2=(y<0.5)
mask=tmp1&tmp2
- In other words, every intermediate step is explicitly allocated in memory.
- If the
x
andy
arrays are very large, this can lead to significant memory and computational overhead. - The NumExpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
# In[4]
import numexpr
mask_numexpr=numexpr.evaluate('(x>0.5)&(y<0.5)')
np.all(mask==mask_numexpr)
# Out[4]
True
- The benefit here is that NumExpr evaluates the expression in a way that avoids temporary arrays where possible, and thus can be much more efficient than Numpy, especially for long sequences of computations on large arrays.
- The Pandas
eval
andquery
tools are essentially Pandas-specific wrappers of NumExpr functionality.
pandas.eval for Efficient Operations
- The
eval
function in Pandas uses string expressions to efficiently compute operations on DataFrame objects.
# In[5]
nrows,ncols=100000, 100
df1,df2,df3,df4=(pd.DataFrame(rng.random((nrows,ncols))) for i in range(4))
- To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum.
# In[6]
%timeit df1+df2+df3+df4
# Out[6]
167 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
- The same result can be computed via
pd.eval
by constructing the expression as a string. - The
eval
version of this expression is about 50% faster, while giving the same result.
# In[7]
%timeit pd.eval('df1+df2+df3+df4')
# Out[7]
73.5 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# In[8]
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))
# Out[8]
True
pd.eval
supports a wide range of operations.
# In[10]
df1,df2,df3,df4,df5=(pd.DataFrame(rng.integers(0,1000,(100,3)))
for i in range(5))
Arithmetic operators
# In[11] result1=-df1*df2/(df3+df4)-df5 result2=pd.eval('-df1*df2/(df3+df4)-df5') np.allclose(result1,result2)
# Out[11] True
Comparison operators
# In[12] result1=(df1<df2)&(df2<=df3)&(df3!=df4) result2=pd.eval('df1<df2<=df3!=df4') np.allclose(result1,result2)
# Out[12] True
Bitwise operators
# In[13] result1=(df1<0.5)&(df2<0.5)|(df3<df4) result2=pd.eval('(df1<0.5)&(df2<0.5)|(df3<df4)') np.allclose(result1,result2)
# Out[13] True
- Additionally, it supports the use of the literal
and
andor
in Boolean expressions.
# In[14]
result3=pd.eval('(df1<0.5) and (df2<0.5) or (df3<df4)')
np.allclose(result1,result2)
# Out[14]
True
- Object attributes and indices
pd.eval
supports access to object attributes via theobj.attr
syntax and indexes via theobj[index]
syntax
# In[15]
result1=df2.T[0]+df3.iloc[1]
result2=pd.eval('df2.T[0]+df3.iloc[1]')
np.allclose(result1,result2)
# Out[15]
True
- Other operators
- Other operations, such as function calls, conditional statements, loops, and other more involved constructs are currently not implemented in
pd.eval
. - If you'd like to execute these types of expressions, you can use the NumExpr library itself.
Please reference this url about the np.allclose
:
About np.allclose
DataFrame.eval for Column-Wise Operations
- Just as Pandas has a top-level
pd.eval
function, DataFrame objects have aneval
method that works in similar ways. - The benefit of the
eval
method is that columns can be referred to by name.
# In[16]
df=pd.DataFrame(rng.random((1000,3)),columns=['A','B','C'])
df.head()
# Out[16]
A B C
0 0.850888 0.966709 0.958690
1 0.820126 0.385686 0.061402
2 0.059729 0.831768 0.652259
3 0.244774 0.140322 0.041711
4 0.818205 0.753384 0.578851
- By using
pd.eval
, we can compute expressions with the three columns.
# In[17]
result1=(df['A']+df['B'])/(df['C']-1)
result2=pd.eval("(df.A+df.B)/(df.C-1)")
np.allclose(result1,result2)
# Out[17]
True
- We treat column names as variables within the evaluated expression, and the result is what we would wish.
Assignment in DataFrame.eval
DataFrame.eval
also allows assignment to any column.- We can use
df.eval
to create a new column 'D' and assign to it a value computed from the other columns. - If
inplace
is True, the value is reflected in the original. On the contrary,False
(default), just returned the result.
# In[18]
df.eval('D=(A+B)/C',inplace=True)
df.head()
# Out[18]
A B C D
0 0.850888 0.966709 0.958690 1.895916
1 0.820126 0.385686 0.061402 19.638139
2 0.059729 0.831768 0.652259 1.366782
3 0.244774 0.140322 0.041711 9.232370
4 0.818205 0.753384 0.578851 2.715013
- In the same way, any existing column can be modified.
# In[19]
df.eval('D=(A-B)/C',inplace=True)
df.head()
# Out[19]
A B C D
0 0.850888 0.966709 0.958690 -0.120812
1 0.820126 0.385686 0.061402 7.075399
2 0.059729 0.831768 0.652259 -1.183638
3 0.244774 0.140322 0.041711 2.504142
4 0.818205 0.753384 0.578851 0.111982
Local Variables in DataFrame.eval
- The
DataFrame.eval
method supports an additional syntax that lets it work with local Python variables.
# In[20]
column_mean=df.mean(1)
result1=df['A']+column_mean
result2=df.eval('A+@column_mean')
np.allclose(result1,result2)
# Out[20]
True
- The
@
character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects. - This
@
character is only supported by theDataFrame.eval
method, not by thepandas.eval
function, because thepandas.eval
function only has access to the one(Python) namespace.
The DataFrame.query Method
- The DataFrame has another method based on evaluated strings, called
query
.
# In[21]
result1=df[(df.A < 0.5) & (df.B < 0.5)]
result2=pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1,result2)
# Out[21]
True
- As with the example used in our discussion of
DataFrame.eval
, this is an expression involving columns of the DataFrame. - It cannot be expressed using
DataFrame.eval
syntax. - Instead, for this type of filtering operation, you can use the
query
method.
# In[22]
result2=df.query('A < 0.5 and B < 0.5')
np.allclose(result1,result2)
# Out[22]
True
- In addtion to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
query
method also accpets the@
flag to mark local variables.
# In[23]
Cmean=df['C'].mean()
result1=df[(df.A < Cmean) & (df.B < Cmean)]
result2=df.query('A < @Cmean and B < @Cmean')
np.allclose(result1,result2)
# Out[23]
True
Performance: When to Use These Functions
- When considering whether to use
eval
andquery
, there are two considerations: computation time and memory use. - Every compound expression involving Numpy arrays or Pandas DataFrames will result in implicit creation of temporary arrays.
- If the size of the temporary DataFrames is significant compared to your available system memory, then it's a good idea to use an
eval
orquery
expression. - You can check the approximate size of your array in bytes using like this.
# In[24]
df.values.nbytes
# Out[24]
32000
- On the performance side,
eval
can be faster even when you are not maxing out your system memory. - The difference in computation time between the traditional methods and the
eval/query
method is usually not significant. - The benefit of
eval/query
is mainly in the saved memory, and the sometimes cleaner syntax they offer.
For more information on eval/query
, you can refer these Pandas documentation :
'Python > Pandas' 카테고리의 다른 글
11. Working with Time Series (0) | 2025.06.19 |
---|---|
10. Vectorized String Operations (0) | 2025.06.19 |
9. Pivot Tables (0) | 2025.06.19 |
8. Aggregation and Grouping (0) | 2025.06.19 |
7. Combining Datasets: merge and join (0) | 2025.06.19 |