12. High-Performance Pandas: eval and query

2025. 6. 19. 00:21Python/Pandas

  • The power of PyData stack is built upon the ability of Numpy and Pandas to push basic operations into lower-level compiled code via an intuitive higher-level syntax.
  • While these abstractions are effcient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue(=excessive) overhead in computational time and memory use.
  • To address this, Pandas includes some methods that allow you to directly access C-speed operations without costly allocation of intermediate arrays: eval and query

Motivating query and eval: Compound Expressions

# In[1]
rng=np.random.default_rng(42)
x=rng.random(1000000)
y=rng.random(1000000)
%timeit x+y
# Out[1]
4.26 ms ± 915 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • This is much faster than doing the addition via a Python loop or comprehension.
  • But this abstraction can become less efficient when computing compound expressions.
# In[2]
mask=(x>0.5)&(y<0.5)
  • Because Numpy evaluates each subexpression, this is roughly equivalent to the following code.
# In[3]
tmp1=(x>0.5)
tmp2=(y<0.5)
mask=tmp1&tmp2
  • In other words, every intermediate step is explicitly allocated in memory.
  • If the x and y arrays are very large, this can lead to significant memory and computational overhead.
  • The NumExpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
# In[4]
import numexpr
mask_numexpr=numexpr.evaluate('(x>0.5)&(y<0.5)')
np.all(mask==mask_numexpr)
# Out[4]
True
  • The benefit here is that NumExpr evaluates the expression in a way that avoids temporary arrays where possible, and thus can be much more efficient than Numpy, especially for long sequences of computations on large arrays.
  • The Pandas eval and query tools are essentially Pandas-specific wrappers of NumExpr functionality.

pandas.eval for Efficient Operations

  • The eval function in Pandas uses string expressions to efficiently compute operations on DataFrame objects.
# In[5]
nrows,ncols=100000, 100
df1,df2,df3,df4=(pd.DataFrame(rng.random((nrows,ncols))) for i in range(4))
  • To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum.
# In[6]
%timeit df1+df2+df3+df4
# Out[6]
167 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • The same result can be computed via pd.eval by constructing the expression as a string.
  • The eval version of this expression is about 50% faster, while giving the same result.
# In[7]
%timeit pd.eval('df1+df2+df3+df4')
# Out[7]
73.5 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# In[8]
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))
# Out[8]
True
  • pd.eval supports a wide range of operations.
# In[10]
df1,df2,df3,df4,df5=(pd.DataFrame(rng.integers(0,1000,(100,3))) 
for i in range(5))
  1. Arithmetic operators

    # In[11]
    result1=-df1*df2/(df3+df4)-df5
    result2=pd.eval('-df1*df2/(df3+df4)-df5')
    np.allclose(result1,result2)
    # Out[11]
    True
  2. Comparison operators

    # In[12]
    result1=(df1<df2)&(df2<=df3)&(df3!=df4)
    result2=pd.eval('df1<df2<=df3!=df4')
    np.allclose(result1,result2)
    # Out[12]
    True
  3. Bitwise operators

    # In[13]
    result1=(df1<0.5)&(df2<0.5)|(df3<df4)
    result2=pd.eval('(df1<0.5)&(df2<0.5)|(df3<df4)')
    np.allclose(result1,result2)
    # Out[13]
    True
  • Additionally, it supports the use of the literal and and or in Boolean expressions.
# In[14]
result3=pd.eval('(df1<0.5) and (df2<0.5) or (df3<df4)')
np.allclose(result1,result2)
# Out[14]
True
  1. Object attributes and indices
  • pd.eval supports access to object attributes via the obj.attr syntax and indexes via the obj[index] syntax
# In[15]
result1=df2.T[0]+df3.iloc[1]
result2=pd.eval('df2.T[0]+df3.iloc[1]')
np.allclose(result1,result2)
# Out[15]
True
  1. Other operators
  • Other operations, such as function calls, conditional statements, loops, and other more involved constructs are currently not implemented in pd.eval.
  • If you'd like to execute these types of expressions, you can use the NumExpr library itself.

Please reference this url about the np.allclose :
About np.allclose

DataFrame.eval for Column-Wise Operations

  • Just as Pandas has a top-level pd.eval function, DataFrame objects have an eval method that works in similar ways.
  • The benefit of the eval method is that columns can be referred to by name.
# In[16]
df=pd.DataFrame(rng.random((1000,3)),columns=['A','B','C'])
df.head()
# Out[16]
           A           B           C
0    0.850888    0.966709    0.958690
1    0.820126    0.385686    0.061402
2    0.059729    0.831768    0.652259
3    0.244774    0.140322    0.041711
4    0.818205    0.753384    0.578851
  • By using pd.eval, we can compute expressions with the three columns.
# In[17]
result1=(df['A']+df['B'])/(df['C']-1)
result2=pd.eval("(df.A+df.B)/(df.C-1)")
np.allclose(result1,result2)
# Out[17]
True
  • We treat column names as variables within the evaluated expression, and the result is what we would wish.

Assignment in DataFrame.eval

  • DataFrame.eval also allows assignment to any column.
  • We can use df.eval to create a new column 'D' and assign to it a value computed from the other columns.
  • If inplace is True, the value is reflected in the original. On the contrary, False (default), just returned the result.
# In[18]
df.eval('D=(A+B)/C',inplace=True)
df.head()
# Out[18]
           A           B           C            D
0    0.850888    0.966709    0.958690     1.895916
1    0.820126    0.385686    0.061402    19.638139
2    0.059729    0.831768    0.652259     1.366782
3    0.244774    0.140322    0.041711     9.232370
4    0.818205    0.753384    0.578851     2.715013
  • In the same way, any existing column can be modified.
# In[19]
df.eval('D=(A-B)/C',inplace=True)
df.head()
# Out[19]
           A           B           C            D
0    0.850888    0.966709    0.958690    -0.120812
1    0.820126    0.385686    0.061402     7.075399
2    0.059729    0.831768    0.652259    -1.183638
3    0.244774    0.140322    0.041711     2.504142
4    0.818205    0.753384    0.578851     0.111982

Local Variables in DataFrame.eval

  • The DataFrame.eval method supports an additional syntax that lets it work with local Python variables.
# In[20]
column_mean=df.mean(1)
result1=df['A']+column_mean
result2=df.eval('A+@column_mean')
np.allclose(result1,result2)
# Out[20]
True
  • The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
  • This @ character is only supported by the DataFrame.eval method, not by the pandas.eval function, because the pandas.eval function only has access to the one(Python) namespace.

The DataFrame.query Method

  • The DataFrame has another method based on evaluated strings, called query.
# In[21]
result1=df[(df.A < 0.5) & (df.B < 0.5)]
result2=pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1,result2)
# Out[21]
True
  • As with the example used in our discussion of DataFrame.eval, this is an expression involving columns of the DataFrame.
  • It cannot be expressed using DataFrame.eval syntax.
  • Instead, for this type of filtering operation, you can use the query method.
# In[22]
result2=df.query('A < 0.5 and B < 0.5')
np.allclose(result1,result2)
# Out[22]
True
  • In addtion to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
  • query method also accpets the @ flag to mark local variables.
# In[23]
Cmean=df['C'].mean()
result1=df[(df.A < Cmean) & (df.B < Cmean)]
result2=df.query('A < @Cmean and B < @Cmean')
np.allclose(result1,result2)
# Out[23]
True

Performance: When to Use These Functions

  • When considering whether to use eval and query, there are two considerations: computation time and memory use.
  • Every compound expression involving Numpy arrays or Pandas DataFrames will result in implicit creation of temporary arrays.
  • If the size of the temporary DataFrames is significant compared to your available system memory, then it's a good idea to use an eval or query expression.
  • You can check the approximate size of your array in bytes using like this.
# In[24]
df.values.nbytes
# Out[24]
32000
  • On the performance side, eval can be faster even when you are not maxing out your system memory.
  • The difference in computation time between the traditional methods and the eval/query method is usually not significant.
  • The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

For more information on eval/query , you can refer these Pandas documentation :

  1. Pandas.eval
  2. Pandas.DataFrame.eval
  3. Pandas.DataFrame.query

'Python > Pandas' 카테고리의 다른 글

11. Working with Time Series  (0) 2025.06.19
10. Vectorized String Operations  (0) 2025.06.19
9. Pivot Tables  (0) 2025.06.19
8. Aggregation and Grouping  (0) 2025.06.19
7. Combining Datasets: merge and join  (0) 2025.06.19