Donut Logo Data Donut

Using DuckDB with Pandas DataFrames

Thrown together by Austin on 2024-09-13.

If you’ve worked with data in python, you may have used pandas DataFrames before:

import pandas as pd

df = pd.DataFrame({'u': [1, 10, 100, 100], 'v': [1, 2, 4, 8]})
df
Indexuv
011
1102
21004
31008

DuckDB can query dataframes directly from memory.

import duckdb

q = '''
SELECT
  MEAN(u) AS mean_u,
  MEAN(v) AS mean_v,
FROM df;
'''

result = duckdb.query(q)
result

Result:

┌────────┬────────┐
│ mean_u │ mean_v │
│ double │ double │
├────────┼────────┤
│ 52.75  │  3.75  │
└────────┴────────┘

This is pretty neat. Perhaps more useful, DuckDB query results can be immediatedly converted into DataFrames.

result = duckdb.query('''
SELECT
  *
FROM 'data/iris.csv' AS t;
''')
result_df = result.df()
result_df
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
5.13.51.40.2se
4.93.01.40.2setosa
4.73.21.30.2setosa
4.63.11.50.2setosa
5.03.61.40.2setosa
6.73.05.22.3virginica
6.32.55.01.9virginica
6.53.05.22.0virginica
6.23.45.42.3virginica
5.93.05.11.8virginica

This provides the user the opportunity to do analysis on the data with pandas.

pedal_width_stats_df = result_df.groupby('species')['petal_length'].agg(['min', 'median', 'mean', 'max'])
pedal_width_stats_df
speciesminmedianmeanmax
se1.41.401.4000001.4
setosa1.01.501.4653061.9
versicolor3.04.354.2600005.1
virginica4.55.555.5520006.9

Or, if preferred, we can stick with SQL.

result = duckdb.query('''
WITH Cte AS (
  SELECT * FROM 'data/iris.csv'
)
SELECT
    species,
    MIN(petal_length) AS min,
    MEDIAN(petal_length) AS median,
    MEAN(petal_length) AS mean,
    MAX(petal_length) AS max,
FROM Cte
GROUP BY species
''')
result
┌────────────┬────────┬────────┬────────────────────┬────────┐
│  species   │  min   │ median │        mean        │  max   │
│  varchar   │ double │ double │       double       │ double │
├────────────┼────────┼────────┼────────────────────┼────────┤
│ se         │    1.4 │    1.4 │                1.4 │    1.4 │
│ setosa     │    1.0 │    1.5 │ 1.4653061224489796 │    1.9 │
│ versicolor │    3.0 │   4.35 │               4.26 │    5.1 │
│ virginica  │    4.5 │   5.55 │              5.552 │    6.9 │
└────────────┴────────┴────────┴────────────────────┴────────┘

The interoperability between pandas and DuckDB provides a lot of flexibility. You can try it out in a JupyterLite notebook right in your browser.