Using DuckDB with Pandas DataFrames
Thrown together by Austin on 2024-09-13.
If you’ve worked with data in python, you may have used pandas DataFrames before:
import pandas as pd
df = pd.DataFrame({'u': [1, 10, 100, 100], 'v': [1, 2, 4, 8]})
df
Index | u | v |
---|---|---|
0 | 1 | 1 |
1 | 10 | 2 |
2 | 100 | 4 |
3 | 100 | 8 |
DuckDB can query dataframes directly from memory.
import duckdb
q = '''
SELECT
MEAN(u) AS mean_u,
MEAN(v) AS mean_v,
FROM df;
'''
result = duckdb.query(q)
result
Result:
┌────────┬────────┐
│ mean_u │ mean_v │
│ double │ double │
├────────┼────────┤
│ 52.75 │ 3.75 │
└────────┴────────┘
This is pretty neat. Perhaps more useful, DuckDB query results can be immediatedly converted into DataFrames.
result = duckdb.query('''
SELECT
*
FROM 'data/iris.csv' AS t;
''')
result_df = result.df()
result_df
sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | se |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | virginica |
6.3 | 2.5 | 5.0 | 1.9 | virginica |
6.5 | 3.0 | 5.2 | 2.0 | virginica |
6.2 | 3.4 | 5.4 | 2.3 | virginica |
5.9 | 3.0 | 5.1 | 1.8 | virginica |
This provides the user the opportunity to do analysis on the data with pandas.
pedal_width_stats_df = result_df.groupby('species')['petal_length'].agg(['min', 'median', 'mean', 'max'])
pedal_width_stats_df
species | min | median | mean | max |
---|---|---|---|---|
se | 1.4 | 1.40 | 1.400000 | 1.4 |
setosa | 1.0 | 1.50 | 1.465306 | 1.9 |
versicolor | 3.0 | 4.35 | 4.260000 | 5.1 |
virginica | 4.5 | 5.55 | 5.552000 | 6.9 |
Or, if preferred, we can stick with SQL.
result = duckdb.query('''
WITH Cte AS (
SELECT * FROM 'data/iris.csv'
)
SELECT
species,
MIN(petal_length) AS min,
MEDIAN(petal_length) AS median,
MEAN(petal_length) AS mean,
MAX(petal_length) AS max,
FROM Cte
GROUP BY species
''')
result
┌────────────┬────────┬────────┬────────────────────┬────────┐
│ species │ min │ median │ mean │ max │
│ varchar │ double │ double │ double │ double │
├────────────┼────────┼────────┼────────────────────┼────────┤
│ se │ 1.4 │ 1.4 │ 1.4 │ 1.4 │
│ setosa │ 1.0 │ 1.5 │ 1.4653061224489796 │ 1.9 │
│ versicolor │ 3.0 │ 4.35 │ 4.26 │ 5.1 │
│ virginica │ 4.5 │ 5.55 │ 5.552 │ 6.9 │
└────────────┴────────┴────────┴────────────────────┴────────┘
The interoperability between pandas and DuckDB provides a lot of flexibility. You can try it out in a JupyterLite notebook right in your browser.