Spaces:
Running
Running
# /// script | |
# requires-python = ">=3.11" | |
# dependencies = [ | |
# "marimo", | |
# "duckdb==1.2.2", | |
# "polars==1.27.0", | |
# "numpy==2.2.4", | |
# "pyarrow==19.0.1", | |
# "pandas==2.2.3", | |
# "sqlglot==26.12.1", | |
# "plotly==5.23.1", | |
# "statsmodels==0.14.4", | |
# ] | |
# /// | |
import marimo | |
__generated_with = "0.13.4" | |
app = marimo.App(width="medium") | |
def _(mo): | |
mo.md( | |
rf""" | |
<p align="center"> | |
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/> | |
</p> | |
""" | |
) | |
return | |
def _(mo): | |
mo.md( | |
rf""" | |
# 🦆 **DuckDB**: An Embeddable Analytical Database System | |
## What is DuckDB? | |
[DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server. | |
--- | |
## ⚡ Key Features | |
| Feature | Description | | |
|:---------|:-------------| | |
| **In-Process Architecture** | Runs directly within your application's memory space - no separate server needed, simplifying deployment | | |
| **Columnar Storage** | Data stored in columns instead of rows, dramatically improving performance for analytical queries | | |
| **Vectorized Execution** | Performs operations on entire columns at once, significantly speeding up data processing | | |
| **ACID Transactions** | Ensures data integrity and reliability across operations | | |
| **Multi-Language Support** | Provides APIs for `Python`, `R`, `Java`, `C++`, and more | | |
| **Zero External Dependencies** | Minimal dependencies, making setup and deployment straightforward | | |
| **High Portability** | Works across various operating systems (Windows, macOS, Linux) and hardware architectures | | |
--- | |
## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file): | |
- **Data Analysis and Exploration:** DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis. | |
- **Embedded Analytics in Applications:** You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server. | |
- **ETL (Extract, Transform, Load) Processes:** DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline. | |
- **Data Science and Machine Learning Workflows:** It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models. | |
- **Rapid Prototyping of Data Analysis Pipelines:** Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment. | |
- **Small to Medium Datasets:** DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server. | |
--- | |
### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python): | |
- Python installation: | |
``` | |
pip install duckdb | |
``` | |
``` | |
conda install python-duckdb -c conda-forge. | |
``` | |
<!-- >**_Note_:** DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. --> | |
/// attention | Note | |
DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. | |
/// | |
""" | |
) | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html) | |
DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage. | |
--- | |
| Feature | In-Memory Connection | File-Based Connection | | |
|:---------|:---------------------|:----------------------| | |
| Persistence | Temporary (lost when session ends) | Stored on disk (persists between sessions) | | |
| Use Cases | Quick analysis, ephemeral data, testing | Long-term storage, data that needs to be accessed later | | |
| Performance | Faster for most operations | Slightly slower but provides persistence | | |
| Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') | | |
| Multiple Connection Access | Limited to single connection | Multiple connections can access the same database | | |
""" | |
) | |
return | |
def _(os): | |
# Remove previous database if it exists | |
if os.path.exists("example.db"): | |
os.remove("example.db") | |
if not os.path.exists("data"): | |
os.makedirs("data") | |
return | |
def _(mo): | |
_df = mo.sql( | |
f""" | |
-- Print the DuckDB version | |
SELECT version() AS version_info | |
""" | |
) | |
return | |
def _(mo): | |
mo.md( | |
""" | |
## Creating DuckDB Connections | |
Let's create both types of DuckDB connections and explore their characteristics. | |
1. **In-memory connection**: Data exists only during the current session | |
2. **File-based connection**: Data persists between sessions | |
We'll then demonstrate the key differences between these connection types. | |
""" | |
) | |
return | |
def _(duckdb): | |
# Create an in-memory DuckDB connection | |
memory_db = duckdb.connect(":memory:") | |
# Create a file-based DuckDB connection | |
file_db = duckdb.connect("example.db") | |
return file_db, memory_db | |
def _(file_db, memory_db): | |
# Test both connections | |
memory_db.execute( | |
"CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)" | |
) | |
memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')") | |
file_db.execute( | |
"CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)" | |
) | |
file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')") | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
## Testing Connection Persistence | |
Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist. | |
1. First, we'll query our tables to confirm the data was properly inserted | |
2. Then, we'll simulate an application restart by creating new connections | |
3. Finally, we'll check which data persists after the "restart" | |
""" | |
) | |
return | |
def _(mo): | |
mo.md(r"""## Current Database Contents""") | |
return | |
def _(mem_test, memory_db, mo): | |
_df = mo.sql( | |
f""" | |
SELECT * FROM mem_test | |
""", | |
engine=memory_db | |
) | |
return | |
def _(file_db, file_test, mo): | |
_df = mo.sql( | |
f""" | |
SELECT * FROM file_test | |
""", | |
engine=file_db | |
) | |
return | |
def _(): | |
# We don't actually close the connections here since we need them for later cells | |
# Just a placeholder for the concept | |
return | |
def _(mo): | |
mo.md(rf"""## 🔄 Simulating Application Restart...""") | |
return | |
def _(duckdb): | |
# Create new connections (simulating restart) | |
new_memory_db = duckdb.connect(":memory:") | |
new_file_db = duckdb.connect("example.db") | |
return new_file_db, new_memory_db | |
def _(new_memory_db): | |
# Try to query tables in the new memory connection | |
try: | |
new_memory_db.execute("SELECT * FROM mem_test").df() | |
memory_persistence = "✅ Data persisted in memory (unexpected)" | |
memory_data_available = True | |
except Exception as e: | |
memory_persistence = "❌ Data lost from memory (expected behavior)" | |
memory_data_available = False | |
return memory_data_available, memory_persistence | |
def _(new_file_db): | |
# Try to query tables in the new file connection | |
try: | |
file_data = new_file_db.execute("SELECT * FROM file_test").df() | |
file_persistence = "✅ Data persisted in file (expected behavior)" | |
file_data_available = True | |
except Exception as e: | |
file_persistence = "❌ Data lost from file (unexpected)" | |
file_data_available = False | |
file_data = None | |
return file_data, file_data_available, file_persistence | |
def _( | |
file_data_available, | |
file_persistence, | |
memory_data_available, | |
memory_persistence, | |
mo, | |
): | |
# Create an interactive display to show persistence results | |
persistence_results = mo.ui.table( | |
{ | |
"Connection Type": ["In-Memory Database", "File-Based Database"], | |
"Persistence Status": [memory_persistence, file_persistence], | |
"Data Available After Restart": [ | |
memory_data_available, | |
file_data_available, | |
], | |
} | |
) | |
return (persistence_results,) | |
def _(mo, persistence_results): | |
mo.vstack( | |
[ | |
mo.vstack([mo.md(f"""## Persistence Test Results""")], align="center"), | |
persistence_results, | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(file_data, file_data_available, mo): | |
if file_data_available: | |
mo.md("### Persisted File-Based Data:") | |
mo.ui.table(file_data) | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html) | |
DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints. | |
## Table Creation Options | |
DuckDB supports various table creation options, including: | |
- **Basic tables** with column definitions | |
- **Temporary tables** that exist only during the session | |
- **CREATE OR REPLACE** to recreate tables | |
- **Primary keys** and other constraints | |
- **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc. | |
""" | |
) | |
return | |
def _(file_db, new_memory_db): | |
# For the memory database | |
try: | |
new_memory_db.execute("DROP TABLE IF EXISTS users_memory") | |
except: | |
pass | |
# For the file database | |
try: | |
file_db.execute("DROP TABLE IF EXISTS users_file") | |
except: | |
pass | |
return | |
def _(file_db, new_memory_db): | |
# Create advanced users table in memory database with primary key | |
new_memory_db.execute(""" | |
CREATE TABLE users_memory ( | |
id INTEGER PRIMARY KEY, | |
name VARCHAR NOT NULL, | |
age INTEGER CHECK (age > 0), | |
email VARCHAR UNIQUE, | |
registration_date DATE DEFAULT CURRENT_DATE, | |
last_login TIMESTAMP, | |
account_balance DECIMAL(10,2) DEFAULT 0.00 | |
) | |
""") | |
# Create users table in file database | |
file_db.execute(""" | |
CREATE TABLE users_file ( | |
id INTEGER PRIMARY KEY, | |
name VARCHAR NOT NULL, | |
age INTEGER CHECK (age > 0), | |
email VARCHAR UNIQUE, | |
registration_date DATE DEFAULT CURRENT_DATE, | |
last_login TIMESTAMP, | |
account_balance DECIMAL(10,2) DEFAULT 0.00 | |
) | |
""") | |
return | |
def _(new_memory_db): | |
# Get table schema information using DuckDB's internal system tables | |
memory_schema = new_memory_db.execute(""" | |
SELECT column_name, data_type, is_nullable | |
FROM information_schema.columns | |
WHERE table_name = 'users_memory' | |
ORDER BY ordinal_position | |
""").df() | |
return (memory_schema,) | |
def _(memory_schema, mo): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## 🔍 Table Schema Information """)], align="center" | |
), | |
mo.ui.table(memory_schema), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert) | |
DuckDB supports multiple ways to insert data: | |
1. **INSERT INTO VALUES**: Insert specific values | |
2. **INSERT INTO SELECT**: Insert data from query results | |
3. **Parameterized inserts**: Using prepared statements | |
4. **Bulk inserts**: For efficient loading of multiple rows | |
Let's demonstrate these different insertion methods: | |
""" | |
) | |
return | |
def _(date): | |
today = date.today() | |
# First check if records already exist to avoid duplicate key errors | |
def safe_insert(connection, table_name, data): | |
""" | |
Safely insert data into a table by checking for existing IDs first | |
""" | |
# Check which IDs already exist in the table | |
existing_ids = ( | |
connection.execute(f"SELECT id FROM {table_name}") | |
.fetchdf()["id"] | |
.tolist() | |
) | |
# Filter out data with IDs that already exist | |
new_data = [record for record in data if record[0] not in existing_ids] | |
if not new_data: | |
print( | |
f"No new records to insert into {table_name}. All IDs already exist." | |
) | |
return 0 | |
# Prepare the placeholders for the SQL statement | |
placeholders = ", ".join( | |
["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data) | |
) | |
# Flatten the list of tuples for parameter binding | |
flat_data = [item for sublist in new_data for item in sublist] | |
# Perform the insertion | |
if flat_data: | |
columns = "(id, name, age, email, registration_date, last_login, account_balance)" | |
connection.execute( | |
f"INSERT INTO {table_name} {columns} VALUES {placeholders}", | |
flat_data, | |
) | |
return len(new_data) | |
return 0 | |
return (safe_insert,) | |
def _(): | |
# Prepare the data | |
user_data = [ | |
( | |
1, | |
"Alice", | |
25, | |
"[email protected]", | |
"2021-01-01", | |
"2023-01-15 14:30:00", | |
1250.75, | |
), | |
( | |
2, | |
"Bob", | |
30, | |
"[email protected]", | |
"2021-02-01", | |
"2023-02-10 09:15:22", | |
750.50, | |
), | |
( | |
3, | |
"Charlie", | |
35, | |
"[email protected]", | |
"2021-03-01", | |
"2023-03-05 17:45:10", | |
3200.25, | |
), | |
( | |
4, | |
"David", | |
40, | |
"[email protected]", | |
"2021-04-01", | |
"2023-04-20 10:30:45", | |
1800.00, | |
), | |
( | |
5, | |
"Emma", | |
45, | |
"[email protected]", | |
"2021-05-01", | |
"2023-05-12 11:20:30", | |
2500.00, | |
), | |
( | |
6, | |
"Frank", | |
50, | |
"[email protected]", | |
"2021-06-01", | |
"2023-06-18 16:10:15", | |
900.25, | |
), | |
] | |
return (user_data,) | |
def _(file_db, new_memory_db, safe_insert, user_data): | |
# Safely insert data into memory database | |
safe_insert(new_memory_db, "users_memory", user_data) | |
# Safely insert data into file database | |
safe_insert(file_db, "users_file", user_data) | |
return | |
def _(): | |
# If you need to add just one record, you can use a similar approach: | |
new_user = ( | |
7, | |
"Grace", | |
28, | |
"[email protected]", | |
"2021-07-01", | |
"2023-07-22 13:45:10", | |
1675.50, | |
) | |
return (new_user,) | |
def _(new_memory_db, new_user): | |
# Check if the ID exists before inserting | |
if not new_memory_db.execute( | |
"SELECT id FROM users_memory WHERE id = ?", [new_user[0]] | |
).fetchone(): | |
new_memory_db.execute( | |
""" | |
INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance) | |
VALUES (?, ?, ?, ?, ?, ?, ?) | |
""", | |
new_user, | |
) | |
print(f"Added user {new_user[1]} to users_memory") | |
else: | |
print(f"User with ID {new_user[0]} already exists in users_memory") | |
return | |
def _(file_db, new_user): | |
# Do the same for the file database | |
if not file_db.execute( | |
"SELECT id FROM users_file WHERE id = ?", [new_user[0]] | |
).fetchone(): | |
file_db.execute( | |
""" | |
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance) | |
VALUES (?, ?, ?, ?, ?, ?, ?) | |
""", | |
new_user, | |
) | |
print(f"Added user {new_user[1]} to users_file") | |
else: | |
print(f"User with ID {new_user[0]} already exists in users_file") | |
return | |
def _(new_memory_db): | |
# First try to update | |
cursor = new_memory_db.execute( | |
""" | |
UPDATE users_memory | |
SET name = ?, age = ?, email = ?, | |
registration_date = ?, last_login = ?, account_balance = ? | |
WHERE id = ? | |
""", | |
( | |
"Henry", | |
33, | |
"[email protected]", | |
"2021-08-01", | |
"2023-08-05 09:10:15", | |
3100.75, | |
8, # ID should be the last parameter | |
), | |
) | |
return (cursor,) | |
def _(cursor, mo, new_memory_db): | |
# If no rows were updated, perform an insert | |
if cursor.rowcount == 0: | |
new_memory_db.execute( | |
""" | |
INSERT INTO users_memory | |
(id, name, age, email, registration_date, last_login, account_balance) | |
VALUES (?, ?, ?, ?, ?, ?, ?) | |
""", | |
( | |
8, | |
"Henry", | |
33, | |
"[email protected]", | |
"2021-08-01", | |
"2023-08-05 09:10:15", | |
3100.75, | |
), | |
) | |
mo.md( | |
f""" | |
Upserted Henry into users_memory. | |
""" | |
) | |
return | |
def _(file_db, mo): | |
# For DuckDB using ON CONFLICT, we need to specify the conflict target column | |
file_db.execute( | |
""" | |
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance) | |
VALUES (?, ?, ?, ?, ?, ?, ?) | |
ON CONFLICT (id) DO UPDATE SET | |
name = EXCLUDED.name, | |
age = EXCLUDED.age, | |
email = EXCLUDED.email, | |
registration_date = EXCLUDED.registration_date, | |
last_login = EXCLUDED.last_login, | |
account_balance = EXCLUDED.account_balance | |
""", | |
( | |
8, | |
"Henry", | |
33, | |
"[email protected]", | |
"2021-08-01", | |
"2023-08-05 09:10:15", | |
3100.75, | |
), | |
) | |
mo.md( | |
f""" | |
Upserted Henry into users_file. | |
""" | |
) | |
return | |
def _(new_memory_db): | |
# Display memory data using DuckDB's query capabilities | |
memory_results = new_memory_db.execute(""" | |
SELECT | |
id, | |
name, | |
age, | |
email, | |
registration_date, | |
last_login, | |
account_balance | |
FROM users_memory | |
ORDER BY id | |
""").df() | |
return (memory_results,) | |
def _(file_db): | |
# Display file data with formatting | |
file_results = file_db.execute(""" | |
SELECT | |
id, | |
name, | |
age, | |
email, | |
registration_date, | |
last_login, | |
CAST(account_balance AS DECIMAL(10,2)) AS account_balance | |
FROM users_file | |
ORDER BY id | |
""").df() | |
return (file_results,) | |
def _(file_results, memory_results, mo): | |
tabs = mo.ui.tabs( | |
{ | |
"In-Memory Database": mo.ui.table(memory_results), | |
"File-Based Database": mo.ui.table(file_results), | |
} | |
) | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## 📊 Database Contents After Insertion""")], | |
align="center", | |
), | |
tabs, | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
# [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select) | |
There are multiple ways to leverage DuckDB's SQL capabilities in marimo: | |
1. **Direct execution**: Using DuckDB connections to execute SQL | |
2. **marimo SQL**: Using marimo's built-in SQL engine | |
3. **Interactive queries**: Combining UI elements with SQL execution | |
Let's explore these approaches: | |
""" | |
) | |
return | |
def _(mo): | |
mo.vstack( | |
[ | |
mo.vstack([mo.md(f"""## 🔍 Query with marimo SQL""")], align="center"), | |
mo.md( | |
"### marimo has its own [built-in SQL engine](https://docs.marimo.io/guides/working_with_data/sql/) that can work with DataFrames." | |
), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(memory_results, mo): | |
# Create a SQL selector for users with age threshold | |
age_threshold = mo.ui.slider( | |
25, 50, value=30, label="Minimum Age", full_width=True, show_value=True | |
) | |
# Create a function to filter users based on the slider value | |
def filtered_users(): | |
# Use DuckDB directly instead of mo.sql with users param | |
filtered_df = memory_results[memory_results["age"] >= age_threshold.value] | |
filtered_df = filtered_df.sort_values("age") | |
return mo.ui.table(filtered_df) | |
return age_threshold, filtered_users | |
def _(age_threshold, filtered_users, mo): | |
layout = mo.vstack( | |
[ | |
mo.md("### Select minimum age:"), | |
age_threshold, | |
mo.md("### Users meeting age criteria:"), | |
filtered_users(), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
layout | |
return | |
def _(mo): | |
mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""") | |
return | |
def _(pl): | |
# Create a Polars DataFrame | |
polars_df = pl.DataFrame( | |
{ | |
"id": [101, 102, 103], | |
"name": ["Product A", "Product B", "Product C"], | |
"price": [29.99, 49.99, 19.99], | |
"category": ["Electronics", "Furniture", "Books"], | |
} | |
) | |
return (polars_df,) | |
def _(mo, polars_df): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Original Polars DataFrame:""")], align="center" | |
), | |
mo.ui.table(polars_df), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(new_memory_db, polars_df): | |
# Register the Polars DataFrame as a DuckDB table in memory connection | |
new_memory_db.register("products_polars", polars_df) | |
# Query the registered table | |
polars_query_result = new_memory_db.execute( | |
"SELECT * FROM products_polars WHERE price > 25" | |
).df() | |
return (polars_query_result,) | |
def _(mo, polars_query_result): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## DuckDB Query Result (From Polars Data):""")], | |
align="center", | |
), | |
mo.ui.table(polars_query_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(new_memory_db): | |
# Demonstrate a more complex query | |
complex_query_result = new_memory_db.execute(""" | |
SELECT | |
category, | |
COUNT(*) as product_count, | |
AVG(price) as avg_price, | |
MIN(price) as min_price, | |
MAX(price) as max_price | |
FROM products_polars | |
GROUP BY category | |
ORDER BY avg_price DESC | |
""").df() | |
return (complex_query_result,) | |
def _(complex_query_result, mo): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Aggregated Product Data by Category:""")], | |
align="center", | |
), | |
mo.ui.table(complex_query_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""") | |
return | |
def _(new_memory_db): | |
# Create another table to join with | |
new_memory_db.execute(""" | |
CREATE TABLE IF NOT EXISTS departments ( | |
id INTEGER, | |
department_name VARCHAR, | |
manager_id INTEGER | |
) | |
""") | |
return | |
def _(new_memory_db): | |
new_memory_db.execute(""" | |
INSERT INTO departments VALUES | |
(101, 'Engineering', 1), | |
(102, 'Marketing', 2), | |
(103, 'Finance', NULL) | |
""") | |
return | |
def _(new_memory_db): | |
# Execute a join query | |
join_result = new_memory_db.execute(""" | |
SELECT | |
u.id, | |
u.name, | |
u.age, | |
d.department_name | |
FROM users_memory u | |
LEFT JOIN departments d ON u.id = d.manager_id | |
ORDER BY u.id | |
""").df() | |
return (join_result,) | |
def _(mo): | |
mo.md( | |
rf""" | |
<!-- Display the join result --> | |
## Join Result (Users and Departments): | |
""" | |
) | |
return | |
def _(join_result, mo): | |
mo.ui.table(join_result) | |
return | |
def _(mo): | |
mo.md( | |
rf""" | |
<!-- Demonstrate different types of joins --> | |
## Different Types of Joins | |
""" | |
) | |
return | |
def _(new_memory_db): | |
# Inner join | |
inner_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, d.department_name | |
FROM users_memory u | |
INNER JOIN departments d ON u.id = d.manager_id | |
""").df() | |
# Right join | |
right_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, d.department_name | |
FROM users_memory u | |
RIGHT JOIN departments d ON u.id = d.manager_id | |
""").df() | |
# Full outer join | |
full_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, d.department_name | |
FROM users_memory u | |
FULL OUTER JOIN departments d ON u.id = d.manager_id | |
""").df() | |
# Cross join | |
cross_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, d.department_name | |
FROM users_memory u | |
CROSS JOIN departments d | |
""").df() | |
# Self join (Joining user table with itself to find users with the same age) | |
self_join = new_memory_db.execute(""" | |
SELECT u1.id, u1.name, u2.name AS same_age_user | |
FROM users_memory u1 | |
JOIN users_memory u2 ON u1.age = u2.age AND u1.id <> u2.id | |
""").df() | |
# Semi join (Finding users who are also managers) | |
semi_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, u.age | |
FROM users_memory u | |
WHERE EXISTS ( | |
SELECT 1 FROM departments d | |
WHERE u.id = d.manager_id | |
) | |
""").df() | |
# Anti join (Finding users who are not managers) | |
anti_join = new_memory_db.execute(""" | |
SELECT u.id, u.name, u.age | |
FROM users_memory u | |
WHERE NOT EXISTS ( | |
SELECT 1 FROM departments d | |
WHERE u.id = d.manager_id | |
) | |
""").df() | |
return ( | |
anti_join, | |
cross_join, | |
full_join, | |
inner_join, | |
right_join, | |
self_join, | |
semi_join, | |
) | |
def _(mo, new_memory_db): | |
# Display base table side by side | |
users = new_memory_db.execute("SELECT * FROM users_memory").df() | |
departments = new_memory_db.execute("SELECT * FROM departments").df() | |
base_tables = mo.vstack( | |
[ | |
mo.vstack([mo.md(f"""# Base Tables""")], align="center"), | |
mo.ui.tabs( | |
{ | |
"User Table": mo.ui.table(users), | |
"Departments Table": mo.ui.table(departments), | |
} | |
), | |
] | |
) | |
base_tables | |
return | |
def _( | |
anti_join, | |
cross_join, | |
full_join, | |
inner_join, | |
join_result, | |
mo, | |
right_join, | |
self_join, | |
semi_join, | |
): | |
join_description = { | |
"Left Join": "Shows all records from the left table and matching records from the right table. Non-matches filled with NULL.", | |
"Inner Join": "Shows only the records where there's a match in both tables.", | |
"Right Join": "Shows all records from the right table and matching records from the left table. Non-matches filled with NULL.", | |
"Full Outer Join": "Shows all records from both tables, with NULL values where there's no match.", | |
"Cross Join": "Returns the Cartesian product - all possible combinations of rows from both tables.", | |
"Self Join": "Joins a table with itself, used to compare rows within the same table.", | |
"Semi Join": "Returns rows from the first table where one or more matches exist in the second table.", | |
"Anti Join": "Returns rows from the first table where no matches exist in the second table.", | |
} | |
join_tabs = mo.ui.tabs( | |
{ | |
"Left Join": mo.ui.table(join_result), | |
"Inner Join": mo.ui.table(inner_join), | |
"Right Join": mo.ui.table(right_join), | |
"Full Outer Join": mo.ui.table(full_join), | |
"Cross Join": mo.ui.table(cross_join), | |
"Self Join": mo.ui.table(self_join), | |
"Semi Join": mo.ui.table(semi_join), | |
"Anti Join": mo.ui.table(anti_join), | |
} | |
) | |
return join_description, join_tabs | |
def _(join_description, join_tabs, mo): | |
join_display = mo.vstack( | |
[ | |
mo.vstack([mo.md(f"""# SQL Join Operations""")], align="center"), | |
mo.md(f"**{join_tabs.value}**: {join_description[join_tabs.value]}"), | |
mo.md("## Join Results"), | |
join_tabs, | |
], | |
gap=2, | |
justify="space-between", | |
) | |
join_display | |
return | |
def _(mo): | |
mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""") | |
return | |
def _(new_memory_db): | |
# Execute an aggregate query | |
agg_result = new_memory_db.execute(""" | |
SELECT | |
AVG(age) as avg_age, | |
MAX(age) as max_age, | |
MIN(age) as min_age, | |
COUNT(*) as total_users, | |
SUM(account_balance) as total_balance | |
FROM users_memory | |
""").df() | |
return (agg_result,) | |
def _(agg_result, mo): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Aggregate Results (All Users):""")], align="center" | |
), | |
mo.ui.table(agg_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(new_memory_db): | |
age_groups = new_memory_db.execute(""" | |
SELECT | |
CASE | |
WHEN age < 30 THEN 'Under 30' | |
WHEN age BETWEEN 30 AND 40 THEN '30 to 40' | |
ELSE 'Over 40' | |
END as age_group, | |
COUNT(*) as count, | |
AVG(age) as avg_age, | |
AVG(account_balance) as avg_balance | |
FROM users_memory | |
GROUP BY 1 | |
ORDER BY 1 | |
""").df() | |
return (age_groups,) | |
def _(age_groups, mo): | |
mo.ui.table(age_groups) | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Aggregate Results (Grouped by Age Range):""")], | |
align="center", | |
), | |
mo.ui.table(age_groups), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(new_memory_db): | |
window_result = new_memory_db.execute(""" | |
SELECT | |
id, | |
name, | |
age, | |
account_balance, | |
RANK() OVER (ORDER BY account_balance DESC) as balance_rank, | |
account_balance - AVG(account_balance) OVER () as diff_from_avg, | |
account_balance / SUM(account_balance) OVER () * 100 as pct_of_total | |
FROM users_memory | |
ORDER BY balance_rank | |
""").df() | |
return (window_result,) | |
def _(mo, window_result): | |
mo.vstack( | |
[ | |
mo.vstack([mo.md(f"""## Window Functions Example""")], align="center"), | |
mo.ui.table(window_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""") | |
return | |
def _(new_memory_db): | |
polars_result = new_memory_db.execute( | |
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age""" | |
).pl() | |
return (polars_result,) | |
def _(mo, polars_result): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Query Result as Polars DataFrame:""")], | |
align="center", | |
), | |
mo.ui.table(polars_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(new_memory_db): | |
pandas_result = new_memory_db.execute( | |
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age""" | |
).fetch_df() | |
return (pandas_result,) | |
def _(mo, pandas_result): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Same Query Result as Pandas DataFrame:""")], | |
align="center", | |
), | |
mo.ui.table(pandas_result), | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Differences in DataFrame Handling""")], | |
align="center", | |
), | |
mo.vstack( | |
[ | |
mo.md( | |
f"""## Polars: Filter users over 35 and calculate average balance""" | |
) | |
], | |
align="start", | |
), | |
], | |
gap=2, justify="space-between", | |
) | |
return | |
def _(mo, pl, polars_result): | |
def _(): | |
polars_filtered = polars_result.filter(pl.col("age") > 35) | |
polars_avg = polars_filtered.select( | |
pl.col("account_balance").mean().alias("avg_balance") | |
) | |
layout = mo.vstack( | |
[ | |
mo.md("### Filtered Polars DataFrame (Age > 35):"), | |
mo.ui.table(polars_filtered), | |
mo.md("### Average Account Balance:"), | |
mo.ui.table(polars_avg), | |
], | |
gap=2, | |
) | |
return layout | |
_() | |
return | |
def _(mo, pandas_result): | |
pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean() | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Pandas: Same operation in pandas style""")], | |
align="center", | |
), | |
mo.vstack( | |
[mo.md(f"""### Average balance: {pandas_avg:.2f}""")], | |
align="start", | |
), | |
] | |
) | |
return | |
def _(mo): | |
mo.md("""# 9. Data Visualization with DuckDB and Plotly""") | |
return | |
def _(age_groups, mo, new_memory_db, plotly_express): | |
# User distribution by age group | |
fig1 = plotly_express.bar( | |
age_groups, | |
x="age_group", | |
y="count", | |
title="User Distribution by Age Group", | |
labels={"count": "Number of Users", "age_group": "Age Group"}, | |
color="age_group", | |
color_discrete_sequence=plotly_express.colors.qualitative.Plotly, | |
) | |
fig1.update_traces( | |
text=age_groups["count"], | |
textposition="outside", | |
) | |
fig1.update_layout( | |
height=450, | |
margin=dict(t=50, b=50, l=50, r=25), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
# Average balance by age group | |
fig2 = plotly_express.bar( | |
age_groups, | |
x="age_group", | |
y="avg_balance", | |
title="Average Account Balance by Age Group", | |
labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"}, | |
color="age_group", | |
color_discrete_sequence=plotly_express.colors.qualitative.Plotly, | |
) | |
fig2.update_traces( | |
text=[f"${val:.2f}" for val in age_groups["avg_balance"]], | |
textposition="outside", | |
) | |
fig2.update_layout( | |
height=450, | |
margin=dict(t=50, b=50, l=50, r=25), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
# Age vs Account Balance scatter plot | |
scatter_data = new_memory_db.execute( | |
""" | |
SELECT | |
name, | |
age, | |
account_balance | |
FROM users_memory | |
ORDER BY age | |
""" | |
).df() | |
fig3 = plotly_express.scatter( | |
scatter_data, | |
x="age", | |
y="account_balance", | |
title="Age vs. Account Balance", | |
labels={"account_balance": "Account Balance ($)", "age": "Age"}, | |
color_discrete_sequence=["#FF7F0E"], | |
trendline="ols", | |
hover_data=["age", "account_balance"], | |
size_max=15, | |
) | |
fig3.update_traces(marker=dict(size=12)) | |
fig3.update_layout( | |
height=450, | |
margin=dict(t=50, b=50, l=50, r=25), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
# Distribution of account balances | |
balance_data = new_memory_db.execute( | |
""" | |
SELECT | |
name, | |
account_balance | |
FROM users_memory | |
ORDER BY account_balance DESC | |
""" | |
).df() | |
fig4 = plotly_express.pie( | |
balance_data, | |
names="name", | |
values="account_balance", | |
title="Distribution of Account Balances", | |
labels={"account_balance": "Account Balance ($)", "name": "User"}, | |
color_discrete_sequence=plotly_express.colors.qualitative.Pastel, | |
) | |
fig4.update_traces(textinfo="percent+label", textposition="inside") | |
fig4.update_layout( | |
height=450, | |
margin=dict(t=50, b=50, l=50, r=25), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
category_tabs = mo.ui.tabs( | |
{ | |
"Age Group Analysis": mo.vstack( | |
[ | |
mo.ui.tabs( | |
{ | |
"User Distribution": mo.ui.plotly(fig1), | |
"Average Balance": mo.ui.plotly(fig2), | |
} | |
) | |
], | |
gap=2, | |
justify="space-between", | |
), | |
"Financial Analysis": mo.vstack( | |
[ | |
mo.ui.tabs( | |
{ | |
"Age vs Balance": mo.ui.plotly(fig3), | |
"Balance Distribution": mo.ui.plotly(fig4), | |
} | |
) | |
], | |
gap=2, | |
justify="space-between", | |
), | |
}, | |
lazy=True, | |
) | |
mo.vstack( | |
[ | |
mo.vstack( | |
[mo.md(f"""## Select a visualization category:""")], | |
align="start", | |
), | |
category_tabs, | |
], | |
gap=2, | |
justify="space-between", | |
) | |
return | |
def _(mo): | |
mo.md( | |
r""" | |
/// admonition | | |
## Database Management Best Practices | |
/// | |
### Closing Connections | |
It's important to close database connections when you're done with them, especially for file-based connections: | |
```python | |
memory_db.close() | |
file_db.close() | |
``` | |
### Transaction Management | |
DuckDB supports transactions, which can be useful for more complex operations: | |
```python | |
conn = duckdb.connect('mydb.db') | |
conn.begin() # Start transaction | |
try: | |
conn.execute("INSERT INTO users VALUES (1, 'Test User')") | |
conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1") | |
conn.commit() # Commit changes | |
except: | |
conn.rollback() # Undo changes if error | |
raise | |
``` | |
### Query Performance | |
DuckDB is optimized for analytical queries. For best performance: | |
- Use appropriate data types | |
- Create indexes for frequently queried columns | |
- For large datasets, consider partitioning | |
- Use prepared statements for repeated queries | |
""" | |
) | |
return | |
def _(mo): | |
mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""") | |
return | |
def _(mo): | |
# Create an interactive filter for age range | |
min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age") | |
max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age") | |
return max_age, min_age | |
def _(max_age, min_age, new_memory_db): | |
# Create a function to filter data and update visualizations | |
def get_filtered_data(min_val=min_age.value, max_val=max_age.value): | |
# Get filtered data based on slider values using parameterized query for safety | |
return new_memory_db.execute( | |
""" | |
SELECT | |
id, | |
name, | |
age, | |
email, | |
account_balance, | |
registration_date | |
FROM users_memory | |
WHERE age >= ? AND age <= ? | |
ORDER BY age | |
""", | |
[min_val, max_val], | |
).df() | |
return (get_filtered_data,) | |
def _(get_filtered_data): | |
def get_metrics(data=get_filtered_data()): | |
return { | |
"user count": len(data), | |
"avg_balance": data["account_balance"].mean() if len(data) > 0 else 0, | |
"total_balance": data["account_balance"].sum() if len(data) > 0 else 0, | |
} | |
return (get_metrics,) | |
def _(get_metrics, mo): | |
def metrics_display(metrics=get_metrics()): | |
return mo.hstack( | |
[ | |
mo.vstack( | |
[ | |
mo.md("### Selected Users"), | |
mo.md(f"## {metrics['user count']}"), | |
], | |
align="center", | |
), | |
mo.vstack( | |
[ | |
mo.md("### Average Balance"), | |
mo.md(f"## ${metrics['avg_balance']:.2f}"), | |
], | |
align="center", | |
), | |
mo.vstack( | |
[ | |
mo.md("### Total Balance"), | |
mo.md(f"## ${metrics['total_balance']:.2f}"), | |
], | |
align="center", | |
), | |
], | |
justify="space-between", | |
gap=2, | |
) | |
return (metrics_display,) | |
def _(get_filtered_data, max_age, min_age, mo, plotly_express): | |
def create_visualization( | |
data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value | |
): | |
if len(data) == 0: | |
return mo.ui.text("No data available for the selected age range.") | |
# Create visualizations for filtered data | |
fig1 = plotly_express.bar( | |
data, | |
x="name", | |
y="account_balance", | |
title=f"Account Balance by User (Age {min_val} - {max_val})", | |
labels={"account_balance": "Account Balance ($)", "name": "User"}, | |
color="account_balance", | |
color_continuous_scale=plotly_express.colors.sequential.Plasma, | |
text_auto=".2s", | |
) | |
fig1.update_layout( | |
height=400, | |
xaxis_tickangle=-45, | |
margin=dict(t=50, b=70, l=50, r=30), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
fig1.update_traces( | |
textposition="outside", | |
) | |
fig2 = plotly_express.histogram( | |
data, | |
x="age", | |
nbins=min(10, len(set(data["age"]))), | |
title=f"Age Distribution (Age {min_val} - {max_val})", | |
color_discrete_sequence=["#4C78A8"], | |
opacity=0.8, | |
histnorm="probability density", | |
) | |
fig2.update_layout( | |
height=400, | |
margin=dict(t=50, b=70, l=50, r=30), | |
bargap=0.1, | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
fig3 = plotly_express.scatter( | |
data, | |
x="age", | |
y="account_balance", | |
title=f"Age vs. Account Balance (Age {min_val} - {max_val})", | |
labels={"account_balance": "Account Balance ($)", "age": "Age"}, | |
color="age", | |
color_continuous_scale="Viridis", | |
size_max=25, | |
size="account_balance", | |
hover_name="name", | |
) | |
fig3.update_layout( | |
height=400, | |
margin=dict(t=50, b=70, l=50, r=30), | |
hoverlabel=dict(bgcolor="white", font_size=12), | |
template="plotly_white", | |
) | |
return mo.ui.tabs( | |
{ | |
"Account Balance by User": mo.ui.plotly(fig1), | |
"Age Distribution": mo.ui.plotly(fig2), | |
"Age vs. Account Balance": mo.ui.plotly(fig3), | |
} | |
) | |
return (create_visualization,) | |
def _( | |
create_visualization, | |
get_filtered_data, | |
max_age, | |
metrics_display, | |
min_age, | |
mo, | |
): | |
def dashboard( | |
min_val=min_age.value, | |
max_val=max_age.value, | |
metrics=metrics_display(), | |
data=get_filtered_data(), | |
visualization=create_visualization(), | |
): | |
return mo.vstack( | |
[ | |
mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"), | |
metrics, | |
mo.md("### Data Table"), | |
mo.ui.table(data, page_size=5), | |
mo.md("### Visualizations"), | |
visualization, | |
], | |
gap=2, | |
justify="space-between", | |
) | |
dashboard() | |
return | |
def _(mo): | |
mo.md( | |
rf""" | |
# Summary and Key Takeaways | |
In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered: | |
1. **Connection types**: We learned the difference between in-memory databases (temporary) and file-based databases (persistent). | |
2. **Table creation**: We created tables with various data types, constraints, and primary keys. | |
3. **Data insertion**: We demonstrated different ways to insert data, including single inserts and bulk loading. | |
4. **SQL queries**: We executed various SQL queries directly and through marimo's UI components. | |
5. **Integration with Polars**: We showed how DuckDB can work seamlessly with Polars DataFrames. | |
6. **Joins and relationships**: We performed JOIN operations between tables to combine related data. | |
7. **Aggregation**: We used aggregate functions to summarize and analyze data. | |
8. **Data conversion**: We converted DuckDB results to both Polars and Pandas DataFrames. | |
9. **Best practices**: We reviewed best practices for managing DuckDB connections and transactions. | |
10. **Visualization**: We created interactive visualizations and dashboards with Plotly and marimo. | |
DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases. | |
### Next Steps | |
- Try loading larger datasets into DuckDB | |
- Experiment with more complex queries and window functions | |
- Use DuckDB's COPY functionality to import/export data from/to files | |
- Create more advanced interactive dashboards with marimo and Plotly | |
""" | |
) | |
return | |
def _(): | |
import marimo as mo | |
import duckdb | |
import polars as pl | |
import os | |
from datetime import date | |
import plotly.express as plotly_express | |
import plotly.graph_objects as plotly_graph_objects | |
import numpy as np | |
return date, duckdb, mo, os, pl, plotly_express | |
if __name__ == "__main__": | |
app.run() | |