marimo-learn / duckdb /01_getting_started.py
Azmi-84
Hide code cells in getting started guide for improved user experience
34f04d3
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "marimo",
# "duckdb==1.2.2",
# "polars==1.27.0",
# "numpy==2.2.4",
# "pyarrow==19.0.1",
# "pandas==2.2.3",
# "sqlglot==26.12.1",
# "plotly==5.23.1",
# "statsmodels==0.14.4",
# ]
# ///
import marimo
__generated_with = "0.13.4"
app = marimo.App(width="medium")
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<p align="center">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
</p>
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
# 🦆 **DuckDB**: An Embeddable Analytical Database System
## What is DuckDB?
[DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
---
## ⚡ Key Features
| Feature | Description |
|:---------|:-------------|
| **In-Process Architecture** | Runs directly within your application's memory space - no separate server needed, simplifying deployment |
| **Columnar Storage** | Data stored in columns instead of rows, dramatically improving performance for analytical queries |
| **Vectorized Execution** | Performs operations on entire columns at once, significantly speeding up data processing |
| **ACID Transactions** | Ensures data integrity and reliability across operations |
| **Multi-Language Support** | Provides APIs for `Python`, `R`, `Java`, `C++`, and more |
| **Zero External Dependencies** | Minimal dependencies, making setup and deployment straightforward |
| **High Portability** | Works across various operating systems (Windows, macOS, Linux) and hardware architectures |
---
## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):
- **Data Analysis and Exploration:** DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
- **Embedded Analytics in Applications:** You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
- **ETL (Extract, Transform, Load) Processes:** DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
- **Data Science and Machine Learning Workflows:** It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
- **Rapid Prototyping of Data Analysis Pipelines:** Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
- **Small to Medium Datasets:** DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.
---
### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):
- Python installation:
```
pip install duckdb
```
```
conda install python-duckdb -c conda-forge.
```
<!-- >**_Note_:** DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->
/// attention | Note
DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
///
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
---
| Feature | In-Memory Connection | File-Based Connection |
|:---------|:---------------------|:----------------------|
| Persistence | Temporary (lost when session ends) | Stored on disk (persists between sessions) |
| Use Cases | Quick analysis, ephemeral data, testing | Long-term storage, data that needs to be accessed later |
| Performance | Faster for most operations | Slightly slower but provides persistence |
| Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
| Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
"""
)
return
@app.cell(hide_code=True)
def _(os):
# Remove previous database if it exists
if os.path.exists("example.db"):
os.remove("example.db")
if not os.path.exists("data"):
os.makedirs("data")
return
@app.cell(hide_code=True)
def _(mo):
_df = mo.sql(
f"""
-- Print the DuckDB version
SELECT version() AS version_info
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
"""
## Creating DuckDB Connections
Let's create both types of DuckDB connections and explore their characteristics.
1. **In-memory connection**: Data exists only during the current session
2. **File-based connection**: Data persists between sessions
We'll then demonstrate the key differences between these connection types.
"""
)
return
@app.cell(hide_code=True)
def _(duckdb):
# Create an in-memory DuckDB connection
memory_db = duckdb.connect(":memory:")
# Create a file-based DuckDB connection
file_db = duckdb.connect("example.db")
return file_db, memory_db
@app.cell(hide_code=True)
def _(file_db, memory_db):
# Test both connections
memory_db.execute(
"CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
)
memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")
file_db.execute(
"CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
)
file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
## Testing Connection Persistence
Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
1. First, we'll query our tables to confirm the data was properly inserted
2. Then, we'll simulate an application restart by creating new connections
3. Finally, we'll check which data persists after the "restart"
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""## Current Database Contents""")
return
@app.cell(hide_code=True)
def _(mem_test, memory_db, mo):
_df = mo.sql(
f"""
SELECT * FROM mem_test
""",
engine=memory_db
)
return
@app.cell(hide_code=True)
def _(file_db, file_test, mo):
_df = mo.sql(
f"""
SELECT * FROM file_test
""",
engine=file_db
)
return
@app.cell
def _():
# We don't actually close the connections here since we need them for later cells
# Just a placeholder for the concept
return
@app.cell(hide_code=True)
def _(mo):
mo.md(rf"""## 🔄 Simulating Application Restart...""")
return
@app.cell(hide_code=True)
def _(duckdb):
# Create new connections (simulating restart)
new_memory_db = duckdb.connect(":memory:")
new_file_db = duckdb.connect("example.db")
return new_file_db, new_memory_db
@app.cell(hide_code=True)
def _(new_memory_db):
# Try to query tables in the new memory connection
try:
new_memory_db.execute("SELECT * FROM mem_test").df()
memory_persistence = "✅ Data persisted in memory (unexpected)"
memory_data_available = True
except Exception as e:
memory_persistence = "❌ Data lost from memory (expected behavior)"
memory_data_available = False
return memory_data_available, memory_persistence
@app.cell(hide_code=True)
def _(new_file_db):
# Try to query tables in the new file connection
try:
file_data = new_file_db.execute("SELECT * FROM file_test").df()
file_persistence = "✅ Data persisted in file (expected behavior)"
file_data_available = True
except Exception as e:
file_persistence = "❌ Data lost from file (unexpected)"
file_data_available = False
file_data = None
return file_data, file_data_available, file_persistence
@app.cell(hide_code=True)
def _(
file_data_available,
file_persistence,
memory_data_available,
memory_persistence,
mo,
):
# Create an interactive display to show persistence results
persistence_results = mo.ui.table(
{
"Connection Type": ["In-Memory Database", "File-Based Database"],
"Persistence Status": [memory_persistence, file_persistence],
"Data Available After Restart": [
memory_data_available,
file_data_available,
],
}
)
return (persistence_results,)
@app.cell(hide_code=True)
def _(mo, persistence_results):
mo.vstack(
[
mo.vstack([mo.md(f"""## Persistence Test Results""")], align="center"),
persistence_results,
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(file_data, file_data_available, mo):
if file_data_available:
mo.md("### Persisted File-Based Data:")
mo.ui.table(file_data)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
## Table Creation Options
DuckDB supports various table creation options, including:
- **Basic tables** with column definitions
- **Temporary tables** that exist only during the session
- **CREATE OR REPLACE** to recreate tables
- **Primary keys** and other constraints
- **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
"""
)
return
@app.cell(hide_code=True)
def _(file_db, new_memory_db):
# For the memory database
try:
new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
except:
pass
# For the file database
try:
file_db.execute("DROP TABLE IF EXISTS users_file")
except:
pass
return
@app.cell(hide_code=True)
def _(file_db, new_memory_db):
# Create advanced users table in memory database with primary key
new_memory_db.execute("""
CREATE TABLE users_memory (
id INTEGER PRIMARY KEY,
name VARCHAR NOT NULL,
age INTEGER CHECK (age > 0),
email VARCHAR UNIQUE,
registration_date DATE DEFAULT CURRENT_DATE,
last_login TIMESTAMP,
account_balance DECIMAL(10,2) DEFAULT 0.00
)
""")
# Create users table in file database
file_db.execute("""
CREATE TABLE users_file (
id INTEGER PRIMARY KEY,
name VARCHAR NOT NULL,
age INTEGER CHECK (age > 0),
email VARCHAR UNIQUE,
registration_date DATE DEFAULT CURRENT_DATE,
last_login TIMESTAMP,
account_balance DECIMAL(10,2) DEFAULT 0.00
)
""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Get table schema information using DuckDB's internal system tables
memory_schema = new_memory_db.execute("""
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'users_memory'
ORDER BY ordinal_position
""").df()
return (memory_schema,)
@app.cell(hide_code=True)
def _(memory_schema, mo):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## 🔍 Table Schema Information """)], align="center"
),
mo.ui.table(memory_schema),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
DuckDB supports multiple ways to insert data:
1. **INSERT INTO VALUES**: Insert specific values
2. **INSERT INTO SELECT**: Insert data from query results
3. **Parameterized inserts**: Using prepared statements
4. **Bulk inserts**: For efficient loading of multiple rows
Let's demonstrate these different insertion methods:
"""
)
return
@app.cell(hide_code=True)
def _(date):
today = date.today()
# First check if records already exist to avoid duplicate key errors
def safe_insert(connection, table_name, data):
"""
Safely insert data into a table by checking for existing IDs first
"""
# Check which IDs already exist in the table
existing_ids = (
connection.execute(f"SELECT id FROM {table_name}")
.fetchdf()["id"]
.tolist()
)
# Filter out data with IDs that already exist
new_data = [record for record in data if record[0] not in existing_ids]
if not new_data:
print(
f"No new records to insert into {table_name}. All IDs already exist."
)
return 0
# Prepare the placeholders for the SQL statement
placeholders = ", ".join(
["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
)
# Flatten the list of tuples for parameter binding
flat_data = [item for sublist in new_data for item in sublist]
# Perform the insertion
if flat_data:
columns = "(id, name, age, email, registration_date, last_login, account_balance)"
connection.execute(
f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
flat_data,
)
return len(new_data)
return 0
return (safe_insert,)
@app.cell(hide_code=True)
def _():
# Prepare the data
user_data = [
(
1,
"Alice",
25,
"[email protected]",
"2021-01-01",
"2023-01-15 14:30:00",
1250.75,
),
(
2,
"Bob",
30,
"[email protected]",
"2021-02-01",
"2023-02-10 09:15:22",
750.50,
),
(
3,
"Charlie",
35,
"[email protected]",
"2021-03-01",
"2023-03-05 17:45:10",
3200.25,
),
(
4,
"David",
40,
"[email protected]",
"2021-04-01",
"2023-04-20 10:30:45",
1800.00,
),
(
5,
"Emma",
45,
"[email protected]",
"2021-05-01",
"2023-05-12 11:20:30",
2500.00,
),
(
6,
"Frank",
50,
"[email protected]",
"2021-06-01",
"2023-06-18 16:10:15",
900.25,
),
]
return (user_data,)
@app.cell(hide_code=True)
def _(file_db, new_memory_db, safe_insert, user_data):
# Safely insert data into memory database
safe_insert(new_memory_db, "users_memory", user_data)
# Safely insert data into file database
safe_insert(file_db, "users_file", user_data)
return
@app.cell(hide_code=True)
def _():
# If you need to add just one record, you can use a similar approach:
new_user = (
7,
"Grace",
28,
"[email protected]",
"2021-07-01",
"2023-07-22 13:45:10",
1675.50,
)
return (new_user,)
@app.cell(hide_code=True)
def _(new_memory_db, new_user):
# Check if the ID exists before inserting
if not new_memory_db.execute(
"SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
).fetchone():
new_memory_db.execute(
"""
INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
new_user,
)
print(f"Added user {new_user[1]} to users_memory")
else:
print(f"User with ID {new_user[0]} already exists in users_memory")
return
@app.cell(hide_code=True)
def _(file_db, new_user):
# Do the same for the file database
if not file_db.execute(
"SELECT id FROM users_file WHERE id = ?", [new_user[0]]
).fetchone():
file_db.execute(
"""
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
new_user,
)
print(f"Added user {new_user[1]} to users_file")
else:
print(f"User with ID {new_user[0]} already exists in users_file")
return
@app.cell(hide_code=True)
def _(new_memory_db):
# First try to update
cursor = new_memory_db.execute(
"""
UPDATE users_memory
SET name = ?, age = ?, email = ?,
registration_date = ?, last_login = ?, account_balance = ?
WHERE id = ?
""",
(
"Henry",
33,
"[email protected]",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
8, # ID should be the last parameter
),
)
return (cursor,)
@app.cell(hide_code=True)
def _(cursor, mo, new_memory_db):
# If no rows were updated, perform an insert
if cursor.rowcount == 0:
new_memory_db.execute(
"""
INSERT INTO users_memory
(id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
(
8,
"Henry",
33,
"[email protected]",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
),
)
mo.md(
f"""
Upserted Henry into users_memory.
"""
)
return
@app.cell(hide_code=True)
def _(file_db, mo):
# For DuckDB using ON CONFLICT, we need to specify the conflict target column
file_db.execute(
"""
INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (id) DO UPDATE SET
name = EXCLUDED.name,
age = EXCLUDED.age,
email = EXCLUDED.email,
registration_date = EXCLUDED.registration_date,
last_login = EXCLUDED.last_login,
account_balance = EXCLUDED.account_balance
""",
(
8,
"Henry",
33,
"[email protected]",
"2021-08-01",
"2023-08-05 09:10:15",
3100.75,
),
)
mo.md(
f"""
Upserted Henry into users_file.
"""
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Display memory data using DuckDB's query capabilities
memory_results = new_memory_db.execute("""
SELECT
id,
name,
age,
email,
registration_date,
last_login,
account_balance
FROM users_memory
ORDER BY id
""").df()
return (memory_results,)
@app.cell(hide_code=True)
def _(file_db):
# Display file data with formatting
file_results = file_db.execute("""
SELECT
id,
name,
age,
email,
registration_date,
last_login,
CAST(account_balance AS DECIMAL(10,2)) AS account_balance
FROM users_file
ORDER BY id
""").df()
return (file_results,)
@app.cell(hide_code=True)
def _(file_results, memory_results, mo):
tabs = mo.ui.tabs(
{
"In-Memory Database": mo.ui.table(memory_results),
"File-Based Database": mo.ui.table(file_results),
}
)
mo.vstack(
[
mo.vstack(
[mo.md(f"""## 📊 Database Contents After Insertion""")],
align="center",
),
tabs,
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
# [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
1. **Direct execution**: Using DuckDB connections to execute SQL
2. **marimo SQL**: Using marimo's built-in SQL engine
3. **Interactive queries**: Combining UI elements with SQL execution
Let's explore these approaches:
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.vstack(
[
mo.vstack([mo.md(f"""## 🔍 Query with marimo SQL""")], align="center"),
mo.md(
"### marimo has its own [built-in SQL engine](https://docs.marimo.io/guides/working_with_data/sql/) that can work with DataFrames."
),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(memory_results, mo):
# Create a SQL selector for users with age threshold
age_threshold = mo.ui.slider(
25, 50, value=30, label="Minimum Age", full_width=True, show_value=True
)
# Create a function to filter users based on the slider value
def filtered_users():
# Use DuckDB directly instead of mo.sql with users param
filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
filtered_df = filtered_df.sort_values("age")
return mo.ui.table(filtered_df)
return age_threshold, filtered_users
@app.cell(hide_code=True)
def _(age_threshold, filtered_users, mo):
layout = mo.vstack(
[
mo.md("### Select minimum age:"),
age_threshold,
mo.md("### Users meeting age criteria:"),
filtered_users(),
],
gap=2,
justify="space-between",
)
layout
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
return
@app.cell(hide_code=True)
def _(pl):
# Create a Polars DataFrame
polars_df = pl.DataFrame(
{
"id": [101, 102, 103],
"name": ["Product A", "Product B", "Product C"],
"price": [29.99, 49.99, 19.99],
"category": ["Electronics", "Furniture", "Books"],
}
)
return (polars_df,)
@app.cell(hide_code=True)
def _(mo, polars_df):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Original Polars DataFrame:""")], align="center"
),
mo.ui.table(polars_df),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(new_memory_db, polars_df):
# Register the Polars DataFrame as a DuckDB table in memory connection
new_memory_db.register("products_polars", polars_df)
# Query the registered table
polars_query_result = new_memory_db.execute(
"SELECT * FROM products_polars WHERE price > 25"
).df()
return (polars_query_result,)
@app.cell(hide_code=True)
def _(mo, polars_query_result):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## DuckDB Query Result (From Polars Data):""")],
align="center",
),
mo.ui.table(polars_query_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Demonstrate a more complex query
complex_query_result = new_memory_db.execute("""
SELECT
category,
COUNT(*) as product_count,
AVG(price) as avg_price,
MIN(price) as min_price,
MAX(price) as max_price
FROM products_polars
GROUP BY category
ORDER BY avg_price DESC
""").df()
return (complex_query_result,)
@app.cell(hide_code=True)
def _(complex_query_result, mo):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Aggregated Product Data by Category:""")],
align="center",
),
mo.ui.table(complex_query_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Create another table to join with
new_memory_db.execute("""
CREATE TABLE IF NOT EXISTS departments (
id INTEGER,
department_name VARCHAR,
manager_id INTEGER
)
""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
new_memory_db.execute("""
INSERT INTO departments VALUES
(101, 'Engineering', 1),
(102, 'Marketing', 2),
(103, 'Finance', NULL)
""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Execute a join query
join_result = new_memory_db.execute("""
SELECT
u.id,
u.name,
u.age,
d.department_name
FROM users_memory u
LEFT JOIN departments d ON u.id = d.manager_id
ORDER BY u.id
""").df()
return (join_result,)
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- Display the join result -->
## Join Result (Users and Departments):
"""
)
return
@app.cell
def _(join_result, mo):
mo.ui.table(join_result)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
<!-- Demonstrate different types of joins -->
## Different Types of Joins
"""
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Inner join
inner_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
INNER JOIN departments d ON u.id = d.manager_id
""").df()
# Right join
right_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
RIGHT JOIN departments d ON u.id = d.manager_id
""").df()
# Full outer join
full_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
FULL OUTER JOIN departments d ON u.id = d.manager_id
""").df()
# Cross join
cross_join = new_memory_db.execute("""
SELECT u.id, u.name, d.department_name
FROM users_memory u
CROSS JOIN departments d
""").df()
# Self join (Joining user table with itself to find users with the same age)
self_join = new_memory_db.execute("""
SELECT u1.id, u1.name, u2.name AS same_age_user
FROM users_memory u1
JOIN users_memory u2 ON u1.age = u2.age AND u1.id <> u2.id
""").df()
# Semi join (Finding users who are also managers)
semi_join = new_memory_db.execute("""
SELECT u.id, u.name, u.age
FROM users_memory u
WHERE EXISTS (
SELECT 1 FROM departments d
WHERE u.id = d.manager_id
)
""").df()
# Anti join (Finding users who are not managers)
anti_join = new_memory_db.execute("""
SELECT u.id, u.name, u.age
FROM users_memory u
WHERE NOT EXISTS (
SELECT 1 FROM departments d
WHERE u.id = d.manager_id
)
""").df()
return (
anti_join,
cross_join,
full_join,
inner_join,
right_join,
self_join,
semi_join,
)
@app.cell(hide_code=True)
def _(mo, new_memory_db):
# Display base table side by side
users = new_memory_db.execute("SELECT * FROM users_memory").df()
departments = new_memory_db.execute("SELECT * FROM departments").df()
base_tables = mo.vstack(
[
mo.vstack([mo.md(f"""# Base Tables""")], align="center"),
mo.ui.tabs(
{
"User Table": mo.ui.table(users),
"Departments Table": mo.ui.table(departments),
}
),
]
)
base_tables
return
@app.cell(hide_code=True)
def _(
anti_join,
cross_join,
full_join,
inner_join,
join_result,
mo,
right_join,
self_join,
semi_join,
):
join_description = {
"Left Join": "Shows all records from the left table and matching records from the right table. Non-matches filled with NULL.",
"Inner Join": "Shows only the records where there's a match in both tables.",
"Right Join": "Shows all records from the right table and matching records from the left table. Non-matches filled with NULL.",
"Full Outer Join": "Shows all records from both tables, with NULL values where there's no match.",
"Cross Join": "Returns the Cartesian product - all possible combinations of rows from both tables.",
"Self Join": "Joins a table with itself, used to compare rows within the same table.",
"Semi Join": "Returns rows from the first table where one or more matches exist in the second table.",
"Anti Join": "Returns rows from the first table where no matches exist in the second table.",
}
join_tabs = mo.ui.tabs(
{
"Left Join": mo.ui.table(join_result),
"Inner Join": mo.ui.table(inner_join),
"Right Join": mo.ui.table(right_join),
"Full Outer Join": mo.ui.table(full_join),
"Cross Join": mo.ui.table(cross_join),
"Self Join": mo.ui.table(self_join),
"Semi Join": mo.ui.table(semi_join),
"Anti Join": mo.ui.table(anti_join),
}
)
return join_description, join_tabs
@app.cell(hide_code=True)
def _(join_description, join_tabs, mo):
join_display = mo.vstack(
[
mo.vstack([mo.md(f"""# SQL Join Operations""")], align="center"),
mo.md(f"**{join_tabs.value}**: {join_description[join_tabs.value]}"),
mo.md("## Join Results"),
join_tabs,
],
gap=2,
justify="space-between",
)
join_display
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
# Execute an aggregate query
agg_result = new_memory_db.execute("""
SELECT
AVG(age) as avg_age,
MAX(age) as max_age,
MIN(age) as min_age,
COUNT(*) as total_users,
SUM(account_balance) as total_balance
FROM users_memory
""").df()
return (agg_result,)
@app.cell(hide_code=True)
def _(agg_result, mo):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Aggregate Results (All Users):""")], align="center"
),
mo.ui.table(agg_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
age_groups = new_memory_db.execute("""
SELECT
CASE
WHEN age < 30 THEN 'Under 30'
WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
ELSE 'Over 40'
END as age_group,
COUNT(*) as count,
AVG(age) as avg_age,
AVG(account_balance) as avg_balance
FROM users_memory
GROUP BY 1
ORDER BY 1
""").df()
return (age_groups,)
@app.cell(hide_code=True)
def _(age_groups, mo):
mo.ui.table(age_groups)
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Aggregate Results (Grouped by Age Range):""")],
align="center",
),
mo.ui.table(age_groups),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
window_result = new_memory_db.execute("""
SELECT
id,
name,
age,
account_balance,
RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
account_balance - AVG(account_balance) OVER () as diff_from_avg,
account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
FROM users_memory
ORDER BY balance_rank
""").df()
return (window_result,)
@app.cell(hide_code=True)
def _(mo, window_result):
mo.vstack(
[
mo.vstack([mo.md(f"""## Window Functions Example""")], align="center"),
mo.ui.table(window_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
return
@app.cell(hide_code=True)
def _(new_memory_db):
polars_result = new_memory_db.execute(
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
).pl()
return (polars_result,)
@app.cell(hide_code=True)
def _(mo, polars_result):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Query Result as Polars DataFrame:""")],
align="center",
),
mo.ui.table(polars_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(new_memory_db):
pandas_result = new_memory_db.execute(
"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
).fetch_df()
return (pandas_result,)
@app.cell(hide_code=True)
def _(mo, pandas_result):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Same Query Result as Pandas DataFrame:""")],
align="center",
),
mo.ui.table(pandas_result),
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Differences in DataFrame Handling""")],
align="center",
),
mo.vstack(
[
mo.md(
f"""## Polars: Filter users over 35 and calculate average balance"""
)
],
align="start",
),
],
gap=2, justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo, pl, polars_result):
def _():
polars_filtered = polars_result.filter(pl.col("age") > 35)
polars_avg = polars_filtered.select(
pl.col("account_balance").mean().alias("avg_balance")
)
layout = mo.vstack(
[
mo.md("### Filtered Polars DataFrame (Age > 35):"),
mo.ui.table(polars_filtered),
mo.md("### Average Account Balance:"),
mo.ui.table(polars_avg),
],
gap=2,
)
return layout
_()
return
@app.cell(hide_code=True)
def _(mo, pandas_result):
pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Pandas: Same operation in pandas style""")],
align="center",
),
mo.vstack(
[mo.md(f"""### Average balance: {pandas_avg:.2f}""")],
align="start",
),
]
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md("""# 9. Data Visualization with DuckDB and Plotly""")
return
@app.cell(hide_code=True)
def _(age_groups, mo, new_memory_db, plotly_express):
# User distribution by age group
fig1 = plotly_express.bar(
age_groups,
x="age_group",
y="count",
title="User Distribution by Age Group",
labels={"count": "Number of Users", "age_group": "Age Group"},
color="age_group",
color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
)
fig1.update_traces(
text=age_groups["count"],
textposition="outside",
)
fig1.update_layout(
height=450,
margin=dict(t=50, b=50, l=50, r=25),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
# Average balance by age group
fig2 = plotly_express.bar(
age_groups,
x="age_group",
y="avg_balance",
title="Average Account Balance by Age Group",
labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
color="age_group",
color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
)
fig2.update_traces(
text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
textposition="outside",
)
fig2.update_layout(
height=450,
margin=dict(t=50, b=50, l=50, r=25),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
# Age vs Account Balance scatter plot
scatter_data = new_memory_db.execute(
"""
SELECT
name,
age,
account_balance
FROM users_memory
ORDER BY age
"""
).df()
fig3 = plotly_express.scatter(
scatter_data,
x="age",
y="account_balance",
title="Age vs. Account Balance",
labels={"account_balance": "Account Balance ($)", "age": "Age"},
color_discrete_sequence=["#FF7F0E"],
trendline="ols",
hover_data=["age", "account_balance"],
size_max=15,
)
fig3.update_traces(marker=dict(size=12))
fig3.update_layout(
height=450,
margin=dict(t=50, b=50, l=50, r=25),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
# Distribution of account balances
balance_data = new_memory_db.execute(
"""
SELECT
name,
account_balance
FROM users_memory
ORDER BY account_balance DESC
"""
).df()
fig4 = plotly_express.pie(
balance_data,
names="name",
values="account_balance",
title="Distribution of Account Balances",
labels={"account_balance": "Account Balance ($)", "name": "User"},
color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
)
fig4.update_traces(textinfo="percent+label", textposition="inside")
fig4.update_layout(
height=450,
margin=dict(t=50, b=50, l=50, r=25),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
category_tabs = mo.ui.tabs(
{
"Age Group Analysis": mo.vstack(
[
mo.ui.tabs(
{
"User Distribution": mo.ui.plotly(fig1),
"Average Balance": mo.ui.plotly(fig2),
}
)
],
gap=2,
justify="space-between",
),
"Financial Analysis": mo.vstack(
[
mo.ui.tabs(
{
"Age vs Balance": mo.ui.plotly(fig3),
"Balance Distribution": mo.ui.plotly(fig4),
}
)
],
gap=2,
justify="space-between",
),
},
lazy=True,
)
mo.vstack(
[
mo.vstack(
[mo.md(f"""## Select a visualization category:""")],
align="start",
),
category_tabs,
],
gap=2,
justify="space-between",
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
/// admonition |
## Database Management Best Practices
///
### Closing Connections
It's important to close database connections when you're done with them, especially for file-based connections:
```python
memory_db.close()
file_db.close()
```
### Transaction Management
DuckDB supports transactions, which can be useful for more complex operations:
```python
conn = duckdb.connect('mydb.db')
conn.begin() # Start transaction
try:
conn.execute("INSERT INTO users VALUES (1, 'Test User')")
conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
conn.commit() # Commit changes
except:
conn.rollback() # Undo changes if error
raise
```
### Query Performance
DuckDB is optimized for analytical queries. For best performance:
- Use appropriate data types
- Create indexes for frequently queried columns
- For large datasets, consider partitioning
- Use prepared statements for repeated queries
"""
)
return
@app.cell(hide_code=True)
def _(mo):
mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""")
return
@app.cell(hide_code=True)
def _(mo):
# Create an interactive filter for age range
min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
return max_age, min_age
@app.cell(hide_code=True)
def _(max_age, min_age, new_memory_db):
# Create a function to filter data and update visualizations
def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
# Get filtered data based on slider values using parameterized query for safety
return new_memory_db.execute(
"""
SELECT
id,
name,
age,
email,
account_balance,
registration_date
FROM users_memory
WHERE age >= ? AND age <= ?
ORDER BY age
""",
[min_val, max_val],
).df()
return (get_filtered_data,)
@app.cell(hide_code=True)
def _(get_filtered_data):
def get_metrics(data=get_filtered_data()):
return {
"user count": len(data),
"avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
"total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
}
return (get_metrics,)
@app.cell(hide_code=True)
def _(get_metrics, mo):
def metrics_display(metrics=get_metrics()):
return mo.hstack(
[
mo.vstack(
[
mo.md("### Selected Users"),
mo.md(f"## {metrics['user count']}"),
],
align="center",
),
mo.vstack(
[
mo.md("### Average Balance"),
mo.md(f"## ${metrics['avg_balance']:.2f}"),
],
align="center",
),
mo.vstack(
[
mo.md("### Total Balance"),
mo.md(f"## ${metrics['total_balance']:.2f}"),
],
align="center",
),
],
justify="space-between",
gap=2,
)
return (metrics_display,)
@app.cell(hide_code=True)
def _(get_filtered_data, max_age, min_age, mo, plotly_express):
def create_visualization(
data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
):
if len(data) == 0:
return mo.ui.text("No data available for the selected age range.")
# Create visualizations for filtered data
fig1 = plotly_express.bar(
data,
x="name",
y="account_balance",
title=f"Account Balance by User (Age {min_val} - {max_val})",
labels={"account_balance": "Account Balance ($)", "name": "User"},
color="account_balance",
color_continuous_scale=plotly_express.colors.sequential.Plasma,
text_auto=".2s",
)
fig1.update_layout(
height=400,
xaxis_tickangle=-45,
margin=dict(t=50, b=70, l=50, r=30),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
fig1.update_traces(
textposition="outside",
)
fig2 = plotly_express.histogram(
data,
x="age",
nbins=min(10, len(set(data["age"]))),
title=f"Age Distribution (Age {min_val} - {max_val})",
color_discrete_sequence=["#4C78A8"],
opacity=0.8,
histnorm="probability density",
)
fig2.update_layout(
height=400,
margin=dict(t=50, b=70, l=50, r=30),
bargap=0.1,
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
fig3 = plotly_express.scatter(
data,
x="age",
y="account_balance",
title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
labels={"account_balance": "Account Balance ($)", "age": "Age"},
color="age",
color_continuous_scale="Viridis",
size_max=25,
size="account_balance",
hover_name="name",
)
fig3.update_layout(
height=400,
margin=dict(t=50, b=70, l=50, r=30),
hoverlabel=dict(bgcolor="white", font_size=12),
template="plotly_white",
)
return mo.ui.tabs(
{
"Account Balance by User": mo.ui.plotly(fig1),
"Age Distribution": mo.ui.plotly(fig2),
"Age vs. Account Balance": mo.ui.plotly(fig3),
}
)
return (create_visualization,)
@app.cell(hide_code=True)
def _(
create_visualization,
get_filtered_data,
max_age,
metrics_display,
min_age,
mo,
):
def dashboard(
min_val=min_age.value,
max_val=max_age.value,
metrics=metrics_display(),
data=get_filtered_data(),
visualization=create_visualization(),
):
return mo.vstack(
[
mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
metrics,
mo.md("### Data Table"),
mo.ui.table(data, page_size=5),
mo.md("### Visualizations"),
visualization,
],
gap=2,
justify="space-between",
)
dashboard()
return
@app.cell(hide_code=True)
def _(mo):
mo.md(
rf"""
# Summary and Key Takeaways
In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
1. **Connection types**: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).
2. **Table creation**: We created tables with various data types, constraints, and primary keys.
3. **Data insertion**: We demonstrated different ways to insert data, including single inserts and bulk loading.
4. **SQL queries**: We executed various SQL queries directly and through marimo's UI components.
5. **Integration with Polars**: We showed how DuckDB can work seamlessly with Polars DataFrames.
6. **Joins and relationships**: We performed JOIN operations between tables to combine related data.
7. **Aggregation**: We used aggregate functions to summarize and analyze data.
8. **Data conversion**: We converted DuckDB results to both Polars and Pandas DataFrames.
9. **Best practices**: We reviewed best practices for managing DuckDB connections and transactions.
10. **Visualization**: We created interactive visualizations and dashboards with Plotly and marimo.
DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.
### Next Steps
- Try loading larger datasets into DuckDB
- Experiment with more complex queries and window functions
- Use DuckDB's COPY functionality to import/export data from/to files
- Create more advanced interactive dashboards with marimo and Plotly
"""
)
return
@app.cell(hide_code=True)
def _():
import marimo as mo
import duckdb
import polars as pl
import os
from datetime import date
import plotly.express as plotly_express
import plotly.graph_objects as plotly_graph_objects
import numpy as np
return date, duckdb, mo, os, pl, plotly_express
if __name__ == "__main__":
app.run()