Spaces:

marimo-team
/

marimo-learn

Running

marimo-learn / duckdb /01_getting_started.py

Azmi-84

Hide code cells in getting started guide for improved user experience

34f04d3 17 days ago

50 kB

	# /// script
	# requires-python = ">=3.11"
	# dependencies = [
	# "marimo",
	# "duckdb==1.2.2",
	# "polars==1.27.0",
	# "numpy==2.2.4",
	# "pyarrow==19.0.1",
	# "pandas==2.2.3",
	# "sqlglot==26.12.1",
	# "plotly==5.23.1",
	# "statsmodels==0.14.4",
	# ]
	# ///

	import marimo

	__generated_with = "0.13.4"
	app = marimo.App(width="medium")


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<p align="center">
	<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
	</p>
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	# 🦆 DuckDB: An Embeddable Analytical Database System

	## What is DuckDB?

	[DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.

	---

	## ⚡ Key Features

	\| Feature \| Description \|
	\|:---------\|:-------------\|
	\| In-Process Architecture \| Runs directly within your application's memory space - no separate server needed, simplifying deployment \|
	\| Columnar Storage \| Data stored in columns instead of rows, dramatically improving performance for analytical queries \|
	\| Vectorized Execution \| Performs operations on entire columns at once, significantly speeding up data processing \|
	\| ACID Transactions \| Ensures data integrity and reliability across operations \|
	\| Multi-Language Support \| Provides APIs for `Python`, `R`, `Java`, `C++`, and more \|
	\| Zero External Dependencies \| Minimal dependencies, making setup and deployment straightforward \|
	\| High Portability \| Works across various operating systems (Windows, macOS, Linux) and hardware architectures \|

	---

	## [Use Cases](https://github.com/davidgasquez/awesome-duckdb?tab=readme-ov-file):

	- Data Analysis and Exploration: DuckDB is ideal for quickly querying and analyzing datasets, especially for initial exploratory analysis.
	- Embedded Analytics in Applications: You can integrate DuckDB directly into your applications to provide analytical capabilities without the need for a separate database server.
	- ETL (Extract, Transform, Load) Processes: DuckDB can be used to perform initial data transformation and cleaning steps as part of an ETL pipeline.
	- Data Science and Machine Learning Workflows: It's a lightweight alternative to larger databases for prototyping data analysis and machine learning models.
	- Rapid Prototyping of Data Analysis Pipelines: Quickly test and iterate on data analysis ideas without the complexity of setting up a full-blown database environment.
	- Small to Medium Datasets: DuckDB shines when working with datasets that don't require the massive scalability of a traditional database server.

	---

	### [Installation](https://duckdb.org/docs/installation/?version=stable&environment=python):

	- Python installation:
	```
	pip install duckdb
	```
	```
	conda install python-duckdb -c conda-forge.
	```

	<!-- >_Note_: DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. -->

	/// attention \| Note
	DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
	///
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)

	DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.

	---

	\| Feature \| In-Memory Connection \| File-Based Connection \|
	\|:---------\|:---------------------\|:----------------------\|
	\| Persistence \| Temporary (lost when session ends) \| Stored on disk (persists between sessions) \|
	\| Use Cases \| Quick analysis, ephemeral data, testing \| Long-term storage, data that needs to be accessed later \|
	\| Performance \| Faster for most operations \| Slightly slower but provides persistence \|
	\| Creation \| duckdb.connect(':memory:') \| duckdb.connect('filename.db') \|
	\| Multiple Connection Access \| Limited to single connection \| Multiple connections can access the same database \|
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(os):
	# Remove previous database if it exists
	if os.path.exists("example.db"):
	os.remove("example.db")

	if not os.path.exists("data"):
	os.makedirs("data")
	return


	@app.cell(hide_code=True)
	def _(mo):
	_df = mo.sql(
	f"""
	-- Print the DuckDB version
	SELECT version() AS version_info
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	"""
	## Creating DuckDB Connections

	Let's create both types of DuckDB connections and explore their characteristics.

	1. In-memory connection: Data exists only during the current session
	2. File-based connection: Data persists between sessions

	We'll then demonstrate the key differences between these connection types.
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(duckdb):
	# Create an in-memory DuckDB connection
	memory_db = duckdb.connect(":memory:")

	# Create a file-based DuckDB connection
	file_db = duckdb.connect("example.db")
	return file_db, memory_db


	@app.cell(hide_code=True)
	def _(file_db, memory_db):
	# Test both connections
	memory_db.execute(
	"CREATE TABLE IF NOT EXISTS mem_test (id INTEGER, name VARCHAR)"
	)
	memory_db.execute("INSERT INTO mem_test VALUES (1, 'Memory Test')")

	file_db.execute(
	"CREATE TABLE IF NOT EXISTS file_test (id INTEGER, name VARCHAR)"
	)
	file_db.execute("INSERT INTO file_test VALUES (1, 'File Test')")
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	## Testing Connection Persistence

	Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.

	1. First, we'll query our tables to confirm the data was properly inserted
	2. Then, we'll simulate an application restart by creating new connections
	3. Finally, we'll check which data persists after the "restart"
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""## Current Database Contents""")
	return


	@app.cell(hide_code=True)
	def _(mem_test, memory_db, mo):
	_df = mo.sql(
	f"""
	SELECT * FROM mem_test
	""",
	engine=memory_db
	)
	return


	@app.cell(hide_code=True)
	def _(file_db, file_test, mo):
	_df = mo.sql(
	f"""
	SELECT * FROM file_test
	""",
	engine=file_db
	)
	return


	@app.cell
	def _():
	# We don't actually close the connections here since we need them for later cells
	# Just a placeholder for the concept
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(rf"""## 🔄 Simulating Application Restart...""")
	return


	@app.cell(hide_code=True)
	def _(duckdb):
	# Create new connections (simulating restart)
	new_memory_db = duckdb.connect(":memory:")
	new_file_db = duckdb.connect("example.db")
	return new_file_db, new_memory_db


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Try to query tables in the new memory connection
	try:
	new_memory_db.execute("SELECT * FROM mem_test").df()
	memory_persistence = "✅ Data persisted in memory (unexpected)"
	memory_data_available = True
	except Exception as e:
	memory_persistence = "❌ Data lost from memory (expected behavior)"
	memory_data_available = False
	return memory_data_available, memory_persistence


	@app.cell(hide_code=True)
	def _(new_file_db):
	# Try to query tables in the new file connection
	try:
	file_data = new_file_db.execute("SELECT * FROM file_test").df()
	file_persistence = "✅ Data persisted in file (expected behavior)"
	file_data_available = True
	except Exception as e:
	file_persistence = "❌ Data lost from file (unexpected)"
	file_data_available = False
	file_data = None
	return file_data, file_data_available, file_persistence


	@app.cell(hide_code=True)
	def _(
	file_data_available,
	file_persistence,
	memory_data_available,
	memory_persistence,
	mo,
	):
	# Create an interactive display to show persistence results
	persistence_results = mo.ui.table(
	{
	"Connection Type": ["In-Memory Database", "File-Based Database"],
	"Persistence Status": [memory_persistence, file_persistence],
	"Data Available After Restart": [
	memory_data_available,
	file_data_available,
	],
	}
	)
	return (persistence_results,)


	@app.cell(hide_code=True)
	def _(mo, persistence_results):
	mo.vstack(
	[
	mo.vstack([mo.md(f"""## Persistence Test Results""")], align="center"),
	persistence_results,
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(file_data, file_data_available, mo):
	if file_data_available:
	mo.md("### Persisted File-Based Data:")
	mo.ui.table(file_data)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)

	DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.

	## Table Creation Options

	DuckDB supports various table creation options, including:

	- Basic tables with column definitions
	- Temporary tables that exist only during the session
	- CREATE OR REPLACE to recreate tables
	- Primary keys and other constraints
	- Various data types including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(file_db, new_memory_db):
	# For the memory database
	try:
	new_memory_db.execute("DROP TABLE IF EXISTS users_memory")
	except:
	pass

	# For the file database
	try:
	file_db.execute("DROP TABLE IF EXISTS users_file")
	except:
	pass
	return


	@app.cell(hide_code=True)
	def _(file_db, new_memory_db):
	# Create advanced users table in memory database with primary key
	new_memory_db.execute("""
	CREATE TABLE users_memory (
	id INTEGER PRIMARY KEY,
	name VARCHAR NOT NULL,
	age INTEGER CHECK (age > 0),
	email VARCHAR UNIQUE,
	registration_date DATE DEFAULT CURRENT_DATE,
	last_login TIMESTAMP,
	account_balance DECIMAL(10,2) DEFAULT 0.00
	)
	""")

	# Create users table in file database
	file_db.execute("""
	CREATE TABLE users_file (
	id INTEGER PRIMARY KEY,
	name VARCHAR NOT NULL,
	age INTEGER CHECK (age > 0),
	email VARCHAR UNIQUE,
	registration_date DATE DEFAULT CURRENT_DATE,
	last_login TIMESTAMP,
	account_balance DECIMAL(10,2) DEFAULT 0.00
	)
	""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Get table schema information using DuckDB's internal system tables
	memory_schema = new_memory_db.execute("""
	SELECT column_name, data_type, is_nullable
	FROM information_schema.columns
	WHERE table_name = 'users_memory'
	ORDER BY ordinal_position
	""").df()
	return (memory_schema,)


	@app.cell(hide_code=True)
	def _(memory_schema, mo):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## 🔍 Table Schema Information """)], align="center"
	),
	mo.ui.table(memory_schema),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)

	DuckDB supports multiple ways to insert data:

	1. INSERT INTO VALUES: Insert specific values
	2. INSERT INTO SELECT: Insert data from query results
	3. Parameterized inserts: Using prepared statements
	4. Bulk inserts: For efficient loading of multiple rows

	Let's demonstrate these different insertion methods:
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(date):
	today = date.today()


	# First check if records already exist to avoid duplicate key errors
	def safe_insert(connection, table_name, data):
	"""
	Safely insert data into a table by checking for existing IDs first
	"""
	# Check which IDs already exist in the table
	existing_ids = (
	connection.execute(f"SELECT id FROM {table_name}")
	.fetchdf()["id"]
	.tolist()
	)

	# Filter out data with IDs that already exist
	new_data = [record for record in data if record[0] not in existing_ids]

	if not new_data:
	print(
	f"No new records to insert into {table_name}. All IDs already exist."
	)
	return 0

	# Prepare the placeholders for the SQL statement
	placeholders = ", ".join(
	["(" + ", ".join(["?"] * len(new_data[0])) + ")"] * len(new_data)
	)

	# Flatten the list of tuples for parameter binding
	flat_data = [item for sublist in new_data for item in sublist]

	# Perform the insertion
	if flat_data:
	columns = "(id, name, age, email, registration_date, last_login, account_balance)"
	connection.execute(
	f"INSERT INTO {table_name} {columns} VALUES {placeholders}",
	flat_data,
	)
	return len(new_data)
	return 0
	return (safe_insert,)


	@app.cell(hide_code=True)
	def _():
	# Prepare the data
	user_data = [
	(
	1,
	"Alice",
	25,
	"[email protected]",
	"2021-01-01",
	"2023-01-15 14:30:00",
	1250.75,
	),
	(
	2,
	"Bob",
	30,
	"[email protected]",
	"2021-02-01",
	"2023-02-10 09:15:22",
	750.50,
	),
	(
	3,
	"Charlie",
	35,
	"[email protected]",
	"2021-03-01",
	"2023-03-05 17:45:10",
	3200.25,
	),
	(
	4,
	"David",
	40,
	"[email protected]",
	"2021-04-01",
	"2023-04-20 10:30:45",
	1800.00,
	),
	(
	5,
	"Emma",
	45,
	"[email protected]",
	"2021-05-01",
	"2023-05-12 11:20:30",
	2500.00,
	),
	(
	6,
	"Frank",
	50,
	"[email protected]",
	"2021-06-01",
	"2023-06-18 16:10:15",
	900.25,
	),
	]
	return (user_data,)


	@app.cell(hide_code=True)
	def _(file_db, new_memory_db, safe_insert, user_data):
	# Safely insert data into memory database
	safe_insert(new_memory_db, "users_memory", user_data)

	# Safely insert data into file database
	safe_insert(file_db, "users_file", user_data)
	return


	@app.cell(hide_code=True)
	def _():
	# If you need to add just one record, you can use a similar approach:
	new_user = (
	7,
	"Grace",
	28,
	"[email protected]",
	"2021-07-01",
	"2023-07-22 13:45:10",
	1675.50,
	)
	return (new_user,)


	@app.cell(hide_code=True)
	def _(new_memory_db, new_user):
	# Check if the ID exists before inserting
	if not new_memory_db.execute(
	"SELECT id FROM users_memory WHERE id = ?", [new_user[0]]
	).fetchone():
	new_memory_db.execute(
	"""
	INSERT INTO users_memory (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	new_user,
	)
	print(f"Added user {new_user[1]} to users_memory")
	else:
	print(f"User with ID {new_user[0]} already exists in users_memory")
	return


	@app.cell(hide_code=True)
	def _(file_db, new_user):
	# Do the same for the file database
	if not file_db.execute(
	"SELECT id FROM users_file WHERE id = ?", [new_user[0]]
	).fetchone():
	file_db.execute(
	"""
	INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	new_user,
	)
	print(f"Added user {new_user[1]} to users_file")
	else:
	print(f"User with ID {new_user[0]} already exists in users_file")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# First try to update
	cursor = new_memory_db.execute(
	"""
	UPDATE users_memory
	SET name = ?, age = ?, email = ?,
	registration_date = ?, last_login = ?, account_balance = ?
	WHERE id = ?
	""",
	(
	"Henry",
	33,
	"[email protected]",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	8, # ID should be the last parameter
	),
	)
	return (cursor,)


	@app.cell(hide_code=True)
	def _(cursor, mo, new_memory_db):
	# If no rows were updated, perform an insert
	if cursor.rowcount == 0:
	new_memory_db.execute(
	"""
	INSERT INTO users_memory
	(id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	""",
	(
	8,
	"Henry",
	33,
	"[email protected]",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	),
	)

	mo.md(
	f"""
	Upserted Henry into users_memory.
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(file_db, mo):
	# For DuckDB using ON CONFLICT, we need to specify the conflict target column
	file_db.execute(
	"""
	INSERT INTO users_file (id, name, age, email, registration_date, last_login, account_balance)
	VALUES (?, ?, ?, ?, ?, ?, ?)
	ON CONFLICT (id) DO UPDATE SET
	name = EXCLUDED.name,
	age = EXCLUDED.age,
	email = EXCLUDED.email,
	registration_date = EXCLUDED.registration_date,
	last_login = EXCLUDED.last_login,
	account_balance = EXCLUDED.account_balance
	""",
	(
	8,
	"Henry",
	33,
	"[email protected]",
	"2021-08-01",
	"2023-08-05 09:10:15",
	3100.75,
	),
	)

	mo.md(
	f"""
	Upserted Henry into users_file.
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Display memory data using DuckDB's query capabilities
	memory_results = new_memory_db.execute("""
	SELECT
	id,
	name,
	age,
	email,
	registration_date,
	last_login,
	account_balance
	FROM users_memory
	ORDER BY id
	""").df()
	return (memory_results,)


	@app.cell(hide_code=True)
	def _(file_db):
	# Display file data with formatting
	file_results = file_db.execute("""
	SELECT
	id,
	name,
	age,
	email,
	registration_date,
	last_login,
	CAST(account_balance AS DECIMAL(10,2)) AS account_balance
	FROM users_file
	ORDER BY id
	""").df()
	return (file_results,)


	@app.cell(hide_code=True)
	def _(file_results, memory_results, mo):
	tabs = mo.ui.tabs(
	{
	"In-Memory Database": mo.ui.table(memory_results),
	"File-Based Database": mo.ui.table(file_results),
	}
	)

	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## 📊 Database Contents After Insertion""")],
	align="center",
	),
	tabs,
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	# [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)

	There are multiple ways to leverage DuckDB's SQL capabilities in marimo:

	1. Direct execution: Using DuckDB connections to execute SQL
	2. marimo SQL: Using marimo's built-in SQL engine
	3. Interactive queries: Combining UI elements with SQL execution

	Let's explore these approaches:
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.vstack(
	[
	mo.vstack([mo.md(f"""## 🔍 Query with marimo SQL""")], align="center"),
	mo.md(
	"### marimo has its own [built-in SQL engine](https://docs.marimo.io/guides/working_with_data/sql/) that can work with DataFrames."
	),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(memory_results, mo):
	# Create a SQL selector for users with age threshold
	age_threshold = mo.ui.slider(
	25, 50, value=30, label="Minimum Age", full_width=True, show_value=True
	)


	# Create a function to filter users based on the slider value
	def filtered_users():
	# Use DuckDB directly instead of mo.sql with users param
	filtered_df = memory_results[memory_results["age"] >= age_threshold.value]
	filtered_df = filtered_df.sort_values("age")
	return mo.ui.table(filtered_df)
	return age_threshold, filtered_users


	@app.cell(hide_code=True)
	def _(age_threshold, filtered_users, mo):
	layout = mo.vstack(
	[
	mo.md("### Select minimum age:"),
	age_threshold,
	mo.md("### Users meeting age criteria:"),
	filtered_users(),
	],
	gap=2,
	justify="space-between",
	)

	layout
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
	return


	@app.cell(hide_code=True)
	def _(pl):
	# Create a Polars DataFrame
	polars_df = pl.DataFrame(
	{
	"id": [101, 102, 103],
	"name": ["Product A", "Product B", "Product C"],
	"price": [29.99, 49.99, 19.99],
	"category": ["Electronics", "Furniture", "Books"],
	}
	)
	return (polars_df,)


	@app.cell(hide_code=True)
	def _(mo, polars_df):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Original Polars DataFrame:""")], align="center"
	),
	mo.ui.table(polars_df),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db, polars_df):
	# Register the Polars DataFrame as a DuckDB table in memory connection
	new_memory_db.register("products_polars", polars_df)

	# Query the registered table
	polars_query_result = new_memory_db.execute(
	"SELECT * FROM products_polars WHERE price > 25"
	).df()
	return (polars_query_result,)


	@app.cell(hide_code=True)
	def _(mo, polars_query_result):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## DuckDB Query Result (From Polars Data):""")],
	align="center",
	),
	mo.ui.table(polars_query_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Demonstrate a more complex query
	complex_query_result = new_memory_db.execute("""
	SELECT
	category,
	COUNT(*) as product_count,
	AVG(price) as avg_price,
	MIN(price) as min_price,
	MAX(price) as max_price
	FROM products_polars
	GROUP BY category
	ORDER BY avg_price DESC
	""").df()
	return (complex_query_result,)


	@app.cell(hide_code=True)
	def _(complex_query_result, mo):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Aggregated Product Data by Category:""")],
	align="center",
	),
	mo.ui.table(complex_query_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Create another table to join with
	new_memory_db.execute("""
	CREATE TABLE IF NOT EXISTS departments (
	id INTEGER,
	department_name VARCHAR,
	manager_id INTEGER
	)
	""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	new_memory_db.execute("""
	INSERT INTO departments VALUES
	(101, 'Engineering', 1),
	(102, 'Marketing', 2),
	(103, 'Finance', NULL)
	""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Execute a join query
	join_result = new_memory_db.execute("""
	SELECT
	u.id,
	u.name,
	u.age,
	d.department_name
	FROM users_memory u
	LEFT JOIN departments d ON u.id = d.manager_id
	ORDER BY u.id
	""").df()
	return (join_result,)


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- Display the join result -->
	## Join Result (Users and Departments):
	"""
	)
	return


	@app.cell
	def _(join_result, mo):
	mo.ui.table(join_result)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	<!-- Demonstrate different types of joins -->
	## Different Types of Joins
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Inner join
	inner_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	INNER JOIN departments d ON u.id = d.manager_id
	""").df()

	# Right join
	right_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	RIGHT JOIN departments d ON u.id = d.manager_id
	""").df()

	# Full outer join
	full_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	FULL OUTER JOIN departments d ON u.id = d.manager_id
	""").df()

	# Cross join
	cross_join = new_memory_db.execute("""
	SELECT u.id, u.name, d.department_name
	FROM users_memory u
	CROSS JOIN departments d
	""").df()

	# Self join (Joining user table with itself to find users with the same age)
	self_join = new_memory_db.execute("""
	SELECT u1.id, u1.name, u2.name AS same_age_user
	FROM users_memory u1
	JOIN users_memory u2 ON u1.age = u2.age AND u1.id <> u2.id
	""").df()

	# Semi join (Finding users who are also managers)
	semi_join = new_memory_db.execute("""
	SELECT u.id, u.name, u.age
	FROM users_memory u
	WHERE EXISTS (
	SELECT 1 FROM departments d
	WHERE u.id = d.manager_id
	)
	""").df()

	# Anti join (Finding users who are not managers)
	anti_join = new_memory_db.execute("""
	SELECT u.id, u.name, u.age
	FROM users_memory u
	WHERE NOT EXISTS (
	SELECT 1 FROM departments d
	WHERE u.id = d.manager_id
	)
	""").df()
	return (
	anti_join,
	cross_join,
	full_join,
	inner_join,
	right_join,
	self_join,
	semi_join,
	)


	@app.cell(hide_code=True)
	def _(mo, new_memory_db):
	# Display base table side by side
	users = new_memory_db.execute("SELECT * FROM users_memory").df()
	departments = new_memory_db.execute("SELECT * FROM departments").df()

	base_tables = mo.vstack(
	[
	mo.vstack([mo.md(f"""# Base Tables""")], align="center"),
	mo.ui.tabs(
	{
	"User Table": mo.ui.table(users),
	"Departments Table": mo.ui.table(departments),
	}
	),
	]
	)
	base_tables
	return


	@app.cell(hide_code=True)
	def _(
	anti_join,
	cross_join,
	full_join,
	inner_join,
	join_result,
	mo,
	right_join,
	self_join,
	semi_join,
	):
	join_description = {
	"Left Join": "Shows all records from the left table and matching records from the right table. Non-matches filled with NULL.",
	"Inner Join": "Shows only the records where there's a match in both tables.",
	"Right Join": "Shows all records from the right table and matching records from the left table. Non-matches filled with NULL.",
	"Full Outer Join": "Shows all records from both tables, with NULL values where there's no match.",
	"Cross Join": "Returns the Cartesian product - all possible combinations of rows from both tables.",
	"Self Join": "Joins a table with itself, used to compare rows within the same table.",
	"Semi Join": "Returns rows from the first table where one or more matches exist in the second table.",
	"Anti Join": "Returns rows from the first table where no matches exist in the second table.",
	}


	join_tabs = mo.ui.tabs(
	{
	"Left Join": mo.ui.table(join_result),
	"Inner Join": mo.ui.table(inner_join),
	"Right Join": mo.ui.table(right_join),
	"Full Outer Join": mo.ui.table(full_join),
	"Cross Join": mo.ui.table(cross_join),
	"Self Join": mo.ui.table(self_join),
	"Semi Join": mo.ui.table(semi_join),
	"Anti Join": mo.ui.table(anti_join),
	}
	)
	return join_description, join_tabs


	@app.cell(hide_code=True)
	def _(join_description, join_tabs, mo):
	join_display = mo.vstack(
	[
	mo.vstack([mo.md(f"""# SQL Join Operations""")], align="center"),
	mo.md(f"{join_tabs.value}: {join_description[join_tabs.value]}"),
	mo.md("## Join Results"),
	join_tabs,
	],
	gap=2,
	justify="space-between",
	)

	join_display
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	# Execute an aggregate query
	agg_result = new_memory_db.execute("""
	SELECT
	AVG(age) as avg_age,
	MAX(age) as max_age,
	MIN(age) as min_age,
	COUNT(*) as total_users,
	SUM(account_balance) as total_balance
	FROM users_memory
	""").df()
	return (agg_result,)


	@app.cell(hide_code=True)
	def _(agg_result, mo):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Aggregate Results (All Users):""")], align="center"
	),
	mo.ui.table(agg_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	age_groups = new_memory_db.execute("""
	SELECT
	CASE
	WHEN age < 30 THEN 'Under 30'
	WHEN age BETWEEN 30 AND 40 THEN '30 to 40'
	ELSE 'Over 40'
	END as age_group,
	COUNT(*) as count,
	AVG(age) as avg_age,
	AVG(account_balance) as avg_balance
	FROM users_memory
	GROUP BY 1
	ORDER BY 1
	""").df()
	return (age_groups,)


	@app.cell(hide_code=True)
	def _(age_groups, mo):
	mo.ui.table(age_groups)
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Aggregate Results (Grouped by Age Range):""")],
	align="center",
	),
	mo.ui.table(age_groups),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	window_result = new_memory_db.execute("""
	SELECT
	id,
	name,
	age,
	account_balance,
	RANK() OVER (ORDER BY account_balance DESC) as balance_rank,
	account_balance - AVG(account_balance) OVER () as diff_from_avg,
	account_balance / SUM(account_balance) OVER () * 100 as pct_of_total
	FROM users_memory
	ORDER BY balance_rank
	""").df()
	return (window_result,)


	@app.cell(hide_code=True)
	def _(mo, window_result):
	mo.vstack(
	[
	mo.vstack([mo.md(f"""## Window Functions Example""")], align="center"),
	mo.ui.table(window_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	polars_result = new_memory_db.execute(
	"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
	).pl()
	return (polars_result,)


	@app.cell(hide_code=True)
	def _(mo, polars_result):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Query Result as Polars DataFrame:""")],
	align="center",
	),
	mo.ui.table(polars_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(new_memory_db):
	pandas_result = new_memory_db.execute(
	"""SELECT * FROM users_memory WHERE age > 25 ORDER BY age"""
	).fetch_df()
	return (pandas_result,)


	@app.cell(hide_code=True)
	def _(mo, pandas_result):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Same Query Result as Pandas DataFrame:""")],
	align="center",
	),
	mo.ui.table(pandas_result),
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Differences in DataFrame Handling""")],
	align="center",
	),
	mo.vstack(
	[
	mo.md(
	f"""## Polars: Filter users over 35 and calculate average balance"""
	)
	],
	align="start",
	),
	],
	gap=2, justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo, pl, polars_result):
	def _():
	polars_filtered = polars_result.filter(pl.col("age") > 35)
	polars_avg = polars_filtered.select(
	pl.col("account_balance").mean().alias("avg_balance")
	)

	layout = mo.vstack(
	[
	mo.md("### Filtered Polars DataFrame (Age > 35):"),
	mo.ui.table(polars_filtered),
	mo.md("### Average Account Balance:"),
	mo.ui.table(polars_avg),
	],
	gap=2,
	)
	return layout


	_()
	return


	@app.cell(hide_code=True)
	def _(mo, pandas_result):
	pandas_avg = pandas_result[pandas_result["age"] > 35]["account_balance"].mean()
	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Pandas: Same operation in pandas style""")],
	align="center",
	),
	mo.vstack(
	[mo.md(f"""### Average balance: {pandas_avg:.2f}""")],
	align="start",
	),
	]
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md("""# 9. Data Visualization with DuckDB and Plotly""")
	return


	@app.cell(hide_code=True)
	def _(age_groups, mo, new_memory_db, plotly_express):
	# User distribution by age group
	fig1 = plotly_express.bar(
	age_groups,
	x="age_group",
	y="count",
	title="User Distribution by Age Group",
	labels={"count": "Number of Users", "age_group": "Age Group"},
	color="age_group",
	color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
	)
	fig1.update_traces(
	text=age_groups["count"],
	textposition="outside",
	)
	fig1.update_layout(
	height=450,
	margin=dict(t=50, b=50, l=50, r=25),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)


	# Average balance by age group
	fig2 = plotly_express.bar(
	age_groups,
	x="age_group",
	y="avg_balance",
	title="Average Account Balance by Age Group",
	labels={"avg_balance": "Average Balance ($)", "age_group": "Age Group"},
	color="age_group",
	color_discrete_sequence=plotly_express.colors.qualitative.Plotly,
	)
	fig2.update_traces(
	text=[f"${val:.2f}" for val in age_groups["avg_balance"]],
	textposition="outside",
	)
	fig2.update_layout(
	height=450,
	margin=dict(t=50, b=50, l=50, r=25),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)


	# Age vs Account Balance scatter plot
	scatter_data = new_memory_db.execute(
	"""
	SELECT
	name,
	age,
	account_balance
	FROM users_memory
	ORDER BY age
	"""
	).df()

	fig3 = plotly_express.scatter(
	scatter_data,
	x="age",
	y="account_balance",
	title="Age vs. Account Balance",
	labels={"account_balance": "Account Balance ($)", "age": "Age"},
	color_discrete_sequence=["#FF7F0E"],
	trendline="ols",
	hover_data=["age", "account_balance"],
	size_max=15,
	)
	fig3.update_traces(marker=dict(size=12))
	fig3.update_layout(
	height=450,
	margin=dict(t=50, b=50, l=50, r=25),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)


	# Distribution of account balances
	balance_data = new_memory_db.execute(
	"""
	SELECT
	name,
	account_balance
	FROM users_memory
	ORDER BY account_balance DESC
	"""
	).df()

	fig4 = plotly_express.pie(
	balance_data,
	names="name",
	values="account_balance",
	title="Distribution of Account Balances",
	labels={"account_balance": "Account Balance ($)", "name": "User"},
	color_discrete_sequence=plotly_express.colors.qualitative.Pastel,
	)
	fig4.update_traces(textinfo="percent+label", textposition="inside")
	fig4.update_layout(
	height=450,
	margin=dict(t=50, b=50, l=50, r=25),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)


	category_tabs = mo.ui.tabs(
	{
	"Age Group Analysis": mo.vstack(
	[
	mo.ui.tabs(
	{
	"User Distribution": mo.ui.plotly(fig1),
	"Average Balance": mo.ui.plotly(fig2),
	}
	)
	],
	gap=2,
	justify="space-between",
	),
	"Financial Analysis": mo.vstack(
	[
	mo.ui.tabs(
	{
	"Age vs Balance": mo.ui.plotly(fig3),
	"Balance Distribution": mo.ui.plotly(fig4),
	}
	)
	],
	gap=2,
	justify="space-between",
	),
	},
	lazy=True,
	)

	mo.vstack(
	[
	mo.vstack(
	[mo.md(f"""## Select a visualization category:""")],
	align="start",
	),
	category_tabs,
	],
	gap=2,
	justify="space-between",
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	r"""
	/// admonition \|
	## Database Management Best Practices
	///

	### Closing Connections

	It's important to close database connections when you're done with them, especially for file-based connections:

	```python
	memory_db.close()
	file_db.close()
	```

	### Transaction Management

	DuckDB supports transactions, which can be useful for more complex operations:

	```python
	conn = duckdb.connect('mydb.db')
	conn.begin() # Start transaction

	try:
	conn.execute("INSERT INTO users VALUES (1, 'Test User')")
	conn.execute("UPDATE balances SET amount = amount - 100 WHERE user_id = 1")
	conn.commit() # Commit changes
	except:
	conn.rollback() # Undo changes if error
	raise
	```

	### Query Performance

	DuckDB is optimized for analytical queries. For best performance:

	- Use appropriate data types
	- Create indexes for frequently queried columns
	- For large datasets, consider partitioning
	- Use prepared statements for repeated queries
	"""
	)
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""")
	return


	@app.cell(hide_code=True)
	def _(mo):
	# Create an interactive filter for age range
	min_age = mo.ui.slider(20, 50, value=25, label="Minimum Age")
	max_age = mo.ui.slider(20, 50, value=50, label="Maximum Age")
	return max_age, min_age


	@app.cell(hide_code=True)
	def _(max_age, min_age, new_memory_db):
	# Create a function to filter data and update visualizations
	def get_filtered_data(min_val=min_age.value, max_val=max_age.value):
	# Get filtered data based on slider values using parameterized query for safety
	return new_memory_db.execute(
	"""
	SELECT
	id,
	name,
	age,
	email,
	account_balance,
	registration_date
	FROM users_memory
	WHERE age >= ? AND age <= ?
	ORDER BY age
	""",
	[min_val, max_val],
	).df()
	return (get_filtered_data,)


	@app.cell(hide_code=True)
	def _(get_filtered_data):
	def get_metrics(data=get_filtered_data()):
	return {
	"user count": len(data),
	"avg_balance": data["account_balance"].mean() if len(data) > 0 else 0,
	"total_balance": data["account_balance"].sum() if len(data) > 0 else 0,
	}
	return (get_metrics,)


	@app.cell(hide_code=True)
	def _(get_metrics, mo):
	def metrics_display(metrics=get_metrics()):
	return mo.hstack(
	[
	mo.vstack(
	[
	mo.md("### Selected Users"),
	mo.md(f"## {metrics['user count']}"),
	],
	align="center",
	),
	mo.vstack(
	[
	mo.md("### Average Balance"),
	mo.md(f"## ${metrics['avg_balance']:.2f}"),
	],
	align="center",
	),
	mo.vstack(
	[
	mo.md("### Total Balance"),
	mo.md(f"## ${metrics['total_balance']:.2f}"),
	],
	align="center",
	),
	],
	justify="space-between",
	gap=2,
	)
	return (metrics_display,)


	@app.cell(hide_code=True)
	def _(get_filtered_data, max_age, min_age, mo, plotly_express):
	def create_visualization(
	data=get_filtered_data(), min_val=min_age.value, max_val=max_age.value
	):
	if len(data) == 0:
	return mo.ui.text("No data available for the selected age range.")

	# Create visualizations for filtered data
	fig1 = plotly_express.bar(
	data,
	x="name",
	y="account_balance",
	title=f"Account Balance by User (Age {min_val} - {max_val})",
	labels={"account_balance": "Account Balance ($)", "name": "User"},
	color="account_balance",
	color_continuous_scale=plotly_express.colors.sequential.Plasma,
	text_auto=".2s",
	)
	fig1.update_layout(
	height=400,
	xaxis_tickangle=-45,
	margin=dict(t=50, b=70, l=50, r=30),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)
	fig1.update_traces(
	textposition="outside",
	)

	fig2 = plotly_express.histogram(
	data,
	x="age",
	nbins=min(10, len(set(data["age"]))),
	title=f"Age Distribution (Age {min_val} - {max_val})",
	color_discrete_sequence=["#4C78A8"],
	opacity=0.8,
	histnorm="probability density",
	)
	fig2.update_layout(
	height=400,
	margin=dict(t=50, b=70, l=50, r=30),
	bargap=0.1,
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)

	fig3 = plotly_express.scatter(
	data,
	x="age",
	y="account_balance",
	title=f"Age vs. Account Balance (Age {min_val} - {max_val})",
	labels={"account_balance": "Account Balance ($)", "age": "Age"},
	color="age",
	color_continuous_scale="Viridis",
	size_max=25,
	size="account_balance",
	hover_name="name",
	)
	fig3.update_layout(
	height=400,
	margin=dict(t=50, b=70, l=50, r=30),
	hoverlabel=dict(bgcolor="white", font_size=12),
	template="plotly_white",
	)

	return mo.ui.tabs(
	{
	"Account Balance by User": mo.ui.plotly(fig1),
	"Age Distribution": mo.ui.plotly(fig2),
	"Age vs. Account Balance": mo.ui.plotly(fig3),
	}
	)
	return (create_visualization,)


	@app.cell(hide_code=True)
	def _(
	create_visualization,
	get_filtered_data,
	max_age,
	metrics_display,
	min_age,
	mo,
	):
	def dashboard(
	min_val=min_age.value,
	max_val=max_age.value,
	metrics=metrics_display(),
	data=get_filtered_data(),
	visualization=create_visualization(),
	):
	return mo.vstack(
	[
	mo.md(f"### Interactive Dashboard (Age {min_val} - {max_val})"),
	metrics,
	mo.md("### Data Table"),
	mo.ui.table(data, page_size=5),
	mo.md("### Visualizations"),
	visualization,
	],
	gap=2,
	justify="space-between",
	)


	dashboard()
	return


	@app.cell(hide_code=True)
	def _(mo):
	mo.md(
	rf"""
	# Summary and Key Takeaways

	In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:

	1. Connection types: We learned the difference between in-memory databases (temporary) and file-based databases (persistent).

	2. Table creation: We created tables with various data types, constraints, and primary keys.

	3. Data insertion: We demonstrated different ways to insert data, including single inserts and bulk loading.

	4. SQL queries: We executed various SQL queries directly and through marimo's UI components.

	5. Integration with Polars: We showed how DuckDB can work seamlessly with Polars DataFrames.

	6. Joins and relationships: We performed JOIN operations between tables to combine related data.

	7. Aggregation: We used aggregate functions to summarize and analyze data.

	8. Data conversion: We converted DuckDB results to both Polars and Pandas DataFrames.

	9. Best practices: We reviewed best practices for managing DuckDB connections and transactions.

	10. Visualization: We created interactive visualizations and dashboards with Plotly and marimo.

	DuckDB is an excellent tool for data analysis, especially for analytical workloads. Its in-process nature makes it fast and easy to use, while its SQL compatibility makes it accessible for anyone familiar with SQL databases.

	### Next Steps

	- Try loading larger datasets into DuckDB
	- Experiment with more complex queries and window functions
	- Use DuckDB's COPY functionality to import/export data from/to files
	- Create more advanced interactive dashboards with marimo and Plotly
	"""
	)
	return


	@app.cell(hide_code=True)
	def _():
	import marimo as mo
	import duckdb
	import polars as pl
	import os
	from datetime import date
	import plotly.express as plotly_express
	import plotly.graph_objects as plotly_graph_objects
	import numpy as np
	return date, duckdb, mo, os, pl, plotly_express


	if __name__ == "__main__":
	app.run()