petergy commited on
Commit
305d481
Β·
1 Parent(s): 3fb7b66

Daft - Chapter 01

Browse files
Files changed (1) hide show
  1. daft/01_what_makes_daft_special.py +311 -0
daft/01_what_makes_daft_special.py ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.12"
3
+ # dependencies = [
4
+ # "daft==0.4.14",
5
+ # "marimo",
6
+ # ]
7
+ # ///
8
+
9
+ import marimo
10
+
11
+ __generated_with = "0.13.6"
12
+ app = marimo.App(width="medium")
13
+
14
+
15
+ @app.cell(hide_code=True)
16
+ def _(mo):
17
+ mo.md(
18
+ r"""
19
+ # What Makes Daft Special?
20
+
21
+ > _By [PΓ©ter Ferenc Gyarmati](http://github.com/peter-gy)_.
22
+
23
+ Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
24
+ """
25
+ )
26
+ return
27
+
28
+
29
+ @app.cell(hide_code=True)
30
+ def _(mo):
31
+ mo.md(
32
+ r"""
33
+ ## 🎯 Introducing Daft: A Unified Data Engine
34
+
35
+ Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
36
+
37
+ The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
38
+ """
39
+ )
40
+ return
41
+
42
+
43
+ @app.cell(hide_code=True)
44
+ def _(daft, mo):
45
+ mo.md(f"""You're running Daft version: `{daft.__version__}`""")
46
+ return
47
+
48
+
49
+ @app.cell(hide_code=True)
50
+ def _(df_with_discount, discount_slider, mo):
51
+ mo.vstack([
52
+ discount_slider,
53
+ df_with_discount.collect(),
54
+ ])
55
+ return
56
+
57
+
58
+ @app.cell
59
+ def _(daft, discount_slider):
60
+ # Let's create a very simple Daft DataFrame
61
+ df = daft.from_pydict(
62
+ {
63
+ "id": [1, 2, 3],
64
+ "product_name": ["Laptop", "Mouse", "Keyboard"],
65
+ "price": [1200, 25, 75],
66
+ }
67
+ )
68
+
69
+ # Perform a basic operation: calculate a new price after discount
70
+ df_with_discount = df.with_column(
71
+ "discounted_price",
72
+ df["price"] * (1 - discount_slider.value),
73
+ )
74
+ return (df_with_discount,)
75
+
76
+
77
+ @app.cell(hide_code=True)
78
+ def _(mo):
79
+ discount_slider = mo.ui.slider(
80
+ start=0.05,
81
+ stop=0.5,
82
+ step=0.05,
83
+ label="Discount Rate:",
84
+ show_value=True,
85
+ )
86
+ return (discount_slider,)
87
+
88
+
89
+ @app.cell(hide_code=True)
90
+ def _(mo):
91
+ mo.md(
92
+ r"""
93
+ ## πŸ¦€ Built with Rust: Performance and Simplicity
94
+
95
+ One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
96
+
97
+ * **Performance**: [Rust](https://www.rust-lang.org/) is known for its speed and memory efficiency. Unlike systems built on the Java Virtual Machine (JVM), Rust doesn't have a garbage collector that can introduce unpredictable pauses. This often translates to faster execution and more predictable performance.
98
+ * **Efficient Python Integration**: Daft uses Rust's native Python bindings. This allows Python code (like your DataFrame operations or User-Defined Functions, which we'll cover later) to interact closely with the Rust engine. This can reduce the overhead often seen when bridging Python with JVM-based systems (e.g., PySpark), especially for custom Python logic.
99
+ * **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
100
+
101
+ Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
102
+ """
103
+ )
104
+ return
105
+
106
+
107
+ @app.cell(hide_code=True)
108
+ def _(mo):
109
+ mo.center(
110
+ mo.image(
111
+ src="https://minio.peter.gy/static/assets/marimo/learn/daft/daft-anti-spark-social-club.jpeg",
112
+ alt="Daft Anti Spark Social Club Meme",
113
+ caption="πŸ’‘ Fun Fact: Creators of Daft are proud members of the 'Anti Spark Social Club'.",
114
+ width=512,
115
+ height=682,
116
+ )
117
+ )
118
+ return
119
+
120
+
121
+ @app.cell(hide_code=True)
122
+ def _(mo):
123
+ mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!""")
124
+ return
125
+
126
+
127
+ @app.cell
128
+ def _(daft):
129
+ trillion_rows_df = (
130
+ daft.range(1_000_000_000_000)
131
+ .with_column("times_2", daft.col("id") * 2)
132
+ .filter(daft.col("id") % 2 == 0)
133
+ )
134
+ trillion_rows_df
135
+ return (trillion_rows_df,)
136
+
137
+
138
+ @app.cell(hide_code=True)
139
+ def _(mo):
140
+ mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:""")
141
+ return
142
+
143
+
144
+ @app.cell(hide_code=True)
145
+ def _(mo, trillion_rows_df):
146
+ mo.mermaid(trillion_rows_df.explain(format='mermaid').split('\nSet')[0][11:-3])
147
+ return
148
+
149
+
150
+ @app.cell(hide_code=True)
151
+ def _(mo):
152
+ mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""")
153
+ return
154
+
155
+
156
+ @app.cell(hide_code=True)
157
+ def _(mo):
158
+ mo.md(
159
+ r"""
160
+ ## 🌐 Scale Your Work: From Laptop to Cluster
161
+
162
+ Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
163
+
164
+ * **Locally**: Utilizing multiple cores on your laptop or a single powerful machine for development or processing moderately sized datasets.
165
+ * **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
166
+
167
+ This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
168
+ """
169
+ )
170
+ return
171
+
172
+
173
+ @app.cell(hide_code=True)
174
+ def _(mo):
175
+ mo.md(
176
+ r"""
177
+ ## πŸ–ΌοΈ Handling More Than Just Tables: Multimodal Data Support
178
+
179
+ Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
180
+
181
+ Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
182
+
183
+ As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
184
+ """
185
+ )
186
+ return
187
+
188
+
189
+ @app.cell
190
+ def _(daft):
191
+ (
192
+ # Fetch open data from the National Gallery of Art
193
+ daft.read_csv(
194
+ "https://github.com/NationalGalleryOfArt/opendata/raw/refs/heads/main/data/published_images.csv"
195
+ )
196
+ # Working only with first 5 rows to reduce latency of image fetching during this example
197
+ .limit(5)
198
+ # Select the object ID and the image thumbnail URL
199
+ .select(
200
+ daft.col("depictstmsobjectid").alias("objectid"),
201
+ daft.col("iiifthumburl")
202
+ # Download the content from the URL (string -> bytes)
203
+ .url.download(on_error="null")
204
+ # Decode the image bytes into an image object (bytes -> image)
205
+ .image.decode()
206
+ .alias("thumbnail"),
207
+ )
208
+ # Use Daft's built-in image resizing function to create smaller thumbnails
209
+ .with_column(
210
+ "thumbnail_resized",
211
+ # Resize the 'thumbnail' image column
212
+ daft.col("thumbnail").image.resize(w=32, h=32),
213
+ )
214
+ # Execute the plan and bring the results into memory
215
+ .collect()
216
+ )
217
+ return
218
+
219
+
220
+ @app.cell(hide_code=True)
221
+ def _(mo):
222
+ mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""")
223
+ return
224
+
225
+
226
+ @app.cell(hide_code=True)
227
+ def _(mo):
228
+ mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""")
229
+ return
230
+
231
+
232
+ @app.cell(hide_code=True)
233
+ def _(mo):
234
+ mo.md(
235
+ r"""
236
+ ## πŸ§‘β€πŸ’» Designed for Developers: Python and SQL Interfaces
237
+
238
+ Daft aims to be developer-friendly by offering flexible ways to interact with your data:
239
+
240
+ * **Pythonic DataFrame API**: If you've used Pandas, Polars or similar libraries, Daft's Python API for DataFrames will feel quite natural. It provides a rich set of methods for data manipulation, transformation, and analysis.
241
+ * **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
242
+
243
+ This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
244
+ """
245
+ )
246
+ return
247
+
248
+
249
+ @app.cell
250
+ def _(daft):
251
+ df_simple = daft.from_pydict(
252
+ {
253
+ "item_code": [101, 102, 103, 104],
254
+ "quantity": [5, 0, 12, 7],
255
+ "region": ["North", "South", "North", "East"],
256
+ }
257
+ )
258
+ return (df_simple,)
259
+
260
+
261
+ @app.cell
262
+ def _(df_simple):
263
+ # Pandas-flavored API
264
+ df_simple.where((df_simple["quantity"] > 0) & (df_simple["region"] == "North")).collect()
265
+ return
266
+
267
+
268
+ @app.cell
269
+ def _(daft, df_simple):
270
+ # Polars-flavored API
271
+ df_simple.where((daft.col("quantity") > 0) & (daft.col("region") == "North")).collect()
272
+ return
273
+
274
+
275
+ @app.cell
276
+ def _(daft):
277
+ # SQL Interface
278
+ daft.sql("SELECT * FROM df_simple WHERE quantity > 0 AND region = 'North'").collect()
279
+ return
280
+
281
+
282
+ @app.cell(hide_code=True)
283
+ def _(mo):
284
+ mo.md(
285
+ r"""
286
+ ## 🟣 The Daft Advantage: Putting It All Together
287
+
288
+ So, what makes Daft special? It's the combination of these design choices:
289
+
290
+ * A **Rust-based core engine** provides a solid foundation for performance and memory management.
291
+ * **Built-in scalability** means your code can transition from local development to distributed clusters (with Ray) with minimal changes.
292
+ * **Native handling of multimodal data** opens doors for complex ML/AI and analytics tasks that go beyond traditional tabular data.
293
+ * **Developer-centric Python and SQL APIs** offer flexibility and ease of use.
294
+
295
+ These elements combine to make Daft a versatile tool for tackling modern data challenges.
296
+
297
+ And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows πŸš€.
298
+ """
299
+ )
300
+ return
301
+
302
+
303
+ @app.cell
304
+ def _():
305
+ import daft
306
+ import marimo as mo
307
+ return daft, mo
308
+
309
+
310
+ if __name__ == "__main__":
311
+ app.run()