# /// script # requires-python = ">=3.12" # dependencies = [ # "altair==5.5.0", # "beautifulsoup4==4.13.3", # "httpx==0.28.1", # "marimo", # "nest-asyncio==1.6.0", # "numba==0.61.0", # "numpy==2.1.3", # "polars==1.24.0", # ] # /// import marimo __generated_with = "0.11.17" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): mo.md( r""" # User-Defined Functions _By [PΓ©ter Ferenc Gyarmati](http://github.com/peter-gy)_. Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic. In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example. """ ) return @app.cell(hide_code=True) def _(mo): mo.md( r""" ## βš–οΈ The Cost of UDFs > Performance vs. Flexibility Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*. However, UDFs become inevitable when you need to: - **Integrate external libraries:** Use functionality not directly available in Polars. - **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions. Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient. """ ) return @app.cell(hide_code=True) def _(mo): mo.md( r""" ## πŸ“Š Project Overview > Scraping and Analyzing Observable Notebook Statistics If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web. """ ) return @app.cell(hide_code=True) def _(mo): mo.hstack( [ mo.image( "https://minio.peter.gy/static/assets/marimo/learn/polars/14_d3-gallery.png?0", width=600, caption="Screenshot of https://observablehq.com/@d3/gallery", ), mo.image( "https://minio.peter.gy/static/assets/marimo/learn/polars/14_plot-gallery.png?0", width=600, caption="Screenshot of https://observablehq.com/@observablehq/plot-gallery", ), ] ) return @app.cell(hide_code=True) def _(mo): mo.md(r"""Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches.""") return @app.cell(hide_code=True) def _(mo): mo.mermaid(''' graph LR; url_df --> |"UDF: Fetch HTML"| html_df html_df --> |"UDF: Parse with BeautifulSoup"| parsed_html_df parsed_html_df --> |"Native Polars: Extract Data"| notebooks_df notebooks_df --> |"UDF: Get Notebook Stats"| notebook_stats_df notebook_stats_df --> |"Numba UDF: Compute Popularity"| notebook_popularity_df ''') return @app.cell(hide_code=True) def _(mo): mo.md(r"""Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks.""") return @app.cell(hide_code=True) def _(pl): url_df = pl.from_dict( { "url": [ "https://observablehq.com/@d3/gallery", "https://observablehq.com/@observablehq/plot-gallery", ] } ) url_df return (url_df,) @app.cell(hide_code=True) def _(mo): mo.md( r""" ## πŸ”‚ Element-Wise UDFs > Processing Value by Value The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`. We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression. You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect. """ ) return @app.cell(hide_code=True) def _(httpx, pl, url_df): html_df = url_df.with_columns( html=pl.col("url").map_elements( lambda url: httpx.get(url).text, return_dtype=pl.String, ) ) html_df return (html_df,) @app.cell(hide_code=True) def _(mo): mo.md( r""" Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this. These Observable pages are built with [Next.js](https://nextjs.org/), which helpfully serializes page properties as JSON within the HTML. This simplifies our UDF: we'll extract the raw JSON from the `