petergy commited on
Commit
7cfe107
·
1 Parent(s): 9b7fd9a

Polars chapter on strings - initial

Browse files
Files changed (1) hide show
  1. polars/09_strings.py +963 -0
polars/09_strings.py ADDED
@@ -0,0 +1,963 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import marimo
2
+
3
+ __generated_with = "0.11.13"
4
+ app = marimo.App(width="medium")
5
+
6
+
7
+ @app.cell
8
+ def _(mo):
9
+ mo.md(
10
+ r"""
11
+ # Strings
12
+
13
+ _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
14
+
15
+ In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner.
16
+
17
+ We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions.
18
+ """
19
+ )
20
+ return
21
+
22
+
23
+ @app.cell
24
+ def _(mo):
25
+ mo.md(
26
+ r"""
27
+ ## 🛠️ Parsing & Conversion
28
+
29
+ Let's warm up with one of the most frequent use cases: parsing raw strings into various formats.
30
+ We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types.
31
+ """
32
+ )
33
+ return
34
+
35
+
36
+ @app.cell(hide_code=True)
37
+ def _(pl):
38
+ pip_metadata_raw_df = pl.DataFrame(
39
+ [
40
+ '{"package": "polars", "version": "1.24.0", "released_at": "2025-03-02T20:31:12+0000", "size_mb": "30.9"}',
41
+ '{"package": "marimo", "version": "0.11.14", "released_at": "2025-03-04T00:28:57+0000", "size_mb": "10.7"}',
42
+ ],
43
+ schema={"raw_json": pl.String},
44
+ )
45
+ pip_metadata_raw_df
46
+ return (pip_metadata_raw_df,)
47
+
48
+
49
+ @app.cell
50
+ def _(mo):
51
+ mo.md(r"""We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute.""")
52
+ return
53
+
54
+
55
+ @app.cell
56
+ def _(pip_metadata_raw_df, pl):
57
+ pip_metadata_df = pip_metadata_raw_df.select(json=pl.col('raw_json').str.json_decode()).unnest('json')
58
+ pip_metadata_df
59
+ return (pip_metadata_df,)
60
+
61
+
62
+ @app.cell
63
+ def _(mo):
64
+ mo.md(r"""This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns.""")
65
+ return
66
+
67
+
68
+ @app.cell
69
+ def _(mo):
70
+ mo.md(r"""As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion.""")
71
+ return
72
+
73
+
74
+ @app.cell
75
+ def _(pip_metadata_df, pl):
76
+ pip_metadata_df.select(
77
+ 'package',
78
+ 'version',
79
+ pl.col('size_mb').str.to_decimal(),
80
+ )
81
+ return
82
+
83
+
84
+ @app.cell
85
+ def _(mo):
86
+ mo.md(
87
+ r"""
88
+ Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string.
89
+
90
+ Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings.
91
+
92
+ Here's a quick reference:
93
+
94
+ | Specifier | Meaning |
95
+ |-----------|--------------------|
96
+ | `%Y` | Year (e.g., 2025) |
97
+ | `%m` | Month (01-12) |
98
+ | `%d` | Day (01-31) |
99
+ | `%H` | Hour (00-23) |
100
+ | `%z` | UTC offset |
101
+
102
+ The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string.
103
+ """
104
+ )
105
+ return
106
+
107
+
108
+ @app.cell
109
+ def _(pip_metadata_df, pl):
110
+ pip_metadata_df.select(
111
+ 'package',
112
+ 'version',
113
+ release_date=pl.col('released_at').str.to_date('%Y-%m-%dT%H:%M:%S%z'),
114
+ release_datetime=pl.col('released_at').str.to_datetime('%Y-%m-%dT%H:%M:%S%z'),
115
+ release_time=pl.col('released_at').str.to_time('%Y-%m-%dT%H:%M:%S%z'),
116
+ )
117
+ return
118
+
119
+
120
+ @app.cell
121
+ def _(mo):
122
+ mo.md(r"""Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter.""")
123
+ return
124
+
125
+
126
+ @app.cell
127
+ def _(pip_metadata_df, pl):
128
+ pip_metadata_df.select(
129
+ 'package',
130
+ 'version',
131
+ release_date=pl.col('released_at').str.strptime(pl.Date, '%Y-%m-%dT%H:%M:%S%z'),
132
+ release_datetime=pl.col('released_at').str.strptime(pl.Datetime, '%Y-%m-%dT%H:%M:%S%z'),
133
+ release_time=pl.col('released_at').str.strptime(pl.Time, '%Y-%m-%dT%H:%M:%S%z'),
134
+ )
135
+ return
136
+
137
+
138
+ @app.cell
139
+ def _(mo):
140
+ mo.md(r"""And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax.""")
141
+ return
142
+
143
+
144
+ @app.cell
145
+ def _(pip_metadata_raw_df, pl):
146
+ pip_metadata_raw_df.select(
147
+ package=pl.col("raw_json").str.json_path_match("$.package"),
148
+ version=pl.col("raw_json").str.json_path_match("$.version"),
149
+ release_date=pl.col("raw_json")
150
+ .str.json_path_match("$.size_mb")
151
+ .str.to_decimal(),
152
+ )
153
+ return
154
+
155
+
156
+ @app.cell
157
+ def _(mo):
158
+ mo.md(
159
+ r"""
160
+ ## 📊 Dataset Overview
161
+
162
+ Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module.
163
+
164
+ At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member.
165
+
166
+ Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples.
167
+ """
168
+ )
169
+ return
170
+
171
+
172
+ @app.cell(hide_code=True)
173
+ def _(pl):
174
+ def list_members(expr, namespace) -> list[dict]:
175
+ """Iterates through the attributes of `expr` and returns their metadata"""
176
+ members = []
177
+ for attrname in expr.__dir__():
178
+ is_namespace = attrname in pl.Expr._accessors
179
+ is_private = attrname.startswith("_")
180
+ if is_namespace or is_private:
181
+ continue
182
+
183
+ attr = getattr(expr, attrname)
184
+ members.append(
185
+ {
186
+ "namespace": namespace,
187
+ "member": attrname,
188
+ "docstring": attr.__doc__,
189
+ }
190
+ )
191
+ return members
192
+
193
+
194
+ def list_expr_meta() -> list[dict]:
195
+ # Dummy expression instance to 'crawl'
196
+ expr = pl.lit("")
197
+ root_members = list_members(expr, "root")
198
+ namespaced_members: list[list[dict]] = [
199
+ list_members(getattr(expr, namespace), namespace)
200
+ for namespace in pl.Expr._accessors
201
+ ]
202
+ return sum(namespaced_members, root_members)
203
+
204
+
205
+ expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member')
206
+ expressions_df
207
+ return expressions_df, list_expr_meta, list_members
208
+
209
+
210
+ @app.cell
211
+ def _(mo):
212
+ mo.md(r"""As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it.""")
213
+ return
214
+
215
+
216
+ @app.cell(hide_code=True)
217
+ def _(alt, expressions_df):
218
+ expressions_df.plot.bar(
219
+ x=alt.X("count(member):Q", title='Count of Expressions'),
220
+ y=alt.Y("namespace:N", title='Namespace').sort("-x"),
221
+ )
222
+ return
223
+
224
+
225
+ @app.cell
226
+ def _(mo):
227
+ mo.md(
228
+ r"""
229
+ ## 📏 Length Calculation
230
+
231
+ A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data.
232
+
233
+ The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations.
234
+
235
+ Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of.
236
+ """
237
+ )
238
+ return
239
+
240
+
241
+ @app.cell
242
+ def _(expressions_df, pl):
243
+ docstring_length_df = expressions_df.with_columns(
244
+ docstring_len_chars=pl.col("docstring").str.len_chars(),
245
+ docstring_len_bytes=pl.col("docstring").str.len_bytes(),
246
+ )
247
+ docstring_length_df
248
+ return (docstring_length_df,)
249
+
250
+
251
+ @app.cell
252
+ def _(mo):
253
+ mo.md(r"""As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators.""")
254
+ return
255
+
256
+
257
+ @app.cell
258
+ def _(alt, docstring_length_df):
259
+ docstring_length_df.plot.point(
260
+ x=alt.X('docstring_len_chars', title='Docstring Length (Chars)'),
261
+ y=alt.Y('docstring_len_bytes', title='Docstring Length (Bytes)'),
262
+ tooltip=['namespace', 'member', 'docstring_len_chars', 'docstring_len_bytes'],
263
+ )
264
+ return
265
+
266
+
267
+ @app.cell
268
+ def _(mo):
269
+ mo.md(
270
+ r"""
271
+ ## 🔠 Case Conversion
272
+
273
+ Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so.
274
+ """
275
+ )
276
+ return
277
+
278
+
279
+ @app.cell
280
+ def _(expressions_df, pl):
281
+ expressions_df.select(
282
+ member_lower=pl.col('member').str.to_lowercase(),
283
+ member_upper=pl.col('member').str.to_uppercase(),
284
+ member_title=pl.col('member').str.to_titlecase(),
285
+ )
286
+ return
287
+
288
+
289
+ @app.cell
290
+ def _(mo):
291
+ mo.md(
292
+ r"""
293
+ ## ➕ Padding
294
+
295
+ Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`.
296
+
297
+ In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider.
298
+ """
299
+ )
300
+ return
301
+
302
+
303
+ @app.cell(hide_code=True)
304
+ def _(mo):
305
+ padding = mo.ui.slider(0, 16, step=1, value=8, label="Padding Size")
306
+ return (padding,)
307
+
308
+
309
+ @app.cell
310
+ def _(expressions_df, padding, pl):
311
+ padded_df = expressions_df.select("namespace").unique().select(
312
+ "namespace",
313
+ namespace_front_padded=pl.col("namespace").str.pad_start(padding.value, "_"),
314
+ namespace_back_padded=pl.col("namespace").str.pad_end(padding.value, "_"),
315
+ namespace_zfilled=pl.col("namespace").str.zfill(padding.value),
316
+ )
317
+ return (padded_df,)
318
+
319
+
320
+ @app.cell(hide_code=True)
321
+ def _(mo, padded_df, padding):
322
+ mo.vstack([
323
+ padding,
324
+ padded_df,
325
+ ])
326
+ return
327
+
328
+
329
+ @app.cell
330
+ def _(mo):
331
+ mo.md(
332
+ r"""
333
+ ## 🔄 Replacing
334
+
335
+ Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html).
336
+
337
+ As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for.
338
+ """
339
+ )
340
+ return
341
+
342
+
343
+ @app.cell
344
+ def _(expressions_df, pl):
345
+ expressions_df.select(
346
+ "member",
347
+ member_kebab_case_partial=pl.col("member").str.replace("_", "-"),
348
+ member_kebab_case=pl.col("member").str.replace_all("_", "-"),
349
+ ).sort(pl.col("member").str.len_chars(), descending=True)
350
+ return
351
+
352
+
353
+ @app.cell
354
+ def _(mo):
355
+ mo.md(
356
+ r"""
357
+ A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance.
358
+
359
+ In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression.
360
+ """
361
+ )
362
+ return
363
+
364
+
365
+ @app.cell
366
+ def _(expressions_df, pl):
367
+ expressions_df.select(
368
+ "member",
369
+ member_modified=pl.col("member").str.replace_many(
370
+ {
371
+ "min": "minimum",
372
+ "max": "maximum",
373
+ }
374
+ ),
375
+ )
376
+ return
377
+
378
+
379
+ @app.cell
380
+ def _(mo):
381
+ mo.md(
382
+ r"""
383
+ ## 🔍 Searching & Matching
384
+
385
+ A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern.
386
+
387
+ Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check.
388
+ """
389
+ )
390
+ return
391
+
392
+
393
+ @app.cell
394
+ def _(expressions_df, pl):
395
+ expressions_df.select(
396
+ "namespace",
397
+ "member",
398
+ is_converter=pl.col("member").str.starts_with("to_"),
399
+ ).sort(-pl.col("is_converter").cast(pl.Int8))
400
+ return
401
+
402
+
403
+ @app.cell
404
+ def _(mo):
405
+ mo.md(
406
+ r"""
407
+ Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword.
408
+
409
+ Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords.
410
+ """
411
+ )
412
+ return
413
+
414
+
415
+ @app.cell
416
+ def _(expressions_df, pl):
417
+ expressions_df.select(
418
+ "namespace",
419
+ "member",
420
+ is_escaped_keyword=pl.col("member").str.ends_with("_"),
421
+ ).sort(-pl.col("is_escaped_keyword").cast(pl.Int8))
422
+ return
423
+
424
+
425
+ @app.cell
426
+ def _(mo):
427
+ mo.md(
428
+ r"""
429
+ Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members.
430
+
431
+ As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring.
432
+ """
433
+ )
434
+ return
435
+
436
+
437
+ @app.cell
438
+ def _(expressions_df, pl):
439
+ expressions_df.with_columns(
440
+ is_deprecated=pl.col('docstring').str.contains('.. deprecated', literal=True),
441
+ has_parameters=pl.col('docstring').str.contains('Parameters'),
442
+ has_examples=pl.col('docstring').str.contains('Examples'),
443
+ has_related_members=pl.col('docstring').str.contains('See Also'),
444
+ has_url=pl.col('docstring').str.contains('https?://'),
445
+ )
446
+ return
447
+
448
+
449
+ @app.cell
450
+ def _(mo):
451
+ mo.md(r"""For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns.""")
452
+ return
453
+
454
+
455
+ @app.cell
456
+ def _(expressions_df, pl):
457
+ expressions_df.with_columns(
458
+ has_reference=pl.col('docstring').str.contains_any(['See Also', 'https://'])
459
+ )
460
+ return
461
+
462
+
463
+ @app.cell
464
+ def _(mo):
465
+ mo.md(
466
+ r"""
467
+ From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches.
468
+
469
+ Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars.
470
+
471
+ In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`:
472
+
473
+ - `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier).
474
+ - `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores.
475
+ - ` = ` matches a space, equals sign, and space (indicating assignment).
476
+
477
+ This finds variable assignments like `x = ` or `df_result = ` in docstrings.
478
+ """
479
+ )
480
+ return
481
+
482
+
483
+ @app.cell
484
+ def _(expressions_df, pl):
485
+ expressions_df.with_columns(
486
+ variable_assignment_count=pl.col('docstring').str.count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = '),
487
+ )
488
+ return
489
+
490
+
491
+ @app.cell
492
+ def _(mo):
493
+ mo.md(r"""A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`.""")
494
+ return
495
+
496
+
497
+ @app.cell
498
+ def _(expressions_df, pl):
499
+ expressions_df.with_columns(
500
+ code_example_start=pl.col('docstring').str.find('>>>'),
501
+ )
502
+ return
503
+
504
+
505
+ @app.cell
506
+ def _(mo):
507
+ mo.md(
508
+ r"""
509
+ ## ✂️ Slicing and Substrings
510
+
511
+ Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices.
512
+ """
513
+ )
514
+ return
515
+
516
+
517
+ @app.cell
518
+ def _(mo):
519
+ slice = mo.ui.slider(1, 50, step=1, value=25, label="Slice Size")
520
+ return (slice,)
521
+
522
+
523
+ @app.cell
524
+ def _(expressions_df, pl, slice):
525
+ sliced_df = expressions_df.select(
526
+ # Original string
527
+ "docstring",
528
+ # First 25 chars
529
+ docstring_head=pl.col("docstring").str.head(slice.value),
530
+ # 50 chars after the first 25 chars
531
+ docstring_slice=pl.col("docstring").str.slice(slice.value, 2*slice.value),
532
+ # Last 25 chars
533
+ docstring_tail=pl.col("docstring").str.tail(slice.value),
534
+ )
535
+ return (sliced_df,)
536
+
537
+
538
+ @app.cell
539
+ def _(mo, slice, sliced_df):
540
+ mo.vstack([
541
+ slice,
542
+ sliced_df,
543
+ ])
544
+ return
545
+
546
+
547
+ @app.cell
548
+ def _(mo):
549
+ mo.md(
550
+ r"""
551
+ ## ➗ Splitting
552
+
553
+ Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing.
554
+
555
+ The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this.
556
+
557
+ The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields.
558
+ """
559
+ )
560
+ return
561
+
562
+
563
+ @app.cell
564
+ def _(expressions_df, pl):
565
+ expressions_df.select(
566
+ 'member',
567
+ member_name_parts=pl.col('member').str.split('_'),
568
+ member_name_parts_n=pl.col('member').str.splitn('_', n=2),
569
+ member_name_parts_exact=pl.col('member').str.split_exact('_', n=2),
570
+ )
571
+ return
572
+
573
+
574
+ @app.cell
575
+ def _(mo):
576
+ mo.md(r"""As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces.""")
577
+ return
578
+
579
+
580
+ @app.cell(hide_code=True)
581
+ def _(alt, expressions_df, pl, random):
582
+ wc_df = (
583
+ expressions_df.select(pl.col("member").str.split("_"))
584
+ .explode("member")
585
+ .group_by("member")
586
+ .agg(pl.len())
587
+ # Generating random x and y coordinates to distribute the words in the 2D space
588
+ .with_columns(
589
+ x=pl.col("member").map_elements(
590
+ lambda e: random.randint(0, 32),
591
+ return_dtype=pl.UInt8,
592
+ ),
593
+ y=pl.col("member").map_elements(
594
+ lambda e: random.randint(0, 16),
595
+ return_dtype=pl.UInt8,
596
+ ),
597
+ )
598
+ )
599
+
600
+ alt.Chart(wc_df).mark_text(baseline="middle").encode(
601
+ x=alt.X("x:O", axis=None),
602
+ y=alt.Y("y:O", axis=None),
603
+ text="member:N",
604
+ color=alt.Color("len:Q", scale=alt.Scale(scheme="bluepurple")),
605
+ size=alt.Size("len:Q", legend=None),
606
+ tooltip=["member", "len"],
607
+ ).configure_view(strokeWidth=0)
608
+ return (wc_df,)
609
+
610
+
611
+ @app.cell
612
+ def _(mo):
613
+ mo.md(
614
+ r"""
615
+ ## 🔗 Concatenation & Joining
616
+
617
+ Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one.
618
+
619
+ The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``.
620
+ """
621
+ )
622
+ return
623
+
624
+
625
+ @app.cell
626
+ def _(expressions_df, pl):
627
+ descriptions_df = expressions_df.select(
628
+ description=pl.concat_str(
629
+ [
630
+ pl.lit("- Expression "),
631
+ pl.lit("`"),
632
+ "member",
633
+ pl.lit("`"),
634
+ pl.lit(" belongs to namespace "),
635
+ pl.lit("`"),
636
+ "namespace",
637
+ pl.lit("`"),
638
+ ],
639
+ )
640
+ )
641
+ descriptions_df
642
+ return (descriptions_df,)
643
+
644
+
645
+ @app.cell
646
+ def _(mo):
647
+ mo.md(
648
+ r"""
649
+ Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line.
650
+
651
+ We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so.
652
+ """
653
+ )
654
+ return
655
+
656
+
657
+ @app.cell
658
+ def _(descriptions_df, pl):
659
+ descriptions_df.select(pl.col('description').str.join('\n'))
660
+ return
661
+
662
+
663
+ @app.cell(hide_code=True)
664
+ def _(descriptions_df, mo, pl):
665
+ mo.md(f"""In fact, since the string we constructed dynamically is valid markdown, we can display it dynamically using Marimo's `mo.md` utility!
666
+
667
+ ---
668
+
669
+ {descriptions_df.select(pl.col('description').str.join('\n')).to_numpy().squeeze().tolist()}
670
+ """)
671
+ return
672
+
673
+
674
+ @app.cell
675
+ def _(mo):
676
+ mo.md(
677
+ r"""
678
+ ## 🔍 Pattern-based Extraction
679
+
680
+ In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content.
681
+
682
+ In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs.
683
+
684
+ Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches.
685
+ """
686
+ )
687
+ return
688
+
689
+
690
+ @app.cell
691
+ def _(expressions_df, pl):
692
+ url_pattern = r'(https?://[^\s>]+)'
693
+ expressions_df.with_columns(
694
+ "docstring",
695
+ url_match=pl.col('docstring').str.extract(url_pattern),
696
+ url_matches=pl.col('docstring').str.extract_all(url_pattern),
697
+ ).filter(pl.col('url_match').is_not_null())
698
+ return (url_pattern,)
699
+
700
+
701
+ @app.cell
702
+ def _(mo):
703
+ mo.md(
704
+ r"""
705
+ Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way.
706
+
707
+ [`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that.
708
+
709
+ Below we define the regular expression `r"shape:\s*\((?<height>\S+),\s*(?<width>\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream!
710
+ """
711
+ )
712
+ return
713
+
714
+
715
+ @app.cell
716
+ def _(expressions_df, pl):
717
+ expressions_df.with_columns(
718
+ example_df_shape=pl.col('docstring').str.extract_groups(r"shape:\s*\((?<height>\S+),\s*(?<width>\S+)\)"),
719
+ )
720
+ return
721
+
722
+
723
+ @app.cell
724
+ def _(mo):
725
+ mo.md(
726
+ r"""
727
+ ## 🧹 Stripping
728
+
729
+ Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this.
730
+
731
+ All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us.
732
+ """
733
+ )
734
+ return
735
+
736
+
737
+ @app.cell
738
+ def _(expressions_df, pl):
739
+ expressions_df.select(
740
+ "member",
741
+ member_front_stripped=pl.col("member").str.strip_chars_start("a"),
742
+ member_back_stripped=pl.col("member").str.strip_chars_end("n"),
743
+ member_fully_stripped=pl.col("member").str.strip_chars("na"),
744
+ )
745
+ return
746
+
747
+
748
+ @app.cell
749
+ def _(mo):
750
+ mo.md(
751
+ r"""
752
+ Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set.
753
+
754
+ That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html).
755
+
756
+ Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name.
757
+ """
758
+ )
759
+ return
760
+
761
+
762
+ @app.cell
763
+ def _(expressions_df, pl):
764
+ expressions_df.select(
765
+ "member",
766
+ member_prefix_stripped=pl.col("member").str.strip_prefix("to_"),
767
+ member_suffix_stripped=pl.col("member").str.strip_suffix("_with"),
768
+ ).slice(20)
769
+ return
770
+
771
+
772
+ @app.cell
773
+ def _(mo):
774
+ mo.md(
775
+ r"""
776
+ ## 🔑 Encoding & Decoding
777
+
778
+ Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression.
779
+ """
780
+ )
781
+ return
782
+
783
+
784
+ @app.cell
785
+ def _(expressions_df, pl):
786
+ encoded_df = expressions_df.select(
787
+ "member",
788
+ member_base64=pl.col('member').str.encode('base64'),
789
+ member_hex=pl.col('member').str.encode('hex'),
790
+ )
791
+ encoded_df
792
+ return (encoded_df,)
793
+
794
+
795
+ @app.cell
796
+ def _(mo):
797
+ mo.md(r"""And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression.""")
798
+ return
799
+
800
+
801
+ @app.cell
802
+ def _(encoded_df, pl):
803
+ encoded_df.with_columns(
804
+ member_base64_decoded=pl.col('member_base64').str.decode('base64').cast(pl.String),
805
+ member_hex_decoded=pl.col('member_hex').str.decode('hex').cast(pl.String),
806
+ )
807
+ return
808
+
809
+
810
+ @app.cell
811
+ def _(mo):
812
+ mo.md(
813
+ r"""
814
+ ## 🚀 Application: Dynamic Execution of Polars Examples
815
+
816
+ Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored.
817
+
818
+ We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities.
819
+
820
+ In other words, we will use Polars to execute Polars. ❄️ How cool is that?
821
+ """
822
+ )
823
+ return
824
+
825
+
826
+ @app.cell(hide_code=True)
827
+ def _(
828
+ example_editor,
829
+ execution_result,
830
+ expression,
831
+ expression_description,
832
+ expression_docs_link,
833
+ mo,
834
+ ):
835
+ mo.vstack(
836
+ [
837
+ expression,
838
+ mo.hstack([expression_description, expression_docs_link]),
839
+ example_editor,
840
+ execution_result,
841
+ ]
842
+ )
843
+ return
844
+
845
+
846
+ @app.cell(hide_code=True)
847
+ def _(mo, selected_expression_record):
848
+ expression_description = mo.md(selected_expression_record["description"])
849
+ expression_docs_link = mo.md(
850
+ f"🐻‍❄️ [Official Docs](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.{selected_expression_record['expr']}.html)"
851
+ )
852
+ return expression_description, expression_docs_link
853
+
854
+
855
+ @app.cell(hide_code=True)
856
+ def _(example_editor, execute_code):
857
+ execution_result = execute_code(example_editor.value)
858
+ return (execution_result,)
859
+
860
+
861
+ @app.cell(hide_code=True)
862
+ def _(code_df, mo):
863
+ expression = mo.ui.dropdown(code_df.get_column('expr'), value='arr.all', searchable=True)
864
+ return (expression,)
865
+
866
+
867
+ @app.cell(hide_code=True)
868
+ def _(code_df, expression):
869
+ selected_expression_record = code_df.filter(expr=expression.value).to_dicts()[0]
870
+ return (selected_expression_record,)
871
+
872
+
873
+ @app.cell(hide_code=True)
874
+ def _(mo, selected_expression_record):
875
+ example_editor = mo.ui.code_editor(value=selected_expression_record["code"])
876
+ return (example_editor,)
877
+
878
+
879
+ @app.cell(hide_code=True)
880
+ def _(expressions_df, pl):
881
+ code_df = (
882
+ expressions_df.select(
883
+ expr=pl.when(pl.col("namespace") == "root")
884
+ .then("member")
885
+ .otherwise(pl.concat_str(["namespace", "member"], separator=".")),
886
+ description=pl.col("docstring")
887
+ .str.split("\n\n")
888
+ .list.get(0)
889
+ .str.slice(9),
890
+ docstring_lines=pl.col("docstring").str.split("\n"),
891
+ )
892
+ .with_row_index()
893
+ .explode("docstring_lines")
894
+ .rename({"docstring_lines": "docstring_line"})
895
+ .with_columns(pl.col("docstring_line").str.strip_chars(" "))
896
+ .filter(pl.col("docstring_line").str.contains_any([">>> ", "... "]))
897
+ .with_columns(pl.col("docstring_line").str.slice(4))
898
+ .group_by(pl.exclude("docstring_line"), maintain_order=True)
899
+ .agg(code=pl.col("docstring_line").str.join("\n"))
900
+ .drop("index")
901
+ )
902
+ return (code_df,)
903
+
904
+
905
+ @app.cell(hide_code=True)
906
+ def _():
907
+ def execute_code(code: str):
908
+ import ast
909
+
910
+ # Create a new local namespace for execution
911
+ local_namespace = {}
912
+
913
+ # Parse the code into an AST to identify the last expression
914
+ parsed_code = ast.parse(code)
915
+
916
+ # Check if there's at least one statement
917
+ if not parsed_code.body:
918
+ return None
919
+
920
+ # If the last statement is an expression, we'll need to get its value
921
+ last_is_expr = isinstance(parsed_code.body[-1], ast.Expr)
922
+
923
+ if last_is_expr:
924
+ # Split the code: everything except the last statement, and the last statement
925
+ last_expr = ast.Expression(parsed_code.body[-1].value)
926
+
927
+ # Remove the last statement from the parsed code
928
+ parsed_code.body = parsed_code.body[:-1]
929
+
930
+ # Execute everything except the last statement
931
+ if parsed_code.body:
932
+ exec(
933
+ compile(parsed_code, "<string>", "exec"),
934
+ globals(),
935
+ local_namespace,
936
+ )
937
+
938
+ # Execute the last statement and get its value
939
+ result = eval(
940
+ compile(last_expr, "<string>", "eval"), globals(), local_namespace
941
+ )
942
+ return result
943
+ else:
944
+ # If the last statement is not an expression (e.g., an assignment),
945
+ # execute the entire code and return None
946
+ exec(code, globals(), local_namespace)
947
+ return None
948
+ return (execute_code,)
949
+
950
+
951
+ @app.cell(hide_code=True)
952
+ def _():
953
+ import polars as pl
954
+ import marimo as mo
955
+ import altair as alt
956
+ import random
957
+
958
+ random.seed(14)
959
+ return alt, mo, pl, random
960
+
961
+
962
+ if __name__ == "__main__":
963
+ app.run()