Joram Mutenge commited on
Commit
4c17152
·
1 Parent(s): cbef791

notebook on basic operations in polars

Browse files
Files changed (1) hide show
  1. polars/04_basic_operations.py +623 -0
polars/04_basic_operations.py ADDED
@@ -0,0 +1,623 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import marimo
2
+
3
+ __generated_with = "0.11.13"
4
+ app = marimo.App(width="medium")
5
+
6
+
7
+ @app.cell
8
+ def _():
9
+ import marimo as mo
10
+ return (mo,)
11
+
12
+
13
+ @app.cell(hide_code=True)
14
+ def _(mo):
15
+ mo.md(
16
+ r"""
17
+ # Basic operations on data
18
+ _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
19
+
20
+ In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new).
21
+ """
22
+ )
23
+ return
24
+
25
+
26
+ @app.cell
27
+ def _():
28
+ import polars as pl
29
+
30
+ df = pl.DataFrame(
31
+ {
32
+ "software": [
33
+ "Lotus-123",
34
+ "WordStar",
35
+ "dBase III",
36
+ "VisiCalc",
37
+ "WinZip",
38
+ "MS-DOS",
39
+ "HyperCard",
40
+ "WordPerfect",
41
+ "Excel",
42
+ "Photoshop",
43
+ "Visual Studio",
44
+ "Slack",
45
+ "Zoom",
46
+ "Notion",
47
+ "Figma",
48
+ "Spotify",
49
+ "VSCode",
50
+ "Docker",
51
+ ],
52
+ "users": [
53
+ 10000,
54
+ 4500,
55
+ 2500,
56
+ 3000,
57
+ 1800,
58
+ 17000,
59
+ 2200,
60
+ 1900,
61
+ 500000,
62
+ 12000000,
63
+ 1500000,
64
+ 3000000,
65
+ 4000000,
66
+ 2000000,
67
+ 2500000,
68
+ 4500000,
69
+ 6000000,
70
+ 3500000,
71
+ ],
72
+ "category": ["Vintage"] * 8 + ["Modern"] * 10,
73
+ "year": [
74
+ 1985,
75
+ 1980,
76
+ 1984,
77
+ 1979,
78
+ 1991,
79
+ 1981,
80
+ 1987,
81
+ 1982,
82
+ 1987,
83
+ 1990,
84
+ 1997,
85
+ 2013,
86
+ 2011,
87
+ 2016,
88
+ 2016,
89
+ 2008,
90
+ 2015,
91
+ 2013,
92
+ ],
93
+ }
94
+ )
95
+
96
+ df
97
+ return df, pl
98
+
99
+
100
+ @app.cell(hide_code=True)
101
+ def _(mo):
102
+ mo.md(
103
+ r"""
104
+ ## Arithmetic
105
+ ### Addition
106
+ Let's add 42 users to each piece of software. This means adding 42 to each value under **users**.
107
+ """
108
+ )
109
+ return
110
+
111
+
112
+ @app.cell
113
+ def _(df, pl):
114
+ df.with_columns(pl.col("users") + 42)
115
+ return
116
+
117
+
118
+ @app.cell(hide_code=True)
119
+ def _(mo):
120
+ mo.md(r"""Another way to perform the above operation is using the built-in function.""")
121
+ return
122
+
123
+
124
+ @app.cell
125
+ def _(df, pl):
126
+ df.with_columns(pl.col("users").add(42))
127
+ return
128
+
129
+
130
+ @app.cell(hide_code=True)
131
+ def _(mo):
132
+ mo.md(
133
+ r"""
134
+ ### Subtraction
135
+ Let's subtract 42 users to each piece of software.
136
+ """
137
+ )
138
+ return
139
+
140
+
141
+ @app.cell
142
+ def _(df, pl):
143
+ df.with_columns(pl.col("users") - 42)
144
+ return
145
+
146
+
147
+ @app.cell(hide_code=True)
148
+ def _(mo):
149
+ mo.md(r"""Alternatively, you could subtract like this:""")
150
+ return
151
+
152
+
153
+ @app.cell
154
+ def _(df, pl):
155
+ df.with_columns(pl.col("users").sub(42))
156
+ return
157
+
158
+
159
+ @app.cell(hide_code=True)
160
+ def _(mo):
161
+ mo.md(
162
+ r"""
163
+ ### Division
164
+ Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it.
165
+ """
166
+ )
167
+ return
168
+
169
+
170
+ @app.cell
171
+ def _(df, pl):
172
+ df.with_columns(pl.col("users") / 1000)
173
+ return
174
+
175
+
176
+ @app.cell(hide_code=True)
177
+ def _(mo):
178
+ mo.md(r"""Or we could do it with a built-in expression.""")
179
+ return
180
+
181
+
182
+ @app.cell
183
+ def _(df, pl):
184
+ df.with_columns(pl.col("users").truediv(1000))
185
+ return
186
+
187
+
188
+ @app.cell(hide_code=True)
189
+ def _(mo):
190
+ mo.md(r"""If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this.""")
191
+ return
192
+
193
+
194
+ @app.cell
195
+ def _(df, pl):
196
+ df.with_columns(pl.col("users").floordiv(1000))
197
+ return
198
+
199
+
200
+ @app.cell(hide_code=True)
201
+ def _(mo):
202
+ mo.md(
203
+ r"""
204
+ ### Multiplication
205
+ Let's pretend the *user* values are deflated and increase them by multiplying by 100.
206
+ """
207
+ )
208
+ return
209
+
210
+
211
+ @app.cell
212
+ def _(df, pl):
213
+ (df.with_columns(pl.col("users") * 100))
214
+ return
215
+
216
+
217
+ @app.cell(hide_code=True)
218
+ def _(mo):
219
+ mo.md(r"""Polars also has a built-in function for multiplication.""")
220
+ return
221
+
222
+
223
+ @app.cell
224
+ def _(df, pl):
225
+ df.with_columns(pl.col("users").mul(100))
226
+ return
227
+
228
+
229
+ @app.cell(hide_code=True)
230
+ def _(mo):
231
+ mo.md(r"""So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000.""")
232
+ return
233
+
234
+
235
+ @app.cell
236
+ def _(df, pl):
237
+ (df.with_columns(decade=pl.col("year").floordiv(10).mul(10)))
238
+ return
239
+
240
+
241
+ @app.cell(hide_code=True)
242
+ def _(mo):
243
+ mo.md(r"""We could create a new column another way as follows:""")
244
+ return
245
+
246
+
247
+ @app.cell
248
+ def _(df, pl):
249
+ df.with_columns((pl.col("year").floordiv(10).mul(10)).alias("decade"))
250
+ return
251
+
252
+
253
+ @app.cell(hide_code=True)
254
+ def _(mo):
255
+ mo.md(
256
+ r"""
257
+ **Tip**
258
+ Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain.
259
+
260
+ ## Comparison
261
+ ### Equal
262
+ Let's get all the software categorized as Vintage.
263
+ """
264
+ )
265
+ return
266
+
267
+
268
+ @app.cell
269
+ def _(df, pl):
270
+ (
271
+ df.with_columns(decade=pl.col("year").floordiv(10).mul(10))
272
+ .filter(pl.col("category") == "Vintage")
273
+ )
274
+ return
275
+
276
+
277
+ @app.cell(hide_code=True)
278
+ def _(mo):
279
+ mo.md(r"""We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation.""")
280
+ return
281
+
282
+
283
+ @app.cell
284
+ def _(df, pl):
285
+ (
286
+ df.with_columns(decade=pl.col("year").floordiv(10).mul(10))
287
+ .filter(pl.col("category") == "Vintage")
288
+ .filter(pl.col("decade") == 1970)
289
+ )
290
+ return
291
+
292
+
293
+ @app.cell(hide_code=True)
294
+ def _(mo):
295
+ mo.md(
296
+ r"""
297
+ We could also do this comparison in one line, if readability is not a concern
298
+
299
+ **Notice** that we must enclose the two expressions between the `&` with parenthesis.
300
+ """
301
+ )
302
+ return
303
+
304
+
305
+ @app.cell
306
+ def _(df, pl):
307
+ (
308
+ df.with_columns(decade=pl.col("year").floordiv(10).mul(10))
309
+ .filter((pl.col("category") == "Vintage") & (pl.col("decade") == 1970))
310
+ )
311
+ return
312
+
313
+
314
+ @app.cell(hide_code=True)
315
+ def _(mo):
316
+ mo.md(r"""We can also use the built-in function for equal to comparisons.""")
317
+ return
318
+
319
+
320
+ @app.cell
321
+ def _(df, pl):
322
+ (df
323
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
324
+ .filter(pl.col('category').eq('Vintage'))
325
+ )
326
+ return
327
+
328
+
329
+ @app.cell(hide_code=True)
330
+ def _(mo):
331
+ mo.md(
332
+ r"""
333
+ ### Not equal
334
+ We can also compare if something is `not` equal to something. In this case, category is not vintage.
335
+ """
336
+ )
337
+ return
338
+
339
+
340
+ @app.cell
341
+ def _(df, pl):
342
+ (df
343
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
344
+ .filter(pl.col('category') != 'Vintage')
345
+ )
346
+ return
347
+
348
+
349
+ @app.cell(hide_code=True)
350
+ def _(mo):
351
+ mo.md(r"""Or with the built-in function.""")
352
+ return
353
+
354
+
355
+ @app.cell
356
+ def _(df, pl):
357
+ (df
358
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
359
+ .filter(pl.col('category').ne('Vintage'))
360
+ )
361
+ return
362
+
363
+
364
+ @app.cell(hide_code=True)
365
+ def _(mo):
366
+ mo.md(r"""Or if you want to be extra clever, you can use the negation symbol `~` used in logic.""")
367
+ return
368
+
369
+
370
+ @app.cell
371
+ def _(df, pl):
372
+ (df
373
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
374
+ .filter(~pl.col('category').eq('Vintage'))
375
+ )
376
+ return
377
+
378
+
379
+ @app.cell(hide_code=True)
380
+ def _(mo):
381
+ mo.md(
382
+ r"""
383
+ ### Greater than
384
+ Let's get the software where the year is greater than 2008 from the above dataframe.
385
+ """
386
+ )
387
+ return
388
+
389
+
390
+ @app.cell
391
+ def _(df, pl):
392
+ (df
393
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
394
+ .filter(~pl.col('category').eq('Vintage'))
395
+ .filter(pl.col('year') > 2008)
396
+ )
397
+ return
398
+
399
+
400
+ @app.cell(hide_code=True)
401
+ def _(mo):
402
+ mo.md(r"""Or if we wanted the year 2008 to be included, we could use great or equal to.""")
403
+ return
404
+
405
+
406
+ @app.cell
407
+ def _(df, pl):
408
+ (df
409
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
410
+ .filter(~pl.col('category').eq('Vintage'))
411
+ .filter(pl.col('year') >= 2008)
412
+ )
413
+ return
414
+
415
+
416
+ @app.cell(hide_code=True)
417
+ def _(mo):
418
+ mo.md(r"""We could do the previous two operations with built-in functions. Here's with greater than.""")
419
+ return
420
+
421
+
422
+ @app.cell
423
+ def _(df, pl):
424
+ (df
425
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
426
+ .filter(~pl.col('category').eq('Vintage'))
427
+ .filter(pl.col('year').gt(2008))
428
+ )
429
+ return
430
+
431
+
432
+ @app.cell(hide_code=True)
433
+ def _(mo):
434
+ mo.md(r"""And here's with greater or equal to""")
435
+ return
436
+
437
+
438
+ @app.cell
439
+ def _(df, pl):
440
+ (df
441
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
442
+ .filter(~pl.col('category').eq('Vintage'))
443
+ .filter(pl.col('year').ge(2008))
444
+ )
445
+ return
446
+
447
+
448
+ @app.cell(hide_code=True)
449
+ def _(mo):
450
+ mo.md(
451
+ r"""
452
+ **Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively.
453
+
454
+ ### Is between
455
+ Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result).
456
+ """
457
+ )
458
+ return
459
+
460
+
461
+ @app.cell
462
+ def _(df, pl):
463
+ (df
464
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
465
+ .filter(pl.col('category').eq('Modern'))
466
+ .filter(pl.col('year').is_between(2013, 2016))
467
+ )
468
+ return
469
+
470
+
471
+ @app.cell(hide_code=True)
472
+ def _(mo):
473
+ mo.md(
474
+ r"""
475
+ ### Or operator
476
+ If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator.
477
+
478
+ Let's get software that is either modern or used in the decade 1980s.
479
+ """
480
+ )
481
+ return
482
+
483
+
484
+ @app.cell
485
+ def _(df, pl):
486
+ (df
487
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
488
+ .filter((pl.col('category') == 'Modern') | (pl.col('decade') == 1980))
489
+ )
490
+ return
491
+
492
+
493
+ @app.cell(hide_code=True)
494
+ def _(mo):
495
+ mo.md(
496
+ r"""
497
+ ## Conditionals
498
+ Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use".
499
+
500
+ Here's a list of products that are no longer in use.
501
+ """
502
+ )
503
+ return
504
+
505
+
506
+ @app.cell
507
+ def _():
508
+ discontinued_list = ['Lotus-123', 'WordStar', 'dBase III', 'VisiCalc', 'MS-DOS', 'HyperCard']
509
+ return (discontinued_list,)
510
+
511
+
512
+ @app.cell(hide_code=True)
513
+ def _(mo):
514
+ mo.md(r"""Here's how we can get a dataframe of the products that are discontinued.""")
515
+ return
516
+
517
+
518
+ @app.cell
519
+ def _(df, discontinued_list, pl):
520
+ (df
521
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
522
+ .filter(pl.col('software').is_in(discontinued_list))
523
+ )
524
+ return
525
+
526
+
527
+ @app.cell(hide_code=True)
528
+ def _(mo):
529
+ mo.md(r"""Now, let's create the *status* column.""")
530
+ return
531
+
532
+
533
+ @app.cell
534
+ def _(df, discontinued_list, pl):
535
+ (df
536
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
537
+ .with_columns(pl.when(pl.col('software').is_in(discontinued_list))
538
+ .then(pl.lit('Discontinued'))
539
+ .otherwise(pl.lit('In use'))
540
+ .alias('status')
541
+ )
542
+ )
543
+ return
544
+
545
+
546
+ @app.cell(hide_code=True)
547
+ def _(mo):
548
+ mo.md(
549
+ r"""
550
+ ## Unique counts
551
+ Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame.
552
+ """
553
+ )
554
+ return
555
+
556
+
557
+ @app.cell
558
+ def _(df, discontinued_list, pl):
559
+ (df
560
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
561
+ .with_columns(pl.when(pl.col('software').is_in(discontinued_list))
562
+ .then(pl.lit('Discontinued'))
563
+ .otherwise(pl.lit('In use'))
564
+ .alias('status')
565
+ )
566
+ .select('decade').unique()
567
+ )
568
+ return
569
+
570
+
571
+ @app.cell(hide_code=True)
572
+ def _(mo):
573
+ mo.md(r"""Finally, let's find out the number of software used in each decade.""")
574
+ return
575
+
576
+
577
+ @app.cell
578
+ def _(df, discontinued_list, pl):
579
+ (df
580
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
581
+ .with_columns(pl.when(pl.col('software').is_in(discontinued_list))
582
+ .then(pl.lit('Discontinued'))
583
+ .otherwise(pl.lit('In use'))
584
+ .alias('status')
585
+ )
586
+ ['decade'].value_counts()
587
+ )
588
+ return
589
+
590
+
591
+ @app.cell(hide_code=True)
592
+ def _(mo):
593
+ mo.md(r"""We could also rewrite the above code as follows:""")
594
+ return
595
+
596
+
597
+ @app.cell
598
+ def _(df, discontinued_list, pl):
599
+ (df
600
+ .with_columns(decade=pl.col('year').floordiv(10).mul(10))
601
+ .with_columns(pl.when(pl.col('software').is_in(discontinued_list))
602
+ .then(pl.lit('Discontinued'))
603
+ .otherwise(pl.lit('In use'))
604
+ .alias('status')
605
+ )
606
+ .select('decade').to_series().value_counts()
607
+ )
608
+ return
609
+
610
+
611
+ @app.cell(hide_code=True)
612
+ def _(mo):
613
+ mo.md(r"""Hopefully, we've picked your interest to try out Polars the next time you analyze your data.""")
614
+ return
615
+
616
+
617
+ @app.cell
618
+ def _():
619
+ return
620
+
621
+
622
+ if __name__ == "__main__":
623
+ app.run()