Koushik Khan commited on
Commit
3bdaf96
·
1 Parent(s): 3f17228

updated visual aspects

Browse files
Files changed (1) hide show
  1. polars/01_why_polars.py +167 -27
polars/01_why_polars.py CHANGED
@@ -2,8 +2,11 @@
2
  # requires-python = ">=3.12"
3
  # dependencies = [
4
  # "marimo",
 
 
5
  # ]
6
  # ///
 
7
  import marimo
8
 
9
  __generated_with = "0.11.0"
@@ -41,73 +44,179 @@ def _(mo):
41
  Pandas has long been the go-to library for data manipulation and analysis in Python. However, as datasets grow larger and more complex, Pandas often struggles with performance and memory limitations. This is where Polars shines. Polars is a modern, high-performance DataFrame library designed to address the shortcomings of Pandas while providing a user-friendly experience.
42
 
43
  Below, we’ll explore key reasons why Polars is a better choice in many scenarios, along with examples.
 
 
 
 
44
 
45
- ## (a) Easier & Intuitive Syntax
 
 
 
 
46
 
47
  Polars is designed with a syntax that is very similar to PySpark while being intuitive like SQL. This makes it easier for data professionals to transition to Polars without a steep learning curve. For example:
48
 
49
  **Example: Filtering and Aggregating Data**
50
 
51
- **In Pandas:**
52
- ```{python}
53
  import pandas as pd
54
 
55
- df = pd.DataFrame(
56
  {
57
  "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
58
  "Male", "Female", "Male", "Female"],
59
  "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
60
- "Height_CM": [150, 170, 146.5, 142, 155, 165, 170.8, 130, 132.5, 162]
61
  }
62
  )
63
 
64
  # query: average height of male and female after the age of 15 years
65
 
66
  # step-1: filter
67
- filtered_df = df[df["Age"] > 15]
68
 
69
  # step-2: groupby and aggregation
70
- result = filtered_df.groupby("Gender").mean()
71
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- **In Polars:**
74
- ```{python}
75
  import polars as pl
76
 
77
- df = pd.DataFrame(
78
  {
79
  "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
80
  "Male", "Female", "Male", "Female"],
81
  "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
82
- "Height_CM": [150, 170, 146.5, 142, 155, 165, 170.8, 130, 132.5, 162]
83
  }
84
  )
85
 
86
  # query: average height of male and female after the age of 15 years
87
 
88
  # filter, groupby and aggregation using method chaining
89
- result = df_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
 
90
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
 
 
 
 
92
  Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query.
93
 
94
  Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
95
 
96
- ```{python}
97
- result = df.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ```
 
 
 
 
99
 
100
- ## (b) Large Collection of Built-in APIs
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  Polars boasts an **extremely expressive API**, enabling you to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly enhances performance.
 
 
 
 
103
 
104
- ## (c) Query Optimization
 
 
 
 
105
 
106
  A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
107
 
108
  For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
109
 
110
- ```{python}
111
  (
112
  df
113
  .groupby(by="Category").agg(pl.col("Number1").mean())
@@ -116,39 +225,55 @@ def _(mo):
116
  ```
117
 
118
  If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
 
 
 
119
 
120
- ## (d) Scalability - Handling Large Datasets in Memory
 
 
 
 
 
121
 
122
  Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
123
 
124
  **Example: Processing a Large Dataset**
125
  In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
126
 
127
- ```{python}
128
  # This may fail with large datasets
129
  df = pd.read_csv("large_dataset.csv")
130
  ```
131
 
132
  In Polars, the same operation is seamless:
133
 
134
- ```{python}
135
  df = pl.read_csv("large_dataset.csv")
136
  ```
137
 
138
  Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
139
 
140
- ```{python}
141
  df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
142
  result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
143
  ```
 
 
 
144
 
145
- ## (e) Compatibility with Other ML Libraries
 
 
 
 
 
146
 
147
  Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
148
 
149
  **Example: Preprocessing Data for Scikit-learn**
150
 
151
- ```{python}
152
  import polars as pl
153
  from sklearn.linear_model import LinearRegression
154
 
@@ -164,17 +289,32 @@ def _(mo):
164
 
165
  Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
166
 
167
- ```{python}
168
  # Convert to Pandas DataFrame
169
  pandas_df = df.to_pandas()
170
 
171
  # Convert to NumPy array
172
  numpy_array = df.to_numpy()
173
  ```
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- ## (f) Rich Functionality
 
 
 
176
 
177
- Polars supports advanced operations like **date handling**, **window functions**, **joins**, and **nested data types**, making it a versatile tool for data manipulation.
178
  """
179
  )
180
  return
@@ -184,7 +324,7 @@ def _(mo):
184
  def _(mo):
185
  mo.md(
186
  """
187
- # Why Not PySpark?
188
 
189
  While **PySpark** is undoubtedly a versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
190
 
 
2
  # requires-python = ">=3.12"
3
  # dependencies = [
4
  # "marimo",
5
+ # "pandas==2.2.3",
6
+ # "polars==1.22.0",
7
  # ]
8
  # ///
9
+
10
  import marimo
11
 
12
  __generated_with = "0.11.0"
 
44
  Pandas has long been the go-to library for data manipulation and analysis in Python. However, as datasets grow larger and more complex, Pandas often struggles with performance and memory limitations. This is where Polars shines. Polars is a modern, high-performance DataFrame library designed to address the shortcomings of Pandas while providing a user-friendly experience.
45
 
46
  Below, we’ll explore key reasons why Polars is a better choice in many scenarios, along with examples.
47
+ """
48
+ )
49
+ return
50
+
51
 
52
+ @app.cell
53
+ def _(mo):
54
+ mo.md(
55
+ """
56
+ ## (a) Easier & Intuitive Syntax 📝
57
 
58
  Polars is designed with a syntax that is very similar to PySpark while being intuitive like SQL. This makes it easier for data professionals to transition to Polars without a steep learning curve. For example:
59
 
60
  **Example: Filtering and Aggregating Data**
61
 
62
+ ```python
 
63
  import pandas as pd
64
 
65
+ df_pd = pd.DataFrame(
66
  {
67
  "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
68
  "Male", "Female", "Male", "Female"],
69
  "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
70
+ "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
71
  }
72
  )
73
 
74
  # query: average height of male and female after the age of 15 years
75
 
76
  # step-1: filter
77
+ filtered_df_pd = df_pd[df_pd["Age"] > 15]
78
 
79
  # step-2: groupby and aggregation
80
+ result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
81
  ```
82
+ """
83
+ )
84
+ return
85
+
86
+
87
+ @app.cell
88
+ def _():
89
+ import pandas as pd
90
+
91
+ df_pd = pd.DataFrame(
92
+ {
93
+ "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
94
+ "Male", "Female", "Male", "Female"],
95
+ "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
96
+ "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
97
+ }
98
+ )
99
+
100
+ # query: average height of male and female after the age of 15 years
101
+
102
+ # step-1: filter
103
+ filtered_df_pd = df_pd[df_pd["Age"] > 15]
104
+
105
+ # step-2: groupby and aggregation
106
+ result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
107
+ result_pd
108
+ return df_pd, filtered_df_pd, pd, result_pd
109
+
110
+
111
+ @app.cell
112
+ def _(mo):
113
+ mo.md(
114
+ r"""
115
+ The same example can be worked out in Polars like below,
116
 
117
+ ```python
 
118
  import polars as pl
119
 
120
+ df_pl = pl.DataFrame(
121
  {
122
  "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
123
  "Male", "Female", "Male", "Female"],
124
  "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
125
+ "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
126
  }
127
  )
128
 
129
  # query: average height of male and female after the age of 15 years
130
 
131
  # filter, groupby and aggregation using method chaining
132
+ result_pl = df_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
133
+ result_pl
134
  ```
135
+ """
136
+ )
137
+ return
138
+
139
+
140
+ @app.cell
141
+ def _():
142
+ import polars as pl
143
+
144
+ df_pl = pl.DataFrame(
145
+ {
146
+ "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
147
+ "Male", "Female", "Male", "Female"],
148
+ "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
149
+ "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
150
+ }
151
+ )
152
+
153
+ # query: average height of male and female after the age of 15 years
154
+
155
+ # filter, groupby and aggregation using method chaining
156
+ result_pl = df_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
157
+ result_pl
158
+ return df_pl, pl, result_pl
159
+
160
 
161
+ @app.cell
162
+ def _(mo):
163
+ mo.md(
164
+ """
165
  Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query.
166
 
167
  Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
168
 
169
+ ```python
170
+ import polars as pl
171
+
172
+ df_pl = pl.DataFrame(
173
+ {
174
+ "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
175
+ "Male", "Female", "Male", "Female"],
176
+ "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
177
+ "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
178
+ }
179
+ )
180
+
181
+ # query: average height of male and female after the age of 15 years
182
+ result = df_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
183
+ result
184
  ```
185
+ """
186
+ )
187
+ return
188
+
189
 
190
+ @app.cell
191
+ def _(df_pl):
192
+ result = df_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
193
+ result
194
+ return (result,)
195
+
196
+
197
+ @app.cell
198
+ def _(mo):
199
+ mo.md(
200
+ """
201
+ ## (b) Large Collection of Built-in APIs ⚙️
202
 
203
  Polars boasts an **extremely expressive API**, enabling you to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly enhances performance.
204
+ """
205
+ )
206
+ return
207
+
208
 
209
+ @app.cell
210
+ def _(mo):
211
+ mo.md(
212
+ """
213
+ ## (c) Query Optimization 📈
214
 
215
  A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
216
 
217
  For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
218
 
219
+ ```python
220
  (
221
  df
222
  .groupby(by="Category").agg(pl.col("Number1").mean())
 
225
  ```
226
 
227
  If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
228
+ """
229
+ )
230
+ return
231
 
232
+
233
+ @app.cell
234
+ def _(mo):
235
+ mo.md(
236
+ """
237
+ ## (d) Scalability - Handling Large Datasets in Memory ⬆️
238
 
239
  Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
240
 
241
  **Example: Processing a Large Dataset**
242
  In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
243
 
244
+ ```python
245
  # This may fail with large datasets
246
  df = pd.read_csv("large_dataset.csv")
247
  ```
248
 
249
  In Polars, the same operation is seamless:
250
 
251
+ ```python
252
  df = pl.read_csv("large_dataset.csv")
253
  ```
254
 
255
  Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
256
 
257
+ ```python
258
  df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
259
  result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
260
  ```
261
+ """
262
+ )
263
+ return
264
 
265
+
266
+ @app.cell
267
+ def _(mo):
268
+ mo.md(
269
+ """
270
+ ## (e) Compatibility with Other ML Libraries 🤝
271
 
272
  Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
273
 
274
  **Example: Preprocessing Data for Scikit-learn**
275
 
276
+ ```python
277
  import polars as pl
278
  from sklearn.linear_model import LinearRegression
279
 
 
289
 
290
  Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
291
 
292
+ ```python
293
  # Convert to Pandas DataFrame
294
  pandas_df = df.to_pandas()
295
 
296
  # Convert to NumPy array
297
  numpy_array = df.to_numpy()
298
  ```
299
+ """
300
+ )
301
+ return
302
+
303
+
304
+ @app.cell
305
+ def _(mo):
306
+ mo.md(
307
+ """
308
+ ## (f) Rich Functionality ⚡
309
+
310
+ Polars supports advanced operations like
311
 
312
+ - **date handling**
313
+ - **window functions**
314
+ - **joins**
315
+ - **nested data types**
316
 
317
+ which is making it a versatile tool for data manipulation.
318
  """
319
  )
320
  return
 
324
  def _(mo):
325
  mo.md(
326
  """
327
+ # Why Not PySpark? ⁉️
328
 
329
  While **PySpark** is undoubtedly a versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
330