Koushik Khan commited on
Commit
24c07d4
·
unverified ·
1 Parent(s): cd95559

updated why_polars

Browse files
Files changed (1) hide show
  1. polars/001_why_polars.py +40 -67
polars/001_why_polars.py CHANGED
@@ -36,14 +36,14 @@ def _(mo):
36
 
37
  Below, we’ll explore key reasons why Polars is a better choice in many scenarios, along with examples.
38
 
39
- ## (a) Easier Syntax Similar
40
 
41
  Polars is designed with a syntax that is very similar to PySpark while being intuitive like SQL. This makes it easier for data professionals to transition to Polars without a steep learning curve. For example:
42
 
43
  **Example: Filtering and Aggregating Data**
44
 
45
  **In Pandas:**
46
- ```
47
  import pandas as pd
48
 
49
  df = pd.DataFrame(
@@ -65,7 +65,7 @@ def _(mo):
65
  ```
66
 
67
  **In Polars:**
68
- ```
69
  import polars as pl
70
 
71
  df = pd.DataFrame(
@@ -87,42 +87,62 @@ def _(mo):
87
 
88
  Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
89
 
90
- ```
91
  result = df.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
92
  ```
93
 
94
- ## (b) Scalability - Handling Large Datasets in Memory
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
97
 
98
  **Example: Processing a Large Dataset**
99
  In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
100
 
101
- ```
102
  # This may fail with large datasets
103
  df = pd.read_csv("large_dataset.csv")
104
  ```
105
 
106
  In Polars, the same operation is seamless:
107
 
108
- ```
109
  df = pl.read_csv("large_dataset.csv")
110
  ```
111
 
112
  Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
113
 
114
- ```
115
  df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
116
  result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
117
  ```
118
 
119
- ## (c) Compatibility with Other Machine Learning Libraries
120
 
121
  Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
122
 
123
  **Example: Preprocessing Data for Scikit-learn**
124
 
125
- ```
126
  import polars as pl
127
  from sklearn.linear_model import LinearRegression
128
 
@@ -138,7 +158,7 @@ def _(mo):
138
 
139
  Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
140
 
141
- ```
142
  # Convert to Pandas DataFrame
143
  pandas_df = df.to_pandas()
144
 
@@ -146,67 +166,25 @@ def _(mo):
146
  numpy_array = df.to_numpy()
147
  ```
148
 
149
- **(d) Additional Advantages of Polars**
150
-
151
- - Rich Functionality: Polars supports advanced operations like window functions, joins, and nested data types, making it a versatile tool for data manipulation.
152
-
153
- - Query Optimization: Polars is significantly faster than Pandas due to its parallelized and vectorized operations. Benchmarks often show Polars outperforming Pandas by 10x or more.
154
 
155
- - Memory Efficiency: Polars uses memory more efficiently, reducing the risk of out-of-memory errors.
156
-
157
- - Lazy API: The lazy evaluation API allows for query optimization and deferred execution, which is particularly useful for complex workflows.
158
  """
159
  )
160
  return
161
 
162
 
163
  @app.cell
164
- def _():
165
- import pandas as pd
166
-
167
- df = pd.DataFrame(
168
- {
169
- "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
170
- "Male", "Female", "Male", "Female"],
171
- "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
172
- "Height_CM": [150, 170, 146.5, 142, 155, 165, 170.8, 130, 132.5, 162]
173
- }
174
- )
175
-
176
- # query: average height of male and female after the age of 15 years
177
- filtered_df = df[df["Age"] > 15]
178
- result = filtered_df.groupby("Gender").mean()["Height_CM"]
179
- result
180
- return df, filtered_df, pd, result
181
-
182
-
183
- @app.cell
184
- def _():
185
- import polars as pl
186
- return (pl,)
187
 
 
188
 
189
- @app.cell
190
- def _(pl):
191
- df_pl = pl.DataFrame(
192
- {
193
- "Gender": ["Male", "Female", "Male", "Female", "Male", "Female",
194
- "Male", "Female", "Male", "Female"],
195
- "Age": [13, 15, 17, 19, 21, 23, 25, 27, 29, 31],
196
- "Height_CM": [150.0, 170.0, 146.5, 142.0, 155.0, 165.0, 170.8, 130.0, 132.5, 162.0]
197
- }
198
  )
199
-
200
- # df_pl
201
- # query: average height of male and female after the age of 15 years
202
- result_pl = df_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
203
- result_pl
204
- return df_pl, result_pl
205
-
206
-
207
- @app.cell
208
- def _(df_pl):
209
- df_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
210
  return
211
 
212
 
@@ -223,10 +201,5 @@ def _(mo):
223
  return
224
 
225
 
226
- @app.cell
227
- def _():
228
- return
229
-
230
-
231
  if __name__ == "__main__":
232
  app.run()
 
36
 
37
  Below, we’ll explore key reasons why Polars is a better choice in many scenarios, along with examples.
38
 
39
+ ## (a) Easier & Intuitive Syntax
40
 
41
  Polars is designed with a syntax that is very similar to PySpark while being intuitive like SQL. This makes it easier for data professionals to transition to Polars without a steep learning curve. For example:
42
 
43
  **Example: Filtering and Aggregating Data**
44
 
45
  **In Pandas:**
46
+ ```{python}
47
  import pandas as pd
48
 
49
  df = pd.DataFrame(
 
65
  ```
66
 
67
  **In Polars:**
68
+ ```{python}
69
  import polars as pl
70
 
71
  df = pd.DataFrame(
 
87
 
88
  Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
89
 
90
+ ```{python}
91
  result = df.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
92
  ```
93
 
94
+ ## (b) Large Collection of Built-in APIs
95
+
96
+ Polars boasts an **extremely expressive API**, enabling you to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly enhances performance.
97
+
98
+ ## (c) Query Optimization
99
+
100
+ A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
101
+
102
+ For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
103
+
104
+ ```{python}
105
+ (
106
+ df
107
+ .groupby(by="Category").agg(pl.col("Number1").mean())
108
+ .filter(pl.col("Category").is_in(["A", "B"]))
109
+ )
110
+ ```
111
+
112
+ If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
113
+
114
+ ## (d) Scalability - Handling Large Datasets in Memory
115
 
116
  Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
117
 
118
  **Example: Processing a Large Dataset**
119
  In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
120
 
121
+ ```{python}
122
  # This may fail with large datasets
123
  df = pd.read_csv("large_dataset.csv")
124
  ```
125
 
126
  In Polars, the same operation is seamless:
127
 
128
+ ```{python}
129
  df = pl.read_csv("large_dataset.csv")
130
  ```
131
 
132
  Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
133
 
134
+ ```{python}
135
  df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
136
  result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
137
  ```
138
 
139
+ ## (e) Compatibility with Other ML Libraries
140
 
141
  Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
142
 
143
  **Example: Preprocessing Data for Scikit-learn**
144
 
145
+ ```{python}
146
  import polars as pl
147
  from sklearn.linear_model import LinearRegression
148
 
 
158
 
159
  Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
160
 
161
+ ```{python}
162
  # Convert to Pandas DataFrame
163
  pandas_df = df.to_pandas()
164
 
 
166
  numpy_array = df.to_numpy()
167
  ```
168
 
169
+ ## (f) Rich Functionality
 
 
 
 
170
 
171
+ Polars supports advanced operations like **date handling**, **window functions**, **joins**, and **nested data types**, making it a versatile tool for data manipulation.
 
 
172
  """
173
  )
174
  return
175
 
176
 
177
  @app.cell
178
+ def _(mo):
179
+ mo.md(
180
+ """
181
+ # Why Not PySpark?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
+ While **PySpark** is undoubtedly a versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
184
 
185
+ When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware.
186
+ """
 
 
 
 
 
 
 
187
  )
 
 
 
 
 
 
 
 
 
 
 
188
  return
189
 
190
 
 
201
  return
202
 
203
 
 
 
 
 
 
204
  if __name__ == "__main__":
205
  app.run()