Mustjaab commited on
Commit
0d74625
·
unverified ·
1 Parent(s): e338d9a

Add files via upload

Browse files
Files changed (1) hide show
  1. DuckDB_Loading_CSVs.py +164 -0
DuckDB_Loading_CSVs.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "marimo",
5
+ # "plotly.express"
6
+ # ]
7
+ # ///
8
+
9
+
10
+ import marimo
11
+
12
+ __generated_with = "0.11.17"
13
+ app = marimo.App(width="medium")
14
+
15
+
16
+ @app.cell
17
+ def _():
18
+ import marimo as mo
19
+ import plotly.express as px
20
+ return mo, px
21
+
22
+
23
+ @app.cell(hide_code=True)
24
+ def _(mo):
25
+ mo.md(r"""#Loading CSVs with DuckDB""")
26
+ return
27
+
28
+
29
+ @app.cell(hide_code=True)
30
+ def _(mo):
31
+ mo.md(
32
+ r"""
33
+ <p> I remember when I first learnt about DuckDB, it was a gamechanger - I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world - now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.</p>
34
+
35
+ ##Introduction
36
+ <p> I found this dataset on the evolution of AI research by disclipine from <a href= "https://oecd.ai/en/data?selectedArea=ai-research&selectedVisualization=16731"> OECD </a>, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case. </p>
37
+
38
+ <p> In this notebook, we'll: </p>
39
+ <ul>
40
+ <li> Import the CSV file into the notebook</li>
41
+ <li> Create another table within the database based on the CSV</li>
42
+ <li> Dig into publications on natural language processing have evolved over the years</li>
43
+ </ul>
44
+ """
45
+ )
46
+ return
47
+
48
+
49
+ @app.cell(hide_code=True)
50
+ def _(mo):
51
+ mo.md(r"""##Load the CSV""")
52
+ return
53
+
54
+
55
+ @app.cell
56
+ def _(mo):
57
+ _df = mo.sql(
58
+ f"""
59
+ /* Another way to load the CSV could be
60
+ SELECT *
61
+ FROM read_csv('AI_Research_Data.csv')
62
+ */
63
+ SELECT *
64
+ FROM "AI_Research_Data.csv"
65
+ LIMIT 5;
66
+ """
67
+ )
68
+ return
69
+
70
+
71
+ @app.cell(hide_code=True)
72
+ def _(mo):
73
+ mo.md(r"""##Create Another Table""")
74
+ return
75
+
76
+
77
+ @app.cell
78
+ def _(mo):
79
+ Discipline_Analysis = mo.sql(
80
+ f"""
81
+ -- Build a table based on the CSV where it just contains the specified columns
82
+ CREATE TABLE Domain_Analysis AS
83
+ SELECT Year, Concept, publications FROM "AI_Research_Data.csv"
84
+ """
85
+ )
86
+ return Discipline_Analysis, Domain_Analysis
87
+
88
+
89
+ @app.cell
90
+ def _(Domain_Analysis, mo):
91
+ Analysis = mo.sql(
92
+ f"""
93
+ SELECT *
94
+ FROM Domain_Analysis
95
+ GROUP BY Concept, Year, publications
96
+ ORDER BY Year
97
+ """
98
+ )
99
+ return (Analysis,)
100
+
101
+
102
+ @app.cell
103
+ def _(Domain_Analysis, mo):
104
+ _df = mo.sql(
105
+ f"""
106
+ SELECT
107
+ AVG(CASE WHEN Year < 2020 THEN publications END) AS avg_pre_2020,
108
+ AVG(CASE WHEN Year >= 2020 THEN publications END) AS avg_2020_onward
109
+ FROM Domain_Analysis
110
+ WHERE Concept = 'Natural language processing';
111
+ """
112
+ )
113
+ return
114
+
115
+
116
+ @app.cell
117
+ def _(Domain_Analysis, mo):
118
+ NLP_Analysis = mo.sql(
119
+ f"""
120
+ SELECT
121
+ publications,
122
+ CASE
123
+ WHEN Year < 2020 THEN 'Pre-2020'
124
+ ELSE '2020-onward'
125
+ END AS period
126
+ FROM Domain_Analysis
127
+ WHERE Year >= 2000
128
+ AND Concept = 'Natural language processing';
129
+ """,
130
+ output=False
131
+ )
132
+ return (NLP_Analysis,)
133
+
134
+
135
+ @app.cell
136
+ def _(NLP_Analysis, px):
137
+ px.box(NLP_Analysis,x=1,y=0,color = 1)
138
+ return
139
+
140
+
141
+ @app.cell(hide_code=True)
142
+ def _(mo):
143
+ mo.md(r"""<p> We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large langage models, and AI assistants. </p>""")
144
+ return
145
+
146
+
147
+ @app.cell(hide_code=True)
148
+ def _(mo):
149
+ mo.md(
150
+ r"""
151
+ ##Conclusion
152
+ <p> In this notebook, we learned how to:</p>
153
+ <ul>
154
+ <li> Load a CSV into DuckDB </li>
155
+ <li> Create other tables using the imported CSV </li>
156
+ <li> Seamlessly analyze and visualize data between SQL, and Python cells</li>
157
+ </ul>
158
+ """
159
+ )
160
+ return
161
+
162
+
163
+ if __name__ == "__main__":
164
+ app.run()