File size: 97,572 Bytes
10de3c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# media_stores.ipynb\n",
    "> A notebook for storing all types of media as vector stores\n",
    "\n",
    "In this notebook, we'll implement the functionality required to interact with many types of media stores. This is - not just for text files and pdfs, but also for images, audio, and video.\n",
    "\n",
    "Below are some references for integration of different media types into vector stores.\n",
    "\n",
    "- YouTube: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/youtube_audio\n",
    "- Websites:\n",
    "  - https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/\n",
    "  - https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_base\n",
    "  - Extracting relevant information from website: https://www.oncrawl.com/technical-seo/extract-relevant-text-content-from-html-page/\n",
    "\n",
    ":::{.callout-caution}\n",
    "These notebooks are development notebooks, meaning that they are meant to be run locally or somewhere that supports navigating a full repository (in other words, not Google Colab unless you clone the entire repository to drive and then mount the Drive-Repository.) However, it is expected if you're able to do all of those steps, you're likely also able to figure out the required pip installs for development there.\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "---\n",
    "skip_exec: true\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| default_exp MediaVectorStores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "# import libraries here\n",
    "import os\n",
    "import itertools\n",
    "\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.document_loaders.unstructured import UnstructuredFileLoader\n",
    "from langchain.document_loaders.generic import GenericLoader\n",
    "from langchain.document_loaders.parsers import OpenAIWhisperParser\n",
    "from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader\n",
    "from langchain.document_loaders import WebBaseLoader, UnstructuredURLLoader\n",
    "from langchain.docstore.document import Document\n",
    "\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.chains import RetrievalQAWithSourcesChain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we will not export the following packages to our module because in this exploration we have decided to go with langchain implementations, or they are only used for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#exploration\n",
    "import trafilatura\n",
    "import requests\n",
    "import justext"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Media to Text Converters\n",
    "In this section, we provide a set of converters that can either read text and convert it to other useful text, or read YouTube or Websites and convert them into text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Standard Text Splitter\n",
    "Here we define a standard text splitter. This can be used on any text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def rawtext_to_doc_split(text, chunk_size=1500, chunk_overlap=150):\n",
    "  \n",
    "  # Quick type checking\n",
    "  if not isinstance(text, list):\n",
    "    text = [text]\n",
    "\n",
    "  # Create splitter\n",
    "  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,\n",
    "                                                 chunk_overlap=chunk_overlap,\n",
    "                                                 add_start_index = True)\n",
    "  \n",
    "  #Split into docs segments\n",
    "  if isinstance(text[0], Document):\n",
    "    doc_segments = text_splitter.split_documents(text)\n",
    "  else:\n",
    "    doc_segments = text_splitter.split_documents(text_splitter.create_documents(text))\n",
    "\n",
    "  # Make into one big list\n",
    "  doc_segments = list(itertools.chain(*doc_segments)) if isinstance(doc_segments[0], list) else doc_segments\n",
    "\n",
    "  return doc_segments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='This is a', metadata={}),\n",
       " Document(page_content='sentence.', metadata={}),\n",
       " Document(page_content='This is', metadata={}),\n",
       " Document(page_content='another', metadata={}),\n",
       " Document(page_content='sentence.', metadata={}),\n",
       " Document(page_content='This is a', metadata={}),\n",
       " Document(page_content='a third', metadata={}),\n",
       " Document(page_content='sentence.', metadata={})]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# test basic functionality\n",
    "rawtext_to_doc_split([\"This is a sentence. This is another sentence.\", \"This is a third sentence.\"], chunk_size=10, chunk_overlap=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll write a quick function to do a unit test on the function we just wrote."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_split_texts():\n",
    "    \n",
    "    # basic behavior\n",
    "    text = \"This is a sample text that we will use to test the splitter function.\"\n",
    "    expected_output = [\"This is a sample text that we will use to test the splitter function.\"]\n",
    "    out_splits = [doc.page_content for doc in rawtext_to_doc_split(text)]\n",
    "    assert all([target==expected for target, expected in zip(expected_output, out_splits)]), ('The basic splitter functionality is incorrect, and does not correctly ' +\n",
    "                                                                                              'use chunk_size and chunk_overlap on chunks <1500.')\n",
    "    \n",
    "    # try a known result with variable chunk_length and chunk_overlap\n",
    "    text = (\"This is a sample text that we will use to test the splitter function. It should split the \" +\n",
    "            \"text into multiple chunks of size 1500 with an overlap of 150 characters. This is the second chunk.\")\n",
    "    expected_output = ['This is a sample text that we will use to test the',\n",
    "                       'test the splitter function. It should split the',\n",
    "                       'split the text into multiple chunks of size 1500',\n",
    "                       'size 1500 with an overlap of 150 characters. This',\n",
    "                       'This is the second chunk.']\n",
    "    out_splits = [doc.page_content for doc in rawtext_to_doc_split(text, 50, 10)]\n",
    "    assert all([target==expected for target, expected in zip(expected_output, out_splits)]), 'The splitter does not correctly use chunk_size and chunk_overlap.'\n",
    "\n",
    "# Run test\n",
    "test_split_texts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function is used for testing to make sure single files and lists can be accommodated, and that what are returned are lists of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a set of tests to make sure that this works on both lists single inputs\n",
    "def test_converters_inputs(test_fcn, files_list=None):\n",
    "    if files_list is None:\n",
    "        single_file = 'The cat was super cute and adorable'\n",
    "        multiple_files = [single_file, 'The dog was also cute and her wet nose is always so cold!']\n",
    "    elif isinstance(files_list, str):\n",
    "        single_file = files_list\n",
    "        multiple_files = [single_file, single_file]\n",
    "    elif isinstance(files_list, list):\n",
    "        single_file = files_list[0]\n",
    "        multiple_files = files_list\n",
    "    else:\n",
    "        TypeError(\"You've passed in a files_list which is neither a string or a list or None\")\n",
    "\n",
    "    # test for single file\n",
    "    res = test_fcn(single_file)\n",
    "    assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A single file should return a list.'\n",
    "    assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A single file should return a 1-dimensional list.'\n",
    "\n",
    "    # test for multiple files\n",
    "    res = test_fcn(multiple_files)\n",
    "    assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A list of files should return a list.'\n",
    "    assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A list of files should return a 1-dimensional list with all documents combined.'\n",
    "\n",
    "    # test that the return type of elements should be Document\n",
    "    assert all([isinstance(doc, Document) for doc in res]), 'FAILED ASSERT in {test_fcn}. The return type of elements should be Document.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test behavior of standard text splitter\n",
    "test_converters_inputs(rawtext_to_doc_split)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### File or Files\n",
    "Functions which load a single file or files from a directory, including pdfs, text files, html, images, and more. See [Unstructured File Documentation](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "## A single File\n",
    "def _file_to_text(single_file, chunk_size = 1000, chunk_overlap=150):\n",
    "\n",
    "  # Create loader and get segments\n",
    "  loader = UnstructuredFileLoader(single_file)\n",
    "  doc_segments = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=chunk_size,\n",
    "                                                                      chunk_overlap=chunk_overlap,\n",
    "                                                                      add_start_index=True))\n",
    "  return doc_segments\n",
    "\n",
    "\n",
    "## Multiple files\n",
    "def files_to_text(files_list, chunk_size=1000, chunk_overlap=150):\n",
    "  \n",
    "  # Quick type checking\n",
    "  if not isinstance(files_list, list):\n",
    "    files_list = [files_list]\n",
    "\n",
    "  # This is currently a fix because the UnstructuredFileLoader expects a list of files yet can't split them correctly yet\n",
    "  all_segments = [_file_to_text(single_file, chunk_size=chunk_size, chunk_overlap=chunk_overlap) for single_file in files_list]\n",
    "  all_segments = list(itertools.chain(*all_segments)) if isinstance(all_segments[0], list) else all_segments\n",
    "\n",
    "  return all_segments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
       " Document(page_content='traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the', metadata={'source': '../roadnottaken.txt', 'start_index': 82}),\n",
       " Document(page_content='it bent in the undergrowth;\\r\\rThen took the other, as just as fair,\\rAnd having perhaps the better', metadata={'source': '../roadnottaken.txt', 'start_index': 152}),\n",
       " Document(page_content='perhaps the better claim,\\rBecause it was grassy and wanted wear;\\rThough as for that the passing', metadata={'source': '../roadnottaken.txt', 'start_index': 230}),\n",
       " Document(page_content='that the passing there\\rHad worn them really about the same,\\r\\rAnd both that morning equally lay\\rIn', metadata={'source': '../roadnottaken.txt', 'start_index': 309}),\n",
       " Document(page_content='equally lay\\rIn leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing', metadata={'source': '../roadnottaken.txt', 'start_index': 392}),\n",
       " Document(page_content='day! Yet knowing how way leads on to way,\\rI doubted if I should ever come back. I shall be telling', metadata={'source': '../roadnottaken.txt', 'start_index': 474}),\n",
       " Document(page_content='I shall be telling this with a sigh\\rSomewhere ages and ages hence:\\rTwo roads diverged in a wood,', metadata={'source': '../roadnottaken.txt', 'start_index': 554}),\n",
       " Document(page_content='diverged in a wood, and IэI took the one less traveled by,\\rAnd that has made all the difference.', metadata={'source': '../roadnottaken.txt', 'start_index': 631}),\n",
       " Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
       " Document(page_content='traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the', metadata={'source': '../roadnottaken.txt', 'start_index': 82})]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ensure basic behavior\n",
    "res = files_to_text(['../roadnottaken.txt', '../roadnottaken.txt'], chunk_size=100, chunk_overlap=20)\n",
    "res[:11]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_converters_inputs(files_to_text, '../roadnottaken.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Youtube\n",
    "This works by first transcribing the video to text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def youtube_to_text(urls, save_dir = \"content\"):\n",
    "  # Transcribe the videos to text\n",
    "  # save_dir: directory to save audio files\n",
    "\n",
    "  if not isinstance(urls, list):\n",
    "    urls = [urls]\n",
    "  \n",
    "  youtube_loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())\n",
    "  youtube_docs = youtube_loader.load()\n",
    "  \n",
    "  return youtube_docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's demonstrate functionality using some existing YouTube videos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Two Karpathy lecture videos\n",
    "urls = [\"https://youtu.be/kCc8FmEb1nY\", \"https://youtu.be/VMj-3S1tku0\"]\n",
    "youtube_text = youtube_to_text(urls)\n",
    "youtube_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other Youtube helper functions to help with getting full features of YouTube videos are included below. These two grab and save the text of the transcripts.\n",
    "\n",
    "<p style=\"color:red\"><strong>Note that in this stage of development, the following cannot be tested due to YouTube download errors.</strong></p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def save_text(text, text_name = None):\n",
    "  if not text_name:\n",
    "    text_name = text[:20]\n",
    "  text_path = os.path.join(\"/content\",text_name+\".txt\")\n",
    "  \n",
    "  with open(text_path, \"x\") as f:\n",
    "    f.write(text)\n",
    "  # Return the location at which the transcript is saved\n",
    "  return text_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def get_youtube_transcript(yt_url, save_transcript = False, temp_audio_dir = \"sample_data\"):\n",
    "  # Transcribe the videos to text and save to file in /content\n",
    "  # save_dir: directory to save audio files\n",
    "\n",
    "  youtube_docs = youtube_to_text(yt_url, save_dir = temp_audio_dir)\n",
    "  \n",
    "  # Combine doc\n",
    "  combined_docs = [doc.page_content for doc in youtube_docs]\n",
    "  combined_text = \" \".join(combined_docs)\n",
    "  \n",
    "  # Save text to file\n",
    "  video_path = youtube_docs[0].metadata[\"source\"]\n",
    "  youtube_name = os.path.splitext(os.path.basename(video_path))[0]\n",
    "\n",
    "  save_path = None\n",
    "  if save_transcript:\n",
    "    save_path = save_text(combined_text, youtube_name)\n",
    "  \n",
    "  return youtube_docs, save_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Websites\n",
    "We have a few different approaches to reading website text. Some approaches are specifically provided through langchain and some are other packages that seem to be performant. We'll show the pros/cons of each approach below.\n",
    "\n",
    "#### Langchain: WebBaseLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def website_to_text_web(url, chunk_size = 1500, chunk_overlap=100):\n",
    "  \n",
    "    # Url can be a single string or list\n",
    "    website_loader = WebBaseLoader(url)\n",
    "    website_raw = website_loader.load()\n",
    "\n",
    "    website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)\n",
    "  \n",
    "    # Combine doc\n",
    "    return website_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now for a quick test to ensure functionality..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "demo_urls = [\"https://www.espn.com/\", \"https://www.vanderbilt.edu/undergrad-datascience/faq\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content=\"ESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n        Skip to main content\\n    \\n\\n        Skip to navigation\\n    \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\nSearch\\n\\n\\n\\nscores\\n\\n\\n\\nNFLMLBNBANHLSoccerGolf…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n  \\n\\nSUBSCRIBE NOW\\n\\n\\n\\n\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPGA TOUR LIVE\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nCrossFit Games\\n\\n\\n\\n\\n\\n\\n\\nSlamBall\\n\\n\\n\\n\\n\\n\\n\\nThe Ultimate Fighter: Season 31\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\n\\nQuick Links\\n\\n\\n\\n\\nWomen's World Cup\\n\\n\\n\\n\\n\\n\\n\\nNHL Free Agency\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nNBA Trade Machine\\n\\n\\n\\n\\n\\n\\n\\nThe Basketball Tournament\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n      Manage Favorites\\n      \\n\\n\\n\\nCustomize ESPNSign UpLog InESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content=\"How can your team win the national title? Connelly breaks down what needs to go right for all 17 contendersThe fewer things that have to go right to win a title, the better a team's chances of taking the crown. Here's what has to fall each contender's way.7hBill ConnellyDale Zanine/USA TODAY SportsPosition U 2023: Is USC on the verge of taking over QBU from Oklahoma?Which schools produce the most talent at each position?1dDavid HaleConnelly's conference previews: Intel on all 133 FBS teamsTOP HEADLINESFreeze 'uncomfortable' as Auburn opens campTexans' Metchie relied on faith amid cancer fightHornets have new owners after MJ sale finalizedMiami coach expects rough treatment of MessiDrexel basketball player found dead in apartmentGermany exits WWC after draw with South KoreaBrady takes minority stake in English soccer teamDeep dish: Cubs' output at plate best since 1897Re-drafting 2018 NFL class 5 years laterWHAT HAPPENED IN INDY?Inside the shocking feud between Jonathan Taylor and the ColtsHe was the NFL's leading rusher two seasons ago and wanted an extension with the Colts, but now he wants out. How things got so bad for Taylor and Indianapolis.8hStephen HolderZach Bolinger/Icon Sportswire'THE BEST IN THE WORLD RIGHT NOW'Why Stephen A. is convinced Tyreek Hill is the NFL's top WR2h2:57WYNDHAM CHAMPIONSHIPCONTINUES THROUGH SUNDAYShane Lowry fluffs shot, drains birdie chip immediately after3h0:35Countdown to FedEx Cup Playoffs, AIG Open and the Ryder CupDiana Taurasi, 10,000\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content=\"Lowry fluffs shot, drains birdie chip immediately after3h0:35Countdown to FedEx Cup Playoffs, AIG Open and the Ryder CupDiana Taurasi, 10,000 points and the shot that made WNBA scoring history1dMLB SCOREBOARDTHURSDAY'S GAMESSee AllTrivia: Can you guess the right player?HERE COMES HELPBring on the reinforcements! 10 returning players as good as a trade deadline blockbusterInjured stars expected to come off the IL soon -- or have already -- could rock MLB's playoff races.7hAlden GonzalezJay Biggerstaff-USA TODAY Sports'CLEARLY THE ACC IS STRUGGLING'Finebaum: FSU is better off leaving the ACC5h1:04Thamel's realignment buzz: Latest on Pac-12, Big 12 and ACCAN AGGRESSIVE STRATEGYHow the Big 12 landed Colorado and shook up college footballThe Big 12 learned lessons two years ago after getting burned by Texas and Oklahoma. It resulted in a more aggressive strategy that could dramatically change the sport.2dHeather DinichRaymond Carlin/Icon Sportswire Top HeadlinesFreeze 'uncomfortable' as Auburn opens campTexans' Metchie relied on faith amid cancer fightHornets have new owners after MJ sale finalizedMiami coach expects rough treatment of MessiDrexel basketball player found dead in apartmentGermany exits WWC after draw with South KoreaBrady takes minority stake in English soccer teamDeep dish: Cubs' output at plate best since 1897Re-drafting 2018 NFL class 5 years laterFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InICYMI0:54Serena Williams, Alexis Ohanian\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content='2018 NFL class 5 years laterFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InICYMI0:54Serena Williams, Alexis Ohanian use drones to reveal gender of 2nd childSerena Williams and her husband Alexis Ohanian find out the gender of their second child in a spectacular display of drones. Best of ESPN+Todd Kirkland/Getty ImagesMLB 2023 trade deadline: Winners, losers and in-betweenersThe 2023 trade deadline is over! Who crushed it, and who left much to be desired? We weigh in on all 30 clubs.AP Photo/Matt YorkLowe: Why Bradley Beal could unlock KD, Book and the most dangerous version of the Phoenix Suns yetWith Kevin Durant, Devin Booker and Beal, Phoenix is already an inner-circle title contender. But if the Suns continue a Beal experiment the Wizards ran last season? Good luck.Cliff Welch/Icon SportswirePredicting 10 NFL starting quarterback battles: Who is QB1?We talked to people around the NFL and projected the QB1 for 10 unsettled situations, including a wide-open race in Tampa Bay. Trending NowAP Photo/Julio Cortez\\'Revis Island\\' resonates long after Hall of Famer\\'s retirementDarrelle Revis made his name as a dominant corner but might be best known for his \"island\" moniker players still adopt today.Illustration by ESPNThe wild life of Gardner MinshewFour colleges, three NFL teams, two Manias and the hug that broke the internet. It\\'s been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.Illustration by ESPNBest 2023 Women\\'s World Cup', metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content=\"that broke the internet. It's been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.Illustration by ESPNBest 2023 Women's World Cup players: Morgan, Caicedo, moreESPN's expert panel selected the top 25 players of the Women's World Cup to keep an eye on, from Sophia Smith to Sam Kerr and more. How to Watch on ESPN+(AP Photo/Koji Sasahara, File)How to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here's everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivateMock Draft NowSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice With a Mock DraftSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice with a Mock Draft\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content=\"ESPN+\\n\\n\\n\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPGA TOUR LIVE\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nCrossFit Games\\n\\n\\n\\n\\n\\n\\n\\nSlamBall\\n\\n\\n\\n\\n\\n\\n\\nThe Ultimate Fighter: Season 31\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\n\\nQuick Links\\n\\n\\n\\n\\nWomen's World Cup\\n\\n\\n\\n\\n\\n\\n\\nNHL Free Agency\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nNBA Trade Machine\\n\\n\\n\\n\\n\\n\\n\\nThe Basketball Tournament\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCopyright: © ESPN Enterprises, Inc. All rights reserved.\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
       " Document(page_content='Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University\\n\\n\\n\\n\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSkip to main content\\n\\nlink\\n\\n\\n\\n\\n\\nHome\\nPeople\\nMinor\\n\\nMinor Requirements\\nCourse Descriptions\\nCourse Schedule\\nHow to Declare the Minor\\nChoosing a Minor\\n\\n\\nResearch and Immersion\\n\\nResearch and Immersion Overview\\nDS 3850 Research in Data Science\\nDSI Summer Research Program\\nData Science for Social Good\\nResearch Immersion in Data Science\\nDSI Internship\\n\\n\\nFAQ\\nNews\\nForms\\nContact and Email List\\nData Science Institute\\n \\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\tUndergraduate Data Science \\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFrequently Asked Questions\\nDeclaring the Minor\\n\\n\\n\\nHow do I declare the Data Science Minor?Use the forms and follow the procedures for your home college. See How to Declare the Data Science Minor.\\n\\n\\nWhen should I declare the Data Science Minor?While minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?First, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved.\\n\\n\\nI am a first-year A&S student. Can I really declare the Data Science Minor now?Yes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?Juniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options.\\n\\n\\nI am a rising senior or current senior and cannot register for DS 1000. Why?Rising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place.\\n\\n\\nCollege-Specific Information\\n\\n\\n\\nWhat college is the home of the Data Science Minor?The Data Science Minor is a trans-institutional minor, shared by A&S, Blair, Engineering, and Peabody.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I am an A&S student. Do DS courses count as A&S courses?All courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly.\\n\\n\\nWhat are the unique credit hour rules for the Data Science Minor?Students electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements.\\n\\n\\nInfo About the Courses\\n\\n\\n\\nDS 1000Thank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='What computer programming course should I take?See What Programming Course To Take? In general, students interested in data science and scientific computing (not in computer science per se) should learn Python (and R).\\n\\n\\nHow do I find courses approved for the data science minor on YES?On YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.)\\n\\n\\nCan other courses, besides those listed, count towards the Data Science Minor?New courses, special topics courses, or graduate-level courses that seem related to data science could count as electives. Contact the Director of Undergraduate Data Science to request consideration.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='Why doesn’t CS 1104 count towards the Data Science Minor?It does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for\\xa0the two courses. See also What Programming Course To Take?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?After taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this.\\n\\n\\nWhat is the difference between CS 1100 and DS 1100?Nothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I have taken CS 1101. What computer programming course should I take next?You have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take?\\n\\n\\nECON 3750 and MATH 3670 are listed both as satisfying the core machine learning requirement and as electives. If I take one, will it double-count for both requirements?No. They are listed under both because a student who takes one of the other machine learning\\xa0courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='Can I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?Yes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective.\\nCS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors.\\n\\n\\nWhy doesn’t MATH 2820 count towards the Data Science Minor?It does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the\\xa0two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.\\n\\n\\nResearch and Immersion Information\\n\\n\\n\\nCan I do research for course credit?Yes, you can do research for course credit (including DS 3850). More information can be found here: https://www.vanderbilt.edu/undergrad-datascience/ds-3850-research-in-data-science/', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='I am interested in the Undergraduate Data Science Immersion Program. How can I participate.Some competitive summer immersion programs include DSI-SPR and Data Science for Social Good (DSSG). More information can be found on the following websites.\\n\\nhttps://www.vanderbilt.edu/datascience/academics/undergraduate/summer-research-program/\\nhttps://www.vanderbilt.edu/datascience/data-science-for-social-good/\\n\\nTo get involved in data-science-oriented research with a faculty member, you will need to reach out to the faculty member. Pointers can be found here: https://www.vanderbilt.edu/undergrad-datascience/research-and-immersion-overview/. Having that research count towards the immersion requirement will be between your faculty mentor and your faculty immersion coordinator.\\nAdditional information about research opportunities will be posted on the website in the future.\\n\\xa0\\n\\n\\nContact\\n\\n\\n\\nHow do I ask a question about the Data Science?If you have questions about the Data Science Minor or Immersion opportunities in data science, please email us: [email protected]\\n\\n\\nTo whom can I petition if the Director denies my request?The Governing Board of the Data Science Minor acts as the college-level oversight body for this trans-institutional minor and would be the appropriate next step for petitions related to the minor.\\n\\n\\n\\n\\n\\n\\n\\nData Science News\\n\\n\\n\\n Opportunities for Capstone Projects and Research Experience\\n\\n\\n\\n Attention Graduate Students! We’re Hiring!', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
       " Document(page_content='Data Science News\\n\\n\\n\\n Opportunities for Capstone Projects and Research Experience\\n\\n\\n\\n Attention Graduate Students! We’re Hiring!\\n\\n\\n\\n Vanderbilt student-athlete drives sports performance through data analysis\\n\\n\\n\\n New Course: DS 3891 Special Topics: Intro to Generative AI\\n\\n\\n\\n Now Accepting Applications: DS Minor Teaching Fellowship for graduate students\\n\\n\\n\\n Join Our Team: Student Worker Positions Available for Fall 2023 Semester!\\n\\n\\n\\n\\n\\nVIEW MORE EVENTS >\\n\\n\\n\\n\\nYour Vanderbilt\\n\\nAlumni\\nCurrent Students\\nFaculty & Staff\\nInternational Students\\nMedia\\nParents & Family\\nProspective Students\\nResearchers\\nSports Fans\\nVisitors & Neighbors\\n\\n\\n\\n\\n \\n\\n\\n\\nQuick Links\\n\\nPeopleFinder\\nLibraries\\nNews\\nCalendar\\nMaps\\nA-Z\\n\\n\\n\\n\\n\\n\\n\\n\\n\\\\n Vanderbilt University · All rights reserved. Site Development: Digital Strategies (Division of Communications)\\nVanderbilt University is committed to principles of equal opportunity and affirmative action. Accessibility information. Vanderbilt®, Vanderbilt University®, V Oak Leaf Design®, Star V Design® and Anchor Down® are trademarks of The Vanderbilt University', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions |   Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'})]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the results\n",
    "res_web = website_to_text_web(demo_urls)\n",
    "\n",
    "res_web"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#unit testbed\n",
    "test_converters_inputs(website_to_text_web, demo_urls)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Something interesting that we notice here is the proliferation of new lines that aren't for the best.\n",
    "\n",
    "#### Langchain: UnstructuredURLLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def website_to_text_unstructured(web_urls, chunk_size = 1500, chunk_overlap=100):\n",
    "\n",
    "    # Make sure it's a list\n",
    "    if not isinstance(web_urls, list):\n",
    "        web_urls = [web_urls]\n",
    "  \n",
    "    # Url can be a single string or list\n",
    "    website_loader = UnstructuredURLLoader(web_urls)\n",
    "    website_raw = website_loader.load()\n",
    "\n",
    "    website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)\n",
    "  \n",
    "    # Return individual docs or list\n",
    "    return website_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content=\"Menu\\n\\nESPN\\n\\nSearch\\n\\n\\n\\nscores\\n\\nNFL\\n\\nMLB\\n\\nNBA\\n\\nNHL\\n\\nSoccer\\n\\nGolf\\n\\n…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFL\\n\\nMore ESPN\\n\\nFantasy\\n\\nListen\\n\\nWatch\\n\\nESPN+\\n\\nSUBSCRIBE NOW\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\nPGA TOUR LIVE\\n\\nLittle League Baseball: Regionals\\n\\nMLB: Select Games\\n\\nCrossFit Games\\n\\nSlamBall\\n\\nThe Ultimate Fighter: Season 31\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\nQuick Links\\n\\nWomen's World Cup\\n\\nNHL Free Agency\\n\\nNBA Free Agency Buzz\\n\\nNBA Trade Machine\\n\\nThe Basketball Tournament\\n\\nFantasy Football: Sign Up\\n\\nHow To Watch PGA TOUR\\n\\nFavorites\\n\\nManage Favorites\\n\\nCustomize ESPN\\n\\nESPN Sites\\n\\nESPN Deportes\\n\\nAndscape\\n\\nespnW\\n\\nESPNFC\\n\\nX Games\\n\\nSEC Network\\n\\nESPN Apps\\n\\nESPN\\n\\nESPN Fantasy\\n\\nFollow ESPN\\n\\nFacebook\\n\\nX/Twitter\\n\\nInstagram\\n\\nSnapchat\\n\\nTikTok\\n\\nYouTube\\n\\nHow can your team win the national title? Connelly breaks down what needs to go right for all 17 contendersThe fewer things that have to go right to win a title, the better a team's chances of taking the crown. Here's what has to fall each contender's way.7hBill ConnellyDale Zanine/USA TODAY Sports\\n\\nPosition U 2023: Is USC on the verge of taking over QBU from Oklahoma?Which schools produce the most talent at each position?1dDavid Hale\\n\\nConnelly's conference previews: Intel on all 133 FBS teams\\n\\nTOP HEADLINES\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\", metadata={'source': 'https://www.espn.com/'}),\n",
       " Document(page_content=\"TOP HEADLINES\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\\n\\nHornets have new owners after MJ sale finalized\\n\\nMiami coach expects rough treatment of Messi\\n\\nDrexel basketball player found dead in apartment\\n\\nGermany exits WWC after draw with South Korea\\n\\nBrady takes minority stake in English soccer team\\n\\nDeep dish: Cubs' output at plate best since 1897\\n\\nRe-drafting 2018 NFL class 5 years later\\n\\nWHAT HAPPENED IN INDY?\\n\\nInside the shocking feud between Jonathan Taylor and the ColtsHe was the NFL's leading rusher two seasons ago and wanted an extension with the Colts, but now he wants out. How things got so bad for Taylor and Indianapolis.8hStephen HolderZach Bolinger/Icon Sportswire\\n\\n'THE BEST IN THE WORLD RIGHT NOW'\\n\\nWhy Stephen A. is convinced Tyreek Hill is the NFL's top WR\\n\\n2h\\n\\n2:57\\n\\nWYNDHAM CHAMPIONSHIP\\n\\nCONTINUES THROUGH SUNDAY\\n\\nShane Lowry fluffs shot, drains birdie chip immediately after\\n\\n4h\\n\\n0:35\\n\\nCountdown to FedEx Cup Playoffs, AIG Open and the Ryder Cup\\n\\nDiana Taurasi, 10,000 points and the shot that made WNBA scoring history1d\\n\\nMLB SCOREBOARDTHURSDAY'S GAMES\\n\\nSee All\\n\\nTrivia: Can you guess the right player?\\n\\nHERE COMES HELP\\n\\nBring on the reinforcements! 10 returning players as good as a trade deadline blockbusterInjured stars expected to come off the IL soon -- or have already -- could rock MLB's playoff races.7hAlden GonzalezJay Biggerstaff-USA TODAY Sports\\n\\n'CLEARLY THE ACC IS STRUGGLING'\", metadata={'source': 'https://www.espn.com/'}),\n",
       " Document(page_content=\"'CLEARLY THE ACC IS STRUGGLING'\\n\\nFinebaum: FSU is better off leaving the ACC\\n\\n5h\\n\\n1:04\\n\\nThamel's realignment buzz: Latest on Pac-12, Big 12 and ACC\\n\\nAN AGGRESSIVE STRATEGY\\n\\nHow the Big 12 landed Colorado and shook up college footballThe Big 12 learned lessons two years ago after getting burned by Texas and Oklahoma. It resulted in a more aggressive strategy that could dramatically change the sport.2dHeather DinichRaymond Carlin/Icon Sportswire\\n\\nTop Headlines\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\\n\\nHornets have new owners after MJ sale finalized\\n\\nMiami coach expects rough treatment of Messi\\n\\nDrexel basketball player found dead in apartment\\n\\nGermany exits WWC after draw with South Korea\\n\\nBrady takes minority stake in English soccer team\\n\\nDeep dish: Cubs' output at plate best since 1897\\n\\nRe-drafting 2018 NFL class 5 years later\\n\\nFavorites\\n\\nFantasy\\n\\nManage Favorites\\n\\nFantasy Home\\n\\nCustomize ESPN\\n\\nICYMI\\n\\n0:54\\n\\nSerena Williams, Alexis Ohanian use drones to reveal gender of 2nd childSerena Williams and her husband Alexis Ohanian find out the gender of their second child in a spectacular display of drones.\\n\\nBest of ESPN+\\n\\nTodd Kirkland/Getty Images\\n\\nMLB 2023 trade deadline: Winners, losers and in-betweenersThe 2023 trade deadline is over! Who crushed it, and who left much to be desired? We weigh in on all 30 clubs.\\n\\nAP Photo/Matt York\", metadata={'source': 'https://www.espn.com/'}),\n",
       " Document(page_content='AP Photo/Matt York\\n\\nLowe: Why Bradley Beal could unlock KD, Book and the most dangerous version of the Phoenix Suns yetWith Kevin Durant, Devin Booker and Beal, Phoenix is already an inner-circle title contender. But if the Suns continue a Beal experiment the Wizards ran last season? Good luck.\\n\\nCliff Welch/Icon Sportswire\\n\\nPredicting 10 NFL starting quarterback battles: Who is QB1?We talked to people around the NFL and projected the QB1 for 10 unsettled situations, including a wide-open race in Tampa Bay.\\n\\nTrending Now\\n\\nAP Photo/Julio Cortez\\n\\n\\'Revis Island\\' resonates long after Hall of Famer\\'s retirementDarrelle Revis made his name as a dominant corner but might be best known for his \"island\" moniker players still adopt today.\\n\\nIllustration by ESPN\\n\\nThe wild life of Gardner MinshewFour colleges, three NFL teams, two Manias and the hug that broke the internet. It\\'s been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.\\n\\nIllustration by ESPN\\n\\nBest 2023 Women\\'s World Cup players: Morgan, Caicedo, moreESPN\\'s expert panel selected the top 25 players of the Women\\'s World Cup to keep an eye on, from Sophia Smith to Sam Kerr and more.\\n\\nHow to Watch on ESPN+\\n\\n(AP Photo/Koji Sasahara, File)\\n\\nHow to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here\\'s everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+.\\n\\nSign up to play the #1 Fantasy game!', metadata={'source': 'https://www.espn.com/'}),\n",
       " Document(page_content=\"Sign up to play the #1 Fantasy game!\\n\\nCreate A League\\n\\nJoin Public League\\n\\nReactivate\\n\\nMock Draft Now\\n\\nSign up for FREE!\\n\\nCreate A League\\n\\nJoin a Public League\\n\\nReactivate a League\\n\\nPractice With a Mock Draft\\n\\nSign up for FREE!\\n\\nCreate A League\\n\\nJoin a Public League\\n\\nReactivate a League\\n\\nPractice with a Mock Draft\\n\\nESPN+\\n\\nWatch Now\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\nPGA TOUR LIVE\\n\\nLittle League Baseball: Regionals\\n\\nMLB: Select Games\\n\\nCrossFit Games\\n\\nSlamBall\\n\\nThe Ultimate Fighter: Season 31\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\nQuick Links\\n\\nWomen's World Cup\\n\\nNHL Free Agency\\n\\nNBA Free Agency Buzz\\n\\nNBA Trade Machine\\n\\nThe Basketball Tournament\\n\\nFantasy Football: Sign Up\\n\\nHow To Watch PGA TOUR\\n\\nESPN Sites\\n\\nESPN Deportes\\n\\nAndscape\\n\\nespnW\\n\\nESPNFC\\n\\nX Games\\n\\nSEC Network\\n\\nESPN Apps\\n\\nESPN\\n\\nESPN Fantasy\\n\\nFollow ESPN\\n\\nFacebook\\n\\nX/Twitter\\n\\nInstagram\\n\\nSnapchat\\n\\nTikTok\\n\\nYouTube\\n\\nTerms of Use\\n\\nPrivacy Policy\\n\\nYour US State Privacy Rights\\n\\nChildren's Online Privacy Policy\\n\\nInterest-Based Ads\\n\\nAbout Nielsen Measurement\\n\\nDo Not Sell or Share My Personal Information\\n\\nContact Us\\n\\nDisney Ad Sales Site\\n\\nWork for ESPN\", metadata={'source': 'https://www.espn.com/'}),\n",
       " Document(page_content='Skip to main content\\n\\nlink\\n\\nHome\\n\\nPeople\\n\\nMinor\\n\\n\\tMinor Requirements\\n\\tCourse Descriptions\\n\\tCourse Schedule\\n\\tHow to Declare the Minor\\n\\tChoosing a Minor\\n\\nResearch and Immersion\\n\\n\\tResearch and Immersion Overview\\n\\tDS 3850 Research in Data Science\\n\\tDSI Summer Research Program\\n\\tData Science for Social Good\\n\\tResearch Immersion in Data Science\\n\\tDSI Internship\\n\\nFAQ\\n\\nNews\\n\\nForms\\n\\nContact and Email List\\n\\nData Science Institute\\n\\nUndergraduate Data Science\\n\\nFrequently Asked Questions\\n\\nDeclaring the Minor\\n\\nHow do I declare the Data Science Minor?\\n\\nUse the forms and follow the procedures for your home college. See How to Declare the Data Science Minor.\\n\\nWhen should I declare the Data Science Minor?\\n\\nWhile minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late.\\n\\nI declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='I declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?\\n\\nFirst, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved.\\n\\nI am a first-year A&S student. Can I really declare the Data Science Minor now?\\n\\nYes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor.\\n\\nI am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='I am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?\\n\\nJuniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options.\\n\\nI am a rising senior or current senior and cannot register for DS 1000. Why?\\n\\nRising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place.\\n\\nCollege-Specific Information\\n\\nWhat college is the home of the Data Science Minor?\\n\\nThe Data Science Minor is a trans-institutional minor, shared by A&S, Blair, Engineering, and Peabody.\\n\\nI am an A&S student. Do DS courses count as A&S courses?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='I am an A&S student. Do DS courses count as A&S courses?\\n\\nAll courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly.\\n\\nWhat are the unique credit hour rules for the Data Science Minor?\\n\\nStudents electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements.\\n\\nInfo About the Courses\\n\\nDS 1000\\n\\nThank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester.\\n\\nWhat computer programming course should I take?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='What computer programming course should I take?\\n\\nSee What Programming Course To Take? In general, students interested in data science and scientific computing (not in computer science per se) should learn Python (and R).\\n\\nHow do I find courses approved for the data science minor on YES?\\n\\nOn YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.)\\n\\nCan other courses, besides those listed, count towards the Data Science Minor?\\n\\nNew courses, special topics courses, or graduate-level courses that seem related to data science could count as electives. Contact the Director of Undergraduate Data Science to request consideration.\\n\\nWhy doesn’t CS 1104 count towards the Data Science Minor?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='Why doesn’t CS 1104 count towards the Data Science Minor?\\n\\nIt does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for\\xa0the two courses. See also What Programming Course To Take?\\n\\nI see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='I see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?\\n\\nAfter taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this.\\n\\nWhat is the difference between CS 1100 and DS 1100?\\n\\nNothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='Nothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.\\n\\nI have taken CS 1101. What computer programming course should I take next?\\n\\nYou have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take?\\n\\nECON 3750 and MATH 3670 are listed both as satisfying the core machine learning requirement and as electives. If I take one, will it double-count for both requirements?\\n\\nNo. They are listed under both because a student who takes one of the other machine learning\\xa0courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements.\\n\\nCan I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='Can I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?\\n\\nYes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective.\\n\\nCS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors.\\n\\nWhy doesn’t MATH 2820 count towards the Data Science Minor?\\n\\nIt does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the\\xa0two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.\\n\\nResearch and Immersion Information\\n\\nCan I do research for course credit?\\n\\nYes, you can do research for course credit (including DS 3850). More information can be found here: https://www.vanderbilt.edu/undergrad-datascience/ds-3850-research-in-data-science/\\n\\nI am interested in the Undergraduate Data Science Immersion Program. How can I participate.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='I am interested in the Undergraduate Data Science Immersion Program. How can I participate.\\n\\nSome competitive summer immersion programs include DSI-SPR and Data Science for Social Good (DSSG). More information can be found on the following websites.\\n\\nhttps://www.vanderbilt.edu/datascience/academics/undergraduate/summer-research-program/\\n\\nhttps://www.vanderbilt.edu/datascience/data-science-for-social-good/\\n\\nTo get involved in data-science-oriented research with a faculty member, you will need to reach out to the faculty member. Pointers can be found here: https://www.vanderbilt.edu/undergrad-datascience/research-and-immersion-overview/. Having that research count towards the immersion requirement will be between your faculty mentor and your faculty immersion coordinator.\\n\\nAdditional information about research opportunities will be posted on the website in the future.\\n\\nContact\\n\\nHow do I ask a question about the Data Science?\\n\\nIf you have questions about the Data Science Minor or Immersion opportunities in data science, please email us: [email protected]\\n\\nTo whom can I petition if the Director denies my request?\\n\\nThe Governing Board of the Data Science Minor acts as the college-level oversight body for this trans-institutional minor and would be the appropriate next step for petitions related to the minor.\\n\\nData Science News\\n\\nOpportunities for Capstone Projects and Research Experience\\n\\nAttention Graduate Students! We’re Hiring!', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
       " Document(page_content='Data Science News\\n\\nOpportunities for Capstone Projects and Research Experience\\n\\nAttention Graduate Students! We’re Hiring!\\n\\nVanderbilt student-athlete drives sports performance through data analysis\\n\\nNew Course: DS 3891 Special Topics: Intro to Generative AI\\n\\nNow Accepting Applications: DS Minor Teaching Fellowship for graduate students\\n\\nJoin Our Team: Student Worker Positions Available for Fall 2023 Semester!\\n\\nVIEW MORE EVENTS >\\n\\nYour Vanderbilt\\n\\nAlumni\\n\\nCurrent Students\\n\\nFaculty & Staff\\n\\nInternational Students\\n\\nMedia\\n\\nParents & Family\\n\\nProspective Students\\n\\nResearchers\\n\\nSports Fans\\n\\nVisitors & Neighbors\\n\\nQuick Links\\n\\nPeopleFinder\\n\\nLibraries\\n\\nNews\\n\\nCalendar\\n\\nMaps\\n\\nA-Z\\n\\\\n                    Site Development: Digital Strategies (Division of Communications)\\n                    Vanderbilt University is committed to principles of equal opportunity and affirmative action. Accessibility information. Vanderbilt®, Vanderbilt University®, V Oak Leaf Design®, Star V Design® and Anchor Down® are trademarks of The Vanderbilt University', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'})]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the results\n",
    "res_unstructured = website_to_text_unstructured(demo_urls)\n",
    "res_unstructured"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#unit testb\n",
    "test_converters_inputs(website_to_text_unstructured, demo_urls)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also see here that there's something to be said about the unstructured approach which appears to be more conservative in the number of newline characters but still appears to preserve content. However, the gain is not overly significant.\n",
    "\n",
    "#### Trafilatura Parsing\n",
    "\n",
    "[Tralifatura](https://trafilatura.readthedocs.io/en/latest/) is a Python and command-line utility which attempts to extracts the most relevant information from a given website.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def website_trafilatura(url):\n",
    "  downloaded = trafilatura.fetch_url(url)\n",
    "  return trafilatura.extract(downloaded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of characters in example: 1565 \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPHI\\nMIA\\n||\\n56-49\\n57-49\\n||\\n||\\n||\\n||\\n||\\n6:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nMIL\\nWSH\\n||\\n57-49\\n44-62\\n||\\n||\\n||\\n||\\n||\\n7:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nTB\\nNYY\\n||\\n64-44\\n55-50\\n||\\n||\\n||\\n||\\n||\\n7:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nBAL\\nTOR\\n||\\n64-41\\n59-47\\n||\\n||\\n||\\n||\\n||\\n7:07 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nLAA\\nATL\\n||\\n55-51\\n67-36\\n||\\n||\\n||\\n||\\n||\\n7:20 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCIN\\nCHC\\n||\\n58-49\\n53-52\\n||\\n||\\n||\\n||\\n||\\n8:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCLE\\nHOU\\n||\\n53-53\\n59-47\\n||\\n||\\n||\\n||\\n||\\n8:10 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nSD\\nCOL\\n||\\n52-54\\n41-64\\n||\\n||\\n||\\n||\\n||\\n8:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nBOS\\nSEA\\n||\\n56-49\\n54-51\\n||\\n||\\n||\\n||\\n||\\n9:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nARI\\nSF\\n||\\n56-50\\n58-48\\n||\\n||\\n||\\n||\\n||\\n9:45 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nJPN\\nESP\\n||\\n4\\n0\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCRC\\nZAM\\n||\\n1\\n3\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCAN\\nAUS\\n||\\n0\\n4\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nIRL\\nNGA\\n||\\n0\\n0\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPOR\\nUSA\\n||\\nWLDDL\\nDWWWW\\n||\\n||\\n||\\n||\\n||\\n3:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nVIE\\nNED\\n||\\nLLLLL\\nDWWWL\\n||\\n||\\n||\\n||\\n||\\n3:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCHN\\nENG\\n||\\nWLDWW\\nWWDLW\\n||\\n||\\n||\\n||\\n||\\n7:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nHAI\\nDEN\\n||\\nLLLWL\\nLWLWW\\n||\\n||\\n||\\n||\\n||\\n7:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nAME\\nCLB\\n||\\nWWLLW\\nWLDDW\\n||\\n||\\n||\\n||\\n||\\n8:00 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPUE\\nCHI\\n||\\nLLLDL\\nWWWWL\\n||\\n||\\n||\\n||\\n||\\n8:00 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nTOL\\nCOL\\n||\\nWLWDW\\nLDDWL\\n||\\n||\\n||\\n||\\n||\\n9:30 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nGDL\\nSKC\\n||\\nLWWWL\\nLLDDW\\n||\\n||\\n||\\n||\\n||\\n10:00 PM ET\\n||\\n|\\n|\\n|'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trafilatura_text = website_trafilatura(demo_urls[0])\n",
    "print('Total number of characters in example:', len(trafilatura_text), '\\n')\n",
    "trafilatura_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This output is SUBSTANTIALLY shorter with a length of 1565 characters. However, the problem is that the main article on the page actually isn't captured at all.\n",
    "\n",
    "#### jusText\n",
    "\n",
    "[jusText](https://pypi.org/project/jusText/) is another Python library for extracting content from a website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def website_justext(url):\n",
    "  response = requests.get(url)\n",
    "  paragraphs = justext.justext(response.content, justext.get_stoplist(\"English\"))\n",
    "  content = [paragraph.text for paragraph in paragraphs \\\n",
    "            if not paragraph.is_boilerplate]\n",
    "  text = \" \".join(content)\n",
    "  return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "''"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Ensure behavior\n",
    "justext_text = website_justext(demo_urls[0])\n",
    "justext_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Declaring the Minor While minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late. First, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved. Yes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor. Juniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options. Rising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place. All courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly. Students electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements. Info About the Courses Thank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester. On YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.) It does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for the two courses. See also What Programming Course To Take? After taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this. You have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take? No. They are listed under both because a student who takes one of the other machine learning courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements. Yes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective. CS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors. It does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.'"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Try a different URL to see if behavior improves\n",
    "justext_text = website_justext(demo_urls[1])\n",
    "justext_text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we see that we may prefer to stick with the langchain implementations. The first jusText example returned an empty string, although previous work demonstrates that on a different day, it worked well (note that the ESPN's content was different). With the second URL, parts of the website, particularly the headers, is actually missing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Document Segments\n",
    "Now, the precursor to creating vector stores/embeddings is to create document segments. Since we have a variety of sources, we will keep this in mind as we develop the following function.\n",
    "\n",
    ":::{.callout-warning}\n",
    "Note that the `get_document_segments` currently is meant to be used in one single pass with `context_info` being all of a single file type. [Issue #150](https://github.com/vanderbilt-data-science/lo-achievement/issues/150) is meant to expand this functionality so that if many files are uploaded, the software will be able to handle this.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def get_document_segments(context_info, data_type, chunk_size = 1500, chunk_overlap=100):\n",
    "\n",
    "    load_fcn = None\n",
    "    addtnl_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}\n",
    "\n",
    "    # Define function use to do the loading\n",
    "    if data_type == 'text':\n",
    "        load_fcn = rawtext_to_doc_split\n",
    "    elif data_type == 'web_page':\n",
    "        load_fcn = website_to_text_unstructured\n",
    "    elif data_type == 'youtube_video':\n",
    "        load_fcn = youtube_to_text\n",
    "    else:\n",
    "        load_fcn = files_to_text\n",
    "    \n",
    "    # Get the document segments\n",
    "    doc_segments = load_fcn(context_info, **addtnl_params)\n",
    "\n",
    "    return doc_segments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Vector Stores from Document Segments\n",
    "The last step here will be in the creation of vector stores from the provided document segments. We will allow for the usage of either Chroma or DeepLake and enforce OpenAIEmbeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#| export\n",
    "def create_local_vector_store(document_segments, **retriever_kwargs):\n",
    "    embeddings = OpenAIEmbeddings()\n",
    "    db = Chroma.from_documents(document_segments, embeddings)\n",
    "    retriever = db.as_retriever(**retriever_kwargs)\n",
    "    \n",
    "    return db, retriever"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unit test of vector store and segment creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from getpass import getpass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "openai_api_key = getpass()\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
    "\n",
    "llm = ChatOpenAI(model_name = 'gpt-3.5-turbo-16k')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_files = ['../roadnottaken.txt', '../2302.11382.pdf']\n",
    "\n",
    "#get vector store\n",
    "segs = get_document_segments(test_files, data_type='other', chunk_size = 1000, chunk_overlap = 100)\n",
    "chroma_db, vs_retriever = create_local_vector_store(segs)\n",
    "\n",
    "#create test retrievalqa\n",
    "qa_chain = RetrievalQA.from_chain_type(llm=openai_llm, chain_type=\"stuff\", retriever=vs_retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the undergrowth;\\r\\rThen took the other, as just as fair,\\rAnd having perhaps the better claim,\\rBecause it was grassy and wanted wear;\\rThough as for that the passing there\\rHad worn them really about the same,\\r\\rAnd both that morning equally lay\\rIn leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way,\\rI doubted if I should ever come back. I shall be telling this with a sigh\\rSomewhere ages and ages hence:\\rTwo roads diverged in a wood, and IэI took the one less traveled by,\\rAnd that has made all the difference.', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
       " Document(page_content='any unnecessary steps,” is useful in flagging inaccuracies in the user’s original request so that the final recipe is efficient.', metadata={'source': '../2302.11382.pdf', 'start_index': 92662}),\n",
       " Document(page_content='The third statement provides an optional way for the user to stop the output generation process. This step is not always needed, but can be useful in situations where there may be the potential for ambiguity regarding whether or not the user- provided input between inputs is meant as a refinement for the next generation or a command to stop. For example, an explicit stop phrase could be created if the user was generating data related to road signs, where the user might want to enter a refinement of the generation like “stop” to indicate that a stop sign should be added to the output.', metadata={'source': '../2302.11382.pdf', 'start_index': 72043}),\n",
       " Document(page_content='“When I ask you a question, generate three addi- tional questions that would help you give a more accurate answer. Assume that I know little about the topic that we are discussing and please define any terms that are not general knowledge. When I have answered the three questions, combine the answers to produce the final answers to my original question.”\\n\\nOne point of variation in this pattern is where the facts are output. Given that the facts may be terms that the user is not familiar with, it is preferable if the list of facts comes after the output. This after-output presentation ordering allows the user to read and understand the statements before seeing what statements should be checked. The user may also determine additional facts prior to realizing the fact list at the end should be checked.', metadata={'source': '../2302.11382.pdf', 'start_index': 57473})]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check for functionality\n",
    "chroma_db.similarity_search('The street was forked and I did not know which way to go')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#check qa chain for functionality\n",
    "ans = qa_chain({'question':'What is the best prompt to use when I want the model to take on a certain attitude of a person?'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'question': 'What is the best prompt to use when I want the model to take on a certain attitude of a person?',\n",
       " 'answer': 'The best prompt to use when you want the model to take on a certain attitude of a person is to provide a persona for the model to embody. This can be expressed as a job description, title, fictional character, historical figure, or any other attributes associated with a well-known type of person. The prompt should specify the outputs that this persona would create. Additionally, personas can also represent inanimate or non-human entities, such as a Linux terminal or a database. In this case, the prompt should specify how the inputs should be delivered to the entity and what outputs the entity should produce. It is also possible to provide a better version of the question and prompt the model to ask if the user would like to use the better version instead.\\n',\n",
       " 'sources': '../2302.11382.pdf',\n",
       " 'source_documents': [Document(page_content='4) Example Implementation: A sample prompt for a flipped\\n\\ninteraction is shown below:\\n\\n“From now on, I would like you to ask me questions to deploy a Python application to AWS. When you have enough information to deploy the application, create a Python script to automate the deployment.”\\n\\n2) Motivation: Users may not know what types of outputs or details are important for an LLM to focus on to achieve a given task. They may know, however, the role or type of person that they would normally ask to get help with these things. The Persona pattern enables the users to express what they need help with without knowing the exact details of the outputs they need.', metadata={'source': '../2302.11382.pdf', 'start_index': 36397}),\n",
       "  Document(page_content='ments:\\n\\nContextual Statements Act as persona X Provide outputs that persona X would create\\n\\nThe first statement conveys the idea that the LLM needs to act as a specific persona and provide outputs that such a persona would. This persona can be expressed in a number of ways, ranging from a job description, title, fictional char- acter, historical figure, etc. The persona should elicit a set of attributes associated with a well-known job title, type of person, etc.2\\n\\n5) Consequences: One consideration when designing the prompt is how much to dictate to the LLM regarding what information to collect prior to termination. In the example above, the flipped interaction is open-ended and can vary sig- nificantly in the final generated artifact. This open-endedness makes the prompt generic and reusable, but may potentially ask additional questions that could be skipped if more context is given.', metadata={'source': '../2302.11382.pdf', 'start_index': 37872}),\n",
       "  Document(page_content='In this example, the LLM is instructed to provide outputs that a ”security reviewer” would. The prompt further sets the stage that code is going to be evaluated. Finally, the user refines the persona by scoping the persona further to outputs regarding the code.\\n\\nPersonas can also represent inanimate or non-human en- tities, such as a Linux terminal, a database, or an animal’s perspective. When using this pattern to represent these entities, it can be useful to also specify how you want the inputs delivered to the entity, such as “assume my input is what the owner is saying to the dog and your output is the sounds the dog is making”. An example prompt for a non-human entity that uses a “pretend to be” wording is shown below:\\n\\n“You are going to pretend to be a Linux terminal for a computer that has been compromised by an attacker. When I type in a command, you are going the Linux to output terminal would produce.”\\n\\nthe corresponding text\\n\\nthat', metadata={'source': '../2302.11382.pdf', 'start_index': 41330}),\n",
       "  Document(page_content='the corresponding text\\n\\nthat\\n\\nThis prompt is designed to simulate a computer that has been compromised by an attacker and is being controlled through a Linux terminal. The prompt specifies that the user will input commands into the terminal, and in response, the simulated terminal will output the corresponding text that would be produced by a real Linux terminal. This prompt is more prescriptive in the persona and asks the LLM to, not only be a Linux terminal, but to further act as a computer that has been compromised by an attacker.\\n\\n3) Structure and Key Ideas: Fundamental contextual state-\\n\\nments:\\n\\nContextual Statements Within scope X, suggest a better version of the question to use instead (Optional) prompt me if I would like to use the better version instead', metadata={'source': '../2302.11382.pdf', 'start_index': 42256})]}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#show result\n",
    "ans"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In conclusion, this is looking pretty solid. Let's leverage this functionality within the code base."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}