"Open

# Exploring Alternative Media Document Sources
Test how one could get YouTube videos or websites as sources for documents in a vector store.

- YouTube: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/youtube_audio
- Websites:
 - https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/
 - https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_base
 - Extracting relevant information from website: https://www.oncrawl.com/technical-seo/extract-relevant-text-content-from-html-page/



## Libraries

In [None]:
# install libraries here
# -q flag for "quiet" install
!pip install -q langchain
!pip install -q openai
!pip install -q unstructured
!pip install -q tiktoken
!pip install typing_extensions==4.5.0

In [None]:
%pip install -q trafilatura
%pip install -q justext

In [None]:
%pip install yt_dlp
%pip install pydub

In [2]:
# import libraries here
import os
import time
import pprint
from getpass import getpass

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

from langchain.document_loaders.unstructured import UnstructuredFileLoader

In [5]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [40]:
from langchain.document_loaders import WebBaseLoader
import trafilatura
import requests
import justext

In [3]:
# Export requirements.txt (if needed)
%pip freeze > requirements.txt

## API Keys

Use these cells to load the API keys required for this notebook. The below code cell uses the `getpass` library.

In [4]:
openai_api_key = getpass()
os.environ["OPENAI_API_KEY"] = openai_api_key

··········


In [35]:
def splitter(text):
 # Split input text
 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
 splits = text_splitter.split_text(text)
 return splits

## YouTube

In [36]:
def youtube_transcript(urls, save_dir = "content"):
 # Transcribe the videos to text
 # save_dir: directory to save audio files
 youtube_loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
 youtube_docs = youtube_loader.load()
 # Combine doc
 combined_docs = [doc.page_content for doc in youtube_docs]
 text = " ".join(combined_docs)
 return text

In [None]:
# Two Karpathy lecture videos
urls = ["https://youtu.be/kCc8FmEb1nY", "https://youtu.be/VMj-3S1tku0"]
youtube_text = youtube_transcript(urls)
youtube_text

## Websites

In [25]:
url = "https://www.espn.com/"

### WebBaseLoader

In [42]:
def website_webbase(url):
 website_loader = WebBaseLoader(url)
 website_data = website_loader.load()
 # Combine doc
 combined_docs = [doc.page_content for doc in website_data]
 text = " ".join(combined_docs)
 return text

In [43]:
webbase_text = website_webbase(url)
webbase_text

"\n\n\n\n\n\n\n\n\nESPN - Serving Sports Fans. Anytime. Anywhere.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Skip to main content\n \n\n Skip to navigation\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<\n\n>\n\n\n\n\n\n\n\n\n\nMenuESPN\n\n\nSearch\n\n\n\nscores\n\n\n\nNFLNBANHLMLBSoccerTennis…NCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1GolfHorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\nSUBSCRIBE NOW\n\n\n\n\n\nThe Ultimate Fighter: Season 31\n\n\n\n\n\n\n\nWimbledon: Select Courts\n\n\n\n\n\n\n\nNBA Summer League: Select Games\n\n\n\n\n\n\n\nProjecting Messi's Performance In MLS\n\n\nQuick Links\n\n\n\n\nNBA Summer League\n\n\n\n\n\n\n\nNBA Free Agency Buzz\n\n\n\n\n\n\n\nNBA Trade Machine\n\n\n\n\n\n\n\n2023 MLB

### Trafilatura Parsing

[Tralifatura](https://trafilatura.readthedocs.io/en/latest/) is a Python and command-line utility which attempts to extracts the most relevant information from a given website. 

In [44]:
def website_trafilatura(url):
 downloaded = trafilatura.fetch_url(url)
 return trafilatura.extract(downloaded)

In [45]:
trafilatura_text = website_trafilatura(url)
trafilatura_text

'|Sports|\n|scores||News|\n|© 2023 ESPN Internet Ventures. Terms of Use and Privacy Policy and Safety Information / Your California Privacy Rights are applicable to you. All rights reserved.|\n|\nMore From ESPN:\n|\nESPN en Español | Andscape | FiveThirtyEight | ESPN FC | ESPNCricinfo'

### jusText

[jusText](https://pypi.org/project/jusText/) is another Python library for extracting content from a website.

In [46]:
def website_justext(url):
 response = requests.get(url)
 paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
 content = [paragraph.text for paragraph in paragraphs \
 if not paragraph.is_boilerplate]
 text = " ".join(content)
 return text

In [47]:
justext_text = website_justext(url)
justext_text

"Trending Now The Philadelphia 76ers started the 1972-73 season by losing 21 of 23 games. They'd finish with the worst record in NBA history, 9-73. This is the story of what the team learned about themselves through turmoil."