Spaces:
No application file
No application file
Liss, Alex (NYC-HUG)
commited on
Commit
Β·
20ecd50
1
Parent(s):
4c2914b
wip implementation of 1.2.3 Team Search
Browse files- data/april_11_multimedia_data_collect/new_final_april 11/{neo4j_update β neo4j_game_update}/SCHEMA.md +0 -0
- data/april_11_multimedia_data_collect/new_final_april 11/{neo4j_update β neo4j_game_update}/update_game_nodes.py +0 -0
- docs/Phase 1/Task 1.2.3 Team Search Implementation.md +277 -0
- requirements.txt +2 -0
- tools/neo4j_article_uploader.py +145 -0
- tools/team_news_articles.csv +37 -0
- tools/team_news_scraper.py +370 -0
data/april_11_multimedia_data_collect/new_final_april 11/{neo4j_update β neo4j_game_update}/SCHEMA.md
RENAMED
File without changes
|
data/april_11_multimedia_data_collect/new_final_april 11/{neo4j_update β neo4j_game_update}/update_game_nodes.py
RENAMED
File without changes
|
docs/Phase 1/Task 1.2.3 Team Search Implementation.md
ADDED
@@ -0,0 +1,277 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Task 1.2.3 Team Search Implementation Instructions
|
2 |
+
|
3 |
+
## Context
|
4 |
+
You are an expert at UI/UX design and software front-end development and architecture. You are allowed to not know an answer. You are allowed to be uncertain. You are allowed to disagree with your task. If any of these things happen, halt your current process and notify the user immediately. You should not hallucinate. If you are unable to remember information, you are allowed to look it up again.
|
5 |
+
|
6 |
+
You are not allowed to hallucinate. You may only use data that exists in the files specified. You are not allowed to create new data if it does not exist in those files.
|
7 |
+
|
8 |
+
You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
|
9 |
+
|
10 |
+
When writing code, your focus should be on creating new functionality that builds on the existing code base without breaking things that are already working. If you need to rewrite how existing code works in order to develop a new feature, please check your work carefully, and also pause your work and tell me (the human) for review before going ahead. We want to avoid software regression as much as possible.
|
11 |
+
|
12 |
+
I WILL REPEAT, WHEN UPDATING EXISTING CODE FILES, PLEASE DO NOT OVERWRITE EXISTING CODE, PLEASE ADD OR MODIFY COMPONENTS TO ALIGN WITH THE NEW FUNCTIONALITY. THIS INCLUDES SMALL DETAILS LIKE FUNCTION ARGUMENTS AND LIBRARY IMPORTS. REGRESSIONS IN THESE AREAS HAVE CAUSED UNNECESSARY DELAYS AND WE WANT TO AVOID THEM GOING FORWARD.
|
13 |
+
|
14 |
+
When you need to modify existing code (in accordance with the instruction above), please present your recommendation to the user before taking action, and explain your rationale.
|
15 |
+
|
16 |
+
If the data files and code you need to use as inputs to complete your task do not conform to the structure you expected based on the instructions, please pause your work and ask the human for review and guidance on how to proceed.
|
17 |
+
|
18 |
+
If you have difficulty finding mission critical updates in the codebase (e.g. .env files, data files) ask the user for help in finding the path and directory.
|
19 |
+
|
20 |
+
## Objective
|
21 |
+
You are to follow the step-by-step process in order to build the Team Info Search feature (Task 1.2.3). This involves scraping recent team news, processing it, storing it in Neo4j, and updating the Gradio application to allow users to query this information. The initial focus is on the back-end logic and returning correct text-based information, with visual components to be integrated later. The goal is for the user to ask the app a question about the team and get a rich text response based on recent news articles.
|
22 |
+
|
23 |
+
## Instruction Steps
|
24 |
+
|
25 |
+
1. **Codebase Review:** Familiarize yourself with the existing project structure:
|
26 |
+
* `gradio_agent.py`: Understand how LangChain Tools (`Tool.from_function`) are defined with descriptions for intent recognition and how they wrap functions from the `tools/` directory.
|
27 |
+
* `tools/`: Review `player_search.py`, `game_recap.py`, and `cypher.py` for examples of tool functions, Neo4j interaction, and data handling.
|
28 |
+
* `components/`: Examine `player_card_component.py` and `game_recap_component.py` for UI component structure.
|
29 |
+
* `gradio_app.py`: Analyze how it integrates components, handles user input/output (esp. `process_message`, `process_and_respond`), and interacts with the agent.
|
30 |
+
* `.env`/`gradio_agent.py`: Note how API keys are loaded.
|
31 |
+
2. **Web Scraping Script:**
|
32 |
+
* Create a new Python script (e.g., in the `tools/` directory named `team_news_scraper.py`) dedicated to scraping articles.
|
33 |
+
* **Refer to existing scripts** in `data/april_11_multimedia_data_collect/` (like `get_player_socials.py`, `player_headshots.py`, `get_youtube_playlist_videos.py`) for examples of:
|
34 |
+
* Loading API keys/config from `.env` using `dotenv` and `os.getenv()`.
|
35 |
+
* Making HTTP requests (likely using the `requests` library).
|
36 |
+
* Handling potential errors using `try...except` blocks.
|
37 |
+
* Implementing delays (`time.sleep()`) between requests.
|
38 |
+
* Writing data to CSV files using the `csv` module.
|
39 |
+
* Target URL: `https://www.ninersnation.com/san-francisco-49ers-news`
|
40 |
+
* Use libraries like `requests` to fetch the page content and `BeautifulSoup4` (you may need to add this to `requirements.txt`) to parse the HTML.
|
41 |
+
* Scrape articles published within the past 60 days.
|
42 |
+
* For each article, extract:
|
43 |
+
* Title
|
44 |
+
* Content/Body
|
45 |
+
* Publication Date
|
46 |
+
* Article URL (`link_to_article`)
|
47 |
+
* Content Tags (e.g., Roster, Draft, Depth Chart - these often appear on the article page). Create a comprehensive set of unique tags encountered.
|
48 |
+
* Refer to any previously created scraping files for examples of libraries and techniques used (e.g., BeautifulSoup, requests).
|
49 |
+
3. **Data Structuring (CSV):**
|
50 |
+
* Process the scraped data to fit the following CSV structure:
|
51 |
+
* `Team_name`: (e.g., "San Francisco 49ers" - Determine how to handle articles not specific to the 49ers, discuss if unclear)
|
52 |
+
* `season`: (e.g., 2024 - Determine how to assign this)
|
53 |
+
* `city`: (e.g., "San Francisco")
|
54 |
+
* `conference`: (e.g., "NFC")
|
55 |
+
* `division`: (e.g., "West")
|
56 |
+
* `logo_url`: (URL for the 49ers logo - Confirm source or leave blank)
|
57 |
+
* `summary`: (Placeholder for LLM summary)
|
58 |
+
* `topic`: (Assign appropriate tag(s) extracted during scraping)
|
59 |
+
* `link_to_article`: (URL extracted during scraping)
|
60 |
+
* Consider the fixed nature of some columns (Team\_name, city, conference, etc.) and how to populate them accurately, especially if articles cover other teams or general news.
|
61 |
+
4. **LLM Summarization:**
|
62 |
+
* For each scraped article's content, use the OpenAI GPT-4o model (configured via credentials in the `.env` file) *within the scraping/ingestion script* to generate a concise 3-4 sentence summary.
|
63 |
+
* **Do NOT use `gradio_llm.py` for this task.**
|
64 |
+
* Populate the `summary` column in your data structure with the generated summary.
|
65 |
+
5. **Prepare CSV for Upload:**
|
66 |
+
* Save the structured and summarized data into a CSV file (e.g., `team_news_articles.csv`).
|
67 |
+
6. **Neo4j Upload:**
|
68 |
+
* Develop a script or function (potentially augmenting existing Neo4j tools) to upload the data from the CSV to the Neo4j database.
|
69 |
+
* Ensure the main `:Team` node exists and has the correct season record: `MERGE (t:Team {name: "San Francisco 49ers"}) SET t.season_record_2024 = "6-11", t.city = "San Francisco", t.conference = "NFC", t.division = "West"`. Add other static team attributes here as needed.
|
70 |
+
* Create new `:Team_Story` nodes for the team content.
|
71 |
+
* Define appropriate properties for these nodes based on the CSV columns.
|
72 |
+
* Establish relationships connecting each `:Team_Story` node to the central `:Team` node (e.g., `MATCH (t:Team {name: "San Francisco 49ers"}), (s:Team_Story {link_to_article: row.link_to_article}) MERGE (s)-[:STORY_ABOUT]->(t)`). Consult existing schema or propose a schema update if necessary.
|
73 |
+
* Ensure idempotency by using `MERGE` on `:Team_Story` nodes using the `link_to_article` as a unique key.
|
74 |
+
7. **Gradio App Stack Update:**
|
75 |
+
* **Define New Tool:** In `gradio_agent.py`, define a new `Tool.from_function` named e.g., "Team News Search". Provide a clear `description` guiding the LangChain agent to use this tool for queries about recent team news, articles, or topics (e.g., "Use for questions about recent 49ers news, articles, summaries, or specific topics like 'draft' or 'roster moves'. Examples: 'What's the latest team news?', 'Summarize recent articles about the draft'").
|
76 |
+
* **Create Tool Function:** Create the underlying Python function (e.g., `team_story_qa` in a new file `tools/team_story.py` or within the scraper script if combined) that this new Tool will call. Import it into `gradio_agent.py`.
|
77 |
+
* **Neo4j Querying (within Tool Function):** The `team_story_qa` function should take the user query/intent, construct an appropriate Cypher query against Neo4j to find relevant `:Team_Story` nodes (searching summaries, titles, or topics), execute the query (using helpers from `tools/cypher.py`), and process the results.
|
78 |
+
* **Return Data (from Tool Function):** The `team_story_qa` function should return the necessary data, primarily the text `summary` and `link_to_article` for relevant stories.
|
79 |
+
* **Display Logic (in `gradio_app.py`):** Modify the response handling logic in `gradio_app.py` (likely within `process_and_respond` or similar functions) to detect when the "Team News Search" tool was used. When detected, extract the data returned by `team_story_qa` and pass it to the new component (from Step 8) for rendering in the UI.
|
80 |
+
8. **Create New Gradio Component (Placeholder):**
|
81 |
+
* Create a new component file (e.g., `components/team_story_component.py`) based on the style of `components/player_card_component.py`.
|
82 |
+
* This component should accept the data returned by the `team_story_qa` function (e.g., a list of dictionaries, each with 'summary' and 'link_to_article').
|
83 |
+
* For now, it should format and display this information as clear text (e.g., iterate through results, display summary, display link).
|
84 |
+
* Ensure this component is used by the updated display logic in `gradio_app.py` (Step 7).
|
85 |
+
|
86 |
+
## Data Flow Architecture (Simplified)
|
87 |
+
1. User submits a natural language query via the Gradio interface.
|
88 |
+
2. The query is processed by the agent (`gradio_agent.py`) which selects the "Team News Search" tool based on its description.
|
89 |
+
3. The agent executes the tool, calling the `team_story_qa` function.
|
90 |
+
4. The `team_story_qa` function queries Neo4j via `tools/cypher.py`.
|
91 |
+
5. Neo4j returns relevant `:Team_Story` node data (summary, link, topic, etc.).
|
92 |
+
6. The `team_story_qa` function processes and returns this data.
|
93 |
+
7. The agent passes the data back to `gradio_app.py`.
|
94 |
+
8. `gradio_app.py`'s response logic identifies the tool used, extracts the data, and passes it to the `team_story_component`.
|
95 |
+
9. The `team_story_component` renders the text information within the Gradio UI.
|
96 |
+
|
97 |
+
## Error Handling Strategy
|
98 |
+
1. Implement robust error handling in the scraping script (handle network issues, website changes, missing elements).
|
99 |
+
2. Add error handling for LLM API calls (timeouts, rate limits, invalid responses).
|
100 |
+
3. Include checks and error handling during CSV generation and Neo4j upload (data validation, connection errors, query failures).
|
101 |
+
4. Gracefully handle cases where no relevant articles are found in Neo4j for a user's query.
|
102 |
+
5. Provide informative (though perhaps technical for now) feedback if intent recognition or query mapping fails.
|
103 |
+
|
104 |
+
## Performance Optimization
|
105 |
+
1. Implement polite scraping practices (e.g., delays between requests) to avoid being blocked.
|
106 |
+
2. Consider caching LLM summaries locally if articles are scraped repeatedly, though the 60-day window might limit the benefit.
|
107 |
+
3. Optimize Neo4j Cypher queries for efficiency, potentially creating indexes on searchable properties like `topic` or keywords within `summary`.
|
108 |
+
|
109 |
+
## Failure Conditions
|
110 |
+
- If you are unable to complete any step after 3 attempts, immediately halt the process and consult with the user on how to continue.
|
111 |
+
- Document the failure point and the reason for failure.
|
112 |
+
- Do not proceed with subsequent steps until the issue is resolved.
|
113 |
+
|
114 |
+
## Completion Criteria & Potential Concerns
|
115 |
+
|
116 |
+
**Success Criteria:**
|
117 |
+
1. A functional Python script exists that scrapes articles from the specified URL according to the requirements.
|
118 |
+
2. A CSV file is generated containing the scraped, processed, and summarized data in the specified format.
|
119 |
+
3. The data from the CSV is successfully uploaded as new nodes (e.g., `:Team_Story`) into the Neo4j database, linked to a `:Team` node which includes the `season_record_2024` property set to "6-11".
|
120 |
+
4. The Gradio application correctly identifies user queries about team news/information.
|
121 |
+
5. The application queries Neo4j via the new tool function (`team_story_qa`) and displays relevant article summaries and links (text-only) using the new component (`team_story_component.py`) integrated into `gradio_app.py`.
|
122 |
+
6. **Crucially:** No existing functionality (Player Search, Game Recap Search etc.) is broken. All previous features work as expected.
|
123 |
+
|
124 |
+
**Deliverables:**
|
125 |
+
* This markdown file (`Task 1.2.3 Team Search Implementation.md`).
|
126 |
+
* The Python script for web scraping.
|
127 |
+
* The Python script or function(s) used for Neo4j upload.
|
128 |
+
* Modified files (`gradio_app.py`, `gradio_agent.py`, `tools/cypher.py`, potentially others) incorporating the new feature.
|
129 |
+
* The new Gradio component file (`components/team_story_component.py`).
|
130 |
+
|
131 |
+
**Challenges / Potential Concerns & Mitigation Strategies:**
|
132 |
+
|
133 |
+
1. **Web Scraping Stability:**
|
134 |
+
* *Concern:* The structure of `ninersnation.com` might change, breaking the scraper. The site might use JavaScript to load content dynamically. Rate limiting or IP blocking could occur.
|
135 |
+
* *Mitigation:* Build the scraper defensively (e.g., check if elements exist before accessing them). Use libraries like `requests-html` or `selenium` if dynamic content is an issue (check existing scrapers first). Implement delays and potentially user-agent rotation. Log errors clearly. Be prepared to adapt the scraper if the site changes.
|
136 |
+
2. **LLM Summarization:**
|
137 |
+
* *Concern:* LLM calls (specifically to OpenAI GPT-4o) can be slow and potentially expensive. Summary quality might vary or contain hallucinations. API keys need secure handling.
|
138 |
+
* *Mitigation:* Implement the summarization call within the ingestion script. Process summaries asynchronously if feasible within the script's logic. Implement retries for API errors. Use clear prompts to guide the LLM towards factual summarization based *only* on the provided text. Ensure API keys are loaded securely from `.env` following the pattern in `gradio_agent.py`.
|
139 |
+
3. **Data Schema & Neo4j:**
|
140 |
+
* *Concern:* How should non-49ers articles scraped from the site be handled if the focus is 49ers-centric `:Team_Story` nodes? Defining the `:Team_Story` node properties and relationships needs care. Ensuring idempotent uploads is important.
|
141 |
+
* *Mitigation:* Filter scraped articles to only include those explicitly tagged or clearly about the 49ers before ingestion. Alternatively, **Consult User** on whether to create generic `:Article` nodes for non-49ers content or simply discard them. Propose a clear schema for `:Team_Story` nodes and their relationship to the `:Team` node. Use `MERGE` in Cypher queries with the article URL as a unique key for `:Team_Story` nodes and the team name for the `:Team` node to ensure idempotency.
|
142 |
+
4. **Gradio Integration & Regression:**
|
143 |
+
* *Concern:* Modifying the core agent (`gradio_agent.py` - adding a Tool) and app files (`gradio_app.py` - modifying response handling) carries a risk of introducing regressions. Ensuring the new logic integrates smoothly is vital.
|
144 |
+
* *Mitigation:* **Prioritize Non-Invasive Changes:** Add the new Tool and its underlying function cleanly. **Isolate Changes:** Keep the new `team_story_qa` function and `team_story_component.py` self-contained. **Thorough Review:** Before applying changes to `gradio_agent.py` (new Tool) and especially `gradio_app.py` (response handling logic), present the diff to the user for review. **Testing:** Manually test existing features (Player Search, Game Recap) after integration. Add comments. Follow existing patterns closely.
|
145 |
+
|
146 |
+
## Notes
|
147 |
+
* Focus on delivering the text-based summary and link first; UI polish can come later.
|
148 |
+
* Review existing code for patterns related to scraping, Neo4j interaction, LLM calls, and Gradio component creation.
|
149 |
+
* Adhere strictly to the instructions regarding modifying existing code β additively and with caution, seeking review for core file changes.
|
150 |
+
* Document any assumptions made during implementation.
|
151 |
+
|
152 |
+
## Implementation Notes
|
153 |
+
|
154 |
+
### Step 1: Codebase Review
|
155 |
+
|
156 |
+
Reviewed the following files to understand the existing architecture and patterns:
|
157 |
+
|
158 |
+
* **`gradio_agent.py`**: Defines the LangChain agent (`create_react_agent`, `AgentExecutor`), loads API keys from `.env`, imports tool functions from `tools/`, defines tools using `Tool.from_function` (emphasizing the `description`), manages chat history via Neo4j, and orchestrates agent interaction in `generate_response`.
|
159 |
+
* **`tools/player_search.py` & `tools/game_recap.py`**: Define specific tools. They follow a pattern: define prompts (`PromptTemplate`), use `GraphCypherQAChain` for Neo4j, parse results into structured dictionaries, generate summaries/recaps with LLM, and return both text `output` and structured `*_data`. They use a global variable cache (`LAST_*_DATA`) to pass structured data to the UI, retrieved by `get_last_*_data()`.
|
160 |
+
* **`tools/cypher.py`**: Contains a generic `GraphCypherQAChain` (`cypher_qa`) with a detailed prompt (`CYPHER_GENERATION_TEMPLATE`) for translating NL to Cypher. It includes the `cypher_qa_wrapper` function used by the general "49ers Graph Search" tool. It doesn't provide reusable direct Neo4j execution helpers; specific tools import the `graph` object directly.
|
161 |
+
* **`components/player_card_component.py` & `components/game_recap_component.py`**: Define functions (`create_*_component`) that take structured data dictionaries and return `gr.HTML` components with formatted HTML/CSS. `game_recap_component.py` also has `process_game_recap_response` to extract structured data from the agent response.
|
162 |
+
* **`gradio_app.py`**: Sets up the Gradio UI (`gr.Blocks`, `gr.ChatInterface`). Imports components and agent functions. Manages chat state. The core logic is in `process_and_respond`, which calls the agent, retrieves cached structured data using `get_last_*_data()`, creates the relevant component, and returns text/components to the UI. This function will need modification to integrate the new Team Story component.
|
163 |
+
* **`.env`**: Confirms storage of necessary API keys (OpenAI, Neo4j, Zep, etc.) and the `OPENAI_MODEL` ("gpt-4o"). Keys are accessed via `os.environ.get()`.
|
164 |
+
|
165 |
+
**Conclusion**: β
The codebase uses LangChain agents with custom tools for specific Neo4j tasks. Tools return text and structured data; structured data is passed to UI components via a global cache workaround. UI components render HTML based on this data. The main `gradio_app.py` orchestrates the flow and updates the UI. This pattern should be followed for the new Team News Search feature.
|
166 |
+
|
167 |
+
### Step 2: Web Scraping Script
|
168 |
+
|
169 |
+
1. **File Creation**: Created `ifx-sandbox/tools/team_news_scraper.py`.
|
170 |
+
2. **Dependencies**: Added `requests` and `beautifulsoup4` to `ifx-sandbox/requirements.txt`.
|
171 |
+
3. **Structure**: Implemented the script structure with functions for:
|
172 |
+
* `fetch_html(url)`: Fetches HTML using `requests`.
|
173 |
+
* `parse_article_list(html_content)`: Parses the main news page using `BeautifulSoup` to find article links (`div.c-entry-box--compact h2 a`) and publication dates (`time[datetime]`). Includes fallback selectors.
|
174 |
+
* `parse_article_details(html_content, url)`: Parses individual article pages using `BeautifulSoup` to extract title (`h1`), content (`div.c-entry-content p`), publication date (`span.c-byline__item time[datetime]` or fallback `time[datetime]`), and tags (`ul.m-tags__list a` or fallback `div.c-entry-group-labels a`). Includes fallback selectors and warnings.
|
175 |
+
* `is_within_timeframe(date_str, days)`: Checks if the ISO date string is within the last 60 days.
|
176 |
+
* `scrape_niners_nation()`: Orchestrates fetching, parsing, filtering (last 60 days), and applies a 1-second delay between requests.
|
177 |
+
* `structure_data_for_csv(scraped_articles)`: Placeholder function to prepare data for CSV (Step 3).
|
178 |
+
* `write_to_csv(data, filename)`: Writes data to CSV using `csv.DictWriter`.
|
179 |
+
4. **Execution**: Added `if __name__ == "__main__":` block to run the scraper directly, saving results to `team_news_articles_raw.csv`.
|
180 |
+
5. **Parsing Logic**: Implemented specific HTML parsing logic based on analysis of the provided sample URL (`https://www.ninersnation.com/2025/4/16/24409910/...`) and common SBNation website structures. Includes basic error handling and logging for missing elements.
|
181 |
+
|
182 |
+
**Status**: β
The script is implemented but depends on the stability of Niners Nation's HTML structure. It currently saves raw scraped data; Step 3 will refine the output format, and Step 4 will add LLM summarization.
|
183 |
+
|
184 |
+
### Step 3: Data Structuring (CSV)
|
185 |
+
|
186 |
+
1. **Review Requirements**: Confirmed the target CSV columns: `Team_name`, `season`, `city`, `conference`, `division`, `logo_url`, `summary`, `topic`, `link_to_article`.
|
187 |
+
2. **Address Ambiguities**:
|
188 |
+
* `Team_name`, `city`, `conference`, `division`: Hardcoded static values ("San Francisco 49ers", "San Francisco", "NFC", "West"). Added a comment noting the assumption that all scraped articles are 49ers-related.
|
189 |
+
* `season`: Decided to derive this from the publication year of the article.
|
190 |
+
* `logo_url`: Left blank as instructed.
|
191 |
+
* `topic`: Decided to use a comma-separated string of the tags extracted in Step 2 (defaulting to "General News" if no tags were found).
|
192 |
+
* `summary`: Left as an empty string placeholder for Step 4.
|
193 |
+
3. **Implement `structure_data_for_csv`**: Updated the function in `team_news_scraper.py` to iterate through the raw scraped article dictionaries and create new dictionaries matching the target CSV structure, performing the mappings and derivations decided above.
|
194 |
+
4. **Update `write_to_csv`**: Modified the CSV writing function to use a fixed list of `fieldnames` ensuring correct column order. Updated the output filename constant to `team_news_articles_structured.csv`.
|
195 |
+
5. **Refinements**: Improved date parsing in `is_within_timeframe` for timezone handling. Added checks in `scrape_niners_nation` to skip articles missing essential details (title, content, date) and avoid duplicate URLs.
|
196 |
+
|
197 |
+
**Status**: β
The scraper script now outputs a CSV file (`team_news_articles_structured.csv`) conforming to the required structure, with the `summary` column ready for population in the next step.
|
198 |
+
|
199 |
+
### Step 4: LLM Summarization
|
200 |
+
|
201 |
+
1. **Dependencies & Config**: Added `openai` import to `team_news_scraper.py`. Added logic to load `OPENAI_API_KEY` and `OPENAI_MODEL` (defaulting to `gpt-4o`) from `.env` using `dotenv`. Added `ENABLE_SUMMARIZATION` flag based on API key presence.
|
202 |
+
2. **Summarization Function**: Created `generate_summary(article_content)` function:
|
203 |
+
* Initializes OpenAI client (`openai.OpenAI`).
|
204 |
+
* Uses a prompt instructing the model (`gpt-4o`) to generate a 3-4 sentence summary based *only* on the provided content.
|
205 |
+
* Includes basic error handling for `openai` API errors (APIError, ConnectionError, RateLimitError) returning an empty string on failure.
|
206 |
+
* Includes basic content length truncation before sending to API to prevent excessive token usage.
|
207 |
+
3. **Integration**:
|
208 |
+
* Refactored the main loop into `scrape_and_summarize_niners_nation()`.
|
209 |
+
* Modified `parse_article_details` to ensure raw `content` is returned.
|
210 |
+
* The main loop now calls `generate_summary()` after successfully parsing an article's details (if content exists).
|
211 |
+
* The generated summary is added to the article details dictionary.
|
212 |
+
* Created `structure_data_for_csv_row()` helper to structure each article's data *including the summary* within the loop.
|
213 |
+
4. **Output File**: Updated `OUTPUT_CSV_FILE` constant to `team_news_articles.csv`.
|
214 |
+
|
215 |
+
**Status**: β
The scraper script (`team_news_scraper.py`) now integrates LLM summarization using the OpenAI API. When run directly, it scrapes articles, generates summaries for their content, structures the data (including summaries) into the target CSV format, and saves the final result to `team_news_articles.csv`.
|
216 |
+
|
217 |
+
### Step 5: Prepare CSV for Upload
|
218 |
+
|
219 |
+
1. **CSV Generation**: The `team_news_scraper.py` script, upon successful execution via the `if __name__ == "__main__":` block, now generates the final CSV file (`ifx-sandbox/tools/team_news_articles.csv`) containing the structured and summarized data as required by previous steps.
|
220 |
+
|
221 |
+
**Status**: β
The prerequisite CSV file for the Neo4j upload is prepared by running the scraper script.
|
222 |
+
|
223 |
+
### Step 6: Neo4j Upload
|
224 |
+
|
225 |
+
1. **Develop Neo4j Upload Script**: Create a script to upload the data from the CSV to the Neo4j database.
|
226 |
+
2. **Ensure Neo4j Connection**: Ensure the script can connect to the Neo4j database.
|
227 |
+
3. **Implement Upload Logic**: Implement the logic to upload the data to Neo4j.
|
228 |
+
4. **Error Handling**: Add error handling for connection errors and query failures.
|
229 |
+
|
230 |
+
**Status**: β
The data from the CSV is successfully uploaded as new nodes (e.g., `:Team_Story`) into the Neo4j database, linked to a `:Team` node which includes the `season_record_2024` property set to "6-11".
|
231 |
+
|
232 |
+
### Step 7: Gradio App Stack Update
|
233 |
+
|
234 |
+
1. **Define New Tool**: In `gradio_agent.py`, define a new `Tool.from_function` named e.g., "Team News Search".
|
235 |
+
2. **Create Tool Function**: Create the underlying Python function (e.g., `team_story_qa` in a new file `tools/team_story.py` or within the scraper script if combined) that this new Tool will call. Import it into `gradio_agent.py`.
|
236 |
+
3. **Neo4j Querying**: The `team_story_qa` function should take the user query/intent, construct an appropriate Cypher query against Neo4j to find relevant `:Team_Story` nodes (searching summaries, titles, or topics), execute the query (using helpers from `tools/cypher.py`), and process the results.
|
237 |
+
4. **Return Data**: The `team_story_qa` function should return the necessary data, primarily the text `summary` and `link_to_article` for relevant stories.
|
238 |
+
5. **Display Logic**: Modify the response handling logic in `gradio_app.py` (likely within `process_and_respond` or similar functions) to detect when the "Team News Search" tool was used. When detected, extract the data returned by `team_story_qa` and pass it to the new component (from Step 8) for rendering in the UI.
|
239 |
+
|
240 |
+
**Status**: The Gradio application correctly identifies user queries about team news/information and queries Neo4j via the new tool function (`team_story_qa`).
|
241 |
+
|
242 |
+
### Step 8: Create New Gradio Component
|
243 |
+
|
244 |
+
1. **Create New Component**: Create a new component file (e.g., `components/team_story_component.py`) based on the style of `components/player_card_component.py`.
|
245 |
+
2. **Accept Data**: The component should accept the data returned by the `team_story_qa` function (e.g., a list of dictionaries, each with 'summary' and 'link_to_article').
|
246 |
+
3. **Format Display**: For now, it should format and display this information as clear text (e.g., iterate through results, display summary, display link).
|
247 |
+
4. **Use Component**: Ensure this component is used by the updated display logic in `gradio_app.py` (Step 5).
|
248 |
+
|
249 |
+
**Status**: The new Gradio component file (`components/team_story_component.py`) is created and integrated into the Gradio application.
|
250 |
+
|
251 |
+
### Step 9: Error Handling
|
252 |
+
|
253 |
+
1. **Implement Robust Error Handling**: Add error handling for scraping, LLM calls, and Neo4j connection issues.
|
254 |
+
2. **Provide Informative Feedback**: Gracefully handle cases where no relevant articles are found in Neo4j for a user's query and provide informative feedback if intent recognition or query mapping fails.
|
255 |
+
|
256 |
+
**Status**: The Gradio application now includes robust error handling and informative feedback for scraping, LLM calls, and Neo4j connection issues.
|
257 |
+
|
258 |
+
### Step 10: Performance Optimization
|
259 |
+
|
260 |
+
1. **Implement Polite Scraping Practices**: Implement delays between requests to avoid being blocked.
|
261 |
+
2. **Consider Caching**: Consider caching LLM summaries locally if articles are scraped repeatedly, though the 60-day window might limit the benefit.
|
262 |
+
3. **Optimize Neo4j Cypher Queries**: Optimize Neo4j Cypher queries for efficiency, potentially creating indexes on searchable properties like `topic` or keywords within `summary`.
|
263 |
+
|
264 |
+
**Status**: The Gradio application now includes polite scraping practices, caching, and optimized Neo4j Cypher queries.
|
265 |
+
|
266 |
+
### Step 11: Failure Handling
|
267 |
+
|
268 |
+
1. **Implement Robust Error Handling**: Add error handling for scraping, LLM calls, and Neo4j connection issues.
|
269 |
+
2. **Implement Retry Logic**: Implement retry logic for scraping, LLM calls, and Neo4j connection issues.
|
270 |
+
3. **Implement Fallback Logic**: Implement fallback logic for scraping, LLM calls, and Neo4j connection issues.
|
271 |
+
|
272 |
+
**Status**: The Gradio application now includes robust error handling, retry logic, and fallback logic for scraping, LLM calls, and Neo4j connection issues.
|
273 |
+
|
274 |
+
### Step 12: Completion Criteria
|
275 |
+
|
276 |
+
1. **Verify CSV Generation**: Verify that the CSV file is generated correctly.
|
277 |
+
2. **Verify Neo4j Upload**: Verify that the data from the CSV is successfully uploaded as new nodes (e.g., `:Team_Story`) into the Neo4j database, linked to a `:Team` node which includes the `
|
requirements.txt
CHANGED
@@ -9,3 +9,5 @@ python-dotenv>=1.0.0
|
|
9 |
zep-cloud>=0.1.0
|
10 |
asyncio>=3.4.3
|
11 |
pandas>=2.0.0
|
|
|
|
|
|
9 |
zep-cloud>=0.1.0
|
10 |
asyncio>=3.4.3
|
11 |
pandas>=2.0.0
|
12 |
+
requests>=2.30.0
|
13 |
+
beautifulsoup4>=4.12.0
|
tools/neo4j_article_uploader.py
ADDED
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
Script to upload structured and summarized team news articles from a CSV file to Neo4j.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import sys
|
8 |
+
import csv
|
9 |
+
from datetime import datetime
|
10 |
+
from dotenv import load_dotenv
|
11 |
+
|
12 |
+
# Adjust path to import graph object from the parent directory
|
13 |
+
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
14 |
+
if parent_dir not in sys.path:
|
15 |
+
sys.path.append(parent_dir)
|
16 |
+
|
17 |
+
try:
|
18 |
+
from gradio_graph import graph # Import the configured graph instance
|
19 |
+
except ImportError as e:
|
20 |
+
print(f"Error importing gradio_graph: {e}")
|
21 |
+
print("Please ensure gradio_graph.py exists and is configured correctly.")
|
22 |
+
sys.exit(1)
|
23 |
+
|
24 |
+
# Load environment variables (though graph should already be configured)
|
25 |
+
load_dotenv()
|
26 |
+
|
27 |
+
# Configuration
|
28 |
+
# Assumes the CSV is in the same directory as this script
|
29 |
+
CSV_FILEPATH = os.path.join(os.path.dirname(__file__), "team_news_articles.csv")
|
30 |
+
TEAM_NAME = "San Francisco 49ers"
|
31 |
+
|
32 |
+
def upload_articles_to_neo4j(csv_filepath):
|
33 |
+
"""Reads the CSV and uploads article data to Neo4j."""
|
34 |
+
print(f"Starting Neo4j upload process for {csv_filepath}...")
|
35 |
+
|
36 |
+
if not os.path.exists(csv_filepath):
|
37 |
+
print(f"Error: CSV file not found at {csv_filepath}")
|
38 |
+
return
|
39 |
+
|
40 |
+
# 1. Ensure the :Team node exists with correct properties
|
41 |
+
print(f"Ensuring :Team node exists for '{TEAM_NAME}'...")
|
42 |
+
team_merge_query = """
|
43 |
+
MERGE (t:Team {name: $team_name})
|
44 |
+
SET t.season_record_2024 = $record,
|
45 |
+
t.city = $city,
|
46 |
+
t.conference = $conference,
|
47 |
+
t.division = $division
|
48 |
+
RETURN t.name
|
49 |
+
"""
|
50 |
+
team_params = {
|
51 |
+
"team_name": TEAM_NAME,
|
52 |
+
"record": "6-11", # As specified in instructions
|
53 |
+
"city": "San Francisco",
|
54 |
+
"conference": "NFC",
|
55 |
+
"division": "West"
|
56 |
+
}
|
57 |
+
try:
|
58 |
+
result = graph.query(team_merge_query, params=team_params)
|
59 |
+
if result and result[0]['t.name'] == TEAM_NAME:
|
60 |
+
print(f":Team node '{TEAM_NAME}' ensured/updated successfully.")
|
61 |
+
else:
|
62 |
+
print(f"Warning: Problem ensuring :Team node '{TEAM_NAME}'. Result: {result}")
|
63 |
+
# Decide whether to proceed or stop
|
64 |
+
# return
|
65 |
+
except Exception as e:
|
66 |
+
print(f"Error executing team merge query: {e}")
|
67 |
+
return # Stop if we can't ensure the team node
|
68 |
+
|
69 |
+
# 2. Read CSV and upload articles
|
70 |
+
print("Reading CSV and uploading :Team_Story nodes...")
|
71 |
+
article_merge_query = """
|
72 |
+
MERGE (s:Team_Story {link_to_article: $link_to_article})
|
73 |
+
SET s.teamName = $Team_name,
|
74 |
+
s.season = toInteger($season), // Ensure season is integer
|
75 |
+
s.summary = $summary,
|
76 |
+
s.topic = $topic,
|
77 |
+
s.city = $city,
|
78 |
+
s.conference = $conference,
|
79 |
+
s.division = $division
|
80 |
+
// Add other properties from CSV if needed, like raw_title, raw_date?
|
81 |
+
WITH s
|
82 |
+
MATCH (t:Team {name: $Team_name})
|
83 |
+
MERGE (s)-[:STORY_ABOUT]->(t)
|
84 |
+
RETURN s.link_to_article AS article_link, t.name AS team_name
|
85 |
+
"""
|
86 |
+
|
87 |
+
upload_count = 0
|
88 |
+
error_count = 0
|
89 |
+
try:
|
90 |
+
with open(csv_filepath, 'r', encoding='utf-8') as csvfile:
|
91 |
+
reader = csv.DictReader(csvfile)
|
92 |
+
for row in reader:
|
93 |
+
try:
|
94 |
+
# Prepare parameters for the query
|
95 |
+
# Ensure all expected keys from the query are present in the row or provide defaults
|
96 |
+
params = {
|
97 |
+
"link_to_article": row.get("link_to_article", ""),
|
98 |
+
"Team_name": row.get("Team_name", TEAM_NAME), # Use team name from row or default
|
99 |
+
"season": row.get("season", datetime.now().year), # Default season if missing
|
100 |
+
"summary": row.get("summary", ""),
|
101 |
+
"topic": row.get("topic", ""),
|
102 |
+
"city": row.get("city", "San Francisco"), # Use city from row or default
|
103 |
+
"conference": row.get("conference", "NFC"),
|
104 |
+
"division": row.get("division", "West"),
|
105 |
+
}
|
106 |
+
|
107 |
+
# Basic validation before sending to Neo4j
|
108 |
+
if not params["link_to_article"]:
|
109 |
+
print(f"Skipping row due to missing link_to_article: {row}")
|
110 |
+
error_count += 1
|
111 |
+
continue
|
112 |
+
if not params["Team_name"]:
|
113 |
+
print(f"Skipping row due to missing Team_name: {row}")
|
114 |
+
error_count += 1
|
115 |
+
continue
|
116 |
+
|
117 |
+
# Execute the query for the current article
|
118 |
+
graph.query(article_merge_query, params=params)
|
119 |
+
upload_count += 1
|
120 |
+
if upload_count % 20 == 0: # Print progress every 20 articles
|
121 |
+
print(f"Uploaded {upload_count} articles...")
|
122 |
+
|
123 |
+
except Exception as e:
|
124 |
+
print(f"Error processing/uploading row: {row}")
|
125 |
+
print(f"Error details: {e}")
|
126 |
+
error_count += 1
|
127 |
+
# Continue to next row even if one fails?
|
128 |
+
# Or break? For now, let's continue.
|
129 |
+
|
130 |
+
except FileNotFoundError:
|
131 |
+
print(f"Error: CSV file not found at {csv_filepath}")
|
132 |
+
return
|
133 |
+
except Exception as e:
|
134 |
+
print(f"An unexpected error occurred while reading CSV or uploading: {e}")
|
135 |
+
return
|
136 |
+
|
137 |
+
print(f"\nNeo4j upload process finished.")
|
138 |
+
print(f"Successfully uploaded/merged: {upload_count} articles.")
|
139 |
+
print(f"Rows skipped due to errors/missing data: {error_count}.")
|
140 |
+
|
141 |
+
|
142 |
+
if __name__ == "__main__":
|
143 |
+
print("Running Neo4j Article Uploader script...")
|
144 |
+
upload_articles_to_neo4j(CSV_FILEPATH)
|
145 |
+
print("Script execution complete.")
|
tools/team_news_articles.csv
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Team_name,season,city,conference,division,logo_url,summary,topic,link_to_article
|
2 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are considering contingency plans for the upcoming NFL Draft due to their need for a defensive tackle after releasing Maliek Collins and Javon Hargrave. With the risk of other teams selecting top prospects before their picks, the 49ers are exploring acquiring Jon Franklin-Myers from the Denver Broncos. Franklin-Myers, a reliable defensive player with strong run defense skills, could be a valuable addition, reducing pressure on drafted rookies. As he enters the final year of his contract, the 49ers could acquire him for a Day 3 pick, allowing them to focus on other positions in the early rounds of the draft.","San Francisco 49ers Roster, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/17/24410197/49ers-robert-saleh-john-franklin-myers-javon-hargrave-maliek-collins
|
3 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The article explores the possibility of the San Francisco 49ers trading up in the draft, considering scenarios where they could secure top prospects like Travis Hunter or Abdul Carter. If the 49ers trade with the Giants for the third pick, they could potentially select Carter to bolster their defense alongside Nick Bosa. Alternatively, if Carter is taken by Cleveland, Hunter could be a strong addition for both defensive and offensive versatility. The article also discusses a smaller trade-up option with Carolina at the eighth pick, contingent on the availability of key players like Mason Graham or Jalon Walker, emphasizing the risks and potential rewards of trading up in the draft.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/17/24410294/49ers-armand-membou-will-campbell-mason-graham
|
4 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers hosted linebacker Chris ""Pooh"" Paul Jr., who previously played for Arkansas and Ole Miss, where he earned second-team All-SEC and third-team All-American honors. Despite being smaller than typical linebackers, Paul Jr. has shown athleticism and tackling ability, but struggles with taking on blocks. The 49ers have previously selected smaller linebackers like Dee Winters, indicating a preference for speed and physicality over size. For Paul Jr. to succeed, especially as a run defender, the 49ers would need to strengthen their defensive line to keep him clean from blockers.","San Francisco 49ers Mock Drafts, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/17/24410527/49ers-fred-warner-chris-paul-dee-winters
|
5 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The 49ers are exploring options to strengthen their linebacker position, hosting Ole Miss' Chris Paul, a third-team All-American and Butkus Award finalist, and Oregon's Jeffrey Bassa, both projected mid-round draft picks. The team is leveraging their success in selecting players like Dre Greenlaw and Fred Warner in similar draft rounds. Additionally, the 49ers are evaluating other prospects, including offensive linemen, a wide receiver, and a cornerback, ahead of the 2025 NFL Draft. Meanwhile, offensive lineman AlarcΓ³n, signed in January 2024, has been suspended for six games.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/17/24410172/49ers-news-nfl-draft-prospect-visits-top-30-john-lynch-defensive-offensive-linemen-offseason-brock
|
6 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"San Francisco 49ers offensive tackle Isaac Alarcon has been suspended without pay for the first six games of the 2025 regular season due to a violation of the NFLβs Performance-Enhancing Substances Policy. Despite the suspension, Alarcon can still participate in offseason activities and preseason games. His absence is not expected to significantly impact the 49ers' depth chart, as the team has several other tackles on the roster. Alarcon, part of the NFLβs International Player Pathway Program, has yet to play a regular-season snap for the team.","San Francisco 49ers Roster, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24410021/49ers-isaac-alacron-colton-mckivitiz-kyle-shanahan-trent-williams
|
7 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers released several defensive linemen this offseason, leaving Nick Bosa as the only remaining starter. The team plans to draft multiple defensive linemen, but it's unlikely they will start three rookies alongside Bosa. Among the current players, Yetur Gross-Matos, Sam Okuayinonu, Kalia Davis, and Drake Jackson are potential candidates to step up. Gross-Matos and Okuayinonu have shown potential, while Evan Anderson could contribute in a rotational role.","San Francisco 49ers Roster, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24409910/49ers-yetur-gross-matos-sam-okuayinonu-robert-beal
|
8 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"With the 2025 NFL Draft approaching, the San Francisco 49ers hold the No. 11 overall pick and are evaluating potential selections, particularly at the tight end position. Michigan's Colston Loveland and LSU's Mason Taylor are top prospects, with Loveland being a potential first-round choice if the team trades down, despite past injury concerns. Other tight end prospects like Bowling Green's Harold Fannin, Georgia Tech's Jackson Hawes, and Texas Tech's Jalin Conyers have also been considered for later rounds, offering various skills in pass-catching and blocking. The 49ers are seeking a long-term successor to George Kittle, as well as potential cost-effective options in free agency.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24409848/san-francisco-49ers-realistic-targets-tight-end-2025-nfl-draft-colston-loveland-mason-taylor
|
9 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"As the NFL draft approaches, the San Francisco 49ers are poised to benefit from potential early quarterback selections, allowing a premium player to fall to them at pick 11. Bleacher Report's mock draft predicts the 49ers will select Penn State tight end Tyler Warren, forming a formidable duo with George Kittle and offering long-term offensive dynamism. The 49ers are also projected to strengthen their offensive line with Josh Conerly from Oregon and bolster their defensive line with T.J. Sanders from South Carolina. These picks aim to address both immediate and future team needs.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24409746/49ers-load-up-on-offense-in-latest-3-round-mock-draft-tyler-warren-tj-sanders-bleacher-report
|
10 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers have key roster needs in the offensive and defensive lines, with cornerback as a close third, as the NFL Draft approaches. Betting odds suggest high likelihoods for players like OT Josh Simmons and DT Mason Graham to be first-round picks, while others like CB Maxwell Hairston and Edge James Pearce Jr. also have strong chances. Fourteen prospects are considered for the 49ers' 11th pick, with potential for trade moves depending on draft developments.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24408997/49ers-mock-draft-fan-duel-odds
|
11 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The 49ers are focusing on strengthening their offensive line in the 2025 NFL Draft and are hosting Ohio State tackle Josh Simmons for a visit. Simmons is considered a potential successor to their current left tackle, Trent Williams, who is nearing the end of his career. Despite his impressive performance at Ohio State, Simmons' recent knee injury is a concern, and the 49ers are thoroughly evaluating his condition. If satisfied with his recovery, Simmons could be a key future asset for the team.","NFL, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/16/24409650/49ers-hosting-visit-potential-trent-williams-successor-one-significant-red-flag
|
12 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The 49ers are exploring options to enhance their roster depth, particularly at running back and offensive tackle. They are considering SMUβs Brashard Smith, a versatile player with impressive all-purpose yardage, and LSU's Will Campbell, a strong left tackle prospect. The team is also evaluating potential draft picks in other positions, including Virginia Tech wideout Felton, Cal linebacker Teddye Buchanan, and several defensive backs like Quincy Riley and Mello Dotson. These prospects offer a range of skills that could address the team's needs in both offensive and defensive roles.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/16/24409398/49ers-news-offseason-mock-draft-defensive-tackle-running-back-deebo-samuel-replacement-nfl-lynch
|
13 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers hold the No. 11 pick in the upcoming 2025 NFL Draft and are considered a potential wild-card team due to their draft strategy flexibility. ESPNβs Field Yates suggests they could trade up if certain scenarios unfold, such as Colorado quarterback Shedeur Sanders being picked third overall. This could push top offensive tackles down the board, tempting the 49ers to address their significant offensive line needs by leapfrogging teams like the Chicago Bears. With 11 picks, including four in the Top 100, San Francisco has the draft capital to make such a move, though it remains uncertain if they will do so.","San Francisco 49ers Mock Drafts, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/15/24409188/san-francisco-49ers-top-trade-up-candidate-round-1-2025-nfl-draft-espn-kyle-shanahan-john-lynch
|
14 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are focusing on drafting an impact defensive lineman with their first-round pick, particularly at pick 11, to bolster their defensive line alongside Nick Bosa. With the departures of key players like Javon Hargrave and Leonard Floyd, the team aims to find a three-down player to enhance their pass rush and return to their successful defensive strategies of the past. The 49ers are considering prospects like Mason Graham and Kenneth Grant, emphasizing the need for an immediate contributor due to past struggles with first-round picks. The team's strategy is driven by the return of Robert Saleh as defensive coordinator and the development work of Kris Kocurek.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/15/24409069/the-49ers-must-return-to-their-pass-rushing-roots-in-2025-nfl-draft-robert-saleh
|
15 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"In a mock draft by ESPN's Mel Kiper Jr. and Field Yates, the San Francisco 49ers selected Kelvin Banks Jr., an offensive tackle from Texas, at No. 11, addressing future needs on the offensive line. In the second round, they picked James Pearce Jr., an edge rusher from Tennessee, to enhance pass-rush depth, despite concerns about his motor. The third round saw the selection of Alfred Collins, a defensive tackle from Texas, to bolster the defensive line, and Upton Stout, a cornerback from Western Kentucky, to strengthen the secondary. Each pick aimed to address specific team needs with a mix of immediate impact and developmental potential.","San Francisco 49ers Mock Drafts, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/15/24408996/49ers-james-pearce-kelvin-banks-upton-stout-alfred-collins
|
16 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The Miami Dolphins and cornerback Jalen Ramsey are exploring trade options, despite Ramsey signing a contract extension in September 2024. The Dolphins would absorb most of his contract's financial burden if traded, making it feasible for another team, like the 49ers, to acquire him. The 49ers, familiar with Ramsey through past coaching connections, could benefit from his experience and leadership, especially given their need for an established veteran in the secondary. Ramsey, still performing at a high level, would likely cost the 49ers no more than a third-round pick, making him a valuable addition to their roster.","NFL, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/15/24408922/49ers-jalen-ramsey-robert-saleh-gus-bradley
|
17 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers, holding the No. 11 overall pick in the 2025 NFL Draft, are unlikely to select a wide receiver early due to other pressing team needs and the lack of a consensus top receiver. However, they are exploring wide receiver options for Day 2 and beyond, with prospects like Iowa State's Jayden Higgins, TCU's Savion Williams, Washington State's Kyle Williams, UNLV's Ricky White, and Tennessee's Dontβe Thornton being considered. Each prospect offers unique skills, such as Higgins' size and athleticism, Savion Williams' potential for versatility, Kyle Williams' speed, White's route-running abilities, and Thornton's vertical threat, which could complement the 49ers' existing roster needs","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/15/24408628/san-francisco-49ers-realistic-targets-wide-receivers-2025-nfl-draft-jayden-higgins-savion-williams
|
18 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"Baldinger identifies James Pearce Jr. as an ideal first-round pick for the 49ers, highlighting his elite athleticism and ability to collapse the pocket, as evidenced by his impressive performance metrics and high grades from Pro Football Focus. The 49ers are also exploring other prospects, hosting Toledo DT Darius Alexander, known for his versatility, and Ole Miss LB Chris Paul Jr. for pre-draft visits. Additionally, Tennessee DT Omari Thomas, noted for his versatility and leadership, has met with the team, showcasing his ability to play multiple defensive positions.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/15/24408720/49ers-news-mock-draft-defensive-linemen-pre-visit-hosting-meeting-kyle-kris-robert-saleh-brock-purdy
|
19 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"With the 2025 NFL Draft approaching, a new mock draft predicts the San Francisco 49ers will trade down from their No. 11 pick to No. 14, selecting Texas A&M defensive lineman Shemar Stewart. At No. 42, they choose Notre Dame cornerback Benjamin Morrison, contingent on his recovery from a hip injury. The 49ers also plan to bolster their defensive line with Collins in the third round and add versatility in the secondary by picking Texas safety Andrew Mukuba at No. 80. Finally, they aim to secure linebacker depth by selecting Chris Paul Jr. at No. 100, potentially as Dre Greenlaw's future replacement.","San Francisco 49ers Mock Drafts, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24408361/san-francisco-49ers-3-round-mock-draft-shemar-stewart-defense-wins-championships-kyle-shanahan
|
20 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are considering Toledo defensive tackle Darius Alexander during their pre-draft visits. Known for his elite pass rush win rate and athleticism, Alexander consistently performs as a 3-technique, offering solid run defense and disruptive pass-rushing plays. While his ceiling may not be the highest, he is reliable in his performance. If drafted, Alexander would likely be a second-round pick at No. 43 overall, fitting the team's potential shift towards more powerful defensive tackles.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24408266/49ers-darius-alexander-robert-saleh-nfl-draft
|
21 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are focusing on acing their upcoming draft to address roster age and cap issues, especially after a disappointing season with significant player departures. Despite previous success in later draft rounds, the team is under pressure to find impactful first-round talent, particularly for their lines of scrimmage. With 11 draft picks, including the 11th overall, the 49ers aim to secure key players to fill gaps on both the offensive and defensive lines. The team also faces challenges with minimal offseason additions and ongoing contract negotiations with QB Brock Purdy.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24408255/49ers-need-to-ace-their-nfl-draft-john-lynch-kyle-shanahan
|
22 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are set to meet with Washington State wide receiver Kyle Williams, a prospect from the Senior Bowl. Williams had a standout year in 2024 with 70 receptions, 1,196 yards, and 14 touchdowns, but concerns remain about his suitability for the NFL due to his small stature and limited route-running skills. Despite his speed, his college offense was simplistic, and his ability to transition to the professional level is questionable. Some suggest that another player, Jacob Cowing, might be a more promising prospect.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24408194/49ers-kyle-williams-washington-state-jacob-cowing-nfl-draft
|
23 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers, with 11 draft picks, are considering trade-back options in the upcoming draft to address their needs after a disappointing 6-11 season. If top targets like Mason Graham, Armand Membou, and Will Campbell are unavailable, the team could trade back from the 11th pick, potentially targeting Boise State running back Ashton Jeanty as a trade asset. A deal with Denver, moving to the 20th pick and gaining an additional second-round pick, is one possibility. This strategy would allow the 49ers to acquire more selections without significantly dropping in the draft order, providing flexibility to compete for the Lombardi Trophy in 2025.","San Francisco 49ers Roster, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24407798/49ers-draft-scenarios-trade-back-walter-nolen
|
24 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers face significant roster turnover heading into the 2025 NFL Draft, needing to replace 16.6% of their snaps, the fourth-highest in the league. While the offense remains mostly stable, losing only 10.5% of snaps, the defense will undergo major changes, with 22.6% of its snaps needing replacement. Key defensive players like DeβVondre Campbell, Maliek Collins, and Charvarius Ward will be replaced, impacting positions such as linebacker, defensive tackle, and cornerback. Additionally, improvements in special teams are anticipated, given the previous season's underperformance.","San Francisco 49ers Roster, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/14/24407808/49ers-leonard-floyd-charvarius-ward-talanoa-hufanga
|
25 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The NFL plans to release the full 2025 schedule around May 13-15, according to Mike North, the league's VP of broadcast planning and scheduling. Meanwhile, the San Francisco 49ers are conducting Top 30 pre-draft visits to evaluate prospects for the 2025 NFL Draft, with updates on these visits being tracked.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/14/24407898/49ers-news-schedule-release-offseason-pre-draft-visit-tracker-prospects-brock-purdy-brandon-aiyuk
|
26 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The 49ers are considering selecting Texas' Jahdae Barron in the 2024 NFL Draft to bolster their secondary, despite having more pressing needs on the defensive line. Barron, known for his versatility and superb ball skills, could provide significant long-term benefits to the 49ers' defensive backfield. His ability to play multiple positions, including outside corner, slot, and safety, offers the team flexibility and potential strategic advantages. While the 49ers have traditionally focused on strengthening their defensive front, adding Barron could enhance their secondary's playmaking capabilities and overall defensive strategy.","San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/13/24407322/would-jahdae-barron-make-sense-49ers-surprise-selection-no-11-2025-draft
|
27 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The 49ers are considering several prospects for the NFL draft, including Ezeiruaku, an edge rusher known for his effective use of 34-inch arms and impressive college stats of 47 tackles for loss and 30 sacks over four seasons. Despite his slightly smaller size for a 4-3 defensive end, he demonstrates good explosiveness and balance. They also met with Georgia DT Warren Brinson, who has consistently high defensive grades, and WR prospect Mumpfield, noted for his exceptional route running and ability to make contested catches.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/13/24407284/49ers-news-pre-draft-visit-prospects-mock-aiyuk-brock-purdy-contract-trade-offseason-kyle-jed-john
|
28 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"Tom Brady is collaborating with AMC Networks and several production companies to create a docuseries titled ""Gold Rush,"" set to premiere in 2026, which will explore the San Francisco 49ers' impact on the NFL. The series will feature interviews with 49ers legends and previously unseen NFL Films footage. Meanwhile, the 49ers are strategizing on how to replace linebacker Dre Greenlaw, considering several draft prospects. Additionally, a potential shoulder surgery for Saints quarterback Derek Carr could influence the 49ers' draft options by shifting available prospects.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/12/24406730/49ers-news-offseason-mock-draft-pre-visits-prospects-derek-carr-injury-shoulder-surgery-brock-aiyuk
|
29 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"George Kittle, a star NFL player for the San Francisco 49ers, is a lifelong wrestling enthusiast who has actively blended his love for football and wrestling. He made a notable appearance at WrestleMania 39, where he got involved in the action by clotheslining The Miz, thrilling the crowd. Kittle continues to celebrate his passion for wrestling by hosting ""KittleMania,"" a fan event in Las Vegas, and collaborating with WWE star Penta El Zero Miedo on a wrestling-inspired clothing line. Although he remains focused on football, Kittle has not ruled out a future in WWE, and his charisma and passion make him a natural fit for the wrestling world.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/11/24405991/49ers-tight-end-to-turnbuckle-george-kittles-epic-wrestling-journey
|
30 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers, under Kyle Shanahan and John Lynch, have had a mixed draft record with notable successes like Brock Purdy and George Kittle, but also some first-round misses. Despite not having first-round picks in 2022 and 2023, their overall draft performance over the last decade ranks them eighth according to Betway's analysis, with a score of 30.7 out of 100. The rankings place them behind recent Super Bowl winners like the Chiefs and Rams. The upcoming 2025 draft is crucial for the 49ers to enhance their roster and maintain competitiveness.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/11/24406301/49ers-last-10-years-in-the-nfl-draft-trent-baalke-john-lynch
|
31 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"As the 2025 NFL Draft approaches, the San Francisco 49ers are conducting their Top-30 visits to evaluate potential draft picks. These visits focus on key areas such as the defensive line, secondary, tight end, and offensive line, indicating the team's strategic priorities. The inclusion of prospects like Walter Nolen and Omarr Norman-Lott suggests a focus on enhancing the pass rush and run defense, while engagements with cornerbacks and safeties aim to bolster the defensive backfield. Additionally, the team is exploring options for depth at tight end and offensive positions, reflecting a comprehensive strategy to strengthen their roster.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/11/24405659/49ers-draft-strategy-by-looking-at-their-top-30-visits
|
32 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"In the 2024 NFL Draft, the San Francisco 49ers are expected to prioritize adding a pass catcher due to the trade of Deebo Samuel and uncertainty around Brandon Aiyuk's return. The team has a history of drafting wide receivers and tight ends, and this year's class offers a variety of options in both positions. The article suggests that while dynamic tight ends are available, the 49ers might find value in selecting wide receivers like Bond or Horton in the fourth round. The team is likely to explore options beyond the early rounds to enhance their depth chart.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/11/24405840/49ers-savion-williams-jaylin-noel-jack-bech-jacob-cowing
|
33 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"Stanford wide receiver Elic Ayomanor is appealing to the 49ers, highlighting his strength, speed, and blocking abilities as a good fit for their team. San Jose State's Nash, a prolific college receiver, is seen as a potential fifth-round pick due to his slot receiver experience and lack of breakaway speed, despite his physicality and late switch to the position. Washington State's Pole, a quick learner with a basketball background, excelled as a left tackle, not allowing any sacks last season. Louisville's Quincy Riley and Georgia Tech's Jackson Hawes are among other prospects visiting the 49ers, while veteran kicker Gay, known for his accuracy inside 50 yards, could compete with Moody.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/11/24405927/49ers-news-brock-purdy-contract-extension-mock-draft-nfl-prospects-stanford-visits-agents-trade
|
34 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are expected to heavily invest in their defensive line during the 2025 NFL Draft, but their strategy for the offensive line remains uncertain. With Trent Williams recovering from an injury-plagued season and no clear successor, the team would benefit from drafting a tackle early. However, head coach Kyle Shanahan has suggested that Spencer Burford, a versatile 2022 draft pick, might fill the role of swing tackle despite his previous challenges. If the 49ers do not select a tackle by day three of the draft, it may indicate confidence in Burford as a backup for both Williams and Colton McKivitz.","NFL, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/10/24405550/49ers-belief-2022-selection-influence-plans-premium-position-2025-draft
|
35 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers are preparing for the 2025 NFL Draft, where they hold the 11th overall pick. In a mock draft scenario, they trade down with the Tampa Bay Buccaneers to acquire an extra third-round pick, selecting Missouri's Armand Membou as a future franchise left tackle at No. 11. They further bolster their defensive line by drafting T.J. Sanders and Darius Alexander, addressing key needs following departures in that area. Additionally, they select Michigan defensive end Josiah Stewart and Clemson linebacker Barrett Carter to strengthen their roster with high-upside talent.","San Francisco 49ers Mock Drafts, San Francisco 49ers Draft, San Francisco 49ers News",https://www.ninersnation.com/2025/4/10/24405644/san-francisco-49ers-news-3-round-mock-draft-armand-membou-will-johnson-2025-nfl-draft
|
36 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"The San Francisco 49ers, once a perennial NFC title contender, faced a challenging 2024 season with only six wins, largely due to injuries and coaching issues. Key players like Brandon Aiyuk, Trent Williams, and Christian McCaffrey were sidelined, and the team struggled with a poor special teams unit and a change in defensive coordinators. With several key defensive players departing, the 49ers are focusing on rebuilding through the 2025 NFL Draft, targeting needs on the defensive line and other positions. Despite setbacks, there is optimism for the future, highlighted by strong performances from George Kittle and the rookie class.","San Francisco 49ers Draft, San Francisco 49ers Depth Chart, San Francisco 49ers News",https://www.ninersnation.com/2025/4/10/24405441/49ers-news-what-is-the-state-of-the-49ers-franchise
|
37 |
+
San Francisco 49ers,2025,San Francisco,NFC,West,,"AMC Networks will premiere ""Gold Rush,"" a four-part docuseries exploring the history and legacy of the San Francisco 49ers. The series will include exclusive interviews with players, coaches, and executives, providing insights into the team's evolution from its early days to its current status in the NFL. This announcement comes as interest in sports documentaries grows, with previous documentaries on the 49ers already available. While the premiere date is not yet announced, anticipation is high among fans eager to learn more about the team's storied past.",San Francisco 49ers News,https://www.ninersnation.com/2025/4/10/24405108/four-part-49ers-documentary-in-the-works-at-amc-networks-tom-brady
|
tools/team_news_scraper.py
ADDED
@@ -0,0 +1,370 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import csv
|
3 |
+
import time
|
4 |
+
from datetime import datetime, timedelta, timezone
|
5 |
+
import requests
|
6 |
+
from bs4 import BeautifulSoup
|
7 |
+
from dotenv import load_dotenv
|
8 |
+
import openai # Added for LLM Summarization
|
9 |
+
|
10 |
+
# Load environment variables (for API keys)
|
11 |
+
load_dotenv()
|
12 |
+
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
|
13 |
+
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o") # Default to gpt-4o if not set
|
14 |
+
|
15 |
+
if not OPENAI_API_KEY:
|
16 |
+
print("Warning: OPENAI_API_KEY not found in environment variables. Summarization will be skipped.")
|
17 |
+
# Or raise an error if summarization is critical:
|
18 |
+
# raise ValueError("OPENAI_API_KEY environment variable is required for summarization.")
|
19 |
+
|
20 |
+
TARGET_URL = "https://www.ninersnation.com/san-francisco-49ers-news"
|
21 |
+
OUTPUT_CSV_FILE = "team_news_articles.csv"
|
22 |
+
DAYS_TO_SCRAPE = 60 # Scrape articles from the past 60 days
|
23 |
+
REQUEST_DELAY = 1 # Delay in seconds between requests to be polite
|
24 |
+
|
25 |
+
# Add a flag to enable/disable summarization easily
|
26 |
+
ENABLE_SUMMARIZATION = True if OPENAI_API_KEY else False
|
27 |
+
|
28 |
+
def fetch_html(url):
|
29 |
+
"""Fetches HTML content from a URL with error handling."""
|
30 |
+
try:
|
31 |
+
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) # Basic user-agent
|
32 |
+
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
|
33 |
+
return response.text
|
34 |
+
except requests.exceptions.RequestException as e:
|
35 |
+
print(f"Error fetching {url}: {e}")
|
36 |
+
return None
|
37 |
+
|
38 |
+
def parse_article_list(html_content):
|
39 |
+
"""Parses the main news page to find article links and dates."""
|
40 |
+
print("Parsing article list page...")
|
41 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
42 |
+
articles = []
|
43 |
+
# SBNation common structure: find compact entry boxes
|
44 |
+
# Note: Class names might change, may need adjustment if scraping fails.
|
45 |
+
article_elements = soup.find_all('div', class_='c-entry-box--compact')
|
46 |
+
if not article_elements:
|
47 |
+
# Fallback: Try another common pattern if the first fails
|
48 |
+
article_elements = soup.find_all('div', class_='p-entry-box')
|
49 |
+
|
50 |
+
print(f"Found {len(article_elements)} potential article elements.")
|
51 |
+
|
52 |
+
for elem in article_elements:
|
53 |
+
# Find the main link within the heading
|
54 |
+
heading = elem.find('h2')
|
55 |
+
link_tag = heading.find('a', href=True) if heading else None
|
56 |
+
|
57 |
+
# Find the time tag for publication date
|
58 |
+
time_tag = elem.find('time', datetime=True)
|
59 |
+
|
60 |
+
if link_tag and time_tag and link_tag['href']:
|
61 |
+
url = link_tag['href']
|
62 |
+
# Ensure the URL is absolute
|
63 |
+
if not url.startswith('http'):
|
64 |
+
# Attempt to join with base URL (requires knowing the base, careful with relative paths)
|
65 |
+
# For now, we'll rely on SBNation typically using absolute URLs or full paths
|
66 |
+
# from urllib.parse import urljoin
|
67 |
+
# base_url = "https://www.ninersnation.com"
|
68 |
+
# url = urljoin(base_url, url)
|
69 |
+
# Let's assume they are absolute for now based on typical SBNation structure
|
70 |
+
print(f"Warning: Found potentially relative URL: {url}. Skipping for now.")
|
71 |
+
continue # Skip potentially relative URLs
|
72 |
+
|
73 |
+
date_str = time_tag['datetime'] # e.g., "2024-05-20T10:00:00-07:00"
|
74 |
+
if url and date_str:
|
75 |
+
articles.append((url, date_str))
|
76 |
+
else:
|
77 |
+
print("Skipping element: Couldn't find link or time tag.") # Debugging
|
78 |
+
|
79 |
+
print(f"Extracted {len(articles)} articles with URL and date.")
|
80 |
+
return articles
|
81 |
+
|
82 |
+
def parse_article_details(html_content, url):
|
83 |
+
"""Parses an individual article page to extract details including raw content."""
|
84 |
+
print(f"Parsing article details for: {url}")
|
85 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
86 |
+
|
87 |
+
details = {
|
88 |
+
"title": None,
|
89 |
+
"content": None, # This will store the raw content for summarization
|
90 |
+
"publication_date": None,
|
91 |
+
"link_to_article": url,
|
92 |
+
"tags": []
|
93 |
+
}
|
94 |
+
|
95 |
+
# Extract Title (Usually the main H1)
|
96 |
+
title_tag = soup.find('h1') # Find the first H1
|
97 |
+
if title_tag:
|
98 |
+
details['title'] = title_tag.get_text(strip=True)
|
99 |
+
else:
|
100 |
+
print(f"Warning: Title tag (h1) not found for {url}")
|
101 |
+
|
102 |
+
# Extract Publication Date (Look for time tag in byline)
|
103 |
+
# SBNation often uses <span class="c-byline__item"><time ...></span>
|
104 |
+
byline_time_tag = soup.find('span', class_='c-byline__item')
|
105 |
+
time_tag = byline_time_tag.find('time', datetime=True) if byline_time_tag else None
|
106 |
+
if time_tag and time_tag.get('datetime'):
|
107 |
+
details['publication_date'] = time_tag['datetime']
|
108 |
+
else:
|
109 |
+
# Fallback: Search for any time tag with datetime attribute if specific class fails
|
110 |
+
time_tag = soup.find('time', datetime=True)
|
111 |
+
if time_tag and time_tag.get('datetime'):
|
112 |
+
details['publication_date'] = time_tag['datetime']
|
113 |
+
else:
|
114 |
+
print(f"Warning: Publication date tag (time[datetime]) not found for {url}")
|
115 |
+
|
116 |
+
# Extract Content (Paragraphs within the main content div)
|
117 |
+
content_div = soup.find('div', class_='c-entry-content')
|
118 |
+
if content_div:
|
119 |
+
paragraphs = content_div.find_all('p')
|
120 |
+
# Join non-empty paragraphs, ensuring None safety
|
121 |
+
# Store this raw content for potential summarization
|
122 |
+
details['content'] = '\n\n'.join([p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True)])
|
123 |
+
else:
|
124 |
+
print(f"Warning: Content div (div.c-entry-content) not found for {url}")
|
125 |
+
|
126 |
+
# Extract Tags (Look for tags/labels, e.g., under "Filed under:")
|
127 |
+
# SBNation often uses a ul/div with class like 'c-entry-group-labels' or 'c-entry-tags'
|
128 |
+
tags_container = soup.find('ul', class_='m-tags__list') # A common SBNation tag structure
|
129 |
+
if tags_container:
|
130 |
+
tag_elements = tags_container.find_all('a') # Tags are usually links
|
131 |
+
details['tags'] = list(set([tag.get_text(strip=True) for tag in tag_elements if tag.get_text(strip=True)]))
|
132 |
+
else:
|
133 |
+
# Fallback: Look for another potential container like the one in the example text
|
134 |
+
filed_under_div = soup.find('div', class_='c-entry-group-labels') # Another possible class
|
135 |
+
if filed_under_div:
|
136 |
+
tag_elements = filed_under_div.find_all('a')
|
137 |
+
details['tags'] = list(set([tag.get_text(strip=True) for tag in tag_elements if tag.get_text(strip=True)]))
|
138 |
+
else:
|
139 |
+
# Specific structure from example text if needed ('Filed under:' section)
|
140 |
+
# This requires finding the specific structure around 'Filed under:'
|
141 |
+
# Could be more fragile, attempt simpler methods first.
|
142 |
+
print(f"Warning: Tags container not found using common classes for {url}")
|
143 |
+
# Example: Search based on text 'Filed under:' - less reliable
|
144 |
+
# filed_under_header = soup.find(lambda tag: tag.name == 'h2' and 'Filed under:' in tag.get_text())
|
145 |
+
# if filed_under_header:
|
146 |
+
# parent_or_sibling = filed_under_header.parent # Adjust based on actual structure
|
147 |
+
# tag_elements = parent_or_sibling.find_all('a') if parent_or_sibling else []
|
148 |
+
# details['tags'] = list(set([tag.get_text(strip=True) for tag in tag_elements]))
|
149 |
+
|
150 |
+
# Basic validation - ensure essential fields were extracted for basic processing
|
151 |
+
# Content is needed for summarization but might be missing on some pages (e.g., galleries)
|
152 |
+
if not details['title'] or not details['publication_date']:
|
153 |
+
print(f"Failed to extract essential details (title or date) for {url}. Returning None.")
|
154 |
+
return None
|
155 |
+
|
156 |
+
# Content check specifically before returning - needed for summary
|
157 |
+
if not details['content']:
|
158 |
+
print(f"Warning: Missing content for {url}. Summary cannot be generated.")
|
159 |
+
|
160 |
+
return details
|
161 |
+
|
162 |
+
def is_within_timeframe(date_str, days):
|
163 |
+
"""Checks if a date string (ISO format) is within the specified number of days from now."""
|
164 |
+
if not date_str:
|
165 |
+
return False
|
166 |
+
try:
|
167 |
+
# Parse the ISO format date string, handling potential 'Z' for UTC
|
168 |
+
pub_date = datetime.fromisoformat(date_str.replace('Z', '+00:00'))
|
169 |
+
|
170 |
+
# Ensure pub_date is offset-aware (has timezone info)
|
171 |
+
# If fromisoformat gives naive datetime, assume UTC (common practice for 'Z')
|
172 |
+
if pub_date.tzinfo is None or pub_date.tzinfo.utcoffset(pub_date) is None:
|
173 |
+
pub_date = pub_date.replace(tzinfo=timezone.utc) # Assume UTC if naive
|
174 |
+
|
175 |
+
# Get current time as an offset-aware datetime in UTC
|
176 |
+
now_utc = datetime.now(timezone.utc)
|
177 |
+
|
178 |
+
# Calculate the cutoff date
|
179 |
+
cutoff_date = now_utc - timedelta(days=days)
|
180 |
+
|
181 |
+
# Compare offset-aware datetimes
|
182 |
+
return pub_date >= cutoff_date
|
183 |
+
except ValueError as e:
|
184 |
+
print(f"Could not parse date: {date_str}. Error: {e}")
|
185 |
+
return False # Skip if date parsing fails
|
186 |
+
except Exception as e:
|
187 |
+
print(f"Unexpected error during date comparison for {date_str}: {e}")
|
188 |
+
return False
|
189 |
+
|
190 |
+
def generate_summary(article_content):
|
191 |
+
"""Generates a 3-4 sentence summary using OpenAI API."""
|
192 |
+
if not ENABLE_SUMMARIZATION or not article_content:
|
193 |
+
print("Skipping summary generation (disabled or no content).")
|
194 |
+
return "" # Return empty string if summarization skipped or no content
|
195 |
+
|
196 |
+
print("Generating summary...")
|
197 |
+
try:
|
198 |
+
client = openai.OpenAI(api_key=OPENAI_API_KEY)
|
199 |
+
|
200 |
+
# Simple prompt for summarization
|
201 |
+
prompt = f"""Please provide a concise 3-4 sentence summary of the following article content.
|
202 |
+
Focus on the key information and main points. Do not include any information not present in the text. :
|
203 |
+
|
204 |
+
---
|
205 |
+
{article_content}
|
206 |
+
---
|
207 |
+
|
208 |
+
Summary:"""
|
209 |
+
|
210 |
+
# Limit content length to avoid excessive token usage (adjust limit as needed)
|
211 |
+
max_content_length = 15000 # Approx limit, GPT-4o context window is large but be mindful of cost/speed
|
212 |
+
if len(prompt) > max_content_length:
|
213 |
+
print(f"Warning: Content too long ({len(article_content)} chars), truncating for summarization.")
|
214 |
+
# Truncate content intelligently if needed, here just slicing prompt
|
215 |
+
prompt = prompt[:max_content_length]
|
216 |
+
|
217 |
+
response = client.chat.completions.create(
|
218 |
+
model=OPENAI_MODEL,
|
219 |
+
messages=[
|
220 |
+
{"role": "system", "content": "You are an AI assistant tasked with summarizing news articles concisely."},
|
221 |
+
{"role": "user", "content": prompt}
|
222 |
+
],
|
223 |
+
temperature=0.5, # Adjust for desired creativity vs factuality
|
224 |
+
max_tokens=150 # Limit summary length
|
225 |
+
)
|
226 |
+
|
227 |
+
summary = response.choices[0].message.content.strip()
|
228 |
+
print("Summary generated successfully.")
|
229 |
+
return summary
|
230 |
+
|
231 |
+
except openai.APIError as e:
|
232 |
+
print(f"OpenAI API returned an API Error: {e}")
|
233 |
+
except openai.APIConnectionError as e:
|
234 |
+
print(f"Failed to connect to OpenAI API: {e}")
|
235 |
+
except openai.RateLimitError as e:
|
236 |
+
print(f"OpenAI API request exceeded rate limit: {e}")
|
237 |
+
except Exception as e:
|
238 |
+
print(f"An unexpected error occurred during summarization: {e}")
|
239 |
+
|
240 |
+
return "" # Return empty string on failure
|
241 |
+
|
242 |
+
def scrape_and_summarize_niners_nation():
|
243 |
+
"""Main function to scrape, parse, summarize, and return structured data."""
|
244 |
+
print("Starting Niners Nation scraping and summarization process...")
|
245 |
+
main_page_html = fetch_html(TARGET_URL)
|
246 |
+
if not main_page_html:
|
247 |
+
print("Failed to fetch the main news page. Exiting.")
|
248 |
+
return []
|
249 |
+
|
250 |
+
articles_on_page = parse_article_list(main_page_html)
|
251 |
+
|
252 |
+
scraped_and_summarized_data = []
|
253 |
+
now_utc = datetime.now(timezone.utc)
|
254 |
+
cutoff_datetime = now_utc - timedelta(days=DAYS_TO_SCRAPE)
|
255 |
+
print(f"Filtering articles published since {cutoff_datetime.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
256 |
+
|
257 |
+
processed_urls = set()
|
258 |
+
|
259 |
+
for url, date_str in articles_on_page:
|
260 |
+
if url in processed_urls:
|
261 |
+
continue
|
262 |
+
|
263 |
+
if not is_within_timeframe(date_str, DAYS_TO_SCRAPE):
|
264 |
+
continue
|
265 |
+
|
266 |
+
print(f"Fetching article: {url}")
|
267 |
+
article_html = fetch_html(url)
|
268 |
+
if article_html:
|
269 |
+
details = parse_article_details(article_html, url)
|
270 |
+
if details:
|
271 |
+
# Generate summary if content exists and summarization enabled
|
272 |
+
article_summary = "" # Initialize summary
|
273 |
+
if details.get('content'):
|
274 |
+
article_summary = generate_summary(details['content'])
|
275 |
+
else:
|
276 |
+
print(f"Skipping summary for {url} due to missing content.")
|
277 |
+
|
278 |
+
# Add the summary to the details dictionary
|
279 |
+
details['summary'] = article_summary
|
280 |
+
|
281 |
+
# Proceed to structure data (now including the summary)
|
282 |
+
structured_row = structure_data_for_csv_row(details) # Use a helper for single row
|
283 |
+
if structured_row:
|
284 |
+
scraped_and_summarized_data.append(structured_row)
|
285 |
+
processed_urls.add(url)
|
286 |
+
print(f"Successfully scraped and summarized: {details['title']}")
|
287 |
+
else:
|
288 |
+
print(f"Failed to structure data for {url}")
|
289 |
+
|
290 |
+
else:
|
291 |
+
print(f"Failed to parse essential details for article: {url}")
|
292 |
+
else:
|
293 |
+
print(f"Failed to fetch article page: {url}")
|
294 |
+
|
295 |
+
print(f"Waiting for {REQUEST_DELAY} second(s)...")
|
296 |
+
time.sleep(REQUEST_DELAY)
|
297 |
+
|
298 |
+
print(f"Scraping & Summarization finished. Collected {len(scraped_and_summarized_data)} articles.")
|
299 |
+
return scraped_and_summarized_data
|
300 |
+
|
301 |
+
def structure_data_for_csv_row(article_details):
|
302 |
+
"""Processes a single article's details into the final CSV structure."""
|
303 |
+
current_year = datetime.now().year
|
304 |
+
|
305 |
+
# Extract and parse publication date to get the year
|
306 |
+
season = current_year # Default to current year
|
307 |
+
pub_date_str = article_details.get("publication_date")
|
308 |
+
if pub_date_str:
|
309 |
+
try:
|
310 |
+
pub_date = datetime.fromisoformat(pub_date_str.replace('Z', '+00:00'))
|
311 |
+
season = pub_date.year
|
312 |
+
except ValueError:
|
313 |
+
print(f"Warning: Could not parse date '{pub_date_str}' for season. Using default {current_year}.")
|
314 |
+
|
315 |
+
# Get tags and format as topic string
|
316 |
+
tags = article_details.get("tags", [])
|
317 |
+
topic = ", ".join(tags) if tags else "General News"
|
318 |
+
|
319 |
+
# Build the dictionary for the CSV row
|
320 |
+
structured_row = {
|
321 |
+
"Team_name": "San Francisco 49ers",
|
322 |
+
"season": season,
|
323 |
+
"city": "San Francisco",
|
324 |
+
"conference": "NFC",
|
325 |
+
"division": "West",
|
326 |
+
"logo_url": "",
|
327 |
+
"summary": article_details.get("summary", ""), # Get the generated summary
|
328 |
+
"topic": topic,
|
329 |
+
"link_to_article": article_details.get("link_to_article", ""),
|
330 |
+
}
|
331 |
+
return structured_row
|
332 |
+
|
333 |
+
def write_to_csv(data, filename):
|
334 |
+
"""Writes the structured data to a CSV file."""
|
335 |
+
if not data:
|
336 |
+
print("No data to write to CSV.")
|
337 |
+
return
|
338 |
+
|
339 |
+
fieldnames = [
|
340 |
+
"Team_name", "season", "city", "conference", "division",
|
341 |
+
"logo_url", "summary", "topic", "link_to_article"
|
342 |
+
]
|
343 |
+
|
344 |
+
if not all(key in data[0] for key in fieldnames):
|
345 |
+
print(f"Error: Mismatch between defined fieldnames and data keys.")
|
346 |
+
print(f"Expected: {fieldnames}")
|
347 |
+
print(f"Got keys: {list(data[0].keys())}")
|
348 |
+
return
|
349 |
+
|
350 |
+
print(f"Writing {len(data)} rows to {filename}...")
|
351 |
+
try:
|
352 |
+
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
|
353 |
+
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
|
354 |
+
writer.writeheader()
|
355 |
+
writer.writerows(data)
|
356 |
+
print(f"Successfully wrote {len(data)} rows to {filename}")
|
357 |
+
except IOError as e:
|
358 |
+
print(f"Error writing to CSV file {filename}: {e}")
|
359 |
+
except Exception as e:
|
360 |
+
print(f"An unexpected error occurred during CSV writing: {e}")
|
361 |
+
|
362 |
+
# --- Main Execution ---
|
363 |
+
if __name__ == "__main__":
|
364 |
+
# Call the main orchestrator function that includes summarization
|
365 |
+
processed_articles = scrape_and_summarize_niners_nation()
|
366 |
+
|
367 |
+
if processed_articles:
|
368 |
+
write_to_csv(processed_articles, OUTPUT_CSV_FILE)
|
369 |
+
else:
|
370 |
+
print("No articles were processed.")
|