Spaces:
Running
Running
Upload 9 files
Browse files- Dockerfile +24 -0
- README.md +172 -11
- analysis_service.py +288 -0
- app.py +237 -0
- docker-compose.yml +20 -0
- models.py +68 -0
- requirements.txt +10 -0
- twitter_service.py +945 -0
- vercel.json +15 -0
Dockerfile
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM python:3.11-slim
|
2 |
+
|
3 |
+
WORKDIR /app
|
4 |
+
|
5 |
+
# Install dependencies
|
6 |
+
COPY requirements.txt .
|
7 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
8 |
+
|
9 |
+
# Create logs directory for application logs
|
10 |
+
RUN mkdir -p logs
|
11 |
+
|
12 |
+
# Copy application code
|
13 |
+
COPY . .
|
14 |
+
|
15 |
+
# Set environment variables
|
16 |
+
ENV PYTHONDONTWRITEBYTECODE=1
|
17 |
+
ENV PYTHONUNBUFFERED=1
|
18 |
+
ENV LOG_LEVEL=INFO
|
19 |
+
|
20 |
+
# Expose the port the app runs on
|
21 |
+
EXPOSE 8000
|
22 |
+
|
23 |
+
# Command to run the application
|
24 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
|
README.md
CHANGED
@@ -1,11 +1,172 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# WesternFront: India-Pakistan Conflict Tracker API
|
2 |
+
|
3 |
+
A FastAPI application that leverages unofficial Twitter access via Twikit and Google's Gemini AI to monitor and analyze India-Pakistan tensions in real-time.
|
4 |
+
|
5 |
+
## Overview
|
6 |
+
|
7 |
+
WesternFront is an AI-powered conflict tracker that:
|
8 |
+
|
9 |
+
1. Collects tweets from reliable news sources covering India-Pakistan relations without using official Twitter API
|
10 |
+
2. Analyzes these tweets using Google's Gemini AI to assess the current conflict situation
|
11 |
+
3. Provides RESTful endpoints to access the analysis
|
12 |
+
4. Updates analysis periodically and on-demand
|
13 |
+
|
14 |
+
## Core Components
|
15 |
+
|
16 |
+
### Twitter Data Collection
|
17 |
+
- Uses [Twikit](https://github.com/d60/twikit) for unofficial Twitter access
|
18 |
+
- Fetches tweets from a predefined list of reliable sources
|
19 |
+
- Implements caching to avoid unnecessary requests
|
20 |
+
|
21 |
+
### AI Analysis with Gemini
|
22 |
+
- Analyzes collected tweets to assess India-Pakistan tensions
|
23 |
+
- Generates comprehensive reports including:
|
24 |
+
- Current situation summary
|
25 |
+
- Key developments in the last 24-48 hours
|
26 |
+
- Information reliability assessment
|
27 |
+
- Regional stability implications
|
28 |
+
- Tension level classification (Low/Medium/High/Critical)
|
29 |
+
|
30 |
+
### FastAPI Server
|
31 |
+
- Endpoint for on-demand analysis updates
|
32 |
+
- Endpoint to get latest analysis
|
33 |
+
- Background task system for periodic updates
|
34 |
+
- Health check endpoint
|
35 |
+
- Source list and keyword management
|
36 |
+
|
37 |
+
## Getting Started
|
38 |
+
|
39 |
+
### Prerequisites
|
40 |
+
|
41 |
+
- Python 3.9+
|
42 |
+
- Docker (optional)
|
43 |
+
|
44 |
+
### Environment Setup
|
45 |
+
|
46 |
+
1. Clone the repository
|
47 |
+
2. Copy `.env.example` to `.env` and fill in the required values:
|
48 |
+
```
|
49 |
+
# Twitter Credentials
|
50 |
+
TWITTER_USERNAME=your_twitter_username
|
51 |
+
TWITTER_PASSWORD=your_twitter_password
|
52 |
+
TWITTER_EMAIL=your_twitter_email
|
53 |
+
|
54 |
+
# Google Gemini API Key
|
55 |
+
GEMINI_API_KEY=your_gemini_api_key
|
56 |
+
|
57 |
+
# Application Settings
|
58 |
+
UPDATE_INTERVAL_MINUTES=60
|
59 |
+
CACHE_EXPIRY_MINUTES=120
|
60 |
+
LOG_LEVEL=INFO
|
61 |
+
```
|
62 |
+
|
63 |
+
### Installation
|
64 |
+
|
65 |
+
#### Local Development
|
66 |
+
|
67 |
+
```bash
|
68 |
+
# Create virtual environment
|
69 |
+
python -m venv venv
|
70 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
71 |
+
|
72 |
+
# Install dependencies
|
73 |
+
pip install -r requirements.txt
|
74 |
+
|
75 |
+
# Run the application
|
76 |
+
uvicorn app:app --reload
|
77 |
+
```
|
78 |
+
|
79 |
+
#### Docker Deployment
|
80 |
+
|
81 |
+
```bash
|
82 |
+
# Build the Docker image
|
83 |
+
docker build -t westernfront .
|
84 |
+
|
85 |
+
# Run the container
|
86 |
+
docker run -p 8000:8000 --env-file .env westernfront
|
87 |
+
```
|
88 |
+
|
89 |
+
## API Endpoints
|
90 |
+
|
91 |
+
### Root Endpoint
|
92 |
+
- `GET /`: Basic API information
|
93 |
+
|
94 |
+
### Health Check
|
95 |
+
- `GET /health`: Check the health of the API and its components
|
96 |
+
|
97 |
+
### Analysis
|
98 |
+
- `GET /analysis`: Get the latest conflict analysis
|
99 |
+
- `POST /analysis/update`: Trigger an analysis update
|
100 |
+
- Request Body: `{ "force": boolean }` (optional, defaults to false)
|
101 |
+
|
102 |
+
### News Sources
|
103 |
+
- `GET /sources`: Get the current list of news sources
|
104 |
+
- `POST /sources`: Update the list of news sources
|
105 |
+
- Request Body: Array of NewsSource objects
|
106 |
+
|
107 |
+
### Keywords
|
108 |
+
- `GET /keywords`: Get the current search keywords
|
109 |
+
- `POST /keywords`: Update the search keywords
|
110 |
+
- Request Body: Array of strings
|
111 |
+
|
112 |
+
### Tension Levels
|
113 |
+
- `GET /tension-levels`: Get the available tension levels
|
114 |
+
|
115 |
+
## Data Models
|
116 |
+
|
117 |
+
### News Source
|
118 |
+
```json
|
119 |
+
{
|
120 |
+
"name": "BBC News",
|
121 |
+
"twitter_handle": "BBCWorld",
|
122 |
+
"country": "UK",
|
123 |
+
"reliability_score": 0.9,
|
124 |
+
"is_active": true
|
125 |
+
}
|
126 |
+
```
|
127 |
+
|
128 |
+
### Conflict Analysis
|
129 |
+
```json
|
130 |
+
{
|
131 |
+
"analysis_id": "uuid",
|
132 |
+
"generated_at": "2023-05-01T12:00:00Z",
|
133 |
+
"situation_summary": "...",
|
134 |
+
"key_developments": [
|
135 |
+
{
|
136 |
+
"title": "Development 1",
|
137 |
+
"description": "...",
|
138 |
+
"sources": ["@BBCWorld", "@Reuters"],
|
139 |
+
"timestamp": "2023-05-01T10:30:00Z"
|
140 |
+
}
|
141 |
+
],
|
142 |
+
"reliability_assessment": "...",
|
143 |
+
"regional_implications": "...",
|
144 |
+
"tension_level": "Medium",
|
145 |
+
"source_tweets": [],
|
146 |
+
"update_triggered_by": "scheduled"
|
147 |
+
}
|
148 |
+
```
|
149 |
+
|
150 |
+
## Implementation Notes
|
151 |
+
|
152 |
+
- The application uses asyncio for handling concurrent requests
|
153 |
+
- Implements in-memory caching (can be extended to Redis)
|
154 |
+
- Rate limiting and throttling for Twitter scraping to avoid blocking
|
155 |
+
- Proper error handling and logging via loguru
|
156 |
+
- Secure credential management via environment variables
|
157 |
+
|
158 |
+
## Future Enhancements
|
159 |
+
|
160 |
+
- Redis integration for more robust caching
|
161 |
+
- User authentication for API access
|
162 |
+
- Email/notification alerts for critical tension levels
|
163 |
+
- Historical data storage and trend analysis
|
164 |
+
- Additional data sources beyond Twitter
|
165 |
+
|
166 |
+
## License
|
167 |
+
|
168 |
+
MIT License
|
169 |
+
|
170 |
+
## Disclaimer
|
171 |
+
|
172 |
+
This application is designed for educational and research purposes. The analysis provided should not be used as the sole source for critical decision-making related to regional conflicts.
|
analysis_service.py
ADDED
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import uuid
|
3 |
+
from datetime import datetime
|
4 |
+
from typing import Dict, List
|
5 |
+
|
6 |
+
import google.generativeai as genai
|
7 |
+
from loguru import logger
|
8 |
+
from tenacity import RetryError, retry, stop_after_attempt, wait_exponential
|
9 |
+
|
10 |
+
from models import ConflictAnalysis, KeyDevelopment, TensionLevel, Tweet
|
11 |
+
|
12 |
+
|
13 |
+
class AnalysisService:
|
14 |
+
"""Service for analyzing tweets using Google's Gemini AI."""
|
15 |
+
|
16 |
+
def __init__(self):
|
17 |
+
self.api_key = os.getenv("GEMINI_API_KEY")
|
18 |
+
self.model = None
|
19 |
+
self.search_keywords = [
|
20 |
+
"India Pakistan", "Kashmir", "LOC", "Line of Control",
|
21 |
+
"border tension", "ceasefire", "military", "diplomatic relations",
|
22 |
+
"India-Pakistan", "cross-border", "terrorism", "bilateral relations"
|
23 |
+
]
|
24 |
+
self.initialize()
|
25 |
+
|
26 |
+
def initialize(self) -> bool:
|
27 |
+
"""Initialize the Gemini AI client."""
|
28 |
+
if not self.api_key:
|
29 |
+
logger.error("GEMINI_API_KEY not provided")
|
30 |
+
return False
|
31 |
+
|
32 |
+
try:
|
33 |
+
logger.info("Initializing Gemini AI")
|
34 |
+
genai.configure(api_key=self.api_key)
|
35 |
+
# Configure model with lower temperature for more factual responses
|
36 |
+
generation_config = {
|
37 |
+
"temperature": 0.1,
|
38 |
+
"top_p": 0.95,
|
39 |
+
"top_k": 40
|
40 |
+
}
|
41 |
+
self.model = genai.GenerativeModel('gemini-2.0-flash', generation_config=generation_config)
|
42 |
+
logger.info("Gemini AI initialized successfully")
|
43 |
+
return True
|
44 |
+
except Exception as e:
|
45 |
+
logger.error(f"Failed to initialize Gemini AI: {str(e)}")
|
46 |
+
return False
|
47 |
+
|
48 |
+
def _prepare_prompt(self, tweets: List[Tweet]) -> str:
|
49 |
+
"""Prepare the prompt for analysis with intelligence sources data."""
|
50 |
+
# Sort tweets by recency to help with latest status identification
|
51 |
+
sorted_tweets = sorted(tweets, key=lambda x: x.created_at if hasattr(x, 'created_at') else datetime.now(), reverse=True)
|
52 |
+
|
53 |
+
source_entries = [
|
54 |
+
f"DATA POINT {i+1}: [TIMESTAMP: {tweet.created_at if hasattr(tweet, 'created_at') else 'unknown'}, SOURCE: @{tweet.author}]\n{tweet.text}"
|
55 |
+
for i, tweet in enumerate(sorted_tweets)
|
56 |
+
]
|
57 |
+
intelligence_data = "\n\n".join(source_entries)
|
58 |
+
|
59 |
+
prompt = f"""
|
60 |
+
INTELLIGENCE BRIEF: INDIA-PAKISTAN SITUATION ANALYSIS
|
61 |
+
DATE: {datetime.now().strftime("%Y-%m-%d")}
|
62 |
+
CLASSIFICATION: STRATEGIC ASSESSMENT
|
63 |
+
|
64 |
+
SOURCE DATA:
|
65 |
+
{intelligence_data}
|
66 |
+
|
67 |
+
ANALYTICAL PARAMETERS:
|
68 |
+
- Analyze the data points objectively without commentary
|
69 |
+
- Identify factual developments and official statements
|
70 |
+
- Assess tension levels based on concrete actions and statements
|
71 |
+
- Maintain professional, analytical tone throughout
|
72 |
+
- Cite specific data points in all assessments
|
73 |
+
- Do not introduce information not present in the data points
|
74 |
+
- Include exact timestamps when available
|
75 |
+
|
76 |
+
REQUIRED OUTPUT FORMAT:
|
77 |
+
{{
|
78 |
+
"latest_status": "Most recent significant development with exact timestamp and source citation",
|
79 |
+
"situation_summary": "Precise assessment of current Indo-Pak situation with timestamps and citations",
|
80 |
+
"key_developments": [
|
81 |
+
{{
|
82 |
+
"title": "Precise event designation",
|
83 |
+
"description": "Factual account with supporting evidence and timestamps",
|
84 |
+
"sources": ["@source1", "@source2"]
|
85 |
+
}}
|
86 |
+
],
|
87 |
+
"reliability_assessment": {{
|
88 |
+
"source_credibility": "Assessment of source authority and reliability",
|
89 |
+
"information_gaps": "Specific identification of intelligence gaps",
|
90 |
+
"confidence_rating": "HIGH|MEDIUM|LOW based on data quality"
|
91 |
+
}},
|
92 |
+
"regional_implications": {{
|
93 |
+
"security": "Concrete security implications based on factual developments",
|
94 |
+
"diplomatic": "Diplomatic consequences with specific references",
|
95 |
+
"economic": "Economic impacts if applicable to current situation"
|
96 |
+
}},
|
97 |
+
"tension_level": "LOW|MEDIUM|HIGH|CRITICAL",
|
98 |
+
"tension_rationale": "Specific evidence supporting tension level assessment"
|
99 |
+
}}
|
100 |
+
|
101 |
+
IMPORTANT DIRECTIVES:
|
102 |
+
- Return ONLY valid JSON without any additional text or markdown formatting
|
103 |
+
- Do not use conversational language or first-person perspective
|
104 |
+
- Focus on factual analysis, not speculation
|
105 |
+
- Prioritize verified information from official channels
|
106 |
+
- Highlight the most recent developments in the latest_status section
|
107 |
+
"""
|
108 |
+
return prompt
|
109 |
+
|
110 |
+
@retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3))
|
111 |
+
async def _call_gemini(self, prompt: str) -> Dict:
|
112 |
+
"""Call the Gemini API with retry logic and improved parsing."""
|
113 |
+
if not self.model:
|
114 |
+
if not self.initialize():
|
115 |
+
logger.error("Could not analyze tweets, Gemini AI not initialized")
|
116 |
+
raise Exception("Gemini AI initialization failed")
|
117 |
+
|
118 |
+
try:
|
119 |
+
logger.info("Calling Gemini API for conflict analysis")
|
120 |
+
response = await self.model.generate_content_async(prompt)
|
121 |
+
result = response.text
|
122 |
+
|
123 |
+
import json
|
124 |
+
import re
|
125 |
+
|
126 |
+
# Better JSON extraction with multiple patterns
|
127 |
+
json_match = re.search(r'```(?:json)?\n(.*?)\n```', result, re.DOTALL)
|
128 |
+
if json_match:
|
129 |
+
result = json_match.group(1)
|
130 |
+
else:
|
131 |
+
# Try to find JSON objects with or without formatting
|
132 |
+
json_pattern = r'({[\s\S]*})'
|
133 |
+
json_match = re.search(json_pattern, result)
|
134 |
+
if json_match:
|
135 |
+
result = json_match.group(1)
|
136 |
+
|
137 |
+
# Clean the result of any non-JSON content
|
138 |
+
result = re.sub(r'```', '', result).strip()
|
139 |
+
|
140 |
+
# Parse JSON with error handling
|
141 |
+
try:
|
142 |
+
analysis_data = json.loads(result)
|
143 |
+
logger.info("Successfully received and parsed Gemini response")
|
144 |
+
return analysis_data
|
145 |
+
except json.JSONDecodeError as e:
|
146 |
+
logger.error(f"JSON parsing error: {str(e)}")
|
147 |
+
# Attempt cleanup and retry parsing
|
148 |
+
result = re.sub(r'[\n\r\t]', ' ', result)
|
149 |
+
result = re.search(r'({.*})', result).group(1) if re.search(r'({.*})', result) else result
|
150 |
+
analysis_data = json.loads(result)
|
151 |
+
logger.info("Successfully parsed Gemini response after cleanup")
|
152 |
+
return analysis_data
|
153 |
+
|
154 |
+
except Exception as e:
|
155 |
+
logger.error(f"Error calling Gemini API: {str(e)}")
|
156 |
+
logger.debug(f"Raw response content: {result if 'result' in locals() else 'No response'}")
|
157 |
+
raise
|
158 |
+
|
159 |
+
def _extract_tension_level(self, level_text: str) -> TensionLevel:
|
160 |
+
"""Extract tension level enum from text."""
|
161 |
+
level_text = level_text.lower()
|
162 |
+
if "critical" in level_text:
|
163 |
+
return TensionLevel.CRITICAL
|
164 |
+
elif "high" in level_text:
|
165 |
+
return TensionLevel.HIGH
|
166 |
+
elif "medium" in level_text:
|
167 |
+
return TensionLevel.MEDIUM
|
168 |
+
else:
|
169 |
+
return TensionLevel.LOW
|
170 |
+
|
171 |
+
def _process_key_developments(self, developments_data: List[Dict]) -> List[KeyDevelopment]:
|
172 |
+
"""Process key developments from API response."""
|
173 |
+
key_developments = []
|
174 |
+
for dev in developments_data:
|
175 |
+
key_developments.append(
|
176 |
+
KeyDevelopment(
|
177 |
+
title=dev.get("title", "Unnamed Development"),
|
178 |
+
description=dev.get("description", "No description provided"),
|
179 |
+
sources=dev.get("sources", []),
|
180 |
+
timestamp=datetime.now()
|
181 |
+
)
|
182 |
+
)
|
183 |
+
return key_developments
|
184 |
+
|
185 |
+
def _format_reliability_assessment(self, reliability_data: Dict) -> str:
|
186 |
+
"""Format reliability assessment data into a structured string."""
|
187 |
+
if isinstance(reliability_data, str):
|
188 |
+
return reliability_data
|
189 |
+
|
190 |
+
if isinstance(reliability_data, dict):
|
191 |
+
sections = []
|
192 |
+
if "source_credibility" in reliability_data:
|
193 |
+
sections.append(f"SOURCE CREDIBILITY: {reliability_data['source_credibility']}")
|
194 |
+
if "information_gaps" in reliability_data:
|
195 |
+
sections.append(f"INFORMATION GAPS: {reliability_data['information_gaps']}")
|
196 |
+
if "confidence_rating" in reliability_data:
|
197 |
+
sections.append(f"CONFIDENCE: {reliability_data['confidence_rating']}")
|
198 |
+
|
199 |
+
if sections:
|
200 |
+
return "\n\n".join(sections)
|
201 |
+
|
202 |
+
return str(reliability_data)
|
203 |
+
|
204 |
+
def _format_regional_implications(self, implications_data: Dict) -> str:
|
205 |
+
"""Format regional implications data into a structured string."""
|
206 |
+
if isinstance(implications_data, str):
|
207 |
+
return implications_data
|
208 |
+
|
209 |
+
if isinstance(implications_data, dict):
|
210 |
+
sections = []
|
211 |
+
if "security" in implications_data:
|
212 |
+
sections.append(f"SECURITY: {implications_data['security']}")
|
213 |
+
if "diplomatic" in implications_data:
|
214 |
+
sections.append(f"DIPLOMATIC: {implications_data['diplomatic']}")
|
215 |
+
if "economic" in implications_data:
|
216 |
+
sections.append(f"ECONOMIC: {implications_data['economic']}")
|
217 |
+
|
218 |
+
if sections:
|
219 |
+
return "\n\n".join(sections)
|
220 |
+
|
221 |
+
return str(implications_data)
|
222 |
+
|
223 |
+
async def analyze_tweets(self, tweets: List[Tweet], trigger: str = "scheduled") -> ConflictAnalysis:
|
224 |
+
"""Analyze tweets using Gemini AI and generate a conflict analysis."""
|
225 |
+
if not tweets:
|
226 |
+
logger.warning("No tweets provided for analysis")
|
227 |
+
return None
|
228 |
+
|
229 |
+
try:
|
230 |
+
prompt = self._prepare_prompt(tweets)
|
231 |
+
analysis_data = await self._call_gemini(prompt)
|
232 |
+
|
233 |
+
# Process and extract data with proper error handling
|
234 |
+
key_developments = self._process_key_developments(analysis_data.get("key_developments", []))
|
235 |
+
|
236 |
+
# Format complex nested structures if present
|
237 |
+
reliability_assessment = self._format_reliability_assessment(
|
238 |
+
analysis_data.get("reliability_assessment", "No reliability assessment provided")
|
239 |
+
)
|
240 |
+
|
241 |
+
regional_implications = self._format_regional_implications(
|
242 |
+
analysis_data.get("regional_implications", "No regional implications provided")
|
243 |
+
)
|
244 |
+
|
245 |
+
# Extract tension rationale if available
|
246 |
+
tension_info = analysis_data.get("tension_level", "Low")
|
247 |
+
tension_rationale = analysis_data.get("tension_rationale", "")
|
248 |
+
|
249 |
+
# Combine tension level and rationale if both exist
|
250 |
+
if tension_rationale:
|
251 |
+
tension_display = f"{tension_info} - {tension_rationale}"
|
252 |
+
else:
|
253 |
+
tension_display = tension_info
|
254 |
+
|
255 |
+
# Get the latest status
|
256 |
+
latest_status = analysis_data.get("latest_status", "No recent status update available")
|
257 |
+
|
258 |
+
analysis = ConflictAnalysis(
|
259 |
+
analysis_id=str(uuid.uuid4()),
|
260 |
+
generated_at=datetime.now(),
|
261 |
+
situation_summary=analysis_data.get("situation_summary", "No summary provided"),
|
262 |
+
key_developments=key_developments,
|
263 |
+
reliability_assessment=reliability_assessment,
|
264 |
+
regional_implications=regional_implications,
|
265 |
+
tension_level=self._extract_tension_level(tension_display),
|
266 |
+
source_tweets=tweets,
|
267 |
+
update_triggered_by=trigger,
|
268 |
+
latest_status=latest_status # Added new parameter
|
269 |
+
)
|
270 |
+
|
271 |
+
logger.info(f"Generated conflict analysis with ID: {analysis.analysis_id}")
|
272 |
+
return analysis
|
273 |
+
|
274 |
+
except RetryError as e:
|
275 |
+
logger.error(f"Failed to generate analysis after multiple retries: {str(e)}")
|
276 |
+
return None
|
277 |
+
except Exception as e:
|
278 |
+
logger.error(f"Unexpected error in tweet analysis: {str(e)}")
|
279 |
+
return None
|
280 |
+
|
281 |
+
def get_search_keywords(self) -> List[str]:
|
282 |
+
"""Get the current search keywords."""
|
283 |
+
return self.search_keywords
|
284 |
+
|
285 |
+
def update_search_keywords(self, keywords: List[str]) -> None:
|
286 |
+
"""Update the search keywords."""
|
287 |
+
self.search_keywords = keywords
|
288 |
+
logger.info(f"Updated search keywords. New count: {len(keywords)}")
|
app.py
ADDED
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
import os
|
3 |
+
from datetime import datetime
|
4 |
+
from typing import Dict, List, Optional
|
5 |
+
|
6 |
+
from dotenv import load_dotenv
|
7 |
+
from fastapi import BackgroundTasks, Depends, FastAPI, HTTPException, status
|
8 |
+
from fastapi.middleware.cors import CORSMiddleware
|
9 |
+
from loguru import logger
|
10 |
+
|
11 |
+
from analysis_service import AnalysisService
|
12 |
+
from models import (ConflictAnalysis, HealthCheck, NewsSource, TensionLevel,
|
13 |
+
Tweet, UpdateRequest)
|
14 |
+
from twitter_service import TwitterService
|
15 |
+
|
16 |
+
# Load environment variables from .env file
|
17 |
+
load_dotenv()
|
18 |
+
|
19 |
+
# Configure logging
|
20 |
+
os.makedirs("logs", exist_ok=True)
|
21 |
+
logger.add("logs/app.log", rotation="500 MB", level=os.getenv("LOG_LEVEL", "INFO"))
|
22 |
+
|
23 |
+
# Create FastAPI application
|
24 |
+
app = FastAPI(
|
25 |
+
title="WesternFront API",
|
26 |
+
description="AI-powered conflict tracker for monitoring India-Pakistan tensions",
|
27 |
+
version="1.0.0"
|
28 |
+
)
|
29 |
+
|
30 |
+
# Add CORS middleware
|
31 |
+
app.add_middleware(
|
32 |
+
CORSMiddleware,
|
33 |
+
allow_origins=["*"], # Adjust this for production
|
34 |
+
allow_credentials=True,
|
35 |
+
allow_methods=["*"],
|
36 |
+
allow_headers=["*"],
|
37 |
+
)
|
38 |
+
|
39 |
+
# Services
|
40 |
+
twitter_service = TwitterService()
|
41 |
+
analysis_service = AnalysisService()
|
42 |
+
|
43 |
+
# In-memory store for latest analysis
|
44 |
+
latest_analysis: Optional[ConflictAnalysis] = None
|
45 |
+
last_update_time: Optional[datetime] = None
|
46 |
+
|
47 |
+
|
48 |
+
async def get_twitter_service() -> TwitterService:
|
49 |
+
"""Dependency to get the Twitter service."""
|
50 |
+
return twitter_service
|
51 |
+
|
52 |
+
|
53 |
+
async def get_analysis_service() -> AnalysisService:
|
54 |
+
"""Dependency to get the Analysis service."""
|
55 |
+
return analysis_service
|
56 |
+
|
57 |
+
|
58 |
+
@app.on_event("startup")
|
59 |
+
async def startup_event():
|
60 |
+
"""Initialize services on startup."""
|
61 |
+
logger.info("Starting up WesternFront API")
|
62 |
+
|
63 |
+
# Initialize Twitter service
|
64 |
+
initialized = await twitter_service.initialize()
|
65 |
+
if not initialized:
|
66 |
+
logger.warning("Twitter service initialization failed. Some features may not work.")
|
67 |
+
|
68 |
+
# Schedule first update
|
69 |
+
background_tasks = BackgroundTasks()
|
70 |
+
background_tasks.add_task(update_analysis_task)
|
71 |
+
|
72 |
+
# Set up periodic update task
|
73 |
+
asyncio.create_task(periodic_update())
|
74 |
+
|
75 |
+
|
76 |
+
@app.on_event("shutdown")
|
77 |
+
async def shutdown_event():
|
78 |
+
"""Clean up resources on shutdown."""
|
79 |
+
logger.info("Shutting down WesternFront API")
|
80 |
+
if twitter_service and hasattr(twitter_service, 'close'):
|
81 |
+
await twitter_service.close()
|
82 |
+
|
83 |
+
|
84 |
+
async def update_analysis_task(trigger: str = "scheduled") -> None:
|
85 |
+
"""Task to update the conflict analysis."""
|
86 |
+
global latest_analysis, last_update_time
|
87 |
+
|
88 |
+
try:
|
89 |
+
logger.info(f"Starting analysis update ({trigger})")
|
90 |
+
|
91 |
+
# Get tweets related to India-Pakistan conflict
|
92 |
+
keywords = analysis_service.get_search_keywords()
|
93 |
+
tweets = await twitter_service.get_related_tweets(keywords, days_back=2)
|
94 |
+
|
95 |
+
if not tweets:
|
96 |
+
logger.warning("No relevant tweets found for analysis")
|
97 |
+
return
|
98 |
+
|
99 |
+
logger.info(f"Found {len(tweets)} relevant tweets for analysis")
|
100 |
+
|
101 |
+
# Analyze tweets
|
102 |
+
analysis = await analysis_service.analyze_tweets(tweets, trigger)
|
103 |
+
|
104 |
+
if analysis:
|
105 |
+
latest_analysis = analysis
|
106 |
+
last_update_time = datetime.now()
|
107 |
+
logger.info(f"Analysis updated successfully. Tension level: {analysis.tension_level}")
|
108 |
+
else:
|
109 |
+
logger.error("Failed to generate analysis")
|
110 |
+
|
111 |
+
except Exception as e:
|
112 |
+
logger.error(f"Error in update_analysis_task: {str(e)}")
|
113 |
+
|
114 |
+
|
115 |
+
async def periodic_update() -> None:
|
116 |
+
"""Periodically update the analysis."""
|
117 |
+
update_interval = int(os.getenv("UPDATE_INTERVAL_MINUTES", 60))
|
118 |
+
|
119 |
+
while True:
|
120 |
+
try:
|
121 |
+
await asyncio.sleep(update_interval * 60) # Convert to seconds
|
122 |
+
await update_analysis_task("scheduled")
|
123 |
+
except Exception as e:
|
124 |
+
logger.error(f"Error in periodic_update: {str(e)}")
|
125 |
+
await asyncio.sleep(300) # Wait 5 minutes if there was an error
|
126 |
+
|
127 |
+
|
128 |
+
@app.get("/", response_model=Dict)
|
129 |
+
async def root():
|
130 |
+
"""Root endpoint with basic information about the API."""
|
131 |
+
return {
|
132 |
+
"name": "WesternFront API",
|
133 |
+
"description": "AI-powered conflict tracker for India-Pakistan tensions",
|
134 |
+
"version": "1.0.0"
|
135 |
+
}
|
136 |
+
|
137 |
+
|
138 |
+
@app.get("/health", response_model=HealthCheck)
|
139 |
+
async def health_check():
|
140 |
+
"""Health check endpoint."""
|
141 |
+
twitter_initialized = hasattr(twitter_service, 'http_client') and twitter_service.http_client is not None
|
142 |
+
gemini_initialized = analysis_service.model is not None
|
143 |
+
|
144 |
+
return HealthCheck(
|
145 |
+
status="healthy",
|
146 |
+
version="1.0.0",
|
147 |
+
timestamp=datetime.now(),
|
148 |
+
last_update=last_update_time,
|
149 |
+
components_status={
|
150 |
+
"twitter_service": twitter_initialized,
|
151 |
+
"analysis_service": gemini_initialized
|
152 |
+
}
|
153 |
+
)
|
154 |
+
|
155 |
+
|
156 |
+
@app.get("/analysis", response_model=Optional[ConflictAnalysis])
|
157 |
+
async def get_latest_analysis():
|
158 |
+
"""Get the latest conflict analysis."""
|
159 |
+
if not latest_analysis:
|
160 |
+
raise HTTPException(
|
161 |
+
status_code=status.HTTP_404_NOT_FOUND,
|
162 |
+
detail="No analysis available yet. Try triggering an update."
|
163 |
+
)
|
164 |
+
return latest_analysis
|
165 |
+
|
166 |
+
|
167 |
+
@app.post("/analysis/update", response_model=Dict)
|
168 |
+
async def trigger_update(
|
169 |
+
request: UpdateRequest,
|
170 |
+
background_tasks: BackgroundTasks
|
171 |
+
):
|
172 |
+
"""Trigger an analysis update."""
|
173 |
+
if request.force:
|
174 |
+
# Clear cache to force fresh tweets
|
175 |
+
twitter_service.tweet_cache.clear()
|
176 |
+
|
177 |
+
# Add update task to background tasks
|
178 |
+
background_tasks.add_task(update_analysis_task, "manual")
|
179 |
+
|
180 |
+
return {
|
181 |
+
"message": "Analysis update triggered",
|
182 |
+
"timestamp": datetime.now(),
|
183 |
+
"force_refresh": request.force
|
184 |
+
}
|
185 |
+
|
186 |
+
|
187 |
+
@app.get("/sources", response_model=List[NewsSource])
|
188 |
+
async def get_news_sources(
|
189 |
+
twitter: TwitterService = Depends(get_twitter_service)
|
190 |
+
):
|
191 |
+
"""Get the current list of news sources."""
|
192 |
+
return twitter.get_sources()
|
193 |
+
|
194 |
+
|
195 |
+
@app.post("/sources", response_model=Dict)
|
196 |
+
async def update_news_sources(
|
197 |
+
sources: List[NewsSource],
|
198 |
+
twitter: TwitterService = Depends(get_twitter_service)
|
199 |
+
):
|
200 |
+
"""Update the list of news sources."""
|
201 |
+
twitter.update_sources(sources)
|
202 |
+
return {
|
203 |
+
"message": "News sources updated",
|
204 |
+
"count": len(sources)
|
205 |
+
}
|
206 |
+
|
207 |
+
|
208 |
+
@app.get("/keywords", response_model=List[str])
|
209 |
+
async def get_search_keywords(
|
210 |
+
analysis: AnalysisService = Depends(get_analysis_service)
|
211 |
+
):
|
212 |
+
"""Get the current search keywords."""
|
213 |
+
return analysis.get_search_keywords()
|
214 |
+
|
215 |
+
|
216 |
+
@app.post("/keywords", response_model=Dict)
|
217 |
+
async def update_search_keywords(
|
218 |
+
keywords: List[str],
|
219 |
+
analysis: AnalysisService = Depends(get_analysis_service)
|
220 |
+
):
|
221 |
+
"""Update the search keywords."""
|
222 |
+
analysis.update_search_keywords(keywords)
|
223 |
+
return {
|
224 |
+
"message": "Search keywords updated",
|
225 |
+
"count": len(keywords)
|
226 |
+
}
|
227 |
+
|
228 |
+
|
229 |
+
@app.get("/tension-levels", response_model=List[str])
|
230 |
+
async def get_tension_levels():
|
231 |
+
"""Get the available tension levels."""
|
232 |
+
return [level.value for level in TensionLevel]
|
233 |
+
|
234 |
+
|
235 |
+
if __name__ == "__main__":
|
236 |
+
import uvicorn
|
237 |
+
uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
|
docker-compose.yml
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
version: '3'
|
2 |
+
|
3 |
+
services:
|
4 |
+
westernfront-api:
|
5 |
+
build:
|
6 |
+
context: .
|
7 |
+
dockerfile: Dockerfile
|
8 |
+
ports:
|
9 |
+
- "8000:8000"
|
10 |
+
volumes:
|
11 |
+
- ./logs:/app/logs
|
12 |
+
env_file:
|
13 |
+
- .env
|
14 |
+
restart: unless-stopped
|
15 |
+
healthcheck:
|
16 |
+
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
17 |
+
interval: 30s
|
18 |
+
timeout: 10s
|
19 |
+
retries: 3
|
20 |
+
start_period: 10s
|
models.py
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from datetime import datetime
|
2 |
+
from enum import Enum
|
3 |
+
from typing import Dict, List, Optional
|
4 |
+
|
5 |
+
from pydantic import BaseModel, Field
|
6 |
+
|
7 |
+
|
8 |
+
class TensionLevel(str, Enum):
|
9 |
+
"""Enum for tension levels between India and Pakistan."""
|
10 |
+
LOW = "Low"
|
11 |
+
MEDIUM = "Medium"
|
12 |
+
HIGH = "High"
|
13 |
+
CRITICAL = "Critical"
|
14 |
+
|
15 |
+
|
16 |
+
class NewsSource(BaseModel):
|
17 |
+
"""Model for a news source."""
|
18 |
+
name: str
|
19 |
+
twitter_handle: str
|
20 |
+
country: str
|
21 |
+
reliability_score: float = Field(ge=0.0, le=1.0)
|
22 |
+
is_active: bool = True
|
23 |
+
|
24 |
+
|
25 |
+
class Tweet(BaseModel):
|
26 |
+
"""Model for a tweet."""
|
27 |
+
id: str
|
28 |
+
text: str
|
29 |
+
author: str
|
30 |
+
created_at: datetime
|
31 |
+
engagement: Dict[str, int] = {"likes": 0, "retweets": 0, "replies": 0, "views": 0}
|
32 |
+
url: str
|
33 |
+
|
34 |
+
|
35 |
+
class KeyDevelopment(BaseModel):
|
36 |
+
"""Model for a key development in the conflict."""
|
37 |
+
title: str
|
38 |
+
description: str
|
39 |
+
sources: List[str]
|
40 |
+
timestamp: Optional[datetime] = None
|
41 |
+
|
42 |
+
|
43 |
+
class ConflictAnalysis(BaseModel):
|
44 |
+
"""Model for a conflict analysis."""
|
45 |
+
analysis_id: str
|
46 |
+
generated_at: datetime
|
47 |
+
latest_status: str # Added this field
|
48 |
+
situation_summary: str
|
49 |
+
key_developments: List[KeyDevelopment]
|
50 |
+
reliability_assessment: str
|
51 |
+
regional_implications: str
|
52 |
+
tension_level: TensionLevel
|
53 |
+
source_tweets: List[Tweet]
|
54 |
+
update_triggered_by: str
|
55 |
+
|
56 |
+
|
57 |
+
class UpdateRequest(BaseModel):
|
58 |
+
"""Model for an update request."""
|
59 |
+
force: bool = False
|
60 |
+
|
61 |
+
|
62 |
+
class HealthCheck(BaseModel):
|
63 |
+
"""Model for a health check response."""
|
64 |
+
status: str
|
65 |
+
version: str
|
66 |
+
timestamp: datetime
|
67 |
+
last_update: Optional[datetime] = None
|
68 |
+
components_status: Dict[str, bool]
|
requirements.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
fastapi==0.103.1
|
2 |
+
uvicorn[standard]==0.23.2
|
3 |
+
python-dotenv==1.0.0
|
4 |
+
loguru==0.7.0
|
5 |
+
google-generativeai==0.3.0
|
6 |
+
tenacity==8.2.2
|
7 |
+
cachetools==5.3.0
|
8 |
+
pydantic==2.3.0
|
9 |
+
httpx==0.24.1
|
10 |
+
beautifulsoup4==4.12.2
|
twitter_service.py
ADDED
@@ -0,0 +1,945 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
import json
|
3 |
+
import os
|
4 |
+
import re
|
5 |
+
import time
|
6 |
+
import random
|
7 |
+
from datetime import datetime, timedelta
|
8 |
+
from typing import Dict, List, Optional, Tuple
|
9 |
+
from urllib.parse import urlparse, quote
|
10 |
+
|
11 |
+
import httpx
|
12 |
+
from bs4 import BeautifulSoup
|
13 |
+
from cachetools import TTLCache
|
14 |
+
from fastapi import HTTPException
|
15 |
+
from loguru import logger
|
16 |
+
|
17 |
+
from models import NewsSource, Tweet
|
18 |
+
|
19 |
+
|
20 |
+
class FingerprintRandomizer:
|
21 |
+
"""Randomizes browser fingerprints to evade detection"""
|
22 |
+
|
23 |
+
def __init__(self):
|
24 |
+
# Common screen resolutions
|
25 |
+
self.resolutions = [
|
26 |
+
(1920, 1080), (1366, 768), (1280, 720),
|
27 |
+
(1440, 900), (1536, 864), (2560, 1440),
|
28 |
+
(1680, 1050), (1920, 1200), (1024, 768)
|
29 |
+
]
|
30 |
+
|
31 |
+
# Common color depths
|
32 |
+
self.color_depths = [24, 30, 32]
|
33 |
+
|
34 |
+
# Common platforms
|
35 |
+
self.platforms = [
|
36 |
+
"Win32", "MacIntel", "Linux x86_64",
|
37 |
+
"Linux armv8l", "iPhone", "iPad"
|
38 |
+
]
|
39 |
+
|
40 |
+
# Browser variants
|
41 |
+
self.browsers = ["Chrome", "Firefox", "Safari", "Edge"]
|
42 |
+
|
43 |
+
# Common languages
|
44 |
+
self.languages = [
|
45 |
+
"en-US", "en-GB", "en-CA", "fr-FR", "de-DE",
|
46 |
+
"es-ES", "it-IT", "pt-BR", "ja-JP", "zh-CN"
|
47 |
+
]
|
48 |
+
|
49 |
+
# Common timezone offsets
|
50 |
+
self.timezone_offsets = [-60, -120, -180, -240, 0, 60, 120, 180, 330, 480, 540]
|
51 |
+
|
52 |
+
def generate_headers(self):
|
53 |
+
"""Generate randomized headers that mimic a real browser"""
|
54 |
+
browser = random.choice(self.browsers)
|
55 |
+
platform = random.choice(self.platforms)
|
56 |
+
language = random.choice(self.languages)
|
57 |
+
|
58 |
+
user_agent = self._generate_user_agent(browser, platform)
|
59 |
+
|
60 |
+
headers = {
|
61 |
+
"User-Agent": user_agent,
|
62 |
+
"Accept": self._generate_accept_header(browser),
|
63 |
+
"Accept-Language": f"{language},en;q=0.9",
|
64 |
+
"Accept-Encoding": "gzip, deflate, br",
|
65 |
+
"Connection": "keep-alive",
|
66 |
+
}
|
67 |
+
|
68 |
+
# Add browser-specific headers
|
69 |
+
if browser == "Chrome" or browser == "Edge":
|
70 |
+
headers["sec-ch-ua"] = f'"Google Chrome";v="{random.randint(90, 110)}", "Chromium";v="{random.randint(90, 110)}"'
|
71 |
+
headers["sec-ch-ua-mobile"] = "?0"
|
72 |
+
headers["sec-ch-ua-platform"] = f'"{platform}"'
|
73 |
+
|
74 |
+
# Randomize header order (matters for fingerprinting)
|
75 |
+
return dict(sorted(headers.items(), key=lambda x: random.random()))
|
76 |
+
|
77 |
+
def _generate_user_agent(self, browser, platform):
|
78 |
+
"""Generate a realistic user agent string"""
|
79 |
+
if browser == "Chrome":
|
80 |
+
chrome_version = f"{random.randint(90, 110)}.0.{random.randint(1000, 9999)}.{random.randint(10, 999)}"
|
81 |
+
if "Win" in platform:
|
82 |
+
return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
|
83 |
+
elif "Mac" in platform:
|
84 |
+
return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_{random.randint(11, 15)}_{random.randint(1, 7)}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
|
85 |
+
else:
|
86 |
+
return f"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
|
87 |
+
elif browser == "Firefox":
|
88 |
+
ff_version = f"{random.randint(80, 100)}.0"
|
89 |
+
if "Win" in platform:
|
90 |
+
return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
|
91 |
+
elif "Mac" in platform:
|
92 |
+
return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.{random.randint(11, 15)}; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
|
93 |
+
else:
|
94 |
+
return f"Mozilla/5.0 (X11; Linux i686; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
|
95 |
+
elif browser == "Safari":
|
96 |
+
webkit_version = f"605.1.{random.randint(1, 15)}"
|
97 |
+
safari_version = f"{random.randint(13, 16)}.{random.randint(0, 1)}"
|
98 |
+
return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_{random.randint(11, 15)}_{random.randint(1, 7)}) AppleWebKit/{webkit_version} (KHTML, like Gecko) Version/{safari_version} Safari/{webkit_version}"
|
99 |
+
elif browser == "Edge":
|
100 |
+
edge_version = f"{random.randint(90, 110)}.0.{random.randint(1000, 9999)}.{random.randint(10, 999)}"
|
101 |
+
return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{edge_version} Safari/537.36 Edg/{edge_version}"
|
102 |
+
|
103 |
+
def _generate_accept_header(self, browser):
|
104 |
+
"""Generate browser-specific Accept header"""
|
105 |
+
if browser == "Chrome" or browser == "Edge":
|
106 |
+
return "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
|
107 |
+
elif browser == "Firefox":
|
108 |
+
return "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
|
109 |
+
elif browser == "Safari":
|
110 |
+
return "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
|
111 |
+
|
112 |
+
|
113 |
+
class CookieManager:
|
114 |
+
"""Intelligently manages cookies to maintain sessions"""
|
115 |
+
|
116 |
+
def __init__(self):
|
117 |
+
self.cookies_by_domain = {}
|
118 |
+
self.cookie_jar_path = os.path.join(os.path.dirname(__file__), '.cookie_store')
|
119 |
+
os.makedirs(self.cookie_jar_path, exist_ok=True)
|
120 |
+
self.load_cookies()
|
121 |
+
|
122 |
+
def load_cookies(self):
|
123 |
+
"""Load cookies from storage"""
|
124 |
+
try:
|
125 |
+
for filename in os.listdir(self.cookie_jar_path):
|
126 |
+
if filename.endswith('.json'):
|
127 |
+
domain = filename[:-5] # Remove .json
|
128 |
+
file_path = os.path.join(self.cookie_jar_path, filename)
|
129 |
+
with open(file_path, 'r') as f:
|
130 |
+
try:
|
131 |
+
cookie_data = json.load(f)
|
132 |
+
self.cookies_by_domain[domain] = cookie_data
|
133 |
+
except json.JSONDecodeError:
|
134 |
+
logger.warning(f"Invalid cookie file for {domain}, skipping")
|
135 |
+
except Exception as e:
|
136 |
+
logger.error(f"Error loading cookies: {e}")
|
137 |
+
|
138 |
+
def save_cookies(self):
|
139 |
+
"""Save cookies to storage"""
|
140 |
+
for domain, cookies in self.cookies_by_domain.items():
|
141 |
+
file_path = os.path.join(self.cookie_jar_path, f"{domain}.json")
|
142 |
+
try:
|
143 |
+
with open(file_path, 'w') as f:
|
144 |
+
json.dump(cookies, f)
|
145 |
+
except Exception as e:
|
146 |
+
logger.error(f"Error saving cookies for {domain}: {e}")
|
147 |
+
|
148 |
+
def update_cookies(self, url, response_cookies):
|
149 |
+
"""Update cookies from a response"""
|
150 |
+
domain = urlparse(url).netloc
|
151 |
+
|
152 |
+
if domain not in self.cookies_by_domain:
|
153 |
+
self.cookies_by_domain[domain] = {}
|
154 |
+
|
155 |
+
# Update with new cookies
|
156 |
+
for name, value in response_cookies.items():
|
157 |
+
self.cookies_by_domain[domain][name] = value
|
158 |
+
|
159 |
+
# Save updated cookies
|
160 |
+
self.save_cookies()
|
161 |
+
|
162 |
+
def get_cookies_for_url(self, url):
|
163 |
+
"""Get cookies for a specific URL"""
|
164 |
+
domain = urlparse(url).netloc
|
165 |
+
return self.cookies_by_domain.get(domain, {})
|
166 |
+
|
167 |
+
def clear_cookies_for_domain(self, domain):
|
168 |
+
"""Clear cookies for a specific domain"""
|
169 |
+
if domain in self.cookies_by_domain:
|
170 |
+
del self.cookies_by_domain[domain]
|
171 |
+
file_path = os.path.join(self.cookie_jar_path, f"{domain}.json")
|
172 |
+
if os.path.exists(file_path):
|
173 |
+
os.remove(file_path)
|
174 |
+
|
175 |
+
|
176 |
+
class NitterBypass:
|
177 |
+
"""Advanced Nitter rate limit bypass system"""
|
178 |
+
|
179 |
+
def __init__(self, fingerprint_randomizer, cookie_manager):
|
180 |
+
# Expanded list of Nitter instances for better rotation
|
181 |
+
self.instances = [
|
182 |
+
"https://nitter.net",
|
183 |
+
"https://nitter.lacontrevoie.fr",
|
184 |
+
"https://nitter.1d4.us",
|
185 |
+
"https://nitter.poast.org",
|
186 |
+
"https://nitter.unixfox.eu",
|
187 |
+
"https://nitter.kavin.rocks",
|
188 |
+
"https://nitter.privacydev.net",
|
189 |
+
"https://nitter.projectsegfau.lt",
|
190 |
+
"https://nitter.pussthecat.org",
|
191 |
+
"https://nitter.42l.fr",
|
192 |
+
"https://nitter.fdn.fr",
|
193 |
+
"https://nitter.cz",
|
194 |
+
"https://bird.habedieehre.com",
|
195 |
+
"https://tweet.lambda.dance",
|
196 |
+
"https://nitter.cutelab.space",
|
197 |
+
"https://nitter.fly.dev",
|
198 |
+
"https://notabird.site",
|
199 |
+
"https://nitter.weiler.dev",
|
200 |
+
"https://nitter.sethforprivacy.com",
|
201 |
+
"https://nitter.mask.sh",
|
202 |
+
"https://nitter.space",
|
203 |
+
"https://nitter.hu",
|
204 |
+
"https://nitter.moomoo.me",
|
205 |
+
"https://nitter.grimneko.de",
|
206 |
+
]
|
207 |
+
|
208 |
+
self.fingerprint_randomizer = fingerprint_randomizer
|
209 |
+
self.cookie_manager = cookie_manager
|
210 |
+
|
211 |
+
# Tracking usage statistics per instance
|
212 |
+
self.usage_counts = {instance: 0 for instance in self.instances}
|
213 |
+
self.success_counts = {instance: 0 for instance in self.instances}
|
214 |
+
self.failure_counts = {instance: 0 for instance in self.instances}
|
215 |
+
self.response_times = {instance: [] for instance in self.instances}
|
216 |
+
|
217 |
+
# Track banned instances with timeout
|
218 |
+
self.banned_instances = set()
|
219 |
+
self.banned_time = {}
|
220 |
+
self.ban_duration = 3600 # Default 1 hour ban time
|
221 |
+
|
222 |
+
# Client collection, one per instance
|
223 |
+
self.clients = {}
|
224 |
+
|
225 |
+
# Request flow control
|
226 |
+
self.last_request_time = 0
|
227 |
+
self.min_request_interval = 2.0
|
228 |
+
self.request_jitter = True # Add random jitter to requests
|
229 |
+
|
230 |
+
# Dynamic proxy rotation (if available)
|
231 |
+
self.proxies = self._load_proxies()
|
232 |
+
self.proxy_index = 0
|
233 |
+
|
234 |
+
def _load_proxies(self):
|
235 |
+
"""Load proxy list if available"""
|
236 |
+
proxies = []
|
237 |
+
try:
|
238 |
+
proxy_file = os.path.join(os.path.dirname(__file__), 'proxies.txt')
|
239 |
+
if os.path.exists(proxy_file):
|
240 |
+
with open(proxy_file, 'r') as f:
|
241 |
+
for line in f:
|
242 |
+
line = line.strip()
|
243 |
+
if line and not line.startswith('#'):
|
244 |
+
proxies.append(line)
|
245 |
+
logger.info(f"Loaded {len(proxies)} proxies")
|
246 |
+
except Exception as e:
|
247 |
+
logger.error(f"Error loading proxies: {e}")
|
248 |
+
return proxies
|
249 |
+
|
250 |
+
def _get_next_proxy(self):
|
251 |
+
"""Get next proxy in rotation"""
|
252 |
+
if not self.proxies:
|
253 |
+
return None
|
254 |
+
|
255 |
+
proxy = self.proxies[self.proxy_index]
|
256 |
+
self.proxy_index = (self.proxy_index + 1) % len(self.proxies)
|
257 |
+
return proxy
|
258 |
+
|
259 |
+
async def initialize(self):
|
260 |
+
"""Initialize Nitter bypass system"""
|
261 |
+
# Create clients for each instance
|
262 |
+
for instance in self.instances:
|
263 |
+
await self._initialize_client(instance)
|
264 |
+
|
265 |
+
# Test instances to determine which are working
|
266 |
+
await self._test_instances()
|
267 |
+
|
268 |
+
async def _initialize_client(self, instance):
|
269 |
+
"""Create an HTTP client for an instance"""
|
270 |
+
headers = self.fingerprint_randomizer.generate_headers()
|
271 |
+
|
272 |
+
# Get proxy if available
|
273 |
+
proxy = self._get_next_proxy()
|
274 |
+
proxies = {"all://": proxy} if proxy else None
|
275 |
+
|
276 |
+
# Create client with unique settings for this instance
|
277 |
+
self.clients[instance] = httpx.AsyncClient(
|
278 |
+
timeout=30.0,
|
279 |
+
follow_redirects=True,
|
280 |
+
headers=headers,
|
281 |
+
http2=True,
|
282 |
+
limits=httpx.Limits(max_connections=5, max_keepalive_connections=2),
|
283 |
+
proxies=proxies
|
284 |
+
)
|
285 |
+
|
286 |
+
# Initialize with cookies if we have any
|
287 |
+
domain = urlparse(instance).netloc
|
288 |
+
cookies = self.cookie_manager.get_cookies_for_url(instance)
|
289 |
+
if cookies:
|
290 |
+
for name, value in cookies.items():
|
291 |
+
self.clients[instance].cookies.set(name, value, domain=domain)
|
292 |
+
|
293 |
+
async def _test_instances(self):
|
294 |
+
"""Test all instances to check availability"""
|
295 |
+
for instance in self.instances:
|
296 |
+
try:
|
297 |
+
start_time = time.time()
|
298 |
+
client = self.clients[instance]
|
299 |
+
|
300 |
+
# Add custom parameter to avoid caches
|
301 |
+
params = {"_": str(int(time.time()))}
|
302 |
+
|
303 |
+
response = await client.get(f"{instance}/", params=params, timeout=5.0)
|
304 |
+
end_time = time.time()
|
305 |
+
|
306 |
+
if response.status_code == 200:
|
307 |
+
logger.debug(f"Instance {instance} is available, response time: {end_time - start_time:.2f}s")
|
308 |
+
|
309 |
+
# Update cookies from response
|
310 |
+
self.cookie_manager.update_cookies(instance, dict(client.cookies))
|
311 |
+
|
312 |
+
# Track response time for prioritization
|
313 |
+
self.response_times[instance].append(end_time - start_time)
|
314 |
+
if len(self.response_times[instance]) > 5:
|
315 |
+
self.response_times[instance].pop(0) # Keep only last 5 measurements
|
316 |
+
|
317 |
+
else:
|
318 |
+
logger.warning(f"Instance {instance} returned status {response.status_code}")
|
319 |
+
if response.status_code in [429, 403, 503]:
|
320 |
+
self.banned_instances.add(instance)
|
321 |
+
self.banned_time[instance] = time.time()
|
322 |
+
except Exception as e:
|
323 |
+
logger.warning(f"Instance {instance} test failed: {e}")
|
324 |
+
self.banned_instances.add(instance)
|
325 |
+
self.banned_time[instance] = time.time()
|
326 |
+
|
327 |
+
# Add delay between tests
|
328 |
+
await asyncio.sleep(random.uniform(0.5, 1.5))
|
329 |
+
|
330 |
+
def _get_best_instance(self):
|
331 |
+
"""Select the best instance based on health metrics"""
|
332 |
+
now = time.time()
|
333 |
+
|
334 |
+
# Unban instances that have served their time
|
335 |
+
for instance in list(self.banned_instances):
|
336 |
+
if instance in self.banned_time and now - self.banned_time[instance] > self.ban_duration:
|
337 |
+
self.banned_instances.remove(instance)
|
338 |
+
logger.info(f"Unbanned instance {instance} after timeout")
|
339 |
+
|
340 |
+
# Get available instances
|
341 |
+
available = [i for i in self.instances if i not in self.banned_instances]
|
342 |
+
if not available:
|
343 |
+
# If all are banned, try the least recently banned one
|
344 |
+
if self.banned_time:
|
345 |
+
instance = min(self.banned_time.items(), key=lambda x: x[1])[0]
|
346 |
+
logger.warning(f"All instances banned, trying least recent: {instance}")
|
347 |
+
return instance
|
348 |
+
else:
|
349 |
+
# Fallback to any instance
|
350 |
+
return random.choice(self.instances)
|
351 |
+
|
352 |
+
# Calculate a health score for each instance
|
353 |
+
health_scores = {}
|
354 |
+
for instance in available:
|
355 |
+
# Base score
|
356 |
+
score = 100
|
357 |
+
|
358 |
+
# Adjust for success rate
|
359 |
+
total_requests = self.success_counts[instance] + self.failure_counts[instance]
|
360 |
+
if total_requests > 0:
|
361 |
+
success_rate = self.success_counts[instance] / total_requests
|
362 |
+
score *= (0.5 + 0.5 * success_rate) # Weight success rate as 50% of score
|
363 |
+
|
364 |
+
# Adjust for response time
|
365 |
+
if self.response_times[instance]:
|
366 |
+
avg_response_time = sum(self.response_times[instance]) / len(self.response_times[instance])
|
367 |
+
# Faster responses get higher scores (up to 1.5x bonus for fast responses)
|
368 |
+
speed_factor = min(1.5, max(0.5, 1.0 / (avg_response_time / 2)))
|
369 |
+
score *= speed_factor
|
370 |
+
|
371 |
+
# Adjust for usage count (prefer less used instances)
|
372 |
+
usage_penalty = min(0.9, 0.5 + 0.5 / (1 + self.usage_counts[instance] / 5))
|
373 |
+
score *= usage_penalty
|
374 |
+
|
375 |
+
health_scores[instance] = score
|
376 |
+
|
377 |
+
# Select from top 3 instances with probability weighted by health score
|
378 |
+
top_instances = sorted(health_scores.items(), key=lambda x: x[1], reverse=True)[:3]
|
379 |
+
if not top_instances:
|
380 |
+
return random.choice(available)
|
381 |
+
|
382 |
+
# Extract instances and scores
|
383 |
+
instances = [i[0] for i in top_instances]
|
384 |
+
scores = [i[1] for i in top_instances]
|
385 |
+
|
386 |
+
# Normalize scores for weighted random selection
|
387 |
+
total_score = sum(scores)
|
388 |
+
if total_score > 0:
|
389 |
+
probabilities = [score / total_score for score in scores]
|
390 |
+
chosen = random.choices(instances, weights=probabilities, k=1)[0]
|
391 |
+
else:
|
392 |
+
chosen = random.choice(instances)
|
393 |
+
|
394 |
+
# Update usage count
|
395 |
+
self.usage_counts[chosen] += 1
|
396 |
+
|
397 |
+
return chosen
|
398 |
+
|
399 |
+
async def request(self, path, params=None):
|
400 |
+
"""Make an intelligent request to a Nitter instance"""
|
401 |
+
if params is None:
|
402 |
+
params = {}
|
403 |
+
|
404 |
+
# Add random parameter to avoid caching
|
405 |
+
params["_nonce"] = str(random.randint(10000, 99999999))
|
406 |
+
|
407 |
+
# Rate limiting with jitter
|
408 |
+
now = time.time()
|
409 |
+
since_last = now - self.last_request_time
|
410 |
+
if since_last < self.min_request_interval:
|
411 |
+
if self.request_jitter:
|
412 |
+
# Add jitter to make request patterns less predictable
|
413 |
+
jitter = random.uniform(1.0, 3.0)
|
414 |
+
delay = self.min_request_interval - since_last + jitter
|
415 |
+
else:
|
416 |
+
delay = self.min_request_interval - since_last
|
417 |
+
await asyncio.sleep(delay)
|
418 |
+
|
419 |
+
# Get the best instance
|
420 |
+
instance = self._get_best_instance()
|
421 |
+
client = self.clients[instance]
|
422 |
+
|
423 |
+
# Update headers with new fingerprint to avoid detection
|
424 |
+
client.headers.update(self.fingerprint_randomizer.generate_headers())
|
425 |
+
|
426 |
+
# Update cookies
|
427 |
+
domain = urlparse(instance).netloc
|
428 |
+
cookies = self.cookie_manager.get_cookies_for_url(instance)
|
429 |
+
for name, value in cookies.items():
|
430 |
+
client.cookies.set(name, value, domain=domain)
|
431 |
+
|
432 |
+
# Update request timestamp
|
433 |
+
self.last_request_time = time.time()
|
434 |
+
|
435 |
+
url = f"{instance}{path}"
|
436 |
+
|
437 |
+
try:
|
438 |
+
# Make the request with timing
|
439 |
+
start_time = time.time()
|
440 |
+
response = await client.get(url, params=params)
|
441 |
+
end_time = time.time()
|
442 |
+
response_time = end_time - start_time
|
443 |
+
|
444 |
+
# Update cookies from response
|
445 |
+
if len(response.cookies) > 0:
|
446 |
+
self.cookie_manager.update_cookies(url, dict(response.cookies))
|
447 |
+
|
448 |
+
# Update response time tracking
|
449 |
+
self.response_times[instance].append(response_time)
|
450 |
+
if len(self.response_times[instance]) > 5:
|
451 |
+
self.response_times[instance].pop(0)
|
452 |
+
|
453 |
+
# Handle response based on status code
|
454 |
+
if response.status_code == 200:
|
455 |
+
# Success
|
456 |
+
self.success_counts[instance] += 1
|
457 |
+
return response
|
458 |
+
elif response.status_code in [429, 403, 503]:
|
459 |
+
# Rate limited or banned
|
460 |
+
logger.warning(f"Rate limit detected on {instance}: {response.status_code}")
|
461 |
+
self.failure_counts[instance] += 1
|
462 |
+
self.banned_instances.add(instance)
|
463 |
+
self.banned_time[instance] = time.time()
|
464 |
+
|
465 |
+
# Different ban durations based on response
|
466 |
+
if response.status_code == 429: # Rate limited
|
467 |
+
self.ban_duration = min(self.ban_duration * 2, 7200) # Max 2 hour ban, increasing
|
468 |
+
else: # Other error
|
469 |
+
self.ban_duration = 1800 # 30 minute ban
|
470 |
+
|
471 |
+
# Retry with a different instance
|
472 |
+
return await self.request(path, params)
|
473 |
+
else:
|
474 |
+
# Other error
|
475 |
+
logger.error(f"Error with {instance}: HTTP {response.status_code}")
|
476 |
+
self.failure_counts[instance] += 1
|
477 |
+
|
478 |
+
# Don't immediately ban for non-rate-limit errors
|
479 |
+
if self.failure_counts[instance] > 3: # After 3 failures, ban temporarily
|
480 |
+
self.banned_instances.add(instance)
|
481 |
+
self.banned_time[instance] = time.time()
|
482 |
+
self.ban_duration = 900 # 15 minute ban
|
483 |
+
|
484 |
+
# Retry with a different instance
|
485 |
+
return await self.request(path, params)
|
486 |
+
|
487 |
+
except httpx.HTTPError as e:
|
488 |
+
logger.error(f"HTTP error with {instance}: {str(e)}")
|
489 |
+
self.failure_counts[instance] += 1
|
490 |
+
|
491 |
+
# Ban instance after HTTP errors
|
492 |
+
self.banned_instances.add(instance)
|
493 |
+
self.banned_time[instance] = time.time()
|
494 |
+
|
495 |
+
# Retry with a different instance
|
496 |
+
return await self.request(path, params)
|
497 |
+
|
498 |
+
except Exception as e:
|
499 |
+
logger.error(f"Error with {instance}: {str(e)}")
|
500 |
+
self.failure_counts[instance] += 1
|
501 |
+
|
502 |
+
# Ban instance after errors
|
503 |
+
self.banned_instances.add(instance)
|
504 |
+
self.banned_time[instance] = time.time()
|
505 |
+
|
506 |
+
# Retry with a different instance
|
507 |
+
return await self.request(path, params)
|
508 |
+
|
509 |
+
async def close(self):
|
510 |
+
"""Close all HTTP clients"""
|
511 |
+
for client in self.clients.values():
|
512 |
+
await client.aclose()
|
513 |
+
|
514 |
+
|
515 |
+
class TwitterService:
|
516 |
+
"""Service for collecting tweets via web scraping using Nitter alternative frontends."""
|
517 |
+
|
518 |
+
def __init__(self):
|
519 |
+
self.cache_expiry = int(os.getenv("CACHE_EXPIRY_MINUTES", 120))
|
520 |
+
|
521 |
+
# Initialize advanced components for rate limit bypass
|
522 |
+
self.fingerprint_randomizer = FingerprintRandomizer()
|
523 |
+
self.cookie_manager = CookieManager()
|
524 |
+
self.nitter_bypass = None # Will be initialized later
|
525 |
+
|
526 |
+
# Enhanced cache with TTL and persistence
|
527 |
+
self.tweet_cache_dir = os.path.join(os.path.dirname(__file__), ".tweet_cache")
|
528 |
+
os.makedirs(self.tweet_cache_dir, exist_ok=True)
|
529 |
+
self.in_memory_cache = TTLCache(maxsize=100, ttl=self.cache_expiry * 60)
|
530 |
+
|
531 |
+
# Statistics and monitoring
|
532 |
+
self.stats = {
|
533 |
+
"requests": 0,
|
534 |
+
"cache_hits": 0,
|
535 |
+
"rate_limits": 0,
|
536 |
+
"errors": 0,
|
537 |
+
"success": 0
|
538 |
+
}
|
539 |
+
self.last_stats_reset = time.time()
|
540 |
+
|
541 |
+
# Default trusted news sources - focused on India-Pakistan relations
|
542 |
+
self.news_sources = [
|
543 |
+
NewsSource(name="Shiv Aroor", twitter_handle="ShivAroor", country="India", reliability_score=0.85),
|
544 |
+
NewsSource(name="Sidhant Sibal", twitter_handle="sidhant", country="India", reliability_score=0.85),
|
545 |
+
NewsSource(name="Indian Air Force", twitter_handle="IAF_MCC", country="India", reliability_score=0.95),
|
546 |
+
NewsSource(name="Indian Army", twitter_handle="adgpi", country="India", reliability_score=0.95),
|
547 |
+
NewsSource(name="Indian Defence Ministry", twitter_handle="SpokespersonMoD", country="India", reliability_score=0.95),
|
548 |
+
NewsSource(name="MIB India", twitter_handle="MIB_India", country="India", reliability_score=0.95),
|
549 |
+
NewsSource(name="Indian External Affairs Minister", twitter_handle="DrSJaishankar", country="India", reliability_score=0.95),
|
550 |
+
]
|
551 |
+
|
552 |
+
async def initialize(self) -> bool:
|
553 |
+
"""Initialize the Twitter service."""
|
554 |
+
try:
|
555 |
+
logger.info("Initializing Twitter service with advanced bypass techniques")
|
556 |
+
|
557 |
+
# Initialize the Nitter bypass engine
|
558 |
+
self.nitter_bypass = NitterBypass(self.fingerprint_randomizer, self.cookie_manager)
|
559 |
+
await self.nitter_bypass.initialize()
|
560 |
+
|
561 |
+
# Schedule background health checks for instances
|
562 |
+
asyncio.create_task(self._background_maintenance())
|
563 |
+
|
564 |
+
logger.info("Twitter service initialized successfully with bypass capabilities")
|
565 |
+
return True
|
566 |
+
|
567 |
+
except Exception as e:
|
568 |
+
logger.error(f"Failed to initialize Twitter service: {str(e)}")
|
569 |
+
return False
|
570 |
+
|
571 |
+
async def _background_maintenance(self):
|
572 |
+
"""Run background maintenance tasks"""
|
573 |
+
while True:
|
574 |
+
try:
|
575 |
+
# Wait between maintenance cycles
|
576 |
+
await asyncio.sleep(900) # 15 minutes
|
577 |
+
|
578 |
+
# Log statistics
|
579 |
+
self._log_statistics()
|
580 |
+
|
581 |
+
# Clean up cache files
|
582 |
+
self._cleanup_expired_cache()
|
583 |
+
|
584 |
+
# Reset statistics periodically
|
585 |
+
if time.time() - self.last_stats_reset > 3600: # Reset every hour
|
586 |
+
self.stats = {key: 0 for key in self.stats}
|
587 |
+
self.last_stats_reset = time.time()
|
588 |
+
|
589 |
+
except Exception as e:
|
590 |
+
logger.error(f"Error in background maintenance: {str(e)}")
|
591 |
+
|
592 |
+
def _log_statistics(self):
|
593 |
+
"""Log service statistics"""
|
594 |
+
total_requests = max(1, self.stats["requests"])
|
595 |
+
cache_hit_rate = self.stats["cache_hits"] / total_requests * 100
|
596 |
+
error_rate = (self.stats["errors"] + self.stats["rate_limits"]) / total_requests * 100
|
597 |
+
|
598 |
+
logger.info(f"TwitterService stats - Requests: {total_requests}, " +
|
599 |
+
f"Cache hits: {self.stats['cache_hits']} ({cache_hit_rate:.1f}%), " +
|
600 |
+
f"Rate limits: {self.stats['rate_limits']}, " +
|
601 |
+
f"Errors: {self.stats['errors']} ({error_rate:.1f}%)")
|
602 |
+
|
603 |
+
def _cleanup_expired_cache(self):
|
604 |
+
"""Clean up expired cache files"""
|
605 |
+
now = time.time()
|
606 |
+
expiry_time = self.cache_expiry * 60
|
607 |
+
|
608 |
+
try:
|
609 |
+
for filename in os.listdir(self.tweet_cache_dir):
|
610 |
+
if not filename.endswith('.json'):
|
611 |
+
continue
|
612 |
+
|
613 |
+
file_path = os.path.join(self.tweet_cache_dir, filename)
|
614 |
+
|
615 |
+
try:
|
616 |
+
file_modified_time = os.path.getmtime(file_path)
|
617 |
+
if now - file_modified_time > expiry_time:
|
618 |
+
os.remove(file_path)
|
619 |
+
logger.debug(f"Removed expired cache file: {filename}")
|
620 |
+
except Exception as e:
|
621 |
+
logger.error(f"Error cleaning up cache file {filename}: {e}")
|
622 |
+
except Exception as e:
|
623 |
+
logger.error(f"Error during cache cleanup: {e}")
|
624 |
+
|
625 |
+
def _get_cache_path(self, key):
|
626 |
+
"""Get filesystem path for a cache key"""
|
627 |
+
# Create a safe filename from the cache key
|
628 |
+
safe_key = re.sub(r'[^a-zA-Z0-9_-]', '_', key)
|
629 |
+
return os.path.join(self.tweet_cache_dir, f"{safe_key}.json")
|
630 |
+
|
631 |
+
def _get_from_cache(self, cache_key):
|
632 |
+
"""Get tweets from cache (memory or disk)"""
|
633 |
+
# Check memory cache first
|
634 |
+
if cache_key in self.in_memory_cache:
|
635 |
+
self.stats["cache_hits"] += 1
|
636 |
+
return self.in_memory_cache[cache_key]
|
637 |
+
|
638 |
+
# Check disk cache
|
639 |
+
cache_path = self._get_cache_path(cache_key)
|
640 |
+
if os.path.exists(cache_path):
|
641 |
+
try:
|
642 |
+
with open(cache_path, 'r') as f:
|
643 |
+
cache_data = json.load(f)
|
644 |
+
|
645 |
+
# Check if cache is still valid
|
646 |
+
if time.time() - cache_data['timestamp'] < self.cache_expiry * 60:
|
647 |
+
# Convert dictionaries back to Tweet objects
|
648 |
+
tweets = []
|
649 |
+
for tweet_dict in cache_data['tweets']:
|
650 |
+
# Parse created_at back to datetime if it's stored as a string
|
651 |
+
if 'created_at' in tweet_dict and isinstance(tweet_dict['created_at'], str):
|
652 |
+
try:
|
653 |
+
tweet_dict['created_at'] = datetime.fromisoformat(tweet_dict['created_at'])
|
654 |
+
except ValueError:
|
655 |
+
tweet_dict['created_at'] = datetime.now()
|
656 |
+
|
657 |
+
tweets.append(Tweet(**tweet_dict))
|
658 |
+
|
659 |
+
# Restore to memory cache and return
|
660 |
+
self.in_memory_cache[cache_key] = tweets
|
661 |
+
self.stats["cache_hits"] += 1
|
662 |
+
return tweets
|
663 |
+
else:
|
664 |
+
# Cache expired, remove file
|
665 |
+
os.remove(cache_path)
|
666 |
+
except Exception as e:
|
667 |
+
logger.error(f"Error reading cache file {cache_path}: {e}")
|
668 |
+
|
669 |
+
return None
|
670 |
+
|
671 |
+
def _save_to_cache(self, cache_key, tweets):
|
672 |
+
"""Save tweets to cache (memory and disk)"""
|
673 |
+
# Save to memory cache
|
674 |
+
self.in_memory_cache[cache_key] = tweets
|
675 |
+
|
676 |
+
# Convert tweets to dictionaries for JSON serialization
|
677 |
+
tweet_dicts = []
|
678 |
+
for tweet in tweets:
|
679 |
+
tweet_dicts.append({
|
680 |
+
'id': tweet.id,
|
681 |
+
'text': tweet.text,
|
682 |
+
'author': tweet.author,
|
683 |
+
'created_at': tweet.created_at.isoformat() if hasattr(tweet.created_at, 'isoformat') else str(tweet.created_at),
|
684 |
+
'engagement': tweet.engagement,
|
685 |
+
'url': tweet.url
|
686 |
+
})
|
687 |
+
|
688 |
+
# Save to disk cache
|
689 |
+
cache_path = self._get_cache_path(cache_key)
|
690 |
+
try:
|
691 |
+
with open(cache_path, 'w') as f:
|
692 |
+
json.dump({
|
693 |
+
'tweets': tweet_dicts,
|
694 |
+
'timestamp': time.time()
|
695 |
+
}, f)
|
696 |
+
except Exception as e:
|
697 |
+
logger.error(f"Error writing to cache file {cache_path}: {e}")
|
698 |
+
|
699 |
+
async def get_tweets_from_source(self, source: NewsSource, limit: int = 20, retries: int = 3) -> List[Tweet]:
|
700 |
+
"""Get tweets from a specific Twitter source using advanced bypass techniques."""
|
701 |
+
cache_key = f"{source.twitter_handle}_{limit}"
|
702 |
+
|
703 |
+
# Check cache first
|
704 |
+
cached_tweets = self._get_from_cache(cache_key)
|
705 |
+
if cached_tweets:
|
706 |
+
logger.debug(f"Returning cached tweets for {source.twitter_handle}")
|
707 |
+
return cached_tweets
|
708 |
+
|
709 |
+
self.stats["requests"] += 1
|
710 |
+
|
711 |
+
# Extract tweets with retry logic
|
712 |
+
all_attempts = retries + 1
|
713 |
+
tweets = []
|
714 |
+
|
715 |
+
for attempt in range(all_attempts):
|
716 |
+
try:
|
717 |
+
logger.info(f"Fetching tweets from {source.twitter_handle} (attempt {attempt + 1}/{all_attempts})")
|
718 |
+
|
719 |
+
# Build path with randomization to avoid caching patterns
|
720 |
+
path = f"/{source.twitter_handle}"
|
721 |
+
params = {
|
722 |
+
"f": "tweets", # Filter to tweets only
|
723 |
+
"r": str(random.randint(10000, 99999)) # Random param to bypass caches
|
724 |
+
}
|
725 |
+
|
726 |
+
# Get the response using our bypass system
|
727 |
+
response = await self.nitter_bypass.request(path, params)
|
728 |
+
|
729 |
+
if response.status_code == 200:
|
730 |
+
# Success - extract tweets
|
731 |
+
self.stats["success"] += 1
|
732 |
+
|
733 |
+
# Parse the HTML
|
734 |
+
soup = BeautifulSoup(response.text, "html.parser")
|
735 |
+
|
736 |
+
# Find tweet containers
|
737 |
+
tweet_containers = soup.select(".timeline-item")
|
738 |
+
|
739 |
+
for container in tweet_containers[:limit]:
|
740 |
+
try:
|
741 |
+
# Extract tweet ID from the permalink
|
742 |
+
permalink_element = container.select_one(".tweet-link")
|
743 |
+
if not permalink_element:
|
744 |
+
continue
|
745 |
+
|
746 |
+
permalink = permalink_element.get("href", "")
|
747 |
+
tweet_id = permalink.split("/")[-1]
|
748 |
+
|
749 |
+
# Extract tweet text
|
750 |
+
text_element = container.select_one(".tweet-content")
|
751 |
+
tweet_text = text_element.get_text().strip() if text_element else ""
|
752 |
+
|
753 |
+
# Extract timestamp
|
754 |
+
time_element = container.select_one(".tweet-date")
|
755 |
+
timestamp = time_element.find("a").get("title") if time_element and time_element.find("a") else None
|
756 |
+
|
757 |
+
if timestamp:
|
758 |
+
try:
|
759 |
+
created_at = datetime.strptime(timestamp, "%d/%m/%Y, %H:%M:%S")
|
760 |
+
except ValueError:
|
761 |
+
created_at = datetime.now()
|
762 |
+
else:
|
763 |
+
created_at = datetime.now()
|
764 |
+
|
765 |
+
# Extract engagement metrics
|
766 |
+
stats_container = container.select_one(".tweet-stats")
|
767 |
+
engagement = {"likes": 0, "retweets": 0, "replies": 0, "views": 0}
|
768 |
+
|
769 |
+
if stats_container:
|
770 |
+
stats = stats_container.select(".icon-container")
|
771 |
+
for stat in stats:
|
772 |
+
stat_text = stat.get_text().strip()
|
773 |
+
if "retweet" in stat.get("class", []):
|
774 |
+
engagement["retweets"] = self._parse_count(stat_text)
|
775 |
+
elif "heart" in stat.get("class", []):
|
776 |
+
engagement["likes"] = self._parse_count(stat_text)
|
777 |
+
elif "comment" in stat.get("class", []):
|
778 |
+
engagement["replies"] = self._parse_count(stat_text)
|
779 |
+
|
780 |
+
tweet_url = f"https://x.com/{source.twitter_handle}/status/{tweet_id}"
|
781 |
+
|
782 |
+
tweets.append(
|
783 |
+
Tweet(
|
784 |
+
id=tweet_id,
|
785 |
+
text=tweet_text,
|
786 |
+
author=source.twitter_handle,
|
787 |
+
created_at=created_at,
|
788 |
+
engagement=engagement,
|
789 |
+
url=tweet_url
|
790 |
+
)
|
791 |
+
)
|
792 |
+
except Exception as e:
|
793 |
+
logger.error(f"Error processing tweet from {source.twitter_handle}: {str(e)}")
|
794 |
+
|
795 |
+
# Cache the results
|
796 |
+
if tweets:
|
797 |
+
self._save_to_cache(cache_key, tweets)
|
798 |
+
logger.info(f"Fetched and cached {len(tweets)} tweets from {source.twitter_handle}")
|
799 |
+
|
800 |
+
return tweets
|
801 |
+
|
802 |
+
elif response.status_code == 429:
|
803 |
+
# Rate limited
|
804 |
+
self.stats["rate_limits"] += 1
|
805 |
+
logger.warning(f"Rate limited (429) when fetching tweets from {source.twitter_handle}")
|
806 |
+
|
807 |
+
if attempt < retries:
|
808 |
+
backoff_time = min(30 * (2 ** attempt), 300) # Exponential backoff, max 5 minutes
|
809 |
+
logger.info(f"Retrying in {backoff_time}s...")
|
810 |
+
await asyncio.sleep(backoff_time)
|
811 |
+
else:
|
812 |
+
logger.error(f"Failed to fetch tweets from {source.twitter_handle} after {retries} retries: HTTP 429")
|
813 |
+
return []
|
814 |
+
|
815 |
+
else:
|
816 |
+
# Other error
|
817 |
+
self.stats["errors"] += 1
|
818 |
+
logger.error(f"Failed to fetch tweets from {source.twitter_handle}: HTTP {response.status_code}")
|
819 |
+
|
820 |
+
if attempt < retries:
|
821 |
+
await asyncio.sleep(5)
|
822 |
+
continue
|
823 |
+
else:
|
824 |
+
return []
|
825 |
+
|
826 |
+
except Exception as e:
|
827 |
+
self.stats["errors"] += 1
|
828 |
+
logger.error(f"Error fetching tweets from {source.twitter_handle}: {str(e)}")
|
829 |
+
|
830 |
+
if attempt < retries:
|
831 |
+
await asyncio.sleep(5)
|
832 |
+
continue
|
833 |
+
|
834 |
+
return [] # Return empty list if all retries failed
|
835 |
+
|
836 |
+
def _parse_count(self, count_text: str) -> int:
|
837 |
+
"""Parse count text like '1.2K' into integer value."""
|
838 |
+
try:
|
839 |
+
count_text = count_text.strip()
|
840 |
+
if not count_text:
|
841 |
+
return 0
|
842 |
+
|
843 |
+
if 'K' in count_text:
|
844 |
+
return int(float(count_text.replace('K', '')) * 1000)
|
845 |
+
elif 'M' in count_text:
|
846 |
+
return int(float(count_text.replace('M', '')) * 1000000)
|
847 |
+
else:
|
848 |
+
return int(count_text)
|
849 |
+
except (ValueError, TypeError):
|
850 |
+
return 0
|
851 |
+
|
852 |
+
async def get_related_tweets(self, keywords: List[str], days_back: int = 2) -> List[Tweet]:
|
853 |
+
"""
|
854 |
+
Get tweets related to specific keywords from trusted news sources only.
|
855 |
+
Uses intelligent batching and failover strategies.
|
856 |
+
"""
|
857 |
+
all_tweets = []
|
858 |
+
cutoff_date = datetime.now() - timedelta(days=days_back)
|
859 |
+
|
860 |
+
# Process sources in smaller batches with smart ordering
|
861 |
+
active_sources = [source for source in self.news_sources if source.is_active]
|
862 |
+
|
863 |
+
# Sort sources by reliability score (prioritize higher scores)
|
864 |
+
active_sources.sort(key=lambda s: s.reliability_score, reverse=True)
|
865 |
+
|
866 |
+
# Dynamic batch size - larger when we have fewer sources to optimize throughput
|
867 |
+
source_count = len(active_sources)
|
868 |
+
batch_size = max(1, min(3, 10 // source_count if source_count > 0 else 3))
|
869 |
+
|
870 |
+
logger.info(f"Collecting tweets from {len(active_sources)} trusted news sources")
|
871 |
+
|
872 |
+
for i in range(0, len(active_sources), batch_size):
|
873 |
+
batch_sources = active_sources[i:i+batch_size]
|
874 |
+
|
875 |
+
# Process batch with smart concurrency
|
876 |
+
tasks = []
|
877 |
+
for source in batch_sources:
|
878 |
+
# Adaptive limit based on source reliability
|
879 |
+
fetch_limit = int(50 * min(1.5, source.reliability_score))
|
880 |
+
tasks.append(self.get_tweets_from_source(source, limit=fetch_limit))
|
881 |
+
|
882 |
+
source_tweets_list = await asyncio.gather(*tasks)
|
883 |
+
|
884 |
+
# Process batch results
|
885 |
+
batch_tweets = []
|
886 |
+
for source_tweets in source_tweets_list:
|
887 |
+
# Filter tweets by keywords and date
|
888 |
+
for tweet in source_tweets:
|
889 |
+
if (tweet.created_at >= cutoff_date and
|
890 |
+
any(keyword.lower() in tweet.text.lower() for keyword in keywords)):
|
891 |
+
batch_tweets.append(tweet)
|
892 |
+
|
893 |
+
all_tweets.extend(batch_tweets)
|
894 |
+
|
895 |
+
# Dynamic delay between batches based on results
|
896 |
+
# If we got fewer tweets than expected, slow down more
|
897 |
+
batch_delay = random.uniform(2.0, 5.0)
|
898 |
+
if len(batch_tweets) < batch_size * 3: # Fewer than 3 tweets per source
|
899 |
+
batch_delay += random.uniform(3.0, 7.0) # Add extra delay
|
900 |
+
|
901 |
+
await asyncio.sleep(batch_delay)
|
902 |
+
|
903 |
+
# If we have very few results, try with more relaxed filtering
|
904 |
+
if len(all_tweets) < 5 and active_sources:
|
905 |
+
logger.info("Few relevant tweets found, trying more relaxed filtering")
|
906 |
+
|
907 |
+
# Take top 3 most reliable sources
|
908 |
+
key_sources = active_sources[:min(3, len(active_sources))]
|
909 |
+
tasks = [self.get_tweets_from_source(source, limit=100, retries=5) for source in key_sources]
|
910 |
+
more_tweets_list = await asyncio.gather(*tasks)
|
911 |
+
|
912 |
+
# Process with more relaxed keyword matching
|
913 |
+
for source_tweets in more_tweets_list:
|
914 |
+
for tweet in source_tweets:
|
915 |
+
# Use partial keyword matching
|
916 |
+
if tweet.created_at >= cutoff_date:
|
917 |
+
for keyword in keywords:
|
918 |
+
# Split keyword into parts and check if any part matches
|
919 |
+
keyword_parts = keyword.lower().split()
|
920 |
+
if any(part in tweet.text.lower() for part in keyword_parts if len(part) > 3):
|
921 |
+
if tweet.id not in [t.id for t in all_tweets]:
|
922 |
+
all_tweets.append(tweet)
|
923 |
+
break
|
924 |
+
|
925 |
+
# Sort by recency
|
926 |
+
all_tweets.sort(key=lambda x: x.created_at, reverse=True)
|
927 |
+
|
928 |
+
logger.info(f"Found {len(all_tweets)} tweets from trusted sources related to keywords: {keywords}")
|
929 |
+
return all_tweets
|
930 |
+
|
931 |
+
def update_sources(self, sources: List[NewsSource]) -> None:
|
932 |
+
"""Update the list of trusted news sources."""
|
933 |
+
self.news_sources = sources
|
934 |
+
# Clear cache when sources are updated
|
935 |
+
self.in_memory_cache.clear()
|
936 |
+
logger.info(f"Updated trusted news sources. New count: {len(sources)}")
|
937 |
+
|
938 |
+
def get_sources(self) -> List[NewsSource]:
|
939 |
+
"""Get the current list of trusted news sources."""
|
940 |
+
return self.news_sources
|
941 |
+
|
942 |
+
async def close(self):
|
943 |
+
"""Clean up resources."""
|
944 |
+
if self.nitter_bypass:
|
945 |
+
await self.nitter_bypass.close()
|
vercel.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"version": 2,
|
3 |
+
"builds": [
|
4 |
+
{
|
5 |
+
"src": "/app.py",
|
6 |
+
"use": "@vercel/python"
|
7 |
+
}
|
8 |
+
],
|
9 |
+
"routes": [
|
10 |
+
{
|
11 |
+
"src": "/(.*)",
|
12 |
+
"dest": "/app.py"
|
13 |
+
}
|
14 |
+
]
|
15 |
+
}
|