abhisheksan commited on
Commit
0905727
·
verified ·
1 Parent(s): 82c6c01

Upload 9 files

Browse files
Files changed (9) hide show
  1. Dockerfile +24 -0
  2. README.md +172 -11
  3. analysis_service.py +288 -0
  4. app.py +237 -0
  5. docker-compose.yml +20 -0
  6. models.py +68 -0
  7. requirements.txt +10 -0
  8. twitter_service.py +945 -0
  9. vercel.json +15 -0
Dockerfile ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install dependencies
6
+ COPY requirements.txt .
7
+ RUN pip install --no-cache-dir -r requirements.txt
8
+
9
+ # Create logs directory for application logs
10
+ RUN mkdir -p logs
11
+
12
+ # Copy application code
13
+ COPY . .
14
+
15
+ # Set environment variables
16
+ ENV PYTHONDONTWRITEBYTECODE=1
17
+ ENV PYTHONUNBUFFERED=1
18
+ ENV LOG_LEVEL=INFO
19
+
20
+ # Expose the port the app runs on
21
+ EXPOSE 8000
22
+
23
+ # Command to run the application
24
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
README.md CHANGED
@@ -1,11 +1,172 @@
1
- ---
2
- title: Westernfront
3
- emoji: 🦀
4
- colorFrom: indigo
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- short_description: westernfront ai backend
9
- ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WesternFront: India-Pakistan Conflict Tracker API
2
+
3
+ A FastAPI application that leverages unofficial Twitter access via Twikit and Google's Gemini AI to monitor and analyze India-Pakistan tensions in real-time.
4
+
5
+ ## Overview
6
+
7
+ WesternFront is an AI-powered conflict tracker that:
8
+
9
+ 1. Collects tweets from reliable news sources covering India-Pakistan relations without using official Twitter API
10
+ 2. Analyzes these tweets using Google's Gemini AI to assess the current conflict situation
11
+ 3. Provides RESTful endpoints to access the analysis
12
+ 4. Updates analysis periodically and on-demand
13
+
14
+ ## Core Components
15
+
16
+ ### Twitter Data Collection
17
+ - Uses [Twikit](https://github.com/d60/twikit) for unofficial Twitter access
18
+ - Fetches tweets from a predefined list of reliable sources
19
+ - Implements caching to avoid unnecessary requests
20
+
21
+ ### AI Analysis with Gemini
22
+ - Analyzes collected tweets to assess India-Pakistan tensions
23
+ - Generates comprehensive reports including:
24
+ - Current situation summary
25
+ - Key developments in the last 24-48 hours
26
+ - Information reliability assessment
27
+ - Regional stability implications
28
+ - Tension level classification (Low/Medium/High/Critical)
29
+
30
+ ### FastAPI Server
31
+ - Endpoint for on-demand analysis updates
32
+ - Endpoint to get latest analysis
33
+ - Background task system for periodic updates
34
+ - Health check endpoint
35
+ - Source list and keyword management
36
+
37
+ ## Getting Started
38
+
39
+ ### Prerequisites
40
+
41
+ - Python 3.9+
42
+ - Docker (optional)
43
+
44
+ ### Environment Setup
45
+
46
+ 1. Clone the repository
47
+ 2. Copy `.env.example` to `.env` and fill in the required values:
48
+ ```
49
+ # Twitter Credentials
50
+ TWITTER_USERNAME=your_twitter_username
51
+ TWITTER_PASSWORD=your_twitter_password
52
+ TWITTER_EMAIL=your_twitter_email
53
+
54
+ # Google Gemini API Key
55
+ GEMINI_API_KEY=your_gemini_api_key
56
+
57
+ # Application Settings
58
+ UPDATE_INTERVAL_MINUTES=60
59
+ CACHE_EXPIRY_MINUTES=120
60
+ LOG_LEVEL=INFO
61
+ ```
62
+
63
+ ### Installation
64
+
65
+ #### Local Development
66
+
67
+ ```bash
68
+ # Create virtual environment
69
+ python -m venv venv
70
+ source venv/bin/activate # On Windows: venv\Scripts\activate
71
+
72
+ # Install dependencies
73
+ pip install -r requirements.txt
74
+
75
+ # Run the application
76
+ uvicorn app:app --reload
77
+ ```
78
+
79
+ #### Docker Deployment
80
+
81
+ ```bash
82
+ # Build the Docker image
83
+ docker build -t westernfront .
84
+
85
+ # Run the container
86
+ docker run -p 8000:8000 --env-file .env westernfront
87
+ ```
88
+
89
+ ## API Endpoints
90
+
91
+ ### Root Endpoint
92
+ - `GET /`: Basic API information
93
+
94
+ ### Health Check
95
+ - `GET /health`: Check the health of the API and its components
96
+
97
+ ### Analysis
98
+ - `GET /analysis`: Get the latest conflict analysis
99
+ - `POST /analysis/update`: Trigger an analysis update
100
+ - Request Body: `{ "force": boolean }` (optional, defaults to false)
101
+
102
+ ### News Sources
103
+ - `GET /sources`: Get the current list of news sources
104
+ - `POST /sources`: Update the list of news sources
105
+ - Request Body: Array of NewsSource objects
106
+
107
+ ### Keywords
108
+ - `GET /keywords`: Get the current search keywords
109
+ - `POST /keywords`: Update the search keywords
110
+ - Request Body: Array of strings
111
+
112
+ ### Tension Levels
113
+ - `GET /tension-levels`: Get the available tension levels
114
+
115
+ ## Data Models
116
+
117
+ ### News Source
118
+ ```json
119
+ {
120
+ "name": "BBC News",
121
+ "twitter_handle": "BBCWorld",
122
+ "country": "UK",
123
+ "reliability_score": 0.9,
124
+ "is_active": true
125
+ }
126
+ ```
127
+
128
+ ### Conflict Analysis
129
+ ```json
130
+ {
131
+ "analysis_id": "uuid",
132
+ "generated_at": "2023-05-01T12:00:00Z",
133
+ "situation_summary": "...",
134
+ "key_developments": [
135
+ {
136
+ "title": "Development 1",
137
+ "description": "...",
138
+ "sources": ["@BBCWorld", "@Reuters"],
139
+ "timestamp": "2023-05-01T10:30:00Z"
140
+ }
141
+ ],
142
+ "reliability_assessment": "...",
143
+ "regional_implications": "...",
144
+ "tension_level": "Medium",
145
+ "source_tweets": [],
146
+ "update_triggered_by": "scheduled"
147
+ }
148
+ ```
149
+
150
+ ## Implementation Notes
151
+
152
+ - The application uses asyncio for handling concurrent requests
153
+ - Implements in-memory caching (can be extended to Redis)
154
+ - Rate limiting and throttling for Twitter scraping to avoid blocking
155
+ - Proper error handling and logging via loguru
156
+ - Secure credential management via environment variables
157
+
158
+ ## Future Enhancements
159
+
160
+ - Redis integration for more robust caching
161
+ - User authentication for API access
162
+ - Email/notification alerts for critical tension levels
163
+ - Historical data storage and trend analysis
164
+ - Additional data sources beyond Twitter
165
+
166
+ ## License
167
+
168
+ MIT License
169
+
170
+ ## Disclaimer
171
+
172
+ This application is designed for educational and research purposes. The analysis provided should not be used as the sole source for critical decision-making related to regional conflicts.
analysis_service.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import uuid
3
+ from datetime import datetime
4
+ from typing import Dict, List
5
+
6
+ import google.generativeai as genai
7
+ from loguru import logger
8
+ from tenacity import RetryError, retry, stop_after_attempt, wait_exponential
9
+
10
+ from models import ConflictAnalysis, KeyDevelopment, TensionLevel, Tweet
11
+
12
+
13
+ class AnalysisService:
14
+ """Service for analyzing tweets using Google's Gemini AI."""
15
+
16
+ def __init__(self):
17
+ self.api_key = os.getenv("GEMINI_API_KEY")
18
+ self.model = None
19
+ self.search_keywords = [
20
+ "India Pakistan", "Kashmir", "LOC", "Line of Control",
21
+ "border tension", "ceasefire", "military", "diplomatic relations",
22
+ "India-Pakistan", "cross-border", "terrorism", "bilateral relations"
23
+ ]
24
+ self.initialize()
25
+
26
+ def initialize(self) -> bool:
27
+ """Initialize the Gemini AI client."""
28
+ if not self.api_key:
29
+ logger.error("GEMINI_API_KEY not provided")
30
+ return False
31
+
32
+ try:
33
+ logger.info("Initializing Gemini AI")
34
+ genai.configure(api_key=self.api_key)
35
+ # Configure model with lower temperature for more factual responses
36
+ generation_config = {
37
+ "temperature": 0.1,
38
+ "top_p": 0.95,
39
+ "top_k": 40
40
+ }
41
+ self.model = genai.GenerativeModel('gemini-2.0-flash', generation_config=generation_config)
42
+ logger.info("Gemini AI initialized successfully")
43
+ return True
44
+ except Exception as e:
45
+ logger.error(f"Failed to initialize Gemini AI: {str(e)}")
46
+ return False
47
+
48
+ def _prepare_prompt(self, tweets: List[Tweet]) -> str:
49
+ """Prepare the prompt for analysis with intelligence sources data."""
50
+ # Sort tweets by recency to help with latest status identification
51
+ sorted_tweets = sorted(tweets, key=lambda x: x.created_at if hasattr(x, 'created_at') else datetime.now(), reverse=True)
52
+
53
+ source_entries = [
54
+ f"DATA POINT {i+1}: [TIMESTAMP: {tweet.created_at if hasattr(tweet, 'created_at') else 'unknown'}, SOURCE: @{tweet.author}]\n{tweet.text}"
55
+ for i, tweet in enumerate(sorted_tweets)
56
+ ]
57
+ intelligence_data = "\n\n".join(source_entries)
58
+
59
+ prompt = f"""
60
+ INTELLIGENCE BRIEF: INDIA-PAKISTAN SITUATION ANALYSIS
61
+ DATE: {datetime.now().strftime("%Y-%m-%d")}
62
+ CLASSIFICATION: STRATEGIC ASSESSMENT
63
+
64
+ SOURCE DATA:
65
+ {intelligence_data}
66
+
67
+ ANALYTICAL PARAMETERS:
68
+ - Analyze the data points objectively without commentary
69
+ - Identify factual developments and official statements
70
+ - Assess tension levels based on concrete actions and statements
71
+ - Maintain professional, analytical tone throughout
72
+ - Cite specific data points in all assessments
73
+ - Do not introduce information not present in the data points
74
+ - Include exact timestamps when available
75
+
76
+ REQUIRED OUTPUT FORMAT:
77
+ {{
78
+ "latest_status": "Most recent significant development with exact timestamp and source citation",
79
+ "situation_summary": "Precise assessment of current Indo-Pak situation with timestamps and citations",
80
+ "key_developments": [
81
+ {{
82
+ "title": "Precise event designation",
83
+ "description": "Factual account with supporting evidence and timestamps",
84
+ "sources": ["@source1", "@source2"]
85
+ }}
86
+ ],
87
+ "reliability_assessment": {{
88
+ "source_credibility": "Assessment of source authority and reliability",
89
+ "information_gaps": "Specific identification of intelligence gaps",
90
+ "confidence_rating": "HIGH|MEDIUM|LOW based on data quality"
91
+ }},
92
+ "regional_implications": {{
93
+ "security": "Concrete security implications based on factual developments",
94
+ "diplomatic": "Diplomatic consequences with specific references",
95
+ "economic": "Economic impacts if applicable to current situation"
96
+ }},
97
+ "tension_level": "LOW|MEDIUM|HIGH|CRITICAL",
98
+ "tension_rationale": "Specific evidence supporting tension level assessment"
99
+ }}
100
+
101
+ IMPORTANT DIRECTIVES:
102
+ - Return ONLY valid JSON without any additional text or markdown formatting
103
+ - Do not use conversational language or first-person perspective
104
+ - Focus on factual analysis, not speculation
105
+ - Prioritize verified information from official channels
106
+ - Highlight the most recent developments in the latest_status section
107
+ """
108
+ return prompt
109
+
110
+ @retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3))
111
+ async def _call_gemini(self, prompt: str) -> Dict:
112
+ """Call the Gemini API with retry logic and improved parsing."""
113
+ if not self.model:
114
+ if not self.initialize():
115
+ logger.error("Could not analyze tweets, Gemini AI not initialized")
116
+ raise Exception("Gemini AI initialization failed")
117
+
118
+ try:
119
+ logger.info("Calling Gemini API for conflict analysis")
120
+ response = await self.model.generate_content_async(prompt)
121
+ result = response.text
122
+
123
+ import json
124
+ import re
125
+
126
+ # Better JSON extraction with multiple patterns
127
+ json_match = re.search(r'```(?:json)?\n(.*?)\n```', result, re.DOTALL)
128
+ if json_match:
129
+ result = json_match.group(1)
130
+ else:
131
+ # Try to find JSON objects with or without formatting
132
+ json_pattern = r'({[\s\S]*})'
133
+ json_match = re.search(json_pattern, result)
134
+ if json_match:
135
+ result = json_match.group(1)
136
+
137
+ # Clean the result of any non-JSON content
138
+ result = re.sub(r'```', '', result).strip()
139
+
140
+ # Parse JSON with error handling
141
+ try:
142
+ analysis_data = json.loads(result)
143
+ logger.info("Successfully received and parsed Gemini response")
144
+ return analysis_data
145
+ except json.JSONDecodeError as e:
146
+ logger.error(f"JSON parsing error: {str(e)}")
147
+ # Attempt cleanup and retry parsing
148
+ result = re.sub(r'[\n\r\t]', ' ', result)
149
+ result = re.search(r'({.*})', result).group(1) if re.search(r'({.*})', result) else result
150
+ analysis_data = json.loads(result)
151
+ logger.info("Successfully parsed Gemini response after cleanup")
152
+ return analysis_data
153
+
154
+ except Exception as e:
155
+ logger.error(f"Error calling Gemini API: {str(e)}")
156
+ logger.debug(f"Raw response content: {result if 'result' in locals() else 'No response'}")
157
+ raise
158
+
159
+ def _extract_tension_level(self, level_text: str) -> TensionLevel:
160
+ """Extract tension level enum from text."""
161
+ level_text = level_text.lower()
162
+ if "critical" in level_text:
163
+ return TensionLevel.CRITICAL
164
+ elif "high" in level_text:
165
+ return TensionLevel.HIGH
166
+ elif "medium" in level_text:
167
+ return TensionLevel.MEDIUM
168
+ else:
169
+ return TensionLevel.LOW
170
+
171
+ def _process_key_developments(self, developments_data: List[Dict]) -> List[KeyDevelopment]:
172
+ """Process key developments from API response."""
173
+ key_developments = []
174
+ for dev in developments_data:
175
+ key_developments.append(
176
+ KeyDevelopment(
177
+ title=dev.get("title", "Unnamed Development"),
178
+ description=dev.get("description", "No description provided"),
179
+ sources=dev.get("sources", []),
180
+ timestamp=datetime.now()
181
+ )
182
+ )
183
+ return key_developments
184
+
185
+ def _format_reliability_assessment(self, reliability_data: Dict) -> str:
186
+ """Format reliability assessment data into a structured string."""
187
+ if isinstance(reliability_data, str):
188
+ return reliability_data
189
+
190
+ if isinstance(reliability_data, dict):
191
+ sections = []
192
+ if "source_credibility" in reliability_data:
193
+ sections.append(f"SOURCE CREDIBILITY: {reliability_data['source_credibility']}")
194
+ if "information_gaps" in reliability_data:
195
+ sections.append(f"INFORMATION GAPS: {reliability_data['information_gaps']}")
196
+ if "confidence_rating" in reliability_data:
197
+ sections.append(f"CONFIDENCE: {reliability_data['confidence_rating']}")
198
+
199
+ if sections:
200
+ return "\n\n".join(sections)
201
+
202
+ return str(reliability_data)
203
+
204
+ def _format_regional_implications(self, implications_data: Dict) -> str:
205
+ """Format regional implications data into a structured string."""
206
+ if isinstance(implications_data, str):
207
+ return implications_data
208
+
209
+ if isinstance(implications_data, dict):
210
+ sections = []
211
+ if "security" in implications_data:
212
+ sections.append(f"SECURITY: {implications_data['security']}")
213
+ if "diplomatic" in implications_data:
214
+ sections.append(f"DIPLOMATIC: {implications_data['diplomatic']}")
215
+ if "economic" in implications_data:
216
+ sections.append(f"ECONOMIC: {implications_data['economic']}")
217
+
218
+ if sections:
219
+ return "\n\n".join(sections)
220
+
221
+ return str(implications_data)
222
+
223
+ async def analyze_tweets(self, tweets: List[Tweet], trigger: str = "scheduled") -> ConflictAnalysis:
224
+ """Analyze tweets using Gemini AI and generate a conflict analysis."""
225
+ if not tweets:
226
+ logger.warning("No tweets provided for analysis")
227
+ return None
228
+
229
+ try:
230
+ prompt = self._prepare_prompt(tweets)
231
+ analysis_data = await self._call_gemini(prompt)
232
+
233
+ # Process and extract data with proper error handling
234
+ key_developments = self._process_key_developments(analysis_data.get("key_developments", []))
235
+
236
+ # Format complex nested structures if present
237
+ reliability_assessment = self._format_reliability_assessment(
238
+ analysis_data.get("reliability_assessment", "No reliability assessment provided")
239
+ )
240
+
241
+ regional_implications = self._format_regional_implications(
242
+ analysis_data.get("regional_implications", "No regional implications provided")
243
+ )
244
+
245
+ # Extract tension rationale if available
246
+ tension_info = analysis_data.get("tension_level", "Low")
247
+ tension_rationale = analysis_data.get("tension_rationale", "")
248
+
249
+ # Combine tension level and rationale if both exist
250
+ if tension_rationale:
251
+ tension_display = f"{tension_info} - {tension_rationale}"
252
+ else:
253
+ tension_display = tension_info
254
+
255
+ # Get the latest status
256
+ latest_status = analysis_data.get("latest_status", "No recent status update available")
257
+
258
+ analysis = ConflictAnalysis(
259
+ analysis_id=str(uuid.uuid4()),
260
+ generated_at=datetime.now(),
261
+ situation_summary=analysis_data.get("situation_summary", "No summary provided"),
262
+ key_developments=key_developments,
263
+ reliability_assessment=reliability_assessment,
264
+ regional_implications=regional_implications,
265
+ tension_level=self._extract_tension_level(tension_display),
266
+ source_tweets=tweets,
267
+ update_triggered_by=trigger,
268
+ latest_status=latest_status # Added new parameter
269
+ )
270
+
271
+ logger.info(f"Generated conflict analysis with ID: {analysis.analysis_id}")
272
+ return analysis
273
+
274
+ except RetryError as e:
275
+ logger.error(f"Failed to generate analysis after multiple retries: {str(e)}")
276
+ return None
277
+ except Exception as e:
278
+ logger.error(f"Unexpected error in tweet analysis: {str(e)}")
279
+ return None
280
+
281
+ def get_search_keywords(self) -> List[str]:
282
+ """Get the current search keywords."""
283
+ return self.search_keywords
284
+
285
+ def update_search_keywords(self, keywords: List[str]) -> None:
286
+ """Update the search keywords."""
287
+ self.search_keywords = keywords
288
+ logger.info(f"Updated search keywords. New count: {len(keywords)}")
app.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import os
3
+ from datetime import datetime
4
+ from typing import Dict, List, Optional
5
+
6
+ from dotenv import load_dotenv
7
+ from fastapi import BackgroundTasks, Depends, FastAPI, HTTPException, status
8
+ from fastapi.middleware.cors import CORSMiddleware
9
+ from loguru import logger
10
+
11
+ from analysis_service import AnalysisService
12
+ from models import (ConflictAnalysis, HealthCheck, NewsSource, TensionLevel,
13
+ Tweet, UpdateRequest)
14
+ from twitter_service import TwitterService
15
+
16
+ # Load environment variables from .env file
17
+ load_dotenv()
18
+
19
+ # Configure logging
20
+ os.makedirs("logs", exist_ok=True)
21
+ logger.add("logs/app.log", rotation="500 MB", level=os.getenv("LOG_LEVEL", "INFO"))
22
+
23
+ # Create FastAPI application
24
+ app = FastAPI(
25
+ title="WesternFront API",
26
+ description="AI-powered conflict tracker for monitoring India-Pakistan tensions",
27
+ version="1.0.0"
28
+ )
29
+
30
+ # Add CORS middleware
31
+ app.add_middleware(
32
+ CORSMiddleware,
33
+ allow_origins=["*"], # Adjust this for production
34
+ allow_credentials=True,
35
+ allow_methods=["*"],
36
+ allow_headers=["*"],
37
+ )
38
+
39
+ # Services
40
+ twitter_service = TwitterService()
41
+ analysis_service = AnalysisService()
42
+
43
+ # In-memory store for latest analysis
44
+ latest_analysis: Optional[ConflictAnalysis] = None
45
+ last_update_time: Optional[datetime] = None
46
+
47
+
48
+ async def get_twitter_service() -> TwitterService:
49
+ """Dependency to get the Twitter service."""
50
+ return twitter_service
51
+
52
+
53
+ async def get_analysis_service() -> AnalysisService:
54
+ """Dependency to get the Analysis service."""
55
+ return analysis_service
56
+
57
+
58
+ @app.on_event("startup")
59
+ async def startup_event():
60
+ """Initialize services on startup."""
61
+ logger.info("Starting up WesternFront API")
62
+
63
+ # Initialize Twitter service
64
+ initialized = await twitter_service.initialize()
65
+ if not initialized:
66
+ logger.warning("Twitter service initialization failed. Some features may not work.")
67
+
68
+ # Schedule first update
69
+ background_tasks = BackgroundTasks()
70
+ background_tasks.add_task(update_analysis_task)
71
+
72
+ # Set up periodic update task
73
+ asyncio.create_task(periodic_update())
74
+
75
+
76
+ @app.on_event("shutdown")
77
+ async def shutdown_event():
78
+ """Clean up resources on shutdown."""
79
+ logger.info("Shutting down WesternFront API")
80
+ if twitter_service and hasattr(twitter_service, 'close'):
81
+ await twitter_service.close()
82
+
83
+
84
+ async def update_analysis_task(trigger: str = "scheduled") -> None:
85
+ """Task to update the conflict analysis."""
86
+ global latest_analysis, last_update_time
87
+
88
+ try:
89
+ logger.info(f"Starting analysis update ({trigger})")
90
+
91
+ # Get tweets related to India-Pakistan conflict
92
+ keywords = analysis_service.get_search_keywords()
93
+ tweets = await twitter_service.get_related_tweets(keywords, days_back=2)
94
+
95
+ if not tweets:
96
+ logger.warning("No relevant tweets found for analysis")
97
+ return
98
+
99
+ logger.info(f"Found {len(tweets)} relevant tweets for analysis")
100
+
101
+ # Analyze tweets
102
+ analysis = await analysis_service.analyze_tweets(tweets, trigger)
103
+
104
+ if analysis:
105
+ latest_analysis = analysis
106
+ last_update_time = datetime.now()
107
+ logger.info(f"Analysis updated successfully. Tension level: {analysis.tension_level}")
108
+ else:
109
+ logger.error("Failed to generate analysis")
110
+
111
+ except Exception as e:
112
+ logger.error(f"Error in update_analysis_task: {str(e)}")
113
+
114
+
115
+ async def periodic_update() -> None:
116
+ """Periodically update the analysis."""
117
+ update_interval = int(os.getenv("UPDATE_INTERVAL_MINUTES", 60))
118
+
119
+ while True:
120
+ try:
121
+ await asyncio.sleep(update_interval * 60) # Convert to seconds
122
+ await update_analysis_task("scheduled")
123
+ except Exception as e:
124
+ logger.error(f"Error in periodic_update: {str(e)}")
125
+ await asyncio.sleep(300) # Wait 5 minutes if there was an error
126
+
127
+
128
+ @app.get("/", response_model=Dict)
129
+ async def root():
130
+ """Root endpoint with basic information about the API."""
131
+ return {
132
+ "name": "WesternFront API",
133
+ "description": "AI-powered conflict tracker for India-Pakistan tensions",
134
+ "version": "1.0.0"
135
+ }
136
+
137
+
138
+ @app.get("/health", response_model=HealthCheck)
139
+ async def health_check():
140
+ """Health check endpoint."""
141
+ twitter_initialized = hasattr(twitter_service, 'http_client') and twitter_service.http_client is not None
142
+ gemini_initialized = analysis_service.model is not None
143
+
144
+ return HealthCheck(
145
+ status="healthy",
146
+ version="1.0.0",
147
+ timestamp=datetime.now(),
148
+ last_update=last_update_time,
149
+ components_status={
150
+ "twitter_service": twitter_initialized,
151
+ "analysis_service": gemini_initialized
152
+ }
153
+ )
154
+
155
+
156
+ @app.get("/analysis", response_model=Optional[ConflictAnalysis])
157
+ async def get_latest_analysis():
158
+ """Get the latest conflict analysis."""
159
+ if not latest_analysis:
160
+ raise HTTPException(
161
+ status_code=status.HTTP_404_NOT_FOUND,
162
+ detail="No analysis available yet. Try triggering an update."
163
+ )
164
+ return latest_analysis
165
+
166
+
167
+ @app.post("/analysis/update", response_model=Dict)
168
+ async def trigger_update(
169
+ request: UpdateRequest,
170
+ background_tasks: BackgroundTasks
171
+ ):
172
+ """Trigger an analysis update."""
173
+ if request.force:
174
+ # Clear cache to force fresh tweets
175
+ twitter_service.tweet_cache.clear()
176
+
177
+ # Add update task to background tasks
178
+ background_tasks.add_task(update_analysis_task, "manual")
179
+
180
+ return {
181
+ "message": "Analysis update triggered",
182
+ "timestamp": datetime.now(),
183
+ "force_refresh": request.force
184
+ }
185
+
186
+
187
+ @app.get("/sources", response_model=List[NewsSource])
188
+ async def get_news_sources(
189
+ twitter: TwitterService = Depends(get_twitter_service)
190
+ ):
191
+ """Get the current list of news sources."""
192
+ return twitter.get_sources()
193
+
194
+
195
+ @app.post("/sources", response_model=Dict)
196
+ async def update_news_sources(
197
+ sources: List[NewsSource],
198
+ twitter: TwitterService = Depends(get_twitter_service)
199
+ ):
200
+ """Update the list of news sources."""
201
+ twitter.update_sources(sources)
202
+ return {
203
+ "message": "News sources updated",
204
+ "count": len(sources)
205
+ }
206
+
207
+
208
+ @app.get("/keywords", response_model=List[str])
209
+ async def get_search_keywords(
210
+ analysis: AnalysisService = Depends(get_analysis_service)
211
+ ):
212
+ """Get the current search keywords."""
213
+ return analysis.get_search_keywords()
214
+
215
+
216
+ @app.post("/keywords", response_model=Dict)
217
+ async def update_search_keywords(
218
+ keywords: List[str],
219
+ analysis: AnalysisService = Depends(get_analysis_service)
220
+ ):
221
+ """Update the search keywords."""
222
+ analysis.update_search_keywords(keywords)
223
+ return {
224
+ "message": "Search keywords updated",
225
+ "count": len(keywords)
226
+ }
227
+
228
+
229
+ @app.get("/tension-levels", response_model=List[str])
230
+ async def get_tension_levels():
231
+ """Get the available tension levels."""
232
+ return [level.value for level in TensionLevel]
233
+
234
+
235
+ if __name__ == "__main__":
236
+ import uvicorn
237
+ uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
docker-compose.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3'
2
+
3
+ services:
4
+ westernfront-api:
5
+ build:
6
+ context: .
7
+ dockerfile: Dockerfile
8
+ ports:
9
+ - "8000:8000"
10
+ volumes:
11
+ - ./logs:/app/logs
12
+ env_file:
13
+ - .env
14
+ restart: unless-stopped
15
+ healthcheck:
16
+ test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
17
+ interval: 30s
18
+ timeout: 10s
19
+ retries: 3
20
+ start_period: 10s
models.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datetime import datetime
2
+ from enum import Enum
3
+ from typing import Dict, List, Optional
4
+
5
+ from pydantic import BaseModel, Field
6
+
7
+
8
+ class TensionLevel(str, Enum):
9
+ """Enum for tension levels between India and Pakistan."""
10
+ LOW = "Low"
11
+ MEDIUM = "Medium"
12
+ HIGH = "High"
13
+ CRITICAL = "Critical"
14
+
15
+
16
+ class NewsSource(BaseModel):
17
+ """Model for a news source."""
18
+ name: str
19
+ twitter_handle: str
20
+ country: str
21
+ reliability_score: float = Field(ge=0.0, le=1.0)
22
+ is_active: bool = True
23
+
24
+
25
+ class Tweet(BaseModel):
26
+ """Model for a tweet."""
27
+ id: str
28
+ text: str
29
+ author: str
30
+ created_at: datetime
31
+ engagement: Dict[str, int] = {"likes": 0, "retweets": 0, "replies": 0, "views": 0}
32
+ url: str
33
+
34
+
35
+ class KeyDevelopment(BaseModel):
36
+ """Model for a key development in the conflict."""
37
+ title: str
38
+ description: str
39
+ sources: List[str]
40
+ timestamp: Optional[datetime] = None
41
+
42
+
43
+ class ConflictAnalysis(BaseModel):
44
+ """Model for a conflict analysis."""
45
+ analysis_id: str
46
+ generated_at: datetime
47
+ latest_status: str # Added this field
48
+ situation_summary: str
49
+ key_developments: List[KeyDevelopment]
50
+ reliability_assessment: str
51
+ regional_implications: str
52
+ tension_level: TensionLevel
53
+ source_tweets: List[Tweet]
54
+ update_triggered_by: str
55
+
56
+
57
+ class UpdateRequest(BaseModel):
58
+ """Model for an update request."""
59
+ force: bool = False
60
+
61
+
62
+ class HealthCheck(BaseModel):
63
+ """Model for a health check response."""
64
+ status: str
65
+ version: str
66
+ timestamp: datetime
67
+ last_update: Optional[datetime] = None
68
+ components_status: Dict[str, bool]
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.103.1
2
+ uvicorn[standard]==0.23.2
3
+ python-dotenv==1.0.0
4
+ loguru==0.7.0
5
+ google-generativeai==0.3.0
6
+ tenacity==8.2.2
7
+ cachetools==5.3.0
8
+ pydantic==2.3.0
9
+ httpx==0.24.1
10
+ beautifulsoup4==4.12.2
twitter_service.py ADDED
@@ -0,0 +1,945 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import json
3
+ import os
4
+ import re
5
+ import time
6
+ import random
7
+ from datetime import datetime, timedelta
8
+ from typing import Dict, List, Optional, Tuple
9
+ from urllib.parse import urlparse, quote
10
+
11
+ import httpx
12
+ from bs4 import BeautifulSoup
13
+ from cachetools import TTLCache
14
+ from fastapi import HTTPException
15
+ from loguru import logger
16
+
17
+ from models import NewsSource, Tweet
18
+
19
+
20
+ class FingerprintRandomizer:
21
+ """Randomizes browser fingerprints to evade detection"""
22
+
23
+ def __init__(self):
24
+ # Common screen resolutions
25
+ self.resolutions = [
26
+ (1920, 1080), (1366, 768), (1280, 720),
27
+ (1440, 900), (1536, 864), (2560, 1440),
28
+ (1680, 1050), (1920, 1200), (1024, 768)
29
+ ]
30
+
31
+ # Common color depths
32
+ self.color_depths = [24, 30, 32]
33
+
34
+ # Common platforms
35
+ self.platforms = [
36
+ "Win32", "MacIntel", "Linux x86_64",
37
+ "Linux armv8l", "iPhone", "iPad"
38
+ ]
39
+
40
+ # Browser variants
41
+ self.browsers = ["Chrome", "Firefox", "Safari", "Edge"]
42
+
43
+ # Common languages
44
+ self.languages = [
45
+ "en-US", "en-GB", "en-CA", "fr-FR", "de-DE",
46
+ "es-ES", "it-IT", "pt-BR", "ja-JP", "zh-CN"
47
+ ]
48
+
49
+ # Common timezone offsets
50
+ self.timezone_offsets = [-60, -120, -180, -240, 0, 60, 120, 180, 330, 480, 540]
51
+
52
+ def generate_headers(self):
53
+ """Generate randomized headers that mimic a real browser"""
54
+ browser = random.choice(self.browsers)
55
+ platform = random.choice(self.platforms)
56
+ language = random.choice(self.languages)
57
+
58
+ user_agent = self._generate_user_agent(browser, platform)
59
+
60
+ headers = {
61
+ "User-Agent": user_agent,
62
+ "Accept": self._generate_accept_header(browser),
63
+ "Accept-Language": f"{language},en;q=0.9",
64
+ "Accept-Encoding": "gzip, deflate, br",
65
+ "Connection": "keep-alive",
66
+ }
67
+
68
+ # Add browser-specific headers
69
+ if browser == "Chrome" or browser == "Edge":
70
+ headers["sec-ch-ua"] = f'"Google Chrome";v="{random.randint(90, 110)}", "Chromium";v="{random.randint(90, 110)}"'
71
+ headers["sec-ch-ua-mobile"] = "?0"
72
+ headers["sec-ch-ua-platform"] = f'"{platform}"'
73
+
74
+ # Randomize header order (matters for fingerprinting)
75
+ return dict(sorted(headers.items(), key=lambda x: random.random()))
76
+
77
+ def _generate_user_agent(self, browser, platform):
78
+ """Generate a realistic user agent string"""
79
+ if browser == "Chrome":
80
+ chrome_version = f"{random.randint(90, 110)}.0.{random.randint(1000, 9999)}.{random.randint(10, 999)}"
81
+ if "Win" in platform:
82
+ return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
83
+ elif "Mac" in platform:
84
+ return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_{random.randint(11, 15)}_{random.randint(1, 7)}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
85
+ else:
86
+ return f"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_version} Safari/537.36"
87
+ elif browser == "Firefox":
88
+ ff_version = f"{random.randint(80, 100)}.0"
89
+ if "Win" in platform:
90
+ return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
91
+ elif "Mac" in platform:
92
+ return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.{random.randint(11, 15)}; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
93
+ else:
94
+ return f"Mozilla/5.0 (X11; Linux i686; rv:{ff_version}) Gecko/20100101 Firefox/{ff_version}"
95
+ elif browser == "Safari":
96
+ webkit_version = f"605.1.{random.randint(1, 15)}"
97
+ safari_version = f"{random.randint(13, 16)}.{random.randint(0, 1)}"
98
+ return f"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_{random.randint(11, 15)}_{random.randint(1, 7)}) AppleWebKit/{webkit_version} (KHTML, like Gecko) Version/{safari_version} Safari/{webkit_version}"
99
+ elif browser == "Edge":
100
+ edge_version = f"{random.randint(90, 110)}.0.{random.randint(1000, 9999)}.{random.randint(10, 999)}"
101
+ return f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{edge_version} Safari/537.36 Edg/{edge_version}"
102
+
103
+ def _generate_accept_header(self, browser):
104
+ """Generate browser-specific Accept header"""
105
+ if browser == "Chrome" or browser == "Edge":
106
+ return "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
107
+ elif browser == "Firefox":
108
+ return "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8"
109
+ elif browser == "Safari":
110
+ return "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
111
+
112
+
113
+ class CookieManager:
114
+ """Intelligently manages cookies to maintain sessions"""
115
+
116
+ def __init__(self):
117
+ self.cookies_by_domain = {}
118
+ self.cookie_jar_path = os.path.join(os.path.dirname(__file__), '.cookie_store')
119
+ os.makedirs(self.cookie_jar_path, exist_ok=True)
120
+ self.load_cookies()
121
+
122
+ def load_cookies(self):
123
+ """Load cookies from storage"""
124
+ try:
125
+ for filename in os.listdir(self.cookie_jar_path):
126
+ if filename.endswith('.json'):
127
+ domain = filename[:-5] # Remove .json
128
+ file_path = os.path.join(self.cookie_jar_path, filename)
129
+ with open(file_path, 'r') as f:
130
+ try:
131
+ cookie_data = json.load(f)
132
+ self.cookies_by_domain[domain] = cookie_data
133
+ except json.JSONDecodeError:
134
+ logger.warning(f"Invalid cookie file for {domain}, skipping")
135
+ except Exception as e:
136
+ logger.error(f"Error loading cookies: {e}")
137
+
138
+ def save_cookies(self):
139
+ """Save cookies to storage"""
140
+ for domain, cookies in self.cookies_by_domain.items():
141
+ file_path = os.path.join(self.cookie_jar_path, f"{domain}.json")
142
+ try:
143
+ with open(file_path, 'w') as f:
144
+ json.dump(cookies, f)
145
+ except Exception as e:
146
+ logger.error(f"Error saving cookies for {domain}: {e}")
147
+
148
+ def update_cookies(self, url, response_cookies):
149
+ """Update cookies from a response"""
150
+ domain = urlparse(url).netloc
151
+
152
+ if domain not in self.cookies_by_domain:
153
+ self.cookies_by_domain[domain] = {}
154
+
155
+ # Update with new cookies
156
+ for name, value in response_cookies.items():
157
+ self.cookies_by_domain[domain][name] = value
158
+
159
+ # Save updated cookies
160
+ self.save_cookies()
161
+
162
+ def get_cookies_for_url(self, url):
163
+ """Get cookies for a specific URL"""
164
+ domain = urlparse(url).netloc
165
+ return self.cookies_by_domain.get(domain, {})
166
+
167
+ def clear_cookies_for_domain(self, domain):
168
+ """Clear cookies for a specific domain"""
169
+ if domain in self.cookies_by_domain:
170
+ del self.cookies_by_domain[domain]
171
+ file_path = os.path.join(self.cookie_jar_path, f"{domain}.json")
172
+ if os.path.exists(file_path):
173
+ os.remove(file_path)
174
+
175
+
176
+ class NitterBypass:
177
+ """Advanced Nitter rate limit bypass system"""
178
+
179
+ def __init__(self, fingerprint_randomizer, cookie_manager):
180
+ # Expanded list of Nitter instances for better rotation
181
+ self.instances = [
182
+ "https://nitter.net",
183
+ "https://nitter.lacontrevoie.fr",
184
+ "https://nitter.1d4.us",
185
+ "https://nitter.poast.org",
186
+ "https://nitter.unixfox.eu",
187
+ "https://nitter.kavin.rocks",
188
+ "https://nitter.privacydev.net",
189
+ "https://nitter.projectsegfau.lt",
190
+ "https://nitter.pussthecat.org",
191
+ "https://nitter.42l.fr",
192
+ "https://nitter.fdn.fr",
193
+ "https://nitter.cz",
194
+ "https://bird.habedieehre.com",
195
+ "https://tweet.lambda.dance",
196
+ "https://nitter.cutelab.space",
197
+ "https://nitter.fly.dev",
198
+ "https://notabird.site",
199
+ "https://nitter.weiler.dev",
200
+ "https://nitter.sethforprivacy.com",
201
+ "https://nitter.mask.sh",
202
+ "https://nitter.space",
203
+ "https://nitter.hu",
204
+ "https://nitter.moomoo.me",
205
+ "https://nitter.grimneko.de",
206
+ ]
207
+
208
+ self.fingerprint_randomizer = fingerprint_randomizer
209
+ self.cookie_manager = cookie_manager
210
+
211
+ # Tracking usage statistics per instance
212
+ self.usage_counts = {instance: 0 for instance in self.instances}
213
+ self.success_counts = {instance: 0 for instance in self.instances}
214
+ self.failure_counts = {instance: 0 for instance in self.instances}
215
+ self.response_times = {instance: [] for instance in self.instances}
216
+
217
+ # Track banned instances with timeout
218
+ self.banned_instances = set()
219
+ self.banned_time = {}
220
+ self.ban_duration = 3600 # Default 1 hour ban time
221
+
222
+ # Client collection, one per instance
223
+ self.clients = {}
224
+
225
+ # Request flow control
226
+ self.last_request_time = 0
227
+ self.min_request_interval = 2.0
228
+ self.request_jitter = True # Add random jitter to requests
229
+
230
+ # Dynamic proxy rotation (if available)
231
+ self.proxies = self._load_proxies()
232
+ self.proxy_index = 0
233
+
234
+ def _load_proxies(self):
235
+ """Load proxy list if available"""
236
+ proxies = []
237
+ try:
238
+ proxy_file = os.path.join(os.path.dirname(__file__), 'proxies.txt')
239
+ if os.path.exists(proxy_file):
240
+ with open(proxy_file, 'r') as f:
241
+ for line in f:
242
+ line = line.strip()
243
+ if line and not line.startswith('#'):
244
+ proxies.append(line)
245
+ logger.info(f"Loaded {len(proxies)} proxies")
246
+ except Exception as e:
247
+ logger.error(f"Error loading proxies: {e}")
248
+ return proxies
249
+
250
+ def _get_next_proxy(self):
251
+ """Get next proxy in rotation"""
252
+ if not self.proxies:
253
+ return None
254
+
255
+ proxy = self.proxies[self.proxy_index]
256
+ self.proxy_index = (self.proxy_index + 1) % len(self.proxies)
257
+ return proxy
258
+
259
+ async def initialize(self):
260
+ """Initialize Nitter bypass system"""
261
+ # Create clients for each instance
262
+ for instance in self.instances:
263
+ await self._initialize_client(instance)
264
+
265
+ # Test instances to determine which are working
266
+ await self._test_instances()
267
+
268
+ async def _initialize_client(self, instance):
269
+ """Create an HTTP client for an instance"""
270
+ headers = self.fingerprint_randomizer.generate_headers()
271
+
272
+ # Get proxy if available
273
+ proxy = self._get_next_proxy()
274
+ proxies = {"all://": proxy} if proxy else None
275
+
276
+ # Create client with unique settings for this instance
277
+ self.clients[instance] = httpx.AsyncClient(
278
+ timeout=30.0,
279
+ follow_redirects=True,
280
+ headers=headers,
281
+ http2=True,
282
+ limits=httpx.Limits(max_connections=5, max_keepalive_connections=2),
283
+ proxies=proxies
284
+ )
285
+
286
+ # Initialize with cookies if we have any
287
+ domain = urlparse(instance).netloc
288
+ cookies = self.cookie_manager.get_cookies_for_url(instance)
289
+ if cookies:
290
+ for name, value in cookies.items():
291
+ self.clients[instance].cookies.set(name, value, domain=domain)
292
+
293
+ async def _test_instances(self):
294
+ """Test all instances to check availability"""
295
+ for instance in self.instances:
296
+ try:
297
+ start_time = time.time()
298
+ client = self.clients[instance]
299
+
300
+ # Add custom parameter to avoid caches
301
+ params = {"_": str(int(time.time()))}
302
+
303
+ response = await client.get(f"{instance}/", params=params, timeout=5.0)
304
+ end_time = time.time()
305
+
306
+ if response.status_code == 200:
307
+ logger.debug(f"Instance {instance} is available, response time: {end_time - start_time:.2f}s")
308
+
309
+ # Update cookies from response
310
+ self.cookie_manager.update_cookies(instance, dict(client.cookies))
311
+
312
+ # Track response time for prioritization
313
+ self.response_times[instance].append(end_time - start_time)
314
+ if len(self.response_times[instance]) > 5:
315
+ self.response_times[instance].pop(0) # Keep only last 5 measurements
316
+
317
+ else:
318
+ logger.warning(f"Instance {instance} returned status {response.status_code}")
319
+ if response.status_code in [429, 403, 503]:
320
+ self.banned_instances.add(instance)
321
+ self.banned_time[instance] = time.time()
322
+ except Exception as e:
323
+ logger.warning(f"Instance {instance} test failed: {e}")
324
+ self.banned_instances.add(instance)
325
+ self.banned_time[instance] = time.time()
326
+
327
+ # Add delay between tests
328
+ await asyncio.sleep(random.uniform(0.5, 1.5))
329
+
330
+ def _get_best_instance(self):
331
+ """Select the best instance based on health metrics"""
332
+ now = time.time()
333
+
334
+ # Unban instances that have served their time
335
+ for instance in list(self.banned_instances):
336
+ if instance in self.banned_time and now - self.banned_time[instance] > self.ban_duration:
337
+ self.banned_instances.remove(instance)
338
+ logger.info(f"Unbanned instance {instance} after timeout")
339
+
340
+ # Get available instances
341
+ available = [i for i in self.instances if i not in self.banned_instances]
342
+ if not available:
343
+ # If all are banned, try the least recently banned one
344
+ if self.banned_time:
345
+ instance = min(self.banned_time.items(), key=lambda x: x[1])[0]
346
+ logger.warning(f"All instances banned, trying least recent: {instance}")
347
+ return instance
348
+ else:
349
+ # Fallback to any instance
350
+ return random.choice(self.instances)
351
+
352
+ # Calculate a health score for each instance
353
+ health_scores = {}
354
+ for instance in available:
355
+ # Base score
356
+ score = 100
357
+
358
+ # Adjust for success rate
359
+ total_requests = self.success_counts[instance] + self.failure_counts[instance]
360
+ if total_requests > 0:
361
+ success_rate = self.success_counts[instance] / total_requests
362
+ score *= (0.5 + 0.5 * success_rate) # Weight success rate as 50% of score
363
+
364
+ # Adjust for response time
365
+ if self.response_times[instance]:
366
+ avg_response_time = sum(self.response_times[instance]) / len(self.response_times[instance])
367
+ # Faster responses get higher scores (up to 1.5x bonus for fast responses)
368
+ speed_factor = min(1.5, max(0.5, 1.0 / (avg_response_time / 2)))
369
+ score *= speed_factor
370
+
371
+ # Adjust for usage count (prefer less used instances)
372
+ usage_penalty = min(0.9, 0.5 + 0.5 / (1 + self.usage_counts[instance] / 5))
373
+ score *= usage_penalty
374
+
375
+ health_scores[instance] = score
376
+
377
+ # Select from top 3 instances with probability weighted by health score
378
+ top_instances = sorted(health_scores.items(), key=lambda x: x[1], reverse=True)[:3]
379
+ if not top_instances:
380
+ return random.choice(available)
381
+
382
+ # Extract instances and scores
383
+ instances = [i[0] for i in top_instances]
384
+ scores = [i[1] for i in top_instances]
385
+
386
+ # Normalize scores for weighted random selection
387
+ total_score = sum(scores)
388
+ if total_score > 0:
389
+ probabilities = [score / total_score for score in scores]
390
+ chosen = random.choices(instances, weights=probabilities, k=1)[0]
391
+ else:
392
+ chosen = random.choice(instances)
393
+
394
+ # Update usage count
395
+ self.usage_counts[chosen] += 1
396
+
397
+ return chosen
398
+
399
+ async def request(self, path, params=None):
400
+ """Make an intelligent request to a Nitter instance"""
401
+ if params is None:
402
+ params = {}
403
+
404
+ # Add random parameter to avoid caching
405
+ params["_nonce"] = str(random.randint(10000, 99999999))
406
+
407
+ # Rate limiting with jitter
408
+ now = time.time()
409
+ since_last = now - self.last_request_time
410
+ if since_last < self.min_request_interval:
411
+ if self.request_jitter:
412
+ # Add jitter to make request patterns less predictable
413
+ jitter = random.uniform(1.0, 3.0)
414
+ delay = self.min_request_interval - since_last + jitter
415
+ else:
416
+ delay = self.min_request_interval - since_last
417
+ await asyncio.sleep(delay)
418
+
419
+ # Get the best instance
420
+ instance = self._get_best_instance()
421
+ client = self.clients[instance]
422
+
423
+ # Update headers with new fingerprint to avoid detection
424
+ client.headers.update(self.fingerprint_randomizer.generate_headers())
425
+
426
+ # Update cookies
427
+ domain = urlparse(instance).netloc
428
+ cookies = self.cookie_manager.get_cookies_for_url(instance)
429
+ for name, value in cookies.items():
430
+ client.cookies.set(name, value, domain=domain)
431
+
432
+ # Update request timestamp
433
+ self.last_request_time = time.time()
434
+
435
+ url = f"{instance}{path}"
436
+
437
+ try:
438
+ # Make the request with timing
439
+ start_time = time.time()
440
+ response = await client.get(url, params=params)
441
+ end_time = time.time()
442
+ response_time = end_time - start_time
443
+
444
+ # Update cookies from response
445
+ if len(response.cookies) > 0:
446
+ self.cookie_manager.update_cookies(url, dict(response.cookies))
447
+
448
+ # Update response time tracking
449
+ self.response_times[instance].append(response_time)
450
+ if len(self.response_times[instance]) > 5:
451
+ self.response_times[instance].pop(0)
452
+
453
+ # Handle response based on status code
454
+ if response.status_code == 200:
455
+ # Success
456
+ self.success_counts[instance] += 1
457
+ return response
458
+ elif response.status_code in [429, 403, 503]:
459
+ # Rate limited or banned
460
+ logger.warning(f"Rate limit detected on {instance}: {response.status_code}")
461
+ self.failure_counts[instance] += 1
462
+ self.banned_instances.add(instance)
463
+ self.banned_time[instance] = time.time()
464
+
465
+ # Different ban durations based on response
466
+ if response.status_code == 429: # Rate limited
467
+ self.ban_duration = min(self.ban_duration * 2, 7200) # Max 2 hour ban, increasing
468
+ else: # Other error
469
+ self.ban_duration = 1800 # 30 minute ban
470
+
471
+ # Retry with a different instance
472
+ return await self.request(path, params)
473
+ else:
474
+ # Other error
475
+ logger.error(f"Error with {instance}: HTTP {response.status_code}")
476
+ self.failure_counts[instance] += 1
477
+
478
+ # Don't immediately ban for non-rate-limit errors
479
+ if self.failure_counts[instance] > 3: # After 3 failures, ban temporarily
480
+ self.banned_instances.add(instance)
481
+ self.banned_time[instance] = time.time()
482
+ self.ban_duration = 900 # 15 minute ban
483
+
484
+ # Retry with a different instance
485
+ return await self.request(path, params)
486
+
487
+ except httpx.HTTPError as e:
488
+ logger.error(f"HTTP error with {instance}: {str(e)}")
489
+ self.failure_counts[instance] += 1
490
+
491
+ # Ban instance after HTTP errors
492
+ self.banned_instances.add(instance)
493
+ self.banned_time[instance] = time.time()
494
+
495
+ # Retry with a different instance
496
+ return await self.request(path, params)
497
+
498
+ except Exception as e:
499
+ logger.error(f"Error with {instance}: {str(e)}")
500
+ self.failure_counts[instance] += 1
501
+
502
+ # Ban instance after errors
503
+ self.banned_instances.add(instance)
504
+ self.banned_time[instance] = time.time()
505
+
506
+ # Retry with a different instance
507
+ return await self.request(path, params)
508
+
509
+ async def close(self):
510
+ """Close all HTTP clients"""
511
+ for client in self.clients.values():
512
+ await client.aclose()
513
+
514
+
515
+ class TwitterService:
516
+ """Service for collecting tweets via web scraping using Nitter alternative frontends."""
517
+
518
+ def __init__(self):
519
+ self.cache_expiry = int(os.getenv("CACHE_EXPIRY_MINUTES", 120))
520
+
521
+ # Initialize advanced components for rate limit bypass
522
+ self.fingerprint_randomizer = FingerprintRandomizer()
523
+ self.cookie_manager = CookieManager()
524
+ self.nitter_bypass = None # Will be initialized later
525
+
526
+ # Enhanced cache with TTL and persistence
527
+ self.tweet_cache_dir = os.path.join(os.path.dirname(__file__), ".tweet_cache")
528
+ os.makedirs(self.tweet_cache_dir, exist_ok=True)
529
+ self.in_memory_cache = TTLCache(maxsize=100, ttl=self.cache_expiry * 60)
530
+
531
+ # Statistics and monitoring
532
+ self.stats = {
533
+ "requests": 0,
534
+ "cache_hits": 0,
535
+ "rate_limits": 0,
536
+ "errors": 0,
537
+ "success": 0
538
+ }
539
+ self.last_stats_reset = time.time()
540
+
541
+ # Default trusted news sources - focused on India-Pakistan relations
542
+ self.news_sources = [
543
+ NewsSource(name="Shiv Aroor", twitter_handle="ShivAroor", country="India", reliability_score=0.85),
544
+ NewsSource(name="Sidhant Sibal", twitter_handle="sidhant", country="India", reliability_score=0.85),
545
+ NewsSource(name="Indian Air Force", twitter_handle="IAF_MCC", country="India", reliability_score=0.95),
546
+ NewsSource(name="Indian Army", twitter_handle="adgpi", country="India", reliability_score=0.95),
547
+ NewsSource(name="Indian Defence Ministry", twitter_handle="SpokespersonMoD", country="India", reliability_score=0.95),
548
+ NewsSource(name="MIB India", twitter_handle="MIB_India", country="India", reliability_score=0.95),
549
+ NewsSource(name="Indian External Affairs Minister", twitter_handle="DrSJaishankar", country="India", reliability_score=0.95),
550
+ ]
551
+
552
+ async def initialize(self) -> bool:
553
+ """Initialize the Twitter service."""
554
+ try:
555
+ logger.info("Initializing Twitter service with advanced bypass techniques")
556
+
557
+ # Initialize the Nitter bypass engine
558
+ self.nitter_bypass = NitterBypass(self.fingerprint_randomizer, self.cookie_manager)
559
+ await self.nitter_bypass.initialize()
560
+
561
+ # Schedule background health checks for instances
562
+ asyncio.create_task(self._background_maintenance())
563
+
564
+ logger.info("Twitter service initialized successfully with bypass capabilities")
565
+ return True
566
+
567
+ except Exception as e:
568
+ logger.error(f"Failed to initialize Twitter service: {str(e)}")
569
+ return False
570
+
571
+ async def _background_maintenance(self):
572
+ """Run background maintenance tasks"""
573
+ while True:
574
+ try:
575
+ # Wait between maintenance cycles
576
+ await asyncio.sleep(900) # 15 minutes
577
+
578
+ # Log statistics
579
+ self._log_statistics()
580
+
581
+ # Clean up cache files
582
+ self._cleanup_expired_cache()
583
+
584
+ # Reset statistics periodically
585
+ if time.time() - self.last_stats_reset > 3600: # Reset every hour
586
+ self.stats = {key: 0 for key in self.stats}
587
+ self.last_stats_reset = time.time()
588
+
589
+ except Exception as e:
590
+ logger.error(f"Error in background maintenance: {str(e)}")
591
+
592
+ def _log_statistics(self):
593
+ """Log service statistics"""
594
+ total_requests = max(1, self.stats["requests"])
595
+ cache_hit_rate = self.stats["cache_hits"] / total_requests * 100
596
+ error_rate = (self.stats["errors"] + self.stats["rate_limits"]) / total_requests * 100
597
+
598
+ logger.info(f"TwitterService stats - Requests: {total_requests}, " +
599
+ f"Cache hits: {self.stats['cache_hits']} ({cache_hit_rate:.1f}%), " +
600
+ f"Rate limits: {self.stats['rate_limits']}, " +
601
+ f"Errors: {self.stats['errors']} ({error_rate:.1f}%)")
602
+
603
+ def _cleanup_expired_cache(self):
604
+ """Clean up expired cache files"""
605
+ now = time.time()
606
+ expiry_time = self.cache_expiry * 60
607
+
608
+ try:
609
+ for filename in os.listdir(self.tweet_cache_dir):
610
+ if not filename.endswith('.json'):
611
+ continue
612
+
613
+ file_path = os.path.join(self.tweet_cache_dir, filename)
614
+
615
+ try:
616
+ file_modified_time = os.path.getmtime(file_path)
617
+ if now - file_modified_time > expiry_time:
618
+ os.remove(file_path)
619
+ logger.debug(f"Removed expired cache file: {filename}")
620
+ except Exception as e:
621
+ logger.error(f"Error cleaning up cache file {filename}: {e}")
622
+ except Exception as e:
623
+ logger.error(f"Error during cache cleanup: {e}")
624
+
625
+ def _get_cache_path(self, key):
626
+ """Get filesystem path for a cache key"""
627
+ # Create a safe filename from the cache key
628
+ safe_key = re.sub(r'[^a-zA-Z0-9_-]', '_', key)
629
+ return os.path.join(self.tweet_cache_dir, f"{safe_key}.json")
630
+
631
+ def _get_from_cache(self, cache_key):
632
+ """Get tweets from cache (memory or disk)"""
633
+ # Check memory cache first
634
+ if cache_key in self.in_memory_cache:
635
+ self.stats["cache_hits"] += 1
636
+ return self.in_memory_cache[cache_key]
637
+
638
+ # Check disk cache
639
+ cache_path = self._get_cache_path(cache_key)
640
+ if os.path.exists(cache_path):
641
+ try:
642
+ with open(cache_path, 'r') as f:
643
+ cache_data = json.load(f)
644
+
645
+ # Check if cache is still valid
646
+ if time.time() - cache_data['timestamp'] < self.cache_expiry * 60:
647
+ # Convert dictionaries back to Tweet objects
648
+ tweets = []
649
+ for tweet_dict in cache_data['tweets']:
650
+ # Parse created_at back to datetime if it's stored as a string
651
+ if 'created_at' in tweet_dict and isinstance(tweet_dict['created_at'], str):
652
+ try:
653
+ tweet_dict['created_at'] = datetime.fromisoformat(tweet_dict['created_at'])
654
+ except ValueError:
655
+ tweet_dict['created_at'] = datetime.now()
656
+
657
+ tweets.append(Tweet(**tweet_dict))
658
+
659
+ # Restore to memory cache and return
660
+ self.in_memory_cache[cache_key] = tweets
661
+ self.stats["cache_hits"] += 1
662
+ return tweets
663
+ else:
664
+ # Cache expired, remove file
665
+ os.remove(cache_path)
666
+ except Exception as e:
667
+ logger.error(f"Error reading cache file {cache_path}: {e}")
668
+
669
+ return None
670
+
671
+ def _save_to_cache(self, cache_key, tweets):
672
+ """Save tweets to cache (memory and disk)"""
673
+ # Save to memory cache
674
+ self.in_memory_cache[cache_key] = tweets
675
+
676
+ # Convert tweets to dictionaries for JSON serialization
677
+ tweet_dicts = []
678
+ for tweet in tweets:
679
+ tweet_dicts.append({
680
+ 'id': tweet.id,
681
+ 'text': tweet.text,
682
+ 'author': tweet.author,
683
+ 'created_at': tweet.created_at.isoformat() if hasattr(tweet.created_at, 'isoformat') else str(tweet.created_at),
684
+ 'engagement': tweet.engagement,
685
+ 'url': tweet.url
686
+ })
687
+
688
+ # Save to disk cache
689
+ cache_path = self._get_cache_path(cache_key)
690
+ try:
691
+ with open(cache_path, 'w') as f:
692
+ json.dump({
693
+ 'tweets': tweet_dicts,
694
+ 'timestamp': time.time()
695
+ }, f)
696
+ except Exception as e:
697
+ logger.error(f"Error writing to cache file {cache_path}: {e}")
698
+
699
+ async def get_tweets_from_source(self, source: NewsSource, limit: int = 20, retries: int = 3) -> List[Tweet]:
700
+ """Get tweets from a specific Twitter source using advanced bypass techniques."""
701
+ cache_key = f"{source.twitter_handle}_{limit}"
702
+
703
+ # Check cache first
704
+ cached_tweets = self._get_from_cache(cache_key)
705
+ if cached_tweets:
706
+ logger.debug(f"Returning cached tweets for {source.twitter_handle}")
707
+ return cached_tweets
708
+
709
+ self.stats["requests"] += 1
710
+
711
+ # Extract tweets with retry logic
712
+ all_attempts = retries + 1
713
+ tweets = []
714
+
715
+ for attempt in range(all_attempts):
716
+ try:
717
+ logger.info(f"Fetching tweets from {source.twitter_handle} (attempt {attempt + 1}/{all_attempts})")
718
+
719
+ # Build path with randomization to avoid caching patterns
720
+ path = f"/{source.twitter_handle}"
721
+ params = {
722
+ "f": "tweets", # Filter to tweets only
723
+ "r": str(random.randint(10000, 99999)) # Random param to bypass caches
724
+ }
725
+
726
+ # Get the response using our bypass system
727
+ response = await self.nitter_bypass.request(path, params)
728
+
729
+ if response.status_code == 200:
730
+ # Success - extract tweets
731
+ self.stats["success"] += 1
732
+
733
+ # Parse the HTML
734
+ soup = BeautifulSoup(response.text, "html.parser")
735
+
736
+ # Find tweet containers
737
+ tweet_containers = soup.select(".timeline-item")
738
+
739
+ for container in tweet_containers[:limit]:
740
+ try:
741
+ # Extract tweet ID from the permalink
742
+ permalink_element = container.select_one(".tweet-link")
743
+ if not permalink_element:
744
+ continue
745
+
746
+ permalink = permalink_element.get("href", "")
747
+ tweet_id = permalink.split("/")[-1]
748
+
749
+ # Extract tweet text
750
+ text_element = container.select_one(".tweet-content")
751
+ tweet_text = text_element.get_text().strip() if text_element else ""
752
+
753
+ # Extract timestamp
754
+ time_element = container.select_one(".tweet-date")
755
+ timestamp = time_element.find("a").get("title") if time_element and time_element.find("a") else None
756
+
757
+ if timestamp:
758
+ try:
759
+ created_at = datetime.strptime(timestamp, "%d/%m/%Y, %H:%M:%S")
760
+ except ValueError:
761
+ created_at = datetime.now()
762
+ else:
763
+ created_at = datetime.now()
764
+
765
+ # Extract engagement metrics
766
+ stats_container = container.select_one(".tweet-stats")
767
+ engagement = {"likes": 0, "retweets": 0, "replies": 0, "views": 0}
768
+
769
+ if stats_container:
770
+ stats = stats_container.select(".icon-container")
771
+ for stat in stats:
772
+ stat_text = stat.get_text().strip()
773
+ if "retweet" in stat.get("class", []):
774
+ engagement["retweets"] = self._parse_count(stat_text)
775
+ elif "heart" in stat.get("class", []):
776
+ engagement["likes"] = self._parse_count(stat_text)
777
+ elif "comment" in stat.get("class", []):
778
+ engagement["replies"] = self._parse_count(stat_text)
779
+
780
+ tweet_url = f"https://x.com/{source.twitter_handle}/status/{tweet_id}"
781
+
782
+ tweets.append(
783
+ Tweet(
784
+ id=tweet_id,
785
+ text=tweet_text,
786
+ author=source.twitter_handle,
787
+ created_at=created_at,
788
+ engagement=engagement,
789
+ url=tweet_url
790
+ )
791
+ )
792
+ except Exception as e:
793
+ logger.error(f"Error processing tweet from {source.twitter_handle}: {str(e)}")
794
+
795
+ # Cache the results
796
+ if tweets:
797
+ self._save_to_cache(cache_key, tweets)
798
+ logger.info(f"Fetched and cached {len(tweets)} tweets from {source.twitter_handle}")
799
+
800
+ return tweets
801
+
802
+ elif response.status_code == 429:
803
+ # Rate limited
804
+ self.stats["rate_limits"] += 1
805
+ logger.warning(f"Rate limited (429) when fetching tweets from {source.twitter_handle}")
806
+
807
+ if attempt < retries:
808
+ backoff_time = min(30 * (2 ** attempt), 300) # Exponential backoff, max 5 minutes
809
+ logger.info(f"Retrying in {backoff_time}s...")
810
+ await asyncio.sleep(backoff_time)
811
+ else:
812
+ logger.error(f"Failed to fetch tweets from {source.twitter_handle} after {retries} retries: HTTP 429")
813
+ return []
814
+
815
+ else:
816
+ # Other error
817
+ self.stats["errors"] += 1
818
+ logger.error(f"Failed to fetch tweets from {source.twitter_handle}: HTTP {response.status_code}")
819
+
820
+ if attempt < retries:
821
+ await asyncio.sleep(5)
822
+ continue
823
+ else:
824
+ return []
825
+
826
+ except Exception as e:
827
+ self.stats["errors"] += 1
828
+ logger.error(f"Error fetching tweets from {source.twitter_handle}: {str(e)}")
829
+
830
+ if attempt < retries:
831
+ await asyncio.sleep(5)
832
+ continue
833
+
834
+ return [] # Return empty list if all retries failed
835
+
836
+ def _parse_count(self, count_text: str) -> int:
837
+ """Parse count text like '1.2K' into integer value."""
838
+ try:
839
+ count_text = count_text.strip()
840
+ if not count_text:
841
+ return 0
842
+
843
+ if 'K' in count_text:
844
+ return int(float(count_text.replace('K', '')) * 1000)
845
+ elif 'M' in count_text:
846
+ return int(float(count_text.replace('M', '')) * 1000000)
847
+ else:
848
+ return int(count_text)
849
+ except (ValueError, TypeError):
850
+ return 0
851
+
852
+ async def get_related_tweets(self, keywords: List[str], days_back: int = 2) -> List[Tweet]:
853
+ """
854
+ Get tweets related to specific keywords from trusted news sources only.
855
+ Uses intelligent batching and failover strategies.
856
+ """
857
+ all_tweets = []
858
+ cutoff_date = datetime.now() - timedelta(days=days_back)
859
+
860
+ # Process sources in smaller batches with smart ordering
861
+ active_sources = [source for source in self.news_sources if source.is_active]
862
+
863
+ # Sort sources by reliability score (prioritize higher scores)
864
+ active_sources.sort(key=lambda s: s.reliability_score, reverse=True)
865
+
866
+ # Dynamic batch size - larger when we have fewer sources to optimize throughput
867
+ source_count = len(active_sources)
868
+ batch_size = max(1, min(3, 10 // source_count if source_count > 0 else 3))
869
+
870
+ logger.info(f"Collecting tweets from {len(active_sources)} trusted news sources")
871
+
872
+ for i in range(0, len(active_sources), batch_size):
873
+ batch_sources = active_sources[i:i+batch_size]
874
+
875
+ # Process batch with smart concurrency
876
+ tasks = []
877
+ for source in batch_sources:
878
+ # Adaptive limit based on source reliability
879
+ fetch_limit = int(50 * min(1.5, source.reliability_score))
880
+ tasks.append(self.get_tweets_from_source(source, limit=fetch_limit))
881
+
882
+ source_tweets_list = await asyncio.gather(*tasks)
883
+
884
+ # Process batch results
885
+ batch_tweets = []
886
+ for source_tweets in source_tweets_list:
887
+ # Filter tweets by keywords and date
888
+ for tweet in source_tweets:
889
+ if (tweet.created_at >= cutoff_date and
890
+ any(keyword.lower() in tweet.text.lower() for keyword in keywords)):
891
+ batch_tweets.append(tweet)
892
+
893
+ all_tweets.extend(batch_tweets)
894
+
895
+ # Dynamic delay between batches based on results
896
+ # If we got fewer tweets than expected, slow down more
897
+ batch_delay = random.uniform(2.0, 5.0)
898
+ if len(batch_tweets) < batch_size * 3: # Fewer than 3 tweets per source
899
+ batch_delay += random.uniform(3.0, 7.0) # Add extra delay
900
+
901
+ await asyncio.sleep(batch_delay)
902
+
903
+ # If we have very few results, try with more relaxed filtering
904
+ if len(all_tweets) < 5 and active_sources:
905
+ logger.info("Few relevant tweets found, trying more relaxed filtering")
906
+
907
+ # Take top 3 most reliable sources
908
+ key_sources = active_sources[:min(3, len(active_sources))]
909
+ tasks = [self.get_tweets_from_source(source, limit=100, retries=5) for source in key_sources]
910
+ more_tweets_list = await asyncio.gather(*tasks)
911
+
912
+ # Process with more relaxed keyword matching
913
+ for source_tweets in more_tweets_list:
914
+ for tweet in source_tweets:
915
+ # Use partial keyword matching
916
+ if tweet.created_at >= cutoff_date:
917
+ for keyword in keywords:
918
+ # Split keyword into parts and check if any part matches
919
+ keyword_parts = keyword.lower().split()
920
+ if any(part in tweet.text.lower() for part in keyword_parts if len(part) > 3):
921
+ if tweet.id not in [t.id for t in all_tweets]:
922
+ all_tweets.append(tweet)
923
+ break
924
+
925
+ # Sort by recency
926
+ all_tweets.sort(key=lambda x: x.created_at, reverse=True)
927
+
928
+ logger.info(f"Found {len(all_tweets)} tweets from trusted sources related to keywords: {keywords}")
929
+ return all_tweets
930
+
931
+ def update_sources(self, sources: List[NewsSource]) -> None:
932
+ """Update the list of trusted news sources."""
933
+ self.news_sources = sources
934
+ # Clear cache when sources are updated
935
+ self.in_memory_cache.clear()
936
+ logger.info(f"Updated trusted news sources. New count: {len(sources)}")
937
+
938
+ def get_sources(self) -> List[NewsSource]:
939
+ """Get the current list of trusted news sources."""
940
+ return self.news_sources
941
+
942
+ async def close(self):
943
+ """Clean up resources."""
944
+ if self.nitter_bypass:
945
+ await self.nitter_bypass.close()
vercel.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": 2,
3
+ "builds": [
4
+ {
5
+ "src": "/app.py",
6
+ "use": "@vercel/python"
7
+ }
8
+ ],
9
+ "routes": [
10
+ {
11
+ "src": "/(.*)",
12
+ "dest": "/app.py"
13
+ }
14
+ ]
15
+ }