Spaces:

shukdevdatta123
/

AI-WS

Running

App Files Files Community

shukdevdatta123 commited on 25 days ago

Commit

f7478ee

verified ·

1 Parent(s): 982639c

Update app.py

Browse files

Files changed (1) hide show

app.py +12 -13

app.py CHANGED Viewed

@@ -18,12 +18,10 @@ class WebScrapingTool:
     def __init__(self):
         self.client = None
         self.system_prompt = """You are a specialized web data extraction assistant. Your core purpose is to browse and analyze the content of web pages based on user instructions, and return structured or unstructured information from the provided URL. Your capabilities include:
 1. Navigating and reading web page content from a given URL.
 2. Extracting textual content including headings, paragraphs, lists, and metadata.
 3. Identifying and extracting HTML tables and presenting them in a clean, structured format.
 4. Creating new, custom tables based on user queries by processing, reorganizing, or filtering the content found on the source page.
 You must always follow these guidelines:
 - Accurately extract and summarize both structured (tables, lists) and unstructured (paragraphs, articles) content.
 - Clearly separate different types of data (e.g., summaries, tables, bullet points).
@@ -37,11 +35,8 @@ You must always follow these guidelines:
   - Include only the relevant columns as per the user request.
   - Sort, filter, and reorganize data accordingly.
   - Use clear and consistent headers.
 You must not hallucinate or infer data not present on the page. If content is missing, unclear, or restricted, say so explicitly.
 Always respond based on the actual content from the provided link. If the page fails to load or cannot be accessed, inform the user immediately.
 Your role is to act as an intelligent browser and data interpreter — able to read and reshape any web content to meet user needs."""
     def setup_client(self, api_key):
@@ -59,11 +54,11 @@ Your role is to act as an intelligent browser and data interpreter — able to r
         """Create a robust session with retry strategy and proper headers"""
         session = requests.Session()
-        # Define retry strategy
         retry_strategy = Retry(
             total=3,
             status_forcelist=[429, 500, 502, 503, 504],
-            method_whitelist=["HEAD", "GET", "OPTIONS"],
             backoff_factor=1
         )
@@ -97,6 +92,7 @@ Your role is to act as an intelligent browser and data interpreter — able to r
             # Multiple timeout attempts with increasing duration
             timeout_attempts = [15, 30, 45]
             for timeout in timeout_attempts:
                 try:
@@ -133,9 +129,16 @@ Your role is to act as an intelligent browser and data interpreter — able to r
                         break
                     except:
                         continue
             # Check if we got a response
-            if 'response' not in locals():
                 return {
                     'success': False,
                     'error': "Failed to establish connection after multiple attempts"
@@ -271,15 +274,12 @@ Your role is to act as an intelligent browser and data interpreter — able to r
         content_text = f"""
 WEBPAGE ANALYSIS REQUEST
 ========================
 URL: {scraped_data['url']}
 Title: {scraped_data['title']}
 Content Length: {scraped_data['content_length']} characters
 Tables Found: {len(scraped_data['tables'])}
 META DESCRIPTION:
 {scraped_data['meta_description']}
 MAIN CONTENT:
 {scraped_data['text']}
 """
@@ -426,7 +426,7 @@ def create_interface():
             - E-commerce product pages
             - Financial data sites (Yahoo Finance, MarketWatch)
             - Research papers and academic sites
             ## 🧪 **Test Scenarios**
             ### **1. News & Media Sites**
@@ -577,7 +577,6 @@ def create_interface():
             - HttpBin (perfect for testing basic functionality)
             Start with the simpler tests and gradually move to more complex scenarios to fully evaluate your tool's capabilities!
             """)
         # Event handlers

     def __init__(self):
         self.client = None
         self.system_prompt = """You are a specialized web data extraction assistant. Your core purpose is to browse and analyze the content of web pages based on user instructions, and return structured or unstructured information from the provided URL. Your capabilities include:
 1. Navigating and reading web page content from a given URL.
 2. Extracting textual content including headings, paragraphs, lists, and metadata.
 3. Identifying and extracting HTML tables and presenting them in a clean, structured format.
 4. Creating new, custom tables based on user queries by processing, reorganizing, or filtering the content found on the source page.
 You must always follow these guidelines:
 - Accurately extract and summarize both structured (tables, lists) and unstructured (paragraphs, articles) content.
 - Clearly separate different types of data (e.g., summaries, tables, bullet points).
   - Include only the relevant columns as per the user request.
   - Sort, filter, and reorganize data accordingly.
   - Use clear and consistent headers.
 You must not hallucinate or infer data not present on the page. If content is missing, unclear, or restricted, say so explicitly.
 Always respond based on the actual content from the provided link. If the page fails to load or cannot be accessed, inform the user immediately.
 Your role is to act as an intelligent browser and data interpreter — able to read and reshape any web content to meet user needs."""
     def setup_client(self, api_key):
         """Create a robust session with retry strategy and proper headers"""
         session = requests.Session()
+        # Define retry strategy with fixed parameter name
         retry_strategy = Retry(
             total=3,
             status_forcelist=[429, 500, 502, 503, 504],
+            allowed_methods=["HEAD", "GET", "OPTIONS"],  # Fixed: changed from method_whitelist
             backoff_factor=1
         )
             # Multiple timeout attempts with increasing duration
             timeout_attempts = [15, 30, 45]
+            response = None
             for timeout in timeout_attempts:
                 try:
                         break
                     except:
                         continue
+                except requests.exceptions.RequestException as e:
+                    if timeout == timeout_attempts[-1]:  # Last attempt
+                        return {
+                            'success': False,
+                            'error': f"Request failed: {str(e)}"
+                        }
+                    continue
             # Check if we got a response
+            if response is None:
                 return {
                     'success': False,
                     'error': "Failed to establish connection after multiple attempts"
         content_text = f"""
 WEBPAGE ANALYSIS REQUEST
 ========================
 URL: {scraped_data['url']}
 Title: {scraped_data['title']}
 Content Length: {scraped_data['content_length']} characters
 Tables Found: {len(scraped_data['tables'])}
 META DESCRIPTION:
 {scraped_data['meta_description']}
 MAIN CONTENT:
 {scraped_data['text']}
 """
             - E-commerce product pages
             - Financial data sites (Yahoo Finance, MarketWatch)
             - Research papers and academic sites
             ## 🧪 **Test Scenarios**
             ### **1. News & Media Sites**
             - HttpBin (perfect for testing basic functionality)
             Start with the simpler tests and gradually move to more complex scenarios to fully evaluate your tool's capabilities!
             """)
         # Event handlers