Spaces:

dolphinium
/

pc-ai-data-analyst-dup

Running

App Files Files Community

dolphinium commited on 27 days ago

Commit

2af2760

1 Parent(s): 6e9ade9

suggestion to obligation on filter query part.

Browse files

Files changed (1) hide show

llm_prompts.py +51 -28

llm_prompts.py CHANGED Viewed

@@ -32,36 +32,37 @@ def get_analysis_plan_prompt(natural_language_query, chat_history, search_fields
         # The search_fields are now pre-mapped, so we can use them directly
         formatted_fields = "\n".join([f"  - {field['field_name']}: {field['field_value']}" for field in search_fields])
         dynamic_fields_prompt_section = f"""
----
-### DYNAMIC FIELD SUGGESTIONS (Use Critically)
-An external API has suggested the following field-value pairs based on your query.
-**These are only HINTS.** Do NOT use them blindly.
-Critically evaluate if they make sense. For example, a `molecule_name` associated with a `company_name` might be irrelevant or illogical.
-Use only what is logical for the query. Do not construct filters from fields/values that do not make sense.
-**Suggested Fields:**
 {formatted_fields}
 """
     return f"""
 You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
-Your most important job is to think like an analyst and choose a `analysis_dimension` that provides a meaningful, non-obvious breakdown of the data.
 ---
 ### CONTEXT & RULES
 1.  **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
-2.  **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
-3.  **Crucial Sorting Rules**:
     *   For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
     *   If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
     *   For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
-3.  On **Qualitative Data** Group Operation:
-    * We need to show user **standout examples** for each category chosen.
-    For example: if user asks for "USA approved drugs last 5 years" We need to show user standout examples for each year. In this context: standout means the news with the biggest deals in million for each year for example.
-4.  **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting. The JSON MUST include a `reasoning` object explaining your choices.
 ---
 ### HOW TO CHOOSE THE ANALYSIS DIMENSION AND MEASURE (ANALYTICAL STRATEGY)
@@ -99,17 +100,24 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
 ### EXAMPLES
 **User Query 1:** "What are the top 5 companies by total deal value in 2023?"
 **Correct JSON Output 1:**
 ```json
 {{
   "reasoning": {{
     "dimension_choice": "User explicitly asked for 'top 5 companies', so 'company_name' is the correct dimension.",
-    "measure_choice": "User explicitly asked for 'total deal value', so 'sum(total_deal_value_in_million)' is the correct measure."
   }},
   "analysis_dimension": "company_name",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
-  "query_filter": "date_year:2023 AND total_deal_value_in_million:[0 TO *]",
   "quantitative_request": {{
     "json.facet": {{
       "companies_by_deal_value": {{
@@ -126,25 +134,32 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
   "qualitative_request": {{
     "group": true,
     "group.field": "company_name",
-    "group.limit": 1,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
 }}
 ```
-**User Query 2:** "What are the most common news types for infections this year?"
 **Correct JSON Output 2:**
 ```json
 {{
   "reasoning": {{
-    "dimension_choice": "User asked for 'most common news types', so 'news_type' is the correct dimension.",
-    "measure_choice": "User asked for 'most common', which implies counting occurrences. Therefore, the measure is 'count'."
   }},
   "analysis_dimension": "news_type",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
-  "query_filter": "therapeutic_category_s:infections AND date_year:{datetime.datetime.now().year}",
   "quantitative_request": {{
     "json.facet": {{
       "news_by_type": {{
@@ -158,7 +173,7 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
   "qualitative_request": {{
     "group": true,
     "group.field": "news_type",
-    "group.limit": 1,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
@@ -167,18 +182,26 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
-**User Query 4 :** "Compare deal values for injection vs oral related to infection news."
 **Correct JSON Output 3:**
 ```json
 {{
   "reasoning": {{
-    "dimension_choice": "The user wants to compare deal values for 'injection' vs 'oral' drug delivery methods within 'infection' news.  The user has specified the comparison ('injection vs oral'), making 'route_branch' the appropriate analysis dimension to directly reflect this comparison. Using 'drug_delivery_branch' would be too granular for this high-level comparison.",
-    "measure_choice": "The user explicitly asks to compare 'deal values', so 'sum(total_deal_value_in_million)' is the correct measure."
   }},
   "analysis_dimension": "route_branch",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
-  "query_filter": "drug_delivery_branch_s:(injection OR oral) AND therapeutic_category_s:infections AND date_year:2025",
   "quantitative_request": {{
     "json.facet": {{
       "deal_values_by_route": {{
@@ -195,7 +218,7 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
   "qualitative_request": {{
     "group": true,
     "group.field": "route_branch",
-    "group.limit": 1,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
@@ -204,7 +227,7 @@ This is the most critical part of your task. A bad choice leads to a useless, bo
 ---
 ### YOUR TASK
-Convert the following user query into a single, raw JSON "Analysis Plan" object. Strictly follow all rules, especially the analytical strategy for choosing the dimension and measure. Your JSON output MUST include the `reasoning` field.
 **Current User Query:** `{natural_language_query}`
 """

         # The search_fields are now pre-mapped, so we can use them directly
         formatted_fields = "\n".join([f"  - {field['field_name']}: {field['field_value']}" for field in search_fields])
         dynamic_fields_prompt_section = f"""
+---
+### MANDATORY DYNAMIC FILTERS
+An external API has identified the following field-value pairs from the user query.
+**You MUST use ALL of these fields and values to construct the `query_filter`.**
+- Construct the `query_filter` by combining these key-value pairs using the 'AND' operator.
+- Do NOT add any other fields or conditions to the `query_filter`. This section is the definitive source for it.
+**Mandatory Fields for Query Filter:**
 {formatted_fields}
 """
     return f"""
 You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
+Your most important job is to think like an analyst and choose a `analysis_dimension` and `analysis_measure` that provides a meaningful, non-obvious breakdown of the data.
 ---
 ### CONTEXT & RULES
 1.  **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
+2.  **Query Filter Construction**: The `query_filter` MUST be built exclusively from the fields provided in the "MANDATORY DYNAMIC FILTERS" section, if present.
+3.  **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
+4.  **Crucial Sorting Rules**:
     *   For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
     *   If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
     *   For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
+5.  On **Qualitative Data** Group Operation:
+    * We need to show user **standout examples** for each category chosen.
+    For example: if user asks for "USA approved drugs last 5 years" We need to show user standout examples for each year. In this context: standout means the news with the biggest deals in million for each year for example.
+6.  **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting. The JSON MUST include a `reasoning` object explaining your choices.
 ---
 ### HOW TO CHOOSE THE ANALYSIS DIMENSION AND MEASURE (ANALYTICAL STRATEGY)
 ### EXAMPLES
 **User Query 1:** "What are the top 5 companies by total deal value in 2023?"
+**API Filter Input 1:**
+```
+### MANDATORY DYNAMIC FILTERS
+**Mandatory Fields for Query Filter:**
+  - date: '2023'
+```
 **Correct JSON Output 1:**
 ```json
 {{
   "reasoning": {{
     "dimension_choice": "User explicitly asked for 'top 5 companies', so 'company_name' is the correct dimension.",
+    "measure_choice": "User explicitly asked for 'total deal value', so 'sum(total_deal_value_in_million)' is the correct measure.",
+    "filter_choice": "The query filter was constructed from the mandatory fields provided by the API: date(date is converted to ISO 8601 format) and total_deal_value_in_million."
   }},
   "analysis_dimension": "company_name",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
+  "query_filter": "date:["2023-01-01T00:00:00Z" TO \"2023-12-31T23:59:59Z\"]",
   "quantitative_request": {{
     "json.facet": {{
       "companies_by_deal_value": {{
   "qualitative_request": {{
     "group": true,
     "group.field": "company_name",
+    "group.limit": 2,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
 }}
 ```
+**User Query 2:** "What are the most common news types for infections in 2025?"
+**API Filter Input 2:**
+```### MANDATORY DYNAMIC FILTERS
+**Mandatory Fields for Query Filter:**
+  - therapeutic_category_s: infections
+  - date: '2025'
+```
 **Correct JSON Output 2:**
 ```json
 {{
   "reasoning": {{
+    "dimension_choice": "User asked for 'most common news types', so 'news_type' is the correct dimension. This is not redundant as the filter is on 'therapeutic_category'.",
+    "measure_choice": "User asked for 'most common', which implies counting occurrences. Therefore, the measure is 'count'.",
+    "filter_choice": "The query filter was constructed from the mandatory fields provided by the API: therapeutic_category_s and date(date is converted to ISO 8601 format)."
   }},
   "analysis_dimension": "news_type",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
+  "query_filter": "therapeutic_category_s:infections AND date:["2025-01-01T00:00:00Z" TO *]",
   "quantitative_request": {{
     "json.facet": {{
       "news_by_type": {{
   "qualitative_request": {{
     "group": true,
     "group.field": "news_type",
+    "group.limit": 2,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
+**User Query 3:** "Compare deal values for injection vs oral related to infection news."
+**API Filter Input 3:**
+```
+### MANDATORY DYNAMIC FILTERS
+**Mandatory Fields for Query Filter:**
+  - drug_delivery_branch_s: (injection OR oral)
+  - therapeutic_category_s: infections
+```
 **Correct JSON Output 3:**
 ```json
 {{
   "reasoning": {{
+    "dimension_choice": "The user wants to compare 'injection' vs 'oral', making 'route_branch' the appropriate analysis dimension.",
+    "measure_choice": "The user explicitly asks to compare 'deal values', so 'sum(total_deal_value_in_million)' is the correct measure.",
+    "filter_choice": "The query filter was constructed directly from the mandatory fields provided by the API: drug_delivery_branch_s and therapeutic_category_s."
   }},
   "analysis_dimension": "route_branch",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
+  "query_filter": "drug_delivery_branch_s:(injection OR oral) AND therapeutic_category_s:infections",
   "quantitative_request": {{
     "json.facet": {{
       "deal_values_by_route": {{
   "qualitative_request": {{
     "group": true,
     "group.field": "route_branch",
+    "group.limit": 2,
     "group.sort": "sum(total_deal_value_in_million) desc",
     "sort": "total_deal_value_in_million desc"
   }}
 ---
 ### YOUR TASK
+Convert the following user query into a single, raw JSON "Analysis Plan" object. Strictly follow all rules, especially the rule for building the `query_filter` from the mandatory dynamic filters. Your JSON output MUST include the `reasoning` field.
 **Current User Query:** `{natural_language_query}`
 """