kaikaidai commited on
Commit
16c7d6b
·
verified ·
1 Parent(s): fb2bb34

Synced repo using 'sync_with_huggingface' Github Action

Browse files
Files changed (1) hide show
  1. app.py +54 -49
app.py CHANGED
@@ -19,67 +19,67 @@ MAX_ITERATIONS = 3 # Maximum number of refinement iterations
19
 
20
  # Split the evaluation prompt into separate dimensions
21
  ACCURACY_PROMPT = """
22
- Evaluate the response on Accuracy: Is the response factually correct and free from hallucination or misinformation?
23
-
24
  Scoring Rubric:
25
- Score 1: The response contains numerous factual errors or completely fabricated information.
26
- Score 2: The response contains major factual errors or significant hallucinations.
27
- Score 3: The response contains some factual inaccuracies, but they are not significant.
28
- Score 4: The response is factually sound with only minor inaccuracies.
29
- Score 5: The response is factually flawless and completely accurate.
30
-
31
  Provide:
32
- - A numeric score (1-5, where 5 is excellent)
33
- - A brief explanation justifying the score
34
- - Specific suggestions for improvement
35
  """
36
 
37
  RELEVANCE_PROMPT = """
38
- Evaluate the response on Relevance: Does the response directly answer the user's question effectively?
39
-
40
  Scoring Rubric:
41
- Score 1: The response completely misses the point of the question.
42
- Score 2: The response addresses the general topic but fails to answer the specific question.
43
- Score 3: The response partially answers the question but misses key aspects.
44
- Score 4: The response answers the question well but could be more focused or complete.
45
- Score 5: The response perfectly addresses all aspects of the question.
46
-
47
  Provide:
48
- - A numeric score (1-5, where 5 is excellent)
49
- - A brief explanation justifying the score
50
- - Specific suggestions for improvement
51
  """
52
 
53
  CLARITY_PROMPT = """
54
- Evaluate the response on Clarity: Is the response clearly structured and easily understandable?
55
-
56
  Scoring Rubric:
57
- Score 1: The response is extremely confusing and poorly structured.
58
- Score 2: The response is difficult to follow with major organizational issues.
59
- Score 3: The response is somewhat clear but has organizational or expression issues.
60
- Score 4: The response is well-structured with only minor clarity issues.
61
- Score 5: The response is exceptionally clear, well-organized, and easy to understand.
62
-
63
  Provide:
64
- - A numeric score (1-5, where 5 is excellent)
65
- - A brief explanation justifying the score
66
- - Specific suggestions for improvement
67
  """
68
 
69
  DEPTH_PROMPT = """
70
- Evaluate the response on Depth: Does the response provide sufficient detail, insight, or useful context?
71
-
72
  Scoring Rubric:
73
- Score 1: The response is extremely shallow with no meaningful detail or insight.
74
- Score 2: The response lacks significant depth and provides minimal useful information.
75
- Score 3: The response provides some depth but misses opportunities for insight or context.
76
- Score 4: The response offers good depth with useful details and context.
77
- Score 5: The response provides exceptional depth with comprehensive details, valuable insights, and rich context.
78
-
79
  Provide:
80
- - A numeric score (1-5, where 5 is excellent)
81
- - A brief explanation justifying the score
82
- - Specific suggestions for improvement
83
  """
84
 
85
  # Initialize API keys from environment variables or Streamlit secrets
@@ -160,11 +160,16 @@ async def evaluate_with_atla_async(inputs: dict[str, str]) -> Tuple[float, Dict[
160
  clarity_task = evaluate_dimension(inputs["question"], inputs["response"], CLARITY_PROMPT)
161
  depth_task = evaluate_dimension(inputs["question"], inputs["response"], DEPTH_PROMPT)
162
 
163
- # Run all evaluations concurrently
164
- accuracy_score, accuracy_critique = await accuracy_task
165
- relevance_score, relevance_critique = await relevance_task
166
- clarity_score, clarity_critique = await clarity_task
167
- depth_score, depth_critique = await depth_task
 
 
 
 
 
168
 
169
  # Calculate average score
170
  avg_score = (accuracy_score + relevance_score + clarity_score + depth_score) / 4
 
19
 
20
  # Split the evaluation prompt into separate dimensions
21
  ACCURACY_PROMPT = """
22
+ Evaluate the response on Accuracy: Is the response factually correct, free from even minor inaccuracies, and demonstrating a deep, nuanced understanding of the subject matter?
23
+
24
  Scoring Rubric:
25
+ Score 1: The response contains any factual errors, no matter how minor, or shows any signs of hallucination.
26
+ Score 2: The response is mostly accurate but lacks precision in technical details or contains slight oversimplifications.
27
+ Score 3: The response is accurate and precise, but fails to address potential edge cases or exceptions.
28
+ Score 4: The response is highly accurate, addresses edge cases, but doesn't demonstrate extraordinary depth of knowledge.
29
+ Score 5: The response is impeccably accurate, demonstrates expert-level understanding, and provides insights beyond common knowledge.
30
+
31
  Provide:
32
+ - A numeric score (1-5, where 5 is near impossible to achieve)
33
+ - A detailed critique justifying the score, highlighting even minor inaccuracies
34
+ - Specific suggestions for improvement, including additional facts or nuances that could have been included
35
  """
36
 
37
  RELEVANCE_PROMPT = """
38
+ Evaluate the response on Relevance: Does the response answer the user's question with laser-focused precision, anticipating and addressing all possible interpretations and implications?
39
+
40
  Scoring Rubric:
41
+ Score 1: The response fails to directly address the core question or misses any subtext or implicit aspects.
42
+ Score 2: The response addresses the main question but overlooks subtle nuances or related concerns.
43
+ Score 3: The response is relevant and comprehensive but fails to prioritize the most critical aspects of the question.
44
+ Score 4: The response is highly relevant, prioritizes well, but doesn't explore all possible interpretations of the question.
45
+ Score 5: The response demonstrates perfect relevance, addresses all explicit and implicit aspects, and provides valuable additional context.
46
+
47
  Provide:
48
+ - A numeric score (1-5, where 5 is near impossible to achieve)
49
+ - A detailed critique justifying the score, analyzing how well each part of the question was addressed
50
+ - Specific suggestions for improvement, including unexplored angles or interpretations of the question
51
  """
52
 
53
  CLARITY_PROMPT = """
54
+ Evaluate the response on Clarity: Is the response structured with perfect logical flow, using precise language that leaves no room for misinterpretation?
55
+
56
  Scoring Rubric:
57
+ Score 1: The response has any structural issues, unclear transitions, or imprecise language use.
58
+ Score 2: The response is generally clear but contains minor ambiguities or could be more concise.
59
+ Score 3: The response is well-structured and clear, but lacks optimal organization for the subject matter.
60
+ Score 4: The response demonstrates excellent clarity and structure, but falls short of absolute perfection in precision.
61
+ Score 5: The response exhibits flawless organization, crystal-clear explanations, and language so precise it could serve as a technical reference.
62
+
63
  Provide:
64
+ - A numeric score (1-5, where 5 is near impossible to achieve)
65
+ - A detailed critique justifying the score, analyzing sentence structure, word choice, and overall organization
66
+ - Specific suggestions for improvement, including restructuring ideas or refining language for ultimate clarity
67
  """
68
 
69
  DEPTH_PROMPT = """
70
+ Evaluate the response on Depth: Does the response provide extraordinarily comprehensive coverage, offering cutting-edge insights and exploring the topic to its fullest extent?
71
+
72
  Scoring Rubric:
73
+ Score 1: The response lacks depth, misses key concepts, or fails to go beyond surface-level information.
74
+ Score 2: The response provides good coverage but doesn't delve into advanced concepts or implications.
75
+ Score 3: The response offers solid depth with some advanced concepts, but doesn't push the boundaries of the topic.
76
+ Score 4: The response provides excellent depth, touching on cutting-edge ideas, but falls short of exhaustive coverage.
77
+ Score 5: The response demonstrates unparalleled depth, offering groundbreaking insights, and exhaustively covering all aspects including future implications.
78
+
79
  Provide:
80
+ - A numeric score (1-5, where 5 is near impossible to achieve)
81
+ - A detailed critique justifying the score, analyzing the breadth and depth of concepts covered
82
+ - Specific suggestions for improvement, including additional advanced topics, interdisciplinary connections, or futuristic implications that could have been explored
83
  """
84
 
85
  # Initialize API keys from environment variables or Streamlit secrets
 
160
  clarity_task = evaluate_dimension(inputs["question"], inputs["response"], CLARITY_PROMPT)
161
  depth_task = evaluate_dimension(inputs["question"], inputs["response"], DEPTH_PROMPT)
162
 
163
+ # Run all evaluations concurrently using asyncio.gather
164
+ accuracy_result, relevance_result, clarity_result, depth_result = await asyncio.gather(
165
+ accuracy_task, relevance_task, clarity_task, depth_task
166
+ )
167
+
168
+ # Unpack results
169
+ accuracy_score, accuracy_critique = accuracy_result
170
+ relevance_score, relevance_critique = relevance_result
171
+ clarity_score, clarity_critique = clarity_result
172
+ depth_score, depth_critique = depth_result
173
 
174
  # Calculate average score
175
  avg_score = (accuracy_score + relevance_score + clarity_score + depth_score) / 4