EXAMPLES = [ { "description": "🩺 Biomedical Answer Faithfulness", "inputs_task": [ {"name": "Question", "value": "Does timing of initial surfactant treatment make a difference in rates of chronic lung disease or mortality in premature infants?"}, {"name": "Context", "value": "To compare two treatment strategies in preterm infants with or at risk of respiratory distress syndrome: early surfactant administration (within one hour of birth) versus late surfactant administration, in a geographically defined population. The primary outcome was chronic lung disease (CLD) and mortality before/at 36 weeks. Secondary outcomes included: duration of mechanical ventilation and continuous positive airway pressure (CPAP), post-natal steroids for CLD and major neonatal morbidities. Premature infants born at 22-32 weeks' gestation between January 2006 and December 2009. Ten neonatal intensive care units (NICUs) in New South Wales (NSW) and Australian Capital Territory (ACT), Australia. Retrospective analysis of prospectively collected data from the regional NICU database in NSW and ACT. Of the 2170 infants who received surfactant, 1182 (54.5%) and 988 (45.5%) received early and late surfactant, respectively. The early surfactant group was less mature (27.1 ± 2.1 versus 29.4 ± 2.1 weeks) and had more CLD and mortality (40.2% versus 20.0%). The multivariable analysis showed early surfactant to be associated with less duration of ventilation, longer duration of CPAP and longer hospital stay but had little or no impact on CLD/mortality."} ], "output": {"name": "Answer", "value": "Yes. Early surfactant administration significantly reduces the rates of chronic lung disease and mortality among premature infants. This supports the routine use of early surfactant treatment over alternatives like CPAP, especially in very premature infants. Further large-scale studies are still needed to solidify these findings."}, "evaluation_criteria": "Evaluate whether the information provided in the answer is directly supported by the context given in the document, without any fabricated or hallucinated details.", "rubric": [ {"name": "0", "value": "The answer is not supported by the document. It contains inaccuracies, fabrications, or details that are not present in the document."}, {"name": "1", "value": "The answer is fully supported by the document. It is factually accurate and all details are directly derived from the document."}, ] }, { "description": "🧠 Agent Contextual Understanding", "inputs_task": [ {"name": "Customer query", "value": "I recently purchased your XYZ-3000 smart home security system, and I'm having multiple issues. First, the mobile app keeps crashing whenever I try to view the live camera feed. Second, the motion sensors seem overly sensitive and are triggering false alarms, even when it's just my cat walking around. Lastly, I can't figure out how to integrate the system with my existing smart lights. Can you please help me troubleshoot these problems and explain how to properly set up the integration? Also, I'm wondering if there are any upcoming firmware updates that might address these issues?"}, {"name": "Service context", "value": "The XYZ-3000 smart home security system is our latest product, featuring advanced AI-powered cameras, adjustable motion sensors, and seamless integration with popular smart home devices. For app crashes, users should ensure they have the latest version (v4.2.1) installed and try clearing the app cache. Motion sensor sensitivity can be adjusted in the app settings, with recommended levels based on pet ownership. The system supports integration with major smart light brands through the \"Connected Devices\" menu. Proper setup involves enabling the \"Smart Light Control\" feature and following the step-by-step pairing process.\n\nOur technical team is aware of some users experiencing app stability issues and is actively working on a fix. The next firmware update (v2.1.0) is scheduled for release in two weeks and will include improvements to app performance and motion detection algorithms. This update will be automatically pushed to all connected XYZ-3000 systems.\n\nFor immediate assistance, users can try the following troubleshooting steps:\n1. Restart the system by unplugging the main hub for 30 seconds\n2. Ensure all devices are within the recommended range of the central hub\n3. Check that the home's Wi-Fi network is stable and has sufficient bandwidth\n4. Verify that all components are running the latest firmware version\n\nIf issues persist, our 24/7 customer support team is available for remote diagnostics and personalized assistance."} ], "output": {"name": "Agent response", "value": "Thank you for reaching out about your XYZ-3000 smart home security system. I understand you're having some trouble with the mobile app and motion sensors. For the app crashing issue, have you tried updating to the latest version? That might help solve the problem. If updating doesn't work, try uninstalling and reinstalling the app, as this can often resolve persistent issues.\n\nRegarding the motion sensors, they can sometimes be a bit sensitive. You might want to check if there's a way to adjust the sensitivity in the app settings, but I'm not entirely sure how to do that. Look for options like \"sensor sensitivity\" or \"motion detection threshold\" in the device settings menu.\n\nAs for integrating the system with your smart lights, I'm afraid I don't have specific information on how to set that up. It's possible that not all smart light brands are compatible with our system due to differences in communication protocols or software integration capabilities.\n\nI'm not aware of any upcoming firmware updates at the moment, but you could keep an eye on our website for any announcements.\n\nIf you're still having issues, you might want to try restarting the system by unplugging it for a short while. Sometimes that can help resolve various problems. Specifically, unplug the main control unit for about 30 seconds, then plug it back in and allow the system to fully reboot.\n\nI hope this information is somewhat helpful. If you need further assistance, you may want to contact our technical support team for more detailed guidance on your specific issues."}, "evaluation_criteria": "Considering the full scope and all nuances of the customer's submitted query, along with the entirety of the provided service context, does the generated agent response thoroughly utilize and meaningfully synthesize all the most relevant contextual information to comprehensively resolve every aspect of the customer's multifaceted question, issue, or request with no key details missing or stated inaccurately?", "rubric": [ {"name": "1", "value": "The response neglects to utilize any of the provided service context and thoroughly fails to address even the most basic aspects of the customer's query in any meaningful way. It either contains no information relevant to the customer's submission or discusses topics and details that are completely unrelated to the core question, issue, or request requiring resolution."}, {"name": "2", "value": "The response attempts to utilize a small amount of the provided service context but still substantially fails to resolve most facets of the customer's query. Significant information from the context that is absolutely vital for addressing the core elements of the question, issue, or request is either missing altogether or presented inaccurately in the response."}, {"name": "3", "value": "The response utilizes some of the relevant information from the provided service context and partially addresses certain aspects of the customer's query, but critical details are missing or inaccurately conveyed. Significantly greater context utilization and integration is still necessary to provide a thorough and complete resolution to the entirety of the customer's submission."}, {"name": "4", "value": "The response utilizes most of the relevant service context and resolves the major elements of the customer's query quite well. However, a few minor details from the context are still missing that are needed to fully and thoroughly address some of the more nuanced aspects and subtopics of the question, issue or request."}, {"name": "5", "value": "The response skillfully utilizes all of the most relevant pieces of the provided service context to thoroughly and accurately resolve every aspect and nuance of the customer's originally submitted query. No key details from the relevant context are missing or misstated, and the response comprehensively addresses the entirety of the customer's question, issue, or request."} ] }, { "description": "🔍 Query Decomposition", "inputs_task": [ {"name": "Query", "value": "I placed an order for a custom-built gaming PC (order #AC-789012) two weeks ago, but the estimated delivery date has changed twice since then. Originally it was supposed to arrive yesterday, then it got pushed to next week, and now the tracking page shows 'Status: Processing' with no delivery estimate at all. I've tried calling customer service, but after waiting on hold for an hour, I was told to check the website. Can you please look into this and explain what's causing the delays, when I can realistically expect my order to arrive, and whether I'm eligible for any kind of compensation or expedited shipping given these repeated delays? I'm especially concerned because I need this computer for an upcoming gaming tournament I'm participating in next month."} ], "output": {"name": "Sub queries", "value": "1. What is the current status of order #AC-789012, and why has it changed from the original estimated delivery date?\n2. What specific factors are causing the delays in processing and shipping this custom-built gaming PC?\n3. Based on the current situation, when can the customer realistically expect the order to arrive?\n4. Given the repeated delays and changes in estimated delivery, what compensation options (if any) are available to the customer?\n5. Is expedited shipping an option at this point, and if so, how would it affect the delivery timeline?\n6. How can the urgency of this order be communicated and prioritized, considering the customer's upcoming gaming tournament next month?\n7. What steps has customer service already taken to address this issue, and what additional actions can be taken to resolve it?\n8. How can the customer receive more frequent and accurate updates about their order status going forward?"}, "evaluation_criteria": "Do the generated sub-queries provide sufficient breadth to cover all aspects of the main query?", "rubric": [ {"name": "1", "value": "The sub-queries lack breadth and fail to address multiple important aspects of the main query. They are either too narrow, focusing on only one or two dimensions of the question, or they diverge significantly from the main query's intent. Using these sub-queries alone would result in a severely limited exploration of the topic."}, {"name": "2", "value": "The sub-queries cover some aspects of the main query but lack comprehensive breadth. While they touch on several dimensions of the question, there are still noticeable gaps in coverage. Some important facets of the main query are either underrepresented or missing entirely. Answering these sub-queries would provide a partial, but not complete, exploration of the main topic."}, {"name": "3", "value": "The sub-queries demonstrate excellent breadth, effectively covering all major aspects of the main query. They break down the main question into a diverse set of dimensions, ensuring a comprehensive exploration of the topic. Each significant facet of the main query is represented in the sub-queries, allowing for a thorough and well-rounded investigation of the subject matter."} ] }, { "description": "📰 Article Clarity", "inputs_task": [], "output": {"name": "Article", "value": "# Artificial Intelligence Might Change Everything About Jobs, New Study Claims\n\nIn a world where technology seems to be taking over everything, a new study has come out that's got everyone talking. It's all about how artificial intelligence, or AI for short, could shake up the job market in a big way. The study, done by some researchers with really long names, looks at something called \"large language models\" or LLMs. These are like super smart computer programs that can understand and write text, kind of like a robot writer.\n\nSo, these researchers say that these LLMs could affect a whopping 80% of workers in the United States. That's a lot of people! They think that for most of these workers, at least 10% of what they do at work could be changed by AI. And for some people, it could be even more - like half of their job tasks! Can you imagine that? It's like having a robot colleague that can do half your work.\n\nBut here's the really weird part - it's not just the simple jobs that might be affected. The study says that people with high-paying jobs might actually be more at risk. That's pretty surprising, right? You'd think it would be the other way around. I guess even the bosses aren't safe from the robot invasion!\n\nThe researchers also talked about something called \"LLM-powered software.\" Basically, that's when you take these smart AI programs and use them to make other computer programs. They say this could make an even bigger difference. With this kind of software, almost half of all the tasks people do at work could be done faster or better. That's a lot of change!\n\nNow, the study doesn't say exactly when all this is going to happen. It's not like we're going to wake up tomorrow and find robots sitting at our desks. But it does seem like it could be a big deal in the future. The researchers think AI could be as important as electricity was when it was invented. Remember learning about electricity in school? It was a game-changer. Well, AI could be the same thing for jobs. It's going to change everything."}, "evaluation_criteria": "Does the article's writing quality in terms of clarity, conciseness, and ease of understanding communicate effectively the information to the reader?", "rubric": [ {"name": "0", "value": "The article's writing quality is poor to moderate in terms of clarity, conciseness, and ease of understanding. It may have confusing sentence structures, inappropriate vocabulary, lack of organization, or instances of unnecessary verbosity. The writing does not effectively communicate the information to the reader, making it difficult to comprehend the content without significant effort."}, {"name": "1", "value": "The article's writing quality is high in terms of clarity, conciseness, and ease of understanding. It features well-constructed sentences, appropriate vocabulary, logical organization, and efficient conveyance of information. The writing effectively communicates the information to the reader, allowing for easy comprehension and a smooth reading experience."}, ] }, { "description": "⚙️ Technical Terminology", "inputs_task": [ {"name": "Instruction", "value": "Could you explain the difference between a single-stage-to-orbit (SSTO) vehicle and a multi-stage rocket? Also, please elaborate on the advantages and disadvantages of each in terms of mass ratio, specific impulse, and payload capacity."}, {"name": "Reference response", "value": "Yes, of course. A single-stage-to-orbit (SSTO) vehicle and a multi-stage rocket are both types of space launch vehicles, but they have some key differences. A Single-Stage-To-Orbit (SSTO) vehicle is a spacecraft that reaches orbit from the surface of a body without jettisoning hardware, expending only propellants and fluids. The whole vehicle travels to the destination and returns. A multi-stage rocket, on the other hand, is a rocket that uses two or more stages, each of which contains its own engines and propellant. Each stage is jettisoned as its propellant is exhausted, with the next stage then taking over. In terms of mass ratio, SSTOs have the advantage of simplicity and potential reusability, as they don't jettison any stages. However, achieving a sufficient mass ratio (the total mass at lift-off divided by the mass of the spacecraft when empty) with current technology is a significant challenge, which leads to a disadvantage in terms of payload capacity. Multi-stage rockets, by contrast, can achieve higher mass ratios as each stage is jettisoned once its fuel is exhausted, reducing the mass that the remaining stages have to lift. This allows them to carry larger payloads. Specific impulse, which is a measure of the efficiency of a rocket engine, is another important factor. SSTOs require engines with high specific impulse to reach orbit without jettisoning hardware. Multi-stage rockets can use different types of engines in different stages, optimizing for the operating conditions at each stage of the flight. To summarize, while SSTOs have the advantage of simplicity and the potential for reusability, multi-stage rockets currently have the edge in terms of mass ratio, specific impulse, and payload capacity. However, the optimal choice would depend on the specific requirements of the mission."} ], "output": {"name": "Response", "value": "Single-stage-to-orbit (SSTO) vehicle is like a single train ride to your destination, no changeovers, you just sit and enjoy. Multi-stage rocket is like needing to change trains at several stations to reach your destination. If you ask me about advantages and disadvantages, well, the SSTO is comfortable, easy but might not take you as far as you want. Multi-stage is a bit complicated, but it will take you far. As for things like mass ratio, specific impulse and payload capacity, I think they are some technical stuff that measure how good these vehicles are. I guess SSTO scores less in these and multi-stage scores more."}, "evaluation_criteria": "To what extent can the model effectively interact with technical terminology or specialized industry language? This evaluation aspect measures the model's comprehension and application of specialized terms and industry-specific language, and its ability to deliver precise and insightful feedback.", "rubric": [ {"name": "1", "value": "The model fails to recognize or comprehend technical terminology or industry-oriented language, resulting in incorrect or meaningless responses."}, {"name": "2", "value": "The model sporadically identifies technical language or specialized industry terms, but has difficulty using them properly or providing precise feedback."}, {"name": "3", "value": "The model typically comprehends and correctly applies industry-specific language or technical terms, though there might be occasional inaccuracies or misinterpretations."}, {"name": "4", "value": "The model frequently comprehends and accurately applies industry-specific language or technical terms, with only minor mistakes or inconsistencies."}, {"name": "5", "value": "The model perfectly comprehends and applies industry-specific language or technical terms, offering precise, perceptive feedback that demonstrates a comprehensive understanding of the topic."}, ] }, { "description": "🫠 Response-level Hallucination", "inputs_task": [ {"name": "Question", "value": "Which genus of moth in the world's seventh-largest country contains only one species?"}, {"name": "Passage", "value": "Indogrammodes is a genus of moths of the Crambidae family. It contains only one species, Indogrammodes pectinicornalis, which is found in India.India, officially the Republic of India (\"Bhārat Gaṇarājya\"), is a country in South Asia. It is the seventh-largest country by area, the second-most populous country (with over 1.2 billion people), and the most populous democracy in the world."}, ], "output": {"name": "Answer", "value": "The Indogrammodes genus of moths found in India has only one species."}, "evaluation_criteria": "Evaluate whether the information provided in the answer is factually accurate and directly supported by the context given in the document, without any fabricated or hallucinated details.", "rubric": [ {"name": "0", "value": "The answer is not supported by the document. It contains inaccuracies, fabrications, or details that are not present in the document."}, {"name": "1", "value": "The answer is fully supported by the document. It is factually accurate and all details are directly derived from the document."}, ] } ] def get_examples(): return EXAMPLES