Spaces:

unitxt
/

metric

Running

App Files Files Community

Elron commited on 16 days ago

Commit

c160aec

verified ·

1 Parent(s): c0df44e

Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

README.md +81 -43
evaluate_cli.py +4 -2
hf_utils.py +22 -4
inference.py +38 -262
llm_as_judge_constants.py +7 -5
loaders.py +246 -27
metrics.py +165 -14
operator.py +1 -2
operators.py +38 -1
settings_utils.py +1 -0
struct_data_operators.py +67 -0
utils.py +47 -0
version.py +1 -1

README.md CHANGED Viewed

@@ -8,23 +8,9 @@ app_file: README.md
 pinned: false
 ---
 <div align="center">
-    <img src="https://raw.githubusercontent.com/IBM/unitxt/main/assets/banner.png" alt="Image Description" width="100%" />
 </div>
-[![Button](https://img.shields.io/badge/Video-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/_static/video.mov)
-[![Button](https://img.shields.io/badge/Documentation-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/introduction.html)
-[![Button](https://img.shields.io/badge/Demo-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/demo.html)
-[![Button](https://img.shields.io/badge/Tutorial-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html)
-[![Button](https://img.shields.io/badge/Paper-pink?style=for-the-badge)](https://arxiv.org/abs/2401.14019)
-[![Button](https://img.shields.io/badge/Catalog-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/catalog/catalog.__dir__.html)
-[![Button](https://img.shields.io/badge/Contributors-pink?style=for-the-badge)](https://github.com/IBM/unitxt/blob/main/CONTRIBUTING.md)
-[![Button](https://img.shields.io/badge/PyPi-pink?style=for-the-badge)](https://pypi.org/project/unitxt/)
-In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.
- Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.
 #
 [![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
 ![license](https://img.shields.io/github/license/ibm/unitxt)
@@ -34,34 +20,93 @@ In the dynamic landscape of generative NLP, traditional text processing pipeline
 ![Read the Docs](https://img.shields.io/readthedocs/unitxt)
 [![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
 #
-https://github.com/IBM/unitxt/assets/23455264/baef9131-39d4-4164-90b2-05da52919fdf
-### 🦄 Currently on Unitxt Catalog
-![Abstract Tasks](https://img.shields.io/badge/Abstract_Tasks-64-blue)
-![Dataset Cards](https://img.shields.io/badge/Dataset_Cards-3174-blue)
-![Templates](https://img.shields.io/badge/Templates-342-blue)
-![Benchmarks](https://img.shields.io/badge/Benchmarks-6-blue)
-![Metrics](https://img.shields.io/badge/Metrics-462-blue)
-### 🦄 Run Unitxt Exploration Dashboard
-To launch unitxt graphical user interface first install unitxt with ui requirements:
 ```
-pip install unitxt[ui]
 ```
-Then launch the ui by running:
 ```
 unitxt-explore
 ```
-# 🦄 Example
-This is a simple example of running end-to-end evaluation in self contained python code over user data.
-See more examples in examples subdirectory.
 ```python
 # Import required components
@@ -114,17 +159,13 @@ print("Global Results:\n", results.global_scores.summary)
 print("Instance Results:\n", results.instance_scores.summary)
 ```
-# 🦄 Contributors
-Please install Unitxt from source by:
-```bash
-git clone [email protected]:IBM/unitxt.git
-cd unitxt
-pip install -e ".[dev]"
-pre-commit install
-```
-# 🦄 Citation
 If you use Unitxt in your research, please cite our paper:
@@ -153,8 +194,5 @@ If you use Unitxt in your research, please cite our paper:
     publisher = "Association for Computational Linguistics",
     url = "https://aclanthology.org/2024.naacl-demo.21",
     pages = "207--215",
-    abstract = "In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt",
 }
-```
-Unitxt emoji designed by [OpenMoji](https://openmoji.org/#) - the open-source emoji and icon project. License: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/#)

 pinned: false
 ---
 <div align="center">
+    <img src="https://www.unitxt.ai/en/latest/_static/banner.png" alt="Image Description" width="100%" />
 </div>
 #
 [![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
 ![license](https://img.shields.io/github/license/ibm/unitxt)
 ![Read the Docs](https://img.shields.io/readthedocs/unitxt)
 [![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
+### 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking
 #
+## Why Unitxt?
+- 🌐 **Comprehensive**: Evaluate text, tables, vision, speech, and code in one unified framework
+- 💼 **Enterprise-Ready**: Battle-tested components with extensive catalog of benchmarks
+- 🧠 **Model Agnostic**: Works with HuggingFace, OpenAI, WatsonX, and custom models
+- 🔒 **Reproducible**: Shareable, modular components ensure consistent results
+## Quick Links
+- 📖 [Documentation](https://www.unitxt.ai)
+- 🚀 [Getting Started](https://www.unitxt.ai)
+- 📁 [Browse Catalog](https://www.unitxt.ai/en/latest/catalog/catalog.__dir__.html)
+# Installation
+```bash
+pip install unitxt
 ```
+# Quick Start
+## Command Line Evaluation
+```bash
+# Simple evaluation
+unitxt-evaluate \
+    --tasks "card=cards.mmlu_pro.engineering" \
+    --model cross_provider \
+    --model_args "model_name=llama-3-1-8b-instruct" \
+    --limit 10
+# Multi-task evaluation
+unitxt-evaluate \
+    --tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
+    --model cross_provider \
+    --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
+    --split test \
+    --limit 10 \
+    --output_path ./results/evaluate_cli \
+    --log_samples \
+    --apply_chat_template
+# Benchmark evaluation
+unitxt-evaluate \
+    --tasks "benchmarks.tool_calling" \
+    --model cross_provider \
+    --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
+    --split test \
+    --limit 10 \
+    --output_path ./results/evaluate_cli \
+    --log_samples \
+    --apply_chat_template
 ```
+## Loading as Dataset
+Load thousands of datasets in chat API format, ready for any model:
+```python
+from unitxt import load_dataset
+dataset = load_dataset(
+    card="cards.gpqa.diamond",
+    split="test",
+    format="formats.chat_api",
+)
 ```
+## 📊 Available on The Catalog
+![Tasks](https://img.shields.io/badge/Tasks-68-blue)
+![Datasets](https://img.shields.io/badge/Datasets-3254-blue)
+![Prompts](https://img.shields.io/badge/Prompts-357-blue)
+![Benchmarks](https://img.shields.io/badge/Benchmarks-11-blue)
+![Metrics](https://img.shields.io/badge/Metrics-584-blue)
+## 🚀 Interactive Dashboard
+Launch the graphical user interface to explore datasets and benchmarks:
+```
+pip install unitxt[ui]
 unitxt-explore
 ```
+# Complete Python Example
+Evaluate your own data with any model:
 ```python
 # Import required components
 print("Instance Results:\n", results.instance_scores.summary)
 ```
+# Contributing
+Read the [contributing guide](./CONTRIBUTING.md) for details on how to contribute to Unitxt.
+#
+# Citation
 If you use Unitxt in your research, please cite our paper:
     publisher = "Association for Computational Linguistics",
     url = "https://aclanthology.org/2024.naacl-demo.21",
     pages = "207--215",
 }
+```

evaluate_cli.py CHANGED Viewed

@@ -299,7 +299,9 @@ def cli_load_dataset(args: argparse.Namespace) -> HFDataset:
         )
     # this hack circumvents an issue with multi-level benchmarks (such Bluebench's translation subset) that fail when wrapped with an additional Benchmark() object.
-    if len(benchmark_subsets) == 1:
         source = next(iter(benchmark_subsets.values()))
     else:
         source = Benchmark(subsets=benchmark_subsets)
@@ -452,7 +454,7 @@ def initialize_inference_engine(
         )
         # Keep the actual model name for the results
-        args.model = inference_model.engine.model
     else:
         # This case should not be reached due to argparse choices
         logger.error(

         )
     # this hack circumvents an issue with multi-level benchmarks (such Bluebench's translation subset) that fail when wrapped with an additional Benchmark() object.
+    if len(benchmark_subsets) == 1 and isinstance(
+        next(iter(benchmark_subsets.values())), Benchmark
+    ):
         source = next(iter(benchmark_subsets.values()))
     else:
         source = Benchmark(subsets=benchmark_subsets)
         )
         # Keep the actual model name for the results
+        args.model = inference_model.get_engine_id()
     else:
         # This case should not be reached due to argparse choices
         logger.error(

hf_utils.py CHANGED Viewed

@@ -1,11 +1,30 @@
 from pathlib import Path
-from datasets.utils.py_utils import get_imports
 from .deprecation_utils import compare_versions
 from .file_utils import get_all_files_in_dir
 def get_missing_imports(file, exclude=None):
     if exclude is None:
         exclude = []
@@ -13,8 +32,7 @@ def get_missing_imports(file, exclude=None):
     python_files = get_all_files_in_dir(src_dir, file_extension=".py")
     # get only the file without the path and extension
     required_modules = [Path(p).stem for p in python_files]
-    imports = get_imports(file)
-    imported_modules = [i[1] for i in imports if i[0] == "internal"]
     return [
         i for i in required_modules if i not in imported_modules and i not in exclude
     ]

+import re
 from pathlib import Path
+from typing import List
 from .deprecation_utils import compare_versions
 from .file_utils import get_all_files_in_dir
+def get_internal_imports(file_path: str) -> List[str]:
+    """Return a list of local (relative) modules directly imported in the given Python file."""
+    internal_imports = []
+    is_in_docstring = False
+    with open(file_path, encoding="utf-8") as f:
+        for line in f:
+            if line.count('"""') == 1 or line.count("'''") == 1:
+                is_in_docstring = not is_in_docstring
+            if is_in_docstring:
+                continue
+            # Match "import .module" or "from .module import ..."
+            match = re.match(r"^(?:import|from)\s+\.(\w+)", line)
+            if match:
+                module = match.group(1)
+                if module not in internal_imports:
+                    internal_imports.append(module)
+    return internal_imports
 def get_missing_imports(file, exclude=None):
     if exclude is None:
         exclude = []
     python_files = get_all_files_in_dir(src_dir, file_extension=".py")
     # get only the file without the path and extension
     required_modules = [Path(p).stem for p in python_files]
+    imported_modules = get_internal_imports(file)
     return [
         i for i in required_modules if i not in imported_modules and i not in exclude
     ]

inference.py CHANGED Viewed

@@ -1378,45 +1378,6 @@ class MockModeMixin(Artifact):
     mock_mode: bool = False
-class IbmGenAiInferenceEngineParamsMixin(Artifact):
-    beam_width: Optional[int] = None
-    decoding_method: Optional[Literal["greedy", "sample"]] = None
-    include_stop_sequence: Optional[bool] = None
-    length_penalty: Any = None
-    max_new_tokens: Optional[int] = None
-    min_new_tokens: Optional[int] = None
-    random_seed: Optional[int] = None
-    repetition_penalty: Optional[float] = None
-    return_options: Any = None
-    stop_sequences: Optional[List[str]] = None
-    temperature: Optional[float] = None
-    time_limit: Optional[int] = None
-    top_k: Optional[int] = None
-    top_p: Optional[float] = None
-    truncate_input_tokens: Optional[int] = None
-    typical_p: Optional[float] = None
-@deprecation(version="2.0.0", alternative=IbmGenAiInferenceEngineParamsMixin)
-class IbmGenAiInferenceEngineParams(Artifact):
-    beam_width: Optional[int] = None
-    decoding_method: Optional[Literal["greedy", "sample"]] = None
-    include_stop_sequence: Optional[bool] = None
-    length_penalty: Any = None
-    max_new_tokens: Optional[int] = None
-    min_new_tokens: Optional[int] = None
-    random_seed: Optional[int] = None
-    repetition_penalty: Optional[float] = None
-    return_options: Any = None
-    stop_sequences: Optional[List[str]] = None
-    temperature: Optional[float] = None
-    time_limit: Optional[int] = None
-    top_k: Optional[int] = None
-    top_p: Optional[float] = None
-    truncate_input_tokens: Optional[int] = None
-    typical_p: Optional[float] = None
 class GenericInferenceEngine(
     InferenceEngine, ArtifactFetcherMixin, LogProbInferenceEngine
 ):
@@ -1430,7 +1391,7 @@ class GenericInferenceEngine(
                 "GenericInferenceEngine could not be initialized"
                 '\nThis is since both the "UNITXT_INFERENCE_ENGINE" environmental variable is not set and no default engine was not inputted.'
                 "\nFor example, you can fix it by setting"
-                "\nexport UNITXT_INFERENCE_ENGINE=engines.ibm_gen_ai.llama_3_70b_instruct"
                 "\nto your ~/.bashrc"
                 "\nor passing a similar required engine in the default argument"
             )
@@ -1601,214 +1562,6 @@ class OptionSelectingByLogProbsInferenceEngine:
         return dataset
-class IbmGenAiInferenceEngine(
-    InferenceEngine,
-    IbmGenAiInferenceEngineParamsMixin,
-    PackageRequirementsMixin,
-    LogProbInferenceEngine,
-    OptionSelectingByLogProbsInferenceEngine,
-):
-    label: str = "ibm_genai"
-    model_name: str
-    _requirements_list = {
-        "ibm-generative-ai": "Install ibm-genai package using 'pip install --upgrade ibm-generative-ai"
-    }
-    data_classification_policy = ["public", "proprietary"]
-    parameters: Optional[IbmGenAiInferenceEngineParams] = None
-    rate_limit: int = 10
-    def get_engine_id(self):
-        return get_model_and_label_id(self.model_name, self.label)
-    @staticmethod
-    def _get_credentials():
-        from genai import Credentials
-        api_key_env_var_name = "GENAI_KEY"  # pragma: allowlist secret
-        api_key = os.environ.get(api_key_env_var_name)
-        assert api_key is not None, (
-            f"Error while trying to run IbmGenAiInferenceEngine."
-            f" Please set the environment param '{api_key_env_var_name}'."
-        )
-        return Credentials(api_key=api_key)
-    def prepare_engine(self):
-        self.check_missing_requirements()
-        from genai import Client
-        from genai.text.generation import CreateExecutionOptions
-        credentials = self._get_credentials()
-        self.client = Client(credentials=credentials)
-        self.execution_options = CreateExecutionOptions(
-            concurrency_limit=self.rate_limit
-        )
-        self._set_inference_parameters()
-    def _infer(
-        self,
-        dataset: Union[List[Dict[str, Any]], Dataset],
-        return_meta_data: bool = False,
-    ) -> Union[List[str], List[TextGenerationInferenceOutput]]:
-        from genai.schema import TextGenerationParameters, TextGenerationResult
-        self.verify_not_chat_api(dataset)
-        genai_params = TextGenerationParameters(
-            **self.to_dict([IbmGenAiInferenceEngineParamsMixin])
-        )
-        responses = self.client.text.generation.create(
-            model_id=self.model_name,
-            inputs=[instance["source"] for instance in dataset],
-            parameters=genai_params,
-            execution_options=self.execution_options,
-        )
-        results = []
-        for response in responses:
-            generation_result: TextGenerationResult = response.results[0]
-            result = self.get_return_object(
-                generation_result.generated_text, generation_result, return_meta_data
-            )
-            results.append(result)
-        return results
-    def _infer_log_probs(
-        self,
-        dataset: Union[List[Dict[str, Any]], Dataset],
-        return_meta_data: bool = False,
-    ) -> Union[List[Dict], List[TextGenerationInferenceOutput]]:
-        from genai.schema import TextGenerationParameters, TextGenerationResult
-        self.verify_not_chat_api(dataset)
-        logprobs_return_options = {
-            "generated_tokens": True,
-            "input_text": False,
-            "input_tokens": False,
-            "token_logprobs": True,
-            "token_ranks": True,
-            "top_n_tokens": 5,
-        }
-        genai_params = self.to_dict(
-            [IbmGenAiInferenceEngineParamsMixin], keep_empty=False
-        )
-        genai_params = {**genai_params, "return_options": logprobs_return_options}
-        genai_params = TextGenerationParameters(**genai_params)
-        predictions = self.client.text.generation.create(
-            model_id=self.model_name,
-            inputs=[instance["source"] for instance in dataset],
-            parameters=genai_params,
-            execution_options=self.execution_options,
-        )
-        predict_results = []
-        for prediction in predictions:
-            result: TextGenerationResult = prediction.results[0]
-            assert isinstance(
-                result.generated_tokens, list
-            ), "result.generated_tokens should be a list"
-            predict_result = []
-            for base_token in result.generated_tokens:
-                res = {**base_token.__dict__, **base_token.model_extra}
-                res["top_tokens"] = [
-                    {"logprob": top_token.logprob, "text": top_token.text}
-                    for top_token in res["top_tokens"]
-                ]
-                predict_result.append(res)
-            final_results = self.get_return_object(
-                predict_result, result, return_meta_data
-            )
-            predict_results.append(final_results)
-        return predict_results
-    def get_return_object(self, predict_result, result, return_meta_data):
-        if return_meta_data:
-            return TextGenerationInferenceOutput(
-                prediction=predict_result,
-                input_tokens=result.input_token_count,
-                output_tokens=result.generated_token_count,
-                model_name=self.model_name,
-                inference_type=self.label,
-                input_text=result.input_text,
-                seed=self.random_seed,
-                stop_reason=result.stop_reason,
-            )
-        return predict_result
-    def get_model_details(self) -> Dict:
-        from genai import ApiClient
-        from genai.model import ModelService
-        api_client = ApiClient(credentials=self._get_credentials())
-        model_info = (
-            ModelService(api_client=api_client).retrieve(id=self.model_name).result
-        )
-        return model_info.dict()
-    def get_token_count(self, dataset):
-        texts = [instance["source"] for instance in dataset]
-        token_counts = list(
-            tqdm(
-                [
-                    result.token_count
-                    for response in self.client.text.tokenization.create(
-                        model_id=self.model_name,
-                        input=texts,
-                        execution_options={"ordered": True},
-                    )
-                    for result in response.results
-                ],
-                desc="Tokenizing",
-                total=len(texts),
-            )
-        )
-        for i, token_count in enumerate(token_counts):
-            dataset[i]["token_count"] = token_count
-        return dataset
-    def get_options_log_probs(self, dataset):
-        """Add to each instance in the data a "options_log_prob" field, which is a dict with str as key and a list of {text: str, logprob:float}."""
-        from genai.schema import TextGenerationParameters, TextGenerationReturnOptions
-        texts = [x["source"] for x in dataset]
-        responses = tqdm(
-            self.client.text.generation.create(
-                model_id=self.model_name,
-                inputs=texts,
-                execution_options={"ordered": True},
-                parameters=TextGenerationParameters(
-                    max_new_tokens=1,
-                    return_options=TextGenerationReturnOptions(
-                        input_tokens=True, token_logprobs=True
-                    ),
-                    # random_seed=self.random_state
-                ),
-            ),
-            total=len(texts),
-            desc="Completions",
-        )
-        scores = [
-            [
-                {"text": token.text, "logprob": token.logprob}
-                for token in response.results[0].input_tokens
-            ]
-            for response in responses
-        ]
-        for instance, score in zip(dataset, scores):
-            instance["prediction"] = score[instance["task_data"]["token_count"] - 1 :]
-        return dataset
 class CredentialsOpenAi(TypedDict, total=False):
     api_key: str
     api_url: str
@@ -2099,6 +1852,11 @@ class RITSInferenceEngine(
         "meta-llama/Llama-3.1-8B-Instruct": "llama-3-1-8b-instruct",
         "meta-llama/Llama-4-Scout-17B-16E-Instruct": "llama-4-scout-17b-16e-instruct",
         "mistralai/Mistral-Small-3.1-24B-Instruct-2503": "mistral-small-3-1-24b-2503",
     }
     def get_default_headers(self):
@@ -3467,16 +3225,18 @@ _supported_apis = Literal[
     "open-ai",
     "aws",
     "ollama",
-    "bam",
     "watsonx-sdk",
     "rits",
     "azure",
     "vertex-ai",
     "replicate",
 ]
-class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
     """Inference engine capable of dynamically switching between multiple providers APIs.
     This class extends the InferenceEngine and OpenAiInferenceEngineParamsMixin
@@ -3516,7 +3276,11 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
             "granite-3-3-2b-instruct": "ibm/granite-3-3-2b-instruct",
             "granite-3-3-8b-instruct": "ibm/granite-3-3-8b-instruct",
             "granite-34b-code-instruct": "ibm/granite-34b-code-instruct",
             "granite-guardian-3-8b": "ibm/granite-guardian-3-8b",
             "granite-vision-3-2-2b": "ibm/granite-vision-3-2-2b",
             "llama-3-1-8b-instruct": "meta-llama/llama-3-1-8b-instruct",
             "llama-3-1-70b-instruct": "meta-llama/llama-3-1-70b-instruct",
@@ -3570,17 +3334,14 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
             "granite-3-3-2b-instruct": "granite3.3:2b",
             "granite-3-3-8b-instruct": "granite3.3:8b",
         },
-        "bam": {
-            "granite-3-8b-instruct": "ibm/granite-8b-instruct-preview-4k",
-            "llama-3-8b-instruct": "meta-llama/llama-3-8b-instruct",
-            "llama-3-2-1b-instruct": "meta-llama/llama-3-2-1b-instruct",
-            "flan-t5-xxl": "google/flan-t5-xxl",
-        },
         "rits": {
             "granite-3-8b-instruct": "ibm-granite/granite-3.0-8b-instruct",
             "granite-3-1-8b-instruct": "ibm-granite/granite-3.1-8b-instruct",
             "granite-3-2-8b-instruct": "ibm-granite/granite-3.2-8b-instruct",
             "granite-3-3-8b-instruct": "ibm-granite/granite-3.3-8b-instruct",
             "llama-3-1-8b-instruct": "meta-llama/Llama-3.1-8B-Instruct",
             "llama-3-1-70b-instruct": "meta-llama/llama-3-1-70b-instruct",
             "llama-3-1-405b-instruct": "meta-llama/llama-3-1-405b-instruct-fp8",
@@ -3595,9 +3356,9 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
             "mixtral-8x7b-instruct": "mistralai/mixtral-8x7B-instruct-v0.1",
             "mixtral-8x7b-instruct-v01": "mistralai/mixtral-8x7B-instruct-v0.1",
             "deepseek-v3": "deepseek-ai/DeepSeek-V3",
-            "granite-guardian-3-2-3b-a800m": "ibm-granite/granite-guardian-3.2-3b-a800m",
-            "granite-guardian-3-2-5b": "ibm-granite/granite-guardian-3.2-5b",
             "phi-4": "microsoft/phi-4",
         },
         "open-ai": {
             "o1-mini": "o1-mini",
@@ -3699,9 +3460,16 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
             "gpt-4-1": "replicate/openai/gpt-4.1",
         },
         "hf-local": {
-            "granite-3-3-8b-instruct": "ibm-granite/granite-3.3-8b-instruct",
             "llama-3-3-8b-instruct": "meta-llama/Llama-3.3-8B-Instruct",
             "SmolLM2-1.7B-Instruct": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
         },
     }
     provider_model_map["watsonx"] = {
@@ -3714,7 +3482,6 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
         "together-ai": LiteLLMInferenceEngine,
         "aws": LiteLLMInferenceEngine,
         "ollama": OllamaInferenceEngine,
-        "bam": IbmGenAiInferenceEngine,
         "watsonx-sdk": WMLInferenceEngineChat,
         "rits": RITSInferenceEngine,
         "azure": LiteLLMInferenceEngine,
@@ -3724,7 +3491,6 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
     }
     _provider_param_renaming = {
-        "bam": {"max_tokens": "max_new_tokens", "model": "model_name"},
         "watsonx-sdk": {"model": "model_name"},
         "rits": {"model": "model_name"},
         "hf-local": {"model": "model_name", "max_tokens": "max_new_tokens"},
@@ -3737,7 +3503,6 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
         return self.provider if self.provider is not None else settings.default_provider
     def prepare_engine(self):
-        # print("provider", self.provider)
         provider = self.get_provider_name()
         if provider not in self._provider_to_base_class:
             raise UnitxtError(
@@ -3783,6 +3548,17 @@ class CrossProviderInferenceEngine(InferenceEngine, StandardAPIParamsMixin):
             return get_model_and_label_id(self.provider_model_map[api][self.model], api)
         return get_model_and_label_id(self.model, api)
 class HFOptionSelectingInferenceEngine(InferenceEngine, TorchDeviceMixin):
     """HuggingFace based class for inference engines that calculate log probabilities.

     mock_mode: bool = False
 class GenericInferenceEngine(
     InferenceEngine, ArtifactFetcherMixin, LogProbInferenceEngine
 ):
                 "GenericInferenceEngine could not be initialized"
                 '\nThis is since both the "UNITXT_INFERENCE_ENGINE" environmental variable is not set and no default engine was not inputted.'
                 "\nFor example, you can fix it by setting"
+                "\nexport UNITXT_INFERENCE_ENGINE=engines.ibm_wml.llama_3_70b_instruct"
                 "\nto your ~/.bashrc"
                 "\nor passing a similar required engine in the default argument"
             )
         return dataset
 class CredentialsOpenAi(TypedDict, total=False):
     api_key: str
     api_url: str
         "meta-llama/Llama-3.1-8B-Instruct": "llama-3-1-8b-instruct",
         "meta-llama/Llama-4-Scout-17B-16E-Instruct": "llama-4-scout-17b-16e-instruct",
         "mistralai/Mistral-Small-3.1-24B-Instruct-2503": "mistral-small-3-1-24b-2503",
+        "ibm-granite/granite-guardian-3.2-3b-a800m": "granite-guardian-3-2-3b-a800m",
+        "ibm-granite/granite-guardian-3.2-5b": "granite-guardian-3-2-5b-ris",
+        "granite-guardian-3-2-5b-ris": "granite-guardian-3-3-8b",
+        "openai/gpt-oss-20b": "gpt-oss-20b",
+        "openai/gpt-oss-120b": "gpt-oss-120b",
     }
     def get_default_headers(self):
     "open-ai",
     "aws",
     "ollama",
     "watsonx-sdk",
     "rits",
     "azure",
     "vertex-ai",
     "replicate",
+    "hf-local",
 ]
+class CrossProviderInferenceEngine(
+    InferenceEngine, StandardAPIParamsMixin, LogProbInferenceEngine
+):
     """Inference engine capable of dynamically switching between multiple providers APIs.
     This class extends the InferenceEngine and OpenAiInferenceEngineParamsMixin
             "granite-3-3-2b-instruct": "ibm/granite-3-3-2b-instruct",
             "granite-3-3-8b-instruct": "ibm/granite-3-3-8b-instruct",
             "granite-34b-code-instruct": "ibm/granite-34b-code-instruct",
+            "granite-guardian-3-2b": "ibm/granite-guardian-3-2b",
             "granite-guardian-3-8b": "ibm/granite-guardian-3-8b",
+            "granite-guardian-3-1-2b": "ibm/granite-guardian-3-2b",  # LifecycleWarning: Model 'ibm/granite-guardian-3-2b' is in deprecated state from 2025-07-09 until 2025-10-08. IDs of alternative models: ibm/granite-guardian-3-2-5b.
+            "granite-guardian-3-1-8b": "ibm/granite-guardian-3-8b",
+            "granite-guardian-3-2-5b": "ibm/granite-guardian-3-2-5b",
             "granite-vision-3-2-2b": "ibm/granite-vision-3-2-2b",
             "llama-3-1-8b-instruct": "meta-llama/llama-3-1-8b-instruct",
             "llama-3-1-70b-instruct": "meta-llama/llama-3-1-70b-instruct",
             "granite-3-3-2b-instruct": "granite3.3:2b",
             "granite-3-3-8b-instruct": "granite3.3:8b",
         },
         "rits": {
             "granite-3-8b-instruct": "ibm-granite/granite-3.0-8b-instruct",
             "granite-3-1-8b-instruct": "ibm-granite/granite-3.1-8b-instruct",
             "granite-3-2-8b-instruct": "ibm-granite/granite-3.2-8b-instruct",
             "granite-3-3-8b-instruct": "ibm-granite/granite-3.3-8b-instruct",
+            "granite-guardian-3-2-3b": "ibm-granite/granite-guardian-3.2-3b-a800m",
+            "granite-guardian-3-2-5b": "ibm-granite/granite-guardian-3.2-5b",
+            "granite-guardian-3-3-8b": "ibm-granite/granite-guardian-3.3-8b",
             "llama-3-1-8b-instruct": "meta-llama/Llama-3.1-8B-Instruct",
             "llama-3-1-70b-instruct": "meta-llama/llama-3-1-70b-instruct",
             "llama-3-1-405b-instruct": "meta-llama/llama-3-1-405b-instruct-fp8",
             "mixtral-8x7b-instruct": "mistralai/mixtral-8x7B-instruct-v0.1",
             "mixtral-8x7b-instruct-v01": "mistralai/mixtral-8x7B-instruct-v0.1",
             "deepseek-v3": "deepseek-ai/DeepSeek-V3",
             "phi-4": "microsoft/phi-4",
+            "gpt-oss-20b": "openai/gpt-oss-20b",
+            "gpt-oss-120b": "openai/gpt-oss-120b",
         },
         "open-ai": {
             "o1-mini": "o1-mini",
             "gpt-4-1": "replicate/openai/gpt-4.1",
         },
         "hf-local": {
             "llama-3-3-8b-instruct": "meta-llama/Llama-3.3-8B-Instruct",
             "SmolLM2-1.7B-Instruct": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
+            "granite-guardian-3-1-2b": "ibm-granite/granite-guardian-3.1-2b",
+            "granite-guardian-3-1-8b": "ibm-granite/granite-guardian-3.1-8b",
+            "granite-guardian-3-2-3b": "ibm-granite/granite-guardian-3.2-3b-a800m",
+            "granite-guardian-3-2-5b": "ibm-granite/granite-guardian-3.2-5b",
+            "granite-guardian-3-3-8b": "ibm-granite/granite-guardian-3.3-8b",
+            "granite-3-3-2b-instruct": "ibm-granite/granite-3.3-2b-instruct",
+            "granite-3-3-8b-instruct": "ibm-granite/granite-3.3-8b-instruct",
+            "granite-4-0-tiny-preview": "ibm-granite/granite-4.0-tiny-preview",
         },
     }
     provider_model_map["watsonx"] = {
         "together-ai": LiteLLMInferenceEngine,
         "aws": LiteLLMInferenceEngine,
         "ollama": OllamaInferenceEngine,
         "watsonx-sdk": WMLInferenceEngineChat,
         "rits": RITSInferenceEngine,
         "azure": LiteLLMInferenceEngine,
     }
     _provider_param_renaming = {
         "watsonx-sdk": {"model": "model_name"},
         "rits": {"model": "model_name"},
         "hf-local": {"model": "model_name", "max_tokens": "max_new_tokens"},
         return self.provider if self.provider is not None else settings.default_provider
     def prepare_engine(self):
         provider = self.get_provider_name()
         if provider not in self._provider_to_base_class:
             raise UnitxtError(
             return get_model_and_label_id(self.provider_model_map[api][self.model], api)
         return get_model_and_label_id(self.model, api)
+    def _infer_log_probs(
+        self,
+        dataset: Union[List[Dict[str, Any]], Dataset],
+        return_meta_data: bool = False,
+    ) -> Union[List[Dict], List[TextGenerationInferenceOutput]]:
+        if not isinstance(self.engine, LogProbInferenceEngine):
+            raise UnitxtError(
+                f"The underlying inference engine of this instance of CrossProviderInferenceEngine ({self.engine.get_engine_id()}) must inherit from LogProbInferenceEngine and implement _infer_log_probs"
+            )
+        return self.engine._infer_log_probs(dataset, return_meta_data)
 class HFOptionSelectingInferenceEngine(InferenceEngine, TorchDeviceMixin):
     """HuggingFace based class for inference engines that calculate log probabilities.

llm_as_judge_constants.py CHANGED Viewed

@@ -32,7 +32,7 @@ class Criteria(Artifact):
     prediction_field: Optional[str] = None
     """The prediction field name this criteria expects and refers to, e.g. answer/model response/summary"""
-    context_fields: Union[str, List[str], Dict[str, str]] = None
     """The context field names this criteria expects, i.e. [context]/[source article, user questions]"""
     @staticmethod
@@ -370,7 +370,7 @@ class DirectCriteriaCatalogEnum(Enum):
         name="conciseness",
         description="Is the response concise and to the point?",
         prediction_field="response",
-        context_fields=[],
         options=[
             CriteriaOption(
                 name="Yes",
@@ -1603,7 +1603,9 @@ Errors: Are there any errors in grammar, vocabulary, punctuation, or formatting
     )
-DIRECT_CRITERIA = [c.value for c in DirectCriteriaCatalogEnum]
 class PairwiseCriteriaCatalogEnum(Enum):
@@ -1625,7 +1627,7 @@ class PairwiseCriteriaCatalogEnum(Enum):
         name="factually_consistent",
         description="A factually consistent response contains only statements that are entailed by the source document.",
         prediction_field="response",
-        context_fields=[],
     )
     INCLUSIVITY = Criteria(
@@ -1658,4 +1660,4 @@ class PairwiseCriteriaCatalogEnum(Enum):
     )
-PAIRWISE_CRITERIA = [c.value for c in PairwiseCriteriaCatalogEnum]

     prediction_field: Optional[str] = None
     """The prediction field name this criteria expects and refers to, e.g. answer/model response/summary"""
+    context_fields: Optional[Union[str, List[str], Dict[str, str]]] = None
     """The context field names this criteria expects, i.e. [context]/[source article, user questions]"""
     @staticmethod
         name="conciseness",
         description="Is the response concise and to the point?",
         prediction_field="response",
+        context_fields=["question"],
         options=[
             CriteriaOption(
                 name="Yes",
     )
+DIRECT_CRITERIA: List[CriteriaWithOptions] = [
+    c.value for c in DirectCriteriaCatalogEnum
+]
 class PairwiseCriteriaCatalogEnum(Enum):
         name="factually_consistent",
         description="A factually consistent response contains only statements that are entailed by the source document.",
         prediction_field="response",
+        context_fields=["source document"],
     )
     INCLUSIVITY = Criteria(
     )
+PAIRWISE_CRITERIA: List[Criteria] = [c.value for c in PairwiseCriteriaCatalogEnum]

loaders.py CHANGED Viewed

@@ -24,6 +24,7 @@ Available Loaders Overview:
     - :class:`MultipleSourceLoader <unitxt.loaders.MultipleSourceLoader>` - Combines data from multiple different sources.
     - :class:`LoadFromDictionary <unitxt.loaders.LoadFromDictionary>` - Loads data from a user-defined Python dictionary.
     - :class:`LoadFromHFSpace <unitxt.loaders.LoadFromHFSpace>` - Downloads and loads data from HuggingFace Spaces.
@@ -52,6 +53,7 @@ from typing import (
     Union,
 )
 import pandas as pd
 import requests
 from datasets import (
@@ -62,6 +64,7 @@ from datasets import (
 )
 from datasets import load_dataset as _hf_load_dataset
 from huggingface_hub import HfApi
 from tqdm import tqdm
 from .dataclass import NonPositionalField
@@ -96,21 +99,19 @@ def hf_load_dataset(path: str, *args, **kwargs):
     ):
         if settings.hf_offline_datasets_path is not None:
             path = os.path.join(settings.hf_offline_datasets_path, path)
-        try:
-            return _hf_load_dataset(
-                path,
-                *args,
-                **kwargs,
-                verification_mode="no_checks",
-                trust_remote_code=settings.allow_unverified_code,
-                download_mode="force_redownload"
-                if settings.disable_hf_datasets_cache
-                else "reuse_dataset_if_exists",
-            )
-        except ValueError as e:
-            if "trust_remote_code" in str(e):
-                raise UnitxtUnverifiedCodeError(path) from e
-            raise e  # Re raise
 @retry_connection_with_exponential_backoff(backoff_factor=2)
@@ -119,13 +120,9 @@ def hf_get_dataset_splits(path: str, name: str, revision=None):
         return get_dataset_split_names(
             path=path,
             config_name=name,
-            trust_remote_code=settings.allow_unverified_code,
             revision=revision,
         )
     except Exception as e:
-        if "trust_remote_code" in str(e):
-            raise UnitxtUnverifiedCodeError(path) from e
         if "Couldn't find cache" in str(e):
             raise FileNotFoundError(
                 f"Dataset cache path={path}, name={name} was not found."
@@ -354,7 +351,7 @@ class LoadHF(LazyLoader):
                 raise NotImplementedError() from None
             if not disable_memory_caching:
-                self.__class__._loader_cache.max_size = settings.loader_cache_size
                 self.__class__._loader_cache[dataset_id] = dataset
         self._already_logged_limited_loading = True
@@ -476,7 +473,7 @@ class LoadWithPandas(LazyLoader):
             dataset = dataframe.to_dict("records")
-            self.__class__._loader_cache.max_size = settings.loader_cache_size
             self.__class__._loader_cache[dataset_id] = dataset
         for instance in self.__class__._loader_cache[dataset_id]:
@@ -499,7 +496,7 @@ class LoadWithPandas(LazyLoader):
 class LoadCSV(LoadWithPandas):
-    """Loads data from CSV files.
     Supports streaming and can handle large files by loading them in chunks.
@@ -510,6 +507,7 @@ class LoadCSV(LoadWithPandas):
         streaming: Bool indicating if streaming should be used.
         sep: String specifying the separator used in the CSV files.
         indirect_read: Bool indicating if to open a remote file with urllib first
     Example:
         Loading csv
@@ -517,15 +515,30 @@ class LoadCSV(LoadWithPandas):
         .. code-block:: python
             load_csv = LoadCSV(files={'train': 'path/to/train.csv'}, chunksize=100)
     """
     sep: str = ","
     def read_dataframe(self, file) -> pd.DataFrame:
         with error_context(
             stage="Raw Dataset Loading",
             help="https://www.unitxt.ai/en/latest/unitxt.loaders.html#module-unitxt.loaders",
         ):
             if self.indirect_read:
                 # Open the URL with urllib first to mitigate HTTP errors that sometime happen with the internal pandas implementation
                 from urllib import request
@@ -535,12 +548,10 @@ class LoadCSV(LoadWithPandas):
                         response,
                         sep=self.sep,
                         low_memory=self.streaming,
-                        **self.get_args(),
                     )
-            return pd.read_csv(
-                file, sep=self.sep, low_memory=self.streaming, **self.get_args()
-            )
 def read_file(source) -> bytes:
@@ -668,7 +679,7 @@ class LoadFromSklearn(LazyLoader):
             df = pd.DataFrame([split_data["data"], targets]).T
             df.columns = ["data", "target"]
             dataset = df.to_dict("records")
-            self.__class__._loader_cache.max_size = settings.loader_cache_size
             self.__class__._loader_cache[dataset_id] = dataset
         for instance in self.__class__._loader_cache[dataset_id]:
             yield recursive_copy(instance)
@@ -1247,3 +1258,211 @@ class LoadFromAPI(Loader):
             self.__class__._loader_cache.max_size = settings.loader_cache_size
             self.__class__._loader_cache[str(self)] = iterables
         return MultiStream.from_iterables(iterables, copying=True)

     - :class:`MultipleSourceLoader <unitxt.loaders.MultipleSourceLoader>` - Combines data from multiple different sources.
     - :class:`LoadFromDictionary <unitxt.loaders.LoadFromDictionary>` - Loads data from a user-defined Python dictionary.
     - :class:`LoadFromHFSpace <unitxt.loaders.LoadFromHFSpace>` - Downloads and loads data from HuggingFace Spaces.
+    - :class:`LoadIOB <unitxt.loaders.LoadIOB>` - Loads data from IOB format files for named entity recognition tasks.
     Union,
 )
+import datasets
 import pandas as pd
 import requests
 from datasets import (
 )
 from datasets import load_dataset as _hf_load_dataset
 from huggingface_hub import HfApi
+from packaging.version import Version
 from tqdm import tqdm
 from .dataclass import NonPositionalField
     ):
         if settings.hf_offline_datasets_path is not None:
             path = os.path.join(settings.hf_offline_datasets_path, path)
+        if settings.disable_hf_datasets_cache:
+            kwargs["download_mode"] = "force_redownload"
+        if Version(datasets.__version__) < Version("4.0.0"):
+            kwargs["trust_remote_code"] = True
+        return _hf_load_dataset(
+            path,
+            *args,
+            **kwargs,
+            verification_mode="no_checks",
+        )
 @retry_connection_with_exponential_backoff(backoff_factor=2)
         return get_dataset_split_names(
             path=path,
             config_name=name,
             revision=revision,
         )
     except Exception as e:
         if "Couldn't find cache" in str(e):
             raise FileNotFoundError(
                 f"Dataset cache path={path}, name={name} was not found."
                 raise NotImplementedError() from None
             if not disable_memory_caching:
+                self.__class__._loader_cache._max_size = settings.loader_cache_size
                 self.__class__._loader_cache[dataset_id] = dataset
         self._already_logged_limited_loading = True
             dataset = dataframe.to_dict("records")
+            self.__class__._loader_cache._max_size = settings.loader_cache_size
             self.__class__._loader_cache[dataset_id] = dataset
         for instance in self.__class__._loader_cache[dataset_id]:
 class LoadCSV(LoadWithPandas):
+    r"""Loads data from CSV files.
     Supports streaming and can handle large files by loading them in chunks.
         streaming: Bool indicating if streaming should be used.
         sep: String specifying the separator used in the CSV files.
         indirect_read: Bool indicating if to open a remote file with urllib first
+        column_names: Optional list of column names to use instead of header row.
     Example:
         Loading csv
         .. code-block:: python
             load_csv = LoadCSV(files={'train': 'path/to/train.csv'}, chunksize=100)
+        Loading TSV with custom column names
+        .. code-block:: python
+            load_csv = LoadCSV(
+                files={'train': 'path/to/train.tsv'},
+                sep='\t',
+                column_names=['id', 'question', 'table_name', 'answer']
+            )
     """
     sep: str = ","
+    column_names: Optional[List[str]] = None
     def read_dataframe(self, file) -> pd.DataFrame:
         with error_context(
             stage="Raw Dataset Loading",
             help="https://www.unitxt.ai/en/latest/unitxt.loaders.html#module-unitxt.loaders",
         ):
+            args = self.get_args()
+            if self.column_names is not None:
+                args["names"] = self.column_names
+                args["header"] = None  # Don't use first row as header
             if self.indirect_read:
                 # Open the URL with urllib first to mitigate HTTP errors that sometime happen with the internal pandas implementation
                 from urllib import request
                         response,
                         sep=self.sep,
                         low_memory=self.streaming,
+                        **args,
                     )
+            return pd.read_csv(file, sep=self.sep, low_memory=self.streaming, **args)
 def read_file(source) -> bytes:
             df = pd.DataFrame([split_data["data"], targets]).T
             df.columns = ["data", "target"]
             dataset = df.to_dict("records")
+            self.__class__._loader_cache._max_size = settings.loader_cache_size
             self.__class__._loader_cache[dataset_id] = dataset
         for instance in self.__class__._loader_cache[dataset_id]:
             yield recursive_copy(instance)
             self.__class__._loader_cache.max_size = settings.loader_cache_size
             self.__class__._loader_cache[str(self)] = iterables
         return MultiStream.from_iterables(iterables, copying=True)
+class LoadIOB(LazyLoader):
+    """Loads data from IOB format files.
+    This loader can parse IOB (Inside-Outside-Begin) format files commonly used for
+    named entity recognition tasks. It supports both local files and remote URLs,
+    and can handle various IOB formats including CoNLL-U style files.
+    Args:
+        files (Dict[str, str]):
+            A dictionary mapping split names to file paths or URLs.
+        column_names (tuple, optional):
+            Column names for the IOB format. Defaults to ('id', 'token', 'tag', 'misc', 'annotator').
+        fix_tags (bool, optional):
+            Whether to apply tag fixing for OTH and B-O tags. Defaults to True.
+        encoding (str, optional):
+            File encoding. Defaults to 'utf-8'.
+    Example:
+        Loading IOB files
+        .. code-block:: python
+            load_iob = LoadIOB(files={'train': 'path/to/train.iob2', 'test': 'path/to/test.iob2'})
+    """
+    files: Dict[str, str]
+    column_names: tuple = ("id", "token", "tag", "misc", "annotator")
+    fix_tags: bool = True
+    encoding: str = "utf-8"
+    _requirements_list: List[str] = ["conllu"]
+    def _maybe_set_classification_policy(self):
+        self.set_default_data_classification(
+            ["proprietary"], "when loading from local files"
+        )
+    def get_splits(self) -> List[str]:
+        return list(self.files.keys())
+    def split_generator(self, split: str) -> Generator:
+        import conllu
+        dataset_id = str(self) + "_" + split
+        dataset = self.__class__._loader_cache.get(dataset_id, None)
+        if dataset is None:
+            if self.get_limit() is not None:
+                self.log_limited_loading()
+            file_path = self.files[split]
+            dataset = []
+            id_counter = 0
+            try:
+                # Handle remote URLs
+                if file_path.startswith(("http://", "https://")):
+                    import io
+                    import urllib.request
+                    with urllib.request.urlopen(file_path) as response:
+                        content = response.read().decode(self.encoding)
+                        # Use StringIO to create a file-like object
+                        content_file = io.StringIO(content)
+                        sentences = list(
+                            conllu.parse_incr(content_file, fields=self.column_names)
+                        )
+                else:
+                    # Handle local files
+                    with open(file_path, encoding=self.encoding) as data_file:
+                        sentences = list(
+                            conllu.parse_incr(data_file, fields=self.column_names)
+                        )
+                limit = self.get_limit()
+                processed_count = 0
+                for sent in sentences:
+                    if limit is not None and processed_count >= limit:
+                        break
+                    # Get sentence ID
+                    if "sent_id" in sent.metadata:
+                        idx = sent.metadata["sent_id"]
+                    else:
+                        idx = id_counter
+                    # Extract tokens and tags
+                    tokens = [token["token"] for token in sent]
+                    actual_tags = [token["tag"] for token in sent]
+                    # Apply tag fixing if enabled
+                    if self.fix_tags:
+                        fixed_tags = []
+                        for actual_tag in actual_tags:
+                            if "OTH" in actual_tag or actual_tag == "B-O":
+                                actual_tag = "O"
+                            fixed_tags.append(actual_tag)
+                    else:
+                        fixed_tags = actual_tags
+                    # Extract annotator info if available
+                    annotator = []
+                    for token in sent:
+                        if "annotator" in token and token["annotator"] is not None:
+                            annotator.append(token["annotator"])
+                        else:
+                            annotator.append("")
+                    # Get text from metadata or reconstruct from tokens
+                    if "text" in sent.metadata:
+                        text = sent.metadata["text"]
+                    else:
+                        text = " ".join(tokens)
+                    instance = {
+                        "idx": str(idx),
+                        "text": text,
+                        "tokens": tokens,
+                        "ner_tags": fixed_tags,
+                        "annotator": annotator,
+                    }
+                    dataset.append(instance)
+                    processed_count += 1
+                    id_counter += 1
+            except Exception as e:
+                with error_context(
+                    stage="Raw Dataset Loading",
+                    help="https://www.unitxt.ai/en/latest/unitxt.loaders.html#module-unitxt.loaders",
+                ):
+                    raise UnitxtError(
+                        f"Failed to load IOB file {file_path}: {e!s}"
+                    ) from e
+            # Cache the dataset
+            self.__class__._loader_cache.max_size = settings.loader_cache_size
+            self.__class__._loader_cache[dataset_id] = dataset
+        # Yield instances from cached dataset
+        for instance in dataset:
+            yield recursive_copy(instance)
+class TURLColumnTypeAnnotationLoader(LazyLoader):
+    data_classification_policy = ["public"]
+    _requirements_list = ["huggingface_hub"]
+    def prepare(self):
+        super().prepare()
+        from huggingface_hub import hf_hub_download
+        self._download = hf_hub_download
+    def get_splits(self) -> List[str]:
+        return ["train", "validation", "test"]
+    @staticmethod
+    def _load_table(table_data):
+        headers = table_data[5]
+        cols = table_data[6]
+        if not cols:
+            return {"header": headers, "rows": []}
+        row_count = max(x[-1][0][0] for x in cols)
+        rows = []
+        for i in range(row_count):
+            row = []
+            for col in cols:
+                cell = next((c[1][1] for c in col if c[0][0] == i), "")
+                row.append(cell)
+            if any(row):
+                rows.append(row)
+        return {"header": headers, "rows": rows}
+    def split_generator(self, split: str) -> Generator[Dict[str, Any], None, None]:
+        dataset_id = str(self) + "_" + split
+        dataset = self.__class__._loader_cache.get(dataset_id, None)
+        if split == "validation":
+            split = "dev"
+        if dataset is None:
+            file_path = self._download(
+                "stanford-crfm/helm-scenarios",
+                filename=f"turl-column-type-annotation/{split}.table_col_type.json",
+                repo_type="dataset",
+                revision="main",
+            )
+            with open(file_path, encoding="utf-8") as f:
+                data = json.load(f)
+            dataset = []
+            for table_data in data:
+                table_content = self._load_table(table_data)
+                for idx, colname in enumerate(table_data[5]):
+                    instance = {
+                        "page_title": table_data[1],
+                        "section_title": table_data[3],
+                        "table_caption": table_data[4],
+                        "table": table_content,
+                        "colname": colname,
+                        "annotations": table_data[7][idx],
+                    }
+                    dataset.append(instance)
+            self.__class__._loader_cache[dataset_id] = dataset
+        for instance in self.__class__._loader_cache[dataset_id]:
+            yield instance

metrics.py CHANGED Viewed

@@ -8,7 +8,7 @@ import uuid
 import warnings
 from abc import ABC, abstractmethod
 from collections import Counter, defaultdict
-from dataclasses import asdict, field
 from dataclasses import fields as dataclasses_fields
 from enum import Enum
 from functools import lru_cache
@@ -21,6 +21,7 @@ from typing import (
     Literal,
     Optional,
     Tuple,
     TypeVar,
     Union,
 )
@@ -6160,23 +6161,174 @@ For MacOS: If error on 'mecab-config' show up during installation ], one should
 """
-class NormalizedSacrebleu(HuggingfaceMetric):
-    """Normalized SacreBLEU metric for machine translation evaluation.
-    Range: [0, 1] (higher is better)
-    Character-level tokenization of BLEU score for improved cross-lingual evaluation.
-    Reference: https://arxiv.org/abs/1804.08771
     """
-    hf_metric_name = "sacrebleu"
-    hf_main_score = "score"
-    prediction_type = str
     main_score = "sacrebleu"
-    scale = 100.0
-    scaled_fields = ["sacrebleu", "precisions"]
-    hf_additional_input_fields_pass_one_value = ["tokenize"]
     _requirements_list = ["sacrebleu"]
 class CustomF1Fuzzy(CustomF1):
@@ -6599,7 +6751,6 @@ class GraniteGuardianBase(InstanceMetric):
     """Return metric for different kinds of "risk" from the Granite-3.0 Guardian model."""
     reduction_map: Dict[str, List[str]] = None
-    prediction_type = float
     main_score = None
     reduction_map = {}
     wml_model_name: str = "ibm/granite-guardian-3-8b"
@@ -6936,7 +7087,7 @@ class GraniteGuardianCustomRisk(GraniteGuardianBase):
         return messages
-RISK_TYPE_TO_CLASS: Dict[RiskType, GraniteGuardianBase] = {
     RiskType.USER_MESSAGE: GraniteGuardianUserRisk,
     RiskType.ASSISTANT_MESSAGE: GraniteGuardianAssistantRisk,
     RiskType.RAG: GraniteGuardianRagRisk,

 import warnings
 from abc import ABC, abstractmethod
 from collections import Counter, defaultdict
+from dataclasses import asdict, dataclass, field
 from dataclasses import fields as dataclasses_fields
 from enum import Enum
 from functools import lru_cache
     Literal,
     Optional,
     Tuple,
+    Type,
     TypeVar,
     Union,
 )
 """
+@dataclass
+class SacreBleuStats:
+    counts: List[int]
+    totals: List[int]
+    sys_len: int
+    ref_len: int
+class NormalizedSacrebleu(
+    MapReduceMetric[str, SacreBleuStats], PackageRequirementsMixin
+):
+    """SacreBLEU metric implementation using MapReduceMetric pattern.
+    This implementation uses the official sacrebleu library for tokenization
+    and BLEU computation, while supporting the map-reduce pattern for proper
+    corpus-level evaluation that matches the behavior of the HuggingFace version.
+    Range: [0, 1] (higher is better)
+    Reference: Post, M. 2018. A Call for Clarity in Reporting BLEU Scores.
     """
     main_score = "sacrebleu"
+    ci_score_names = ["sacrebleu"]
+    prediction_type = str
     _requirements_list = ["sacrebleu"]
+    language_to_tokenizer: Optional[Dict[str, str]] = None
+    # Configuration parameters matching sacrebleu API
+    tokenize: str = None
+    lowercase: bool = False
+    force: bool = False
+    smooth_method: str = "exp"
+    smooth_value: Optional[float] = None
+    use_effective_order: bool = True  # Recommended by sacrebleu for sentence-level BLEU
+    max_ngram_order: int = 4
+    def prepare(self):
+        super().prepare()
+        from sacrebleu.metrics.bleu import BLEU
+        self.bleu_metric = BLEU(
+            lowercase=self.lowercase,
+            force=self.force,
+            tokenize=self.tokenize,
+            smooth_method=self.smooth_method,
+            smooth_value=self.smooth_value,
+            max_ngram_order=self.max_ngram_order,
+            effective_order=self.use_effective_order,
+        )
+    def _get_tokenizer_for_language(self, language: str) -> str:
+        """Get appropriate tokenizer for a given language."""
+        if self.language_to_tokenizer is None:
+            raise ValueError("Please set language_to_tokenizer.")
+        if language.lower() not in self.language_to_tokenizer:
+            raise ValueError(
+                f"Language {language} is not in language_to_tokenizer please add it."
+            )
+        return self.language_to_tokenizer.get(language.lower())
+    @staticmethod
+    @lru_cache(maxsize=10000)
+    def get_bleu_metric(
+        lowercase: bool = False,
+        force: bool = False,
+        tokenize: Optional[str] = None,
+        smooth_method: str = "exp",
+        smooth_value: Optional[float] = None,
+        max_ngram_order: int = 4,
+        effective_order: bool = False,
+    ):
+        from sacrebleu.metrics.bleu import BLEU
+        return BLEU(
+            lowercase=lowercase,
+            force=force,
+            tokenize=tokenize,
+            smooth_method=smooth_method,
+            smooth_value=smooth_value,
+            max_ngram_order=max_ngram_order,
+            effective_order=effective_order,
+        )
+    def map(
+        self,
+        prediction: str,
+        references: List[str],
+        task_data: Dict[str, Any],
+    ) -> SacreBleuStats:
+        """Map function: compute BLEU statistics for a single instance using sacrebleu."""
+        if self.tokenize is None and "target_language" in task_data:
+            target_lang = task_data["target_language"]
+            tokenize_method = self._get_tokenizer_for_language(target_lang)
+        else:
+            tokenize_method = self.tokenize
+        instance_bleu_metric = self.get_bleu_metric(
+            lowercase=self.lowercase,
+            force=self.force,
+            tokenize=tokenize_method,
+            smooth_method=self.smooth_method,
+            smooth_value=self.smooth_value,
+            max_ngram_order=self.max_ngram_order,
+            effective_order=self.use_effective_order,
+        )
+        # Use the instance-specific metric to get per-instance statistics
+        bleu_result = instance_bleu_metric.sentence_score(prediction, references)
+        return SacreBleuStats(
+            counts=bleu_result.counts,
+            totals=bleu_result.totals,
+            sys_len=bleu_result.sys_len,
+            ref_len=bleu_result.ref_len,
+        )
+    def reduce(self, intermediates: List[SacreBleuStats]) -> Dict[str, Any]:
+        """Reduce function: aggregate statistics and compute corpus BLEU using sacrebleu."""
+        if not intermediates:
+            return {
+                "sacrebleu": 0.0,
+                "counts": [0, 0, 0, 0],
+                "totals": [0, 0, 0, 0],
+                "precisions": [0.0, 0.0, 0.0, 0.0],
+                "bp": 0.0,
+                "sys_len": 0,
+                "ref_len": 0,
+            }
+        # Aggregate all the statistics across instances
+        total_counts = [0] * self.max_ngram_order
+        total_totals = [0] * self.max_ngram_order
+        total_sys_len = 0
+        total_ref_len = 0
+        for stats in intermediates:
+            for i in range(min(len(stats.counts), self.max_ngram_order)):
+                total_counts[i] += stats.counts[i]
+                total_totals[i] += stats.totals[i]
+            total_sys_len += stats.sys_len
+            total_ref_len += stats.ref_len
+        # Use sacrebleu's compute_bleu static method to compute the final score from aggregated stats
+        # This is the proper way to get corpus-level BLEU from individual statistics
+        bleu_result = self.bleu_metric.compute_bleu(
+            correct=total_counts,
+            total=total_totals,
+            sys_len=total_sys_len,
+            ref_len=total_ref_len,
+            smooth_method=self.smooth_method,
+            smooth_value=self.smooth_value,
+            effective_order=self.use_effective_order,
+            max_ngram_order=self.max_ngram_order,
+        )
+        return {
+            "sacrebleu": round(
+                bleu_result.score / 100.0, 2
+            ),  # Convert from 0-100 to 0-1 scale
+            "counts": total_counts,
+            "totals": total_totals,
+            "precisions": [
+                round(p / 100.0, 2) for p in bleu_result.precisions
+            ],  # Convert from 0-100 to 0-1 scale
+            "bp": round(bleu_result.bp, 2),
+            "sys_len": total_sys_len,
+            "ref_len": total_ref_len,
+        }
 class CustomF1Fuzzy(CustomF1):
     """Return metric for different kinds of "risk" from the Granite-3.0 Guardian model."""
     reduction_map: Dict[str, List[str]] = None
     main_score = None
     reduction_map = {}
     wml_model_name: str = "ibm/granite-guardian-3-8b"
         return messages
+RISK_TYPE_TO_CLASS: Dict[RiskType, Type[GraniteGuardianBase]] = {
     RiskType.USER_MESSAGE: GraniteGuardianUserRisk,
     RiskType.ASSISTANT_MESSAGE: GraniteGuardianAssistantRisk,
     RiskType.RAG: GraniteGuardianRagRisk,

operator.py CHANGED Viewed

@@ -2,13 +2,12 @@ from abc import abstractmethod
 from dataclasses import field
 from typing import Any, Dict, Generator, List, Optional, Union
-from pkg_resources import DistributionNotFound, VersionConflict, require
 from .artifact import Artifact
 from .dataclass import FinalField, InternalField, NonPositionalField
 from .error_utils import error_context
 from .settings_utils import get_constants
 from .stream import DynamicStream, EmptyStreamError, MultiStream, Stream
 constants = get_constants()

 from dataclasses import field
 from typing import Any, Dict, Generator, List, Optional, Union
 from .artifact import Artifact
 from .dataclass import FinalField, InternalField, NonPositionalField
 from .error_utils import error_context
 from .settings_utils import get_constants
 from .stream import DynamicStream, EmptyStreamError, MultiStream, Stream
+from .utils import DistributionNotFound, VersionConflict, require
 constants = get_constants()

operators.py CHANGED Viewed

@@ -218,7 +218,7 @@ class MapInstanceValues(InstanceOperator):
         if val_as_str in mapper:
             return recursive_copy(mapper[val_as_str])
         if self.strict:
-            raise KeyError(
                 f"value '{val_as_str}', the string representation of the value in field '{key}', is not found in mapper '{mapper}'"
             )
         return val
@@ -2574,3 +2574,40 @@ class Fillna(FieldOperator):
         except TypeError:
             return value
         return value

         if val_as_str in mapper:
             return recursive_copy(mapper[val_as_str])
         if self.strict:
+            raise ValueError(
                 f"value '{val_as_str}', the string representation of the value in field '{key}', is not found in mapper '{mapper}'"
             )
         return val
         except TypeError:
             return value
         return value
+class ReadFile(FieldOperator):
+    """Reads file content from local path or URL.
+    This operator can read files from local filesystem paths or remote URLs.
+    The content is returned as a string.
+    Args:
+        encoding (str): Text encoding to use when reading the file. Defaults to 'utf-8'.
+    Example:
+        Reading a local file
+        .. code-block:: python
+            ReadFile(field="file_path", to_field="content")
+        Reading from URL
+        .. code-block:: python
+            ReadFile(field="url", to_field="content")
+    """
+    encoding: str = "utf-8"
+    def process_value(self, value: str) -> str:
+        """Read file content from local path or URL."""
+        if value.startswith(("http://", "https://")):
+            # Read from URL
+            response = requests.get(value)
+            response.raise_for_status()
+            return response.content.decode(self.encoding, errors="replace")
+        # Read from local file
+        with open(value, encoding=self.encoding) as f:
+            return f.read()

settings_utils.py CHANGED Viewed

@@ -224,6 +224,7 @@ if Settings.is_uninitilized():
     settings.hf_offline_models_path = None
     settings.inference_engine_cache_path = "./inference_engine_cache/"
     settings.max_connection_retries = 3
     settings.dataset_cache_default = (bool, False)
 if Constants.is_uninitilized():

     settings.hf_offline_models_path = None
     settings.inference_engine_cache_path = "./inference_engine_cache/"
     settings.max_connection_retries = 3
+    settings.max_templates_tests_for_card_test = 10
     settings.dataset_cache_default = (bool, False)
 if Constants.is_uninitilized():

struct_data_operators.py CHANGED Viewed

@@ -24,6 +24,8 @@ For key-value pairs, expected input format is:
 """
 import ast
 import json
 import random
 from abc import ABC, abstractmethod
@@ -31,6 +33,7 @@ from typing import (
     Any,
     Dict,
     List,
     Optional,
     Tuple,
 )
@@ -1118,3 +1121,67 @@ class JsonStrToDict(FieldOperator):
             )
             dict_value = {}
         return {str(k): str(v) for k, v in dict_value.items() if v is not None}

 """
 import ast
+import csv
+import io
 import json
 import random
 from abc import ABC, abstractmethod
     Any,
     Dict,
     List,
+    Literal,
     Optional,
     Tuple,
 )
             )
             dict_value = {}
         return {str(k): str(v) for k, v in dict_value.items() if v is not None}
+class ParseCSV(FieldOperator):
+    r"""Parse CSV/TSV text content into table format.
+    This operator converts CSV or TSV text content into the standard table format
+    used by Unitxt with header and rows fields.
+    Args:
+        separator (str): Field separator character. Defaults to ','.
+        has_header (bool): Whether the first row contains column headers. Defaults to True.
+        skip_header (bool): Whether to skip the first row entirely. Defaults to False.
+    Example:
+        Parsing CSV content
+        .. code-block:: python
+            ParseCSV(field="csv_content", to_field="table", separator=",")
+        Parsing TSV content
+        .. code-block:: python
+            ParseCSV(field="tsv_content", to_field="table", separator="\t")
+    """
+    separator: str = ","
+    has_header: bool = True
+    skip_header: bool = False
+    dtype: Optional[Literal["str"]] = None
+    strip_cells: bool = False
+    def process_value(self, value: str) -> Dict[str, Any]:
+        csv_reader = csv.reader(
+            io.StringIO(value), delimiter=self.separator, quotechar='"'
+        )
+        rows = []
+        header = []
+        for idx, row in enumerate(csv_reader):
+            if idx == 0 and self.has_header:
+                header = row
+                if self.skip_header:
+                    continue
+            else:
+                rows.append(row)
+        if not self.has_header or self.skip_header:
+            header = [f"col_{i}" for i in range(len(rows[0]))]
+        if self.strip_cells:
+            def clean_cell(x):
+                if isinstance(x, str):
+                    return x.replace("\n", " ").strip()
+                return x
+            rows = [[clean_cell(cell) for cell in row] for row in rows]
+            header = [clean_cell(h) for h in header]
+        return {
+            "header": header,
+            "rows": rows,
+        }

utils.py CHANGED Viewed

@@ -9,9 +9,13 @@ import time
 from collections import OrderedDict
 from contextvars import ContextVar
 from functools import wraps
 from typing import Any, Dict, Optional
 from urllib.error import HTTPError as UrllibHTTPError
 from requests.exceptions import ConnectionError, HTTPError
 from requests.exceptions import Timeout as TimeoutError
@@ -422,3 +426,46 @@ class LongString(str):
         if self._repr_str is not None:
             return self._repr_str
         return super().__repr__()

 from collections import OrderedDict
 from contextvars import ContextVar
 from functools import wraps
+from importlib.metadata import PackageNotFoundError
+from importlib.metadata import version as get_installed_version
 from typing import Any, Dict, Optional
 from urllib.error import HTTPError as UrllibHTTPError
+from packaging.requirements import Requirement
+from packaging.version import Version
 from requests.exceptions import ConnectionError, HTTPError
 from requests.exceptions import Timeout as TimeoutError
         if self._repr_str is not None:
             return self._repr_str
         return super().__repr__()
+class DistributionNotFound(Exception):
+    def __init__(self, requirement):
+        self.requirement = requirement
+        super().__init__(f"Distribution not found for requirement: {requirement}")
+class VersionConflict(Exception):
+    def __init__(self, dist, req):
+        self.dist = dist  # Distribution object, just emulate enough for your needs
+        self.req = req
+        super().__init__(f"Version conflict: {dist} does not satisfy {req}")
+class DistStub:
+    # Minimal stub to mimic pkg_resources.Distribution
+    def __init__(self, project_name, version):
+        self.project_name = project_name
+        self.version = version
+def require(requirements):
+    """Minimal drop-in replacement for pkg_resources.require.
+    Accepts a single requirement string or a list of them.
+    Raises DistributionNotFound or VersionConflict.
+    Returns nothing (side-effect only).
+    """
+    if isinstance(requirements, str):
+        requirements = [requirements]
+    for req_str in requirements:
+        req = Requirement(req_str)
+        if req.marker and not req.marker.evaluate():
+            continue  # skip not needed for this environment
+        name = req.name
+        try:
+            ver = get_installed_version(name)
+        except PackageNotFoundError as e:
+            raise DistributionNotFound(req_str) from e
+        if req.specifier and not req.specifier.contains(Version(ver), prereleases=True):
+            dist = DistStub(name, ver)
+            raise VersionConflict(dist, req_str)

version.py CHANGED Viewed

	@@ -1 +1 @@
1	- version = "1.26.5"


1	+ version = "1.26.6"