mariagrandury commited on
Commit
9ec2e86
·
1 Parent(s): 91f4a96

fix: use official fmti indicators csv

Browse files
Files changed (4) hide show
  1. fmti-indicators.csv +101 -0
  2. fmti_indicators.csv +0 -689
  3. fmti_indicators.pdf +0 -0
  4. pdf_parser.py +0 -74
fmti-indicators.csv ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Domain,Subdomain,Indicator,Definition,Notes,Reference_1,Reference_2,Link_1,Link_2
2
+ Upstream,Data,Data size,"For the data used in building the model, is the data size disclosed?","Data size should be reported in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). No form of decomposition into data phases is required.",Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science,Datasheets for Datasets,https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00041/43452/Data-Statements-for-Natural-Language-Processing,https://arxiv.org/abs/1803.09010
3
+ Upstream,Data,Data sources,"For all data used in building the model, are the data sources disclosed?","To receive this point, a meaningful decomposition of sources must be listed in an understandable way (e.g. named URLs/domains/databases/data providers). It does not suffice to say data is “sourced from the Internet"" or comes from ""licensed sources”.",Datasheets for Datasets,Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2010.13561
4
+ Upstream,Data,Data creators ,"For all data used in building the model, is there some characterization of the people who created the data?","While information about data creators may not be easily discernible for some data scraped from the web, the general sources (URLs/domains) should be listed, and, for other data that is bought, licensed, or collected, a reasonable attempt at characterizing the underlying people who provided the data is required to receive this point. The relevant properties of people can vary depending on context: for example, relevant properties could include demographic information like fraction of Black individuals contributing to the dataset, geographic information like fraction of European individuals contributing to the dataset, language information like fraction of L1 English speakers, or occupational information like the fraction of professional artists.",Datasheets for Datasets,Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2010.13561
5
+ Upstream,Data,Data source selection,Are the selection protocols for including and excluding data sources disclosed?,Selection protocols refer to procedures used to choose which datasets or subsets of datasets will be used to build a model. We will award this point even if the selection protocols are non-exhaustive.,Datasheets for Datasets,Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2010.13561
6
+ Upstream,Data,Data curation,"For all data sources, are the curation protocols for those data sources disclosed?","Curation protocols refer to steps taken to further modify data sources, such as procedures to manage, annotate, and organize data. The aims of curation might include improving the quality, relevance, and representativeness of the data. We will award this point if the developer reports that it does not perform any further curation beyond the data sources.",Datasheets for Datasets,Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2010.13561
7
+ Upstream,Data,Data augmentation,Are any steps the developer takes to augment its data sources disclosed?,Such steps might include augmenting data sources with synthetic data. We will award this point if the developer reports that it does not take any steps to augment its data.,Datasheets for Datasets,Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2010.13561
8
+ Upstream,Data,Harmful data filtration,"If data is filtered to remove harmful content, is there a description of the associated filter?",Such harmful content might relate to violence or child sexual abuse material. We will award this point if the developer reports that it does not perform any harmful data filtration.,Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus,"A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity",https://aclanthology.org/2021.emnlp-main.98/,https://arxiv.org/abs/2305.13169
9
+ Upstream,Data,Copyrighted data,"For all data used in building the model, is the associated copyright status disclosed?","To receive this point, the copyright status (e.g. copyrighted, public domain) must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point or if the disclosure is not comprehensive relative to legal copyright standards.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retrospective Datasheet for BookCorpus",Machine Learning and Artificial Intelligence: Legal Concepts,https://arxiv.org/abs/2105.05241,https://genlaw.github.io/glossary.html#legal-concepts
10
+ Upstream,Data,Data license,"For all data used in building the model, is the associated license status disclosed?","To receive this point, the license status must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retrospective Datasheet for BookCorpus",Machine Learning and Artificial Intelligence: Legal Concepts,https://arxiv.org/abs/2105.05241,https://genlaw.github.io/glossary.html#legal-concepts
11
+ Upstream,Data,Personal information in data,"For all data used in building the model, is the inclusion or exclusion of personal information in that data disclosed?","To receive this point, the disclosure of personal information must relate to some decomposition of the data. We will award this point if there is some meaningful decomposition of the data, even if the decomposition is insufficient to receive the Data Creators point. Additionally, we will award this point if the developer reports the inclusion of personal information, independent of if and how they mitigate related privacy concerns.",Data Capitalism: Redefining the Logics of Surveillance and Privacy,What Does it Mean for a Language Model to Preserve Privacy?,https://journals.sagepub.com/doi/10.1177/0007650317718185,https://arxiv.org/abs/2202.05520
12
+ Upstream,Data labor,Use of human labor,Are the phases of the data pipeline where human labor is involved disclosed?,"Phases of the data pipeline that involve human labor include activities and tasks performed by people to collect, annotate, clean, or validate data. This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer gives a reasonable best-effort description of the use of human labor in their data pipeline.",The future of crowd work,"AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging — and not going anywhere",https://dl.acm.org/doi/10.1145/2441776.2441923,https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots
13
+ Upstream,Data labor,Employment of data laborers,Is the organization that directly employs the people involved in data labor disclosed for each phase of the data pipeline?,"Phases of the data pipeline that involve human labor include activities and tasks performed by people to collect, annotate, clean, or validate data. This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer provides the name of the organization that employs data laborers, even if other details about the employment relationship are not disclosed.",The future of crowd work,"AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging — and not going anywhere",https://dl.acm.org/doi/10.1145/2441776.2441923,https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots
14
+ Upstream,Data labor,Geographic distribution of data laborers,Is geographic information regarding the people involved in data labor disclosed for each phase of the data pipeline?,This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer gives a reasonable best-effort description of the geographic distribution of labor at the country-level.,Cleaning Up ChatGPT Takes Heavy Toll on Human Workers,Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass,https://www.wsj.com/articles/chatgpt-openai-content-abusive-sexually-explicit-harassment-kenya-workers-on-human-workers-cf191483,https://ghostwork.info/
15
+ Upstream,Data labor,Wages,Are the wages for people who perform data labor disclosed?,"This indicator is inclusive of data labor at all points of the model development process, such as training data annotation or red teaming data used to control the model. We will award this point if the developer reports that it does not compensate workers. For all data that is created by or on behalf of the developer, ",The future of crowd work,"AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging — and not going anywhere",https://dl.acm.org/doi/10.1145/2441776.2441923,https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots
16
+ Upstream,Data labor,Instructions for creating data,Are the instructions given to people who perform data labor disclosed?,This indicator is inclusive of all data that is created by or on behalf of the developer. We will award this point if the developer makes a reasonable best-effort attempt to disclose instructions given to people who create data used to build the model for the bulk of the data phases involving human labor.,"Everyone wants to do the model work, not the data work",The future of crowd work,https://dl.acm.org/doi/10.1145/3411764.3445518,https://dl.acm.org/doi/10.1145/2441776.2441923
17
+ Upstream,Data labor,Labor protections,Are the labor protections for people who perform data labor disclosed?,"This indicator is inclusive of data labor at all points of the model development process, such as training data annotation or red teaming data used to control the model. It is also inclusive of all data that is created by or on behalf of the developer. As an example, labor protections might include protocols to reduce the harm to workers' mental health stemming from exposure to violent content when annotating training data. We will award this point if the developer reports that it does not protect workers or if it does not use data laborers and therefore has no labor protections.","The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence",Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass,https://www.jstor.org/stable/j.ctv1ghv45t,https://ghostwork.info/
18
+ Upstream,Data labor,Third party partners,Are the third parties who were or are involved in the development of the model disclosed?,This indicator is inclusive of partnerships that go beyond data labor as there may be third party partners at various stages in the model development process. We will award this point if the developer reports that it was the sole entity involved in the development of the model.,"The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence",Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass,https://www.jstor.org/stable/j.ctv1ghv45t,https://ghostwork.info/
19
+ Upstream,Data access,Queryable external data access,Are external entities provided with queryable access to the data used to build the model?,"We will award this point for any reasonable mechanism for providing access: direct access to the data, an interface to query the data, a developer-mediated access program where developers can inspect requests, etc. Developers may receive this point even if there are rate-limits on the number of queries permitted to an external entity and restrictions on which external entities are given access, insofar as these limits and restrictions are transparent and ensure a reasonable amount of external access. We may accept justifications for prohibiting queries of specific parts of the data.",Datasheets for Datasets,The ROOTS Search Tool: Data Transparency for LLMs,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2302.14035
20
+ Upstream,Data access,Direct external data access,Are external entities provided with direct access to the data used to build the model?,"We will award this point if external entities can directly access the data without any form of gating from the developer. With that said, we may award this point if the developer provides justifications for prohibiting access to specific parts of the data or to unauthorized external entities.",Datasheets for Datasets,The ROOTS Search Tool: Data Transparency for LLMs,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/2302.14035
21
+ Upstream,Compute,Compute usage,Is the compute required for building the model disclosed?,"Compute should be reported in appropriate units, which most often will be floating point operations (FLOPS). Compute should be reported to a precision of one significant figure (e.g. 5 x $10^{25}$ FLOPS). We will award this point even if there is no decomposition of the reported compute usage into compute phases, but it should be clear whether the reported compute usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate compute expenditure.",Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning,Energy and Policy Considerations for Deep Learning in NLP,https://arxiv.org/abs/2002.05651,https://arxiv.org/abs/1906.02243
22
+ Upstream,Compute,Development duration,Is the amount of time required to build the model disclosed?,"The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, training and subsequent evaluation and red teaming).",Compute Trends Across Three Eras of Machine Learning,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2202.05924,https://arxiv.org/abs/2203.15556
23
+ Upstream,Compute,Compute hardware,"For the primary hardware used to build the model, is the amount and type of hardware disclosed?","In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.",Compute Trends Across Three Eras of Machine Learning,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2202.05924,https://arxiv.org/abs/2203.15556
24
+ Upstream,Compute,Hardware owner,"For the primary hardware used in building the model, is the owner of the hardware disclosed?","For example, the hardware owner may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that hardware is owned by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.",Compute Trends Across Three Eras of Machine Learning,Training Compute-Optimal Large Language Models,https://arxiv.org/abs/2202.05924,https://arxiv.org/abs/2203.15556
25
+ Upstream,Compute,Energy usage,Is the amount of energy expended in building the model disclosed?,"Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh). Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage.",Quantifying the Carbon Emissions of Machine Learning,Carbon Emissions and Large Neural Network Training,https://arxiv.org/abs/1910.09700,https://arxiv.org/abs/2104.10350
26
+ Upstream,Compute,Carbon emissions,Is the amount of carbon emitted (associated with the energy used) in building the model disclosed?,"Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2). Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions.",Quantifying the Carbon Emissions of Machine Learning,Carbon Emissions and Large Neural Network Training,https://arxiv.org/abs/1910.09700,https://arxiv.org/abs/2104.10350
27
+ Upstream,Compute,Broader environmental impact,Are any broader environmental impacts from building the model besides carbon emissions disclosed?,"While the most direct environmental impact of building a foundation model is the energy used and, therefore, the potential carbon emissions, there may be other environmental impacts. For example, these may include the use of other resources such as water for cooling data centers or metals for producing specialized hardware. We recognize that there does not exist an authoritative or consensus list of broader environmental factors. For this reason, we will award this point if there is a meaningful, though potentially incomplete, discussion of broader environmental impact.",Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning,Energy and Policy Considerations for Deep Learning in NLP,https://arxiv.org/abs/2302.08476,https://arxiv.org/abs/1906.02243
28
+ Upstream,Methods,Model stages,Are all stages in the model development process disclosed?,"Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.",Model Cards for Model Reporting,Scaling Instruction-Finetuned Language Models,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2210.11416
29
+ Upstream,Methods,Model objectives,"For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?","We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).",Model Cards for Model Reporting,Scaling Instruction-Finetuned Language Models,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2210.11416
30
+ Upstream,Methods,Core frameworks,Are the core frameworks used for model development disclosed?,"Examples of core frameworks include Tensorflow, PyTorch, Jax, Hugging Face Transformers, Seqio, T5X, Keras, SciKit, and Triton. If there are significant internal frameworks, there should be some description of their function and/or a reasonably similar publicly-available analogue. We recognize that there does not exist an authoritative or consensus list of core frameworks. For this reason, we will award this point if there is a meaningful, though potentially incomplete, list of major frameworks for the first version of the index.",Model Cards for Model Reporting,Scaling Instruction-Finetuned Language Models,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2210.11416
31
+ Upstream,Methods,Additional dependencies,"Are any dependencies required to build the model disclosed besides data, compute, and code?","For example, if the model depends on an external search engine, programmable APIs, or tools, this should be disclosed. We recognize that there is not widespread consensus regarding what constitutes key dependencies beyond the data, compute, and code. We will award this point only if developers give a reasonable best-effort description of any additional dependencies or make clear that no additional dependencies are required.",Analyzing Leakage of Personally Identifiable Information in Language Models,ProPILE: Probing Privacy Leakage in Large Language Models,https://arxiv.org/abs/2302.00539,https://arxiv.org/abs/2307.01881
32
+ Upstream,Data Mitigations,Mitigations for privacy,Are any steps the developer takes to mitigate the presence of PII in the data disclosed?,"Such steps might include identifying personal information in the training data, filtering specific datasets to remove personal information, and reducing the likelihood that models will output personal information. We will award this point if the developer reports that it does not take steps to mitigate the presence of PII in the data.",Deduplicating Training Data Mitigates Privacy Risks in Language Models,Machine Learning and Artificial Intelligence: Legal Concepts,https://proceedings.mlr.press/v162/kandpal22a.html,https://genlaw.github.io/glossary.html#legal-concepts
33
+ Upstream,Data Mitigations,Mitigations for copyright,Are any steps the developer takes to mitigate the presence of copyrighted information in the data disclosed?,"Such steps might include identifying copyrighted data, filtering specific datasets to remove copyrighted data, and reducing the likelihood that models will output copyrighted information. We will award this point if the developer reports that it does take steps to mitigate the presence of copyrighted information in the data.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retrospective Datasheet for BookCorpus",Machine Learning and Artificial Intelligence: Legal Concepts,https://arxiv.org/abs/2105.05241,https://genlaw.github.io/glossary.html#legal-concepts
34
+ Model,Model basics,Input modality,Are the input modalities for the model disclosed?,"Input modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs.",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
35
+ Model,Model basics,Output modality,Are the output modalities for the model disclosed?,"Output modalities refer to the types or formats of information that the model can accept as output. Examples of output modalities include text, image, audio, video, tables, graphs.",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
36
+ Model,Model basics,Model components,Are all components of the model disclosed?,"Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module.",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
37
+ Model,Model basics,Model size,"For all components of the model, is the associated model size disclosed?","This information should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder).",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
38
+ Model,Model basics,Model architecture,Is the model architecture disclosed?,"Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently. We will award this point for any clear, though potentially incomplete, description of the model architecture.",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
39
+ Model,Model basics,Centralized model documentation,Is key information about the model included in a centralized artifact such as a model card?,"We recognize that different developers may share this information through different types of documentation, such as a system card or several clearly interrelated documents. We will award this point for the disclosure of any such centralized artifact that provides key information typically included in a model card, though the artifact may be longer-form than a standard model card (e.g. a technical report).",Model Cards for Model Reporting,Interactive Model Cards: A Human-Centered Approach to Model Documentation,https://arxiv.org/abs/1810.03993,https://arxiv.org/abs/2205.02894
40
+ Model,Model access,External model access protocol,Is a protocol for granting external entities access to the model disclosed?,"A model access protocol refers to the steps, requirements, and considerations involved in granting authorized model access to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) a transparent decision on whether access has been granted within a specified, reasonable period of time.",The Gradient of Generative AI Release: Methods and Considerations,Structured access: an emerging paradigm for safe AI deployment,https://arxiv.org/abs/2302.04844,https://arxiv.org/abs/2201.05159
41
+ Model,Model access,Blackbox external model access,Is black box model access provided to external entities?,"Black box model access refers to the ability to query the model with inputs and receive outputs, potentially without further access. Examples of external entities that might be granted access include researchers, third-party auditors, and regulators. We will award this point for any reasonable access level: direct access to the model weights, an interface to query the model, a developer-mediated access program where developers can inspect requests, etc. Developers may receive this point even if there are rate-limits on the number of queries permitted to an external entity and restrictions on the external entities that are permitted access, insofar as these limits and restrictions are transparent.",The Gradient of Generative AI Release: Methods and Considerations,Structured access: an emerging paradigm for safe AI deployment,https://arxiv.org/abs/2302.04844,https://arxiv.org/abs/2201.05159
42
+ Model,Model access,Full external model access,Is full model access provided to external entities?,"Full model access refers to the ability to access the model via the release of model weights. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via some high-level description of who has been granted access to the foundation model).",The Gradient of Generative AI Release: Methods and Considerations,Structured access: an emerging paradigm for safe AI deployment,https://arxiv.org/abs/2302.04844,https://arxiv.org/abs/2201.05159
43
+ Model,Capabilities,Capabilities description,Are the model's capabilities described?,"Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for any clear, but potentially incomplete, description of the multiple capabilities.",Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models,Holistic Evaluation of Language Models,https://arxiv.org/abs/2206.04615,https://openreview.net/forum?id=iO4LZibEqW
44
+ Model,Capabilities,Capabilities demonstration,Are the model’s capabilities demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the model's capabilities that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for clear demonstrations of multiple capabilities.",Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models,Holistic Evaluation of Language Models,https://arxiv.org/abs/2206.04615,https://openreview.net/forum?id=iO4LZibEqW
45
+ Model,Capabilities,Evaluation of capabilities,"Are the model’s capabilities rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?","Rigorous evaluations refer to precise quantifications of the model's behavior in relation to its capabilities. We recognize that capabilities may not perfectly align with evaluations, and that different developers may associate capabilities with evaluations differently. We will award this point for clear evaluations of multiple capabilities. For example, this may include evaluations of world knowledge, reasoning, state tracking or other such proficiencies. Or it may include the measurement of average performance (e.g. accuracy, F1) on benchmarks for specific tasks (e.g. text summarization, image captioning). We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's capabilities are presented as especially unusual such that standard evaluations will not suffice.",Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models,Holistic Evaluation of Language Models,https://arxiv.org/abs/2206.04615,https://openreview.net/forum?id=iO4LZibEqW
46
+ Model,Capabilities,External reproducibility of capabilities evaluation,Are the evaluations of the model’s capabilities reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by an external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may be sufficient to score this point.",Leakage and the reproducibility crisis in machine-learning-based science,Holistic Evaluation of Language Models,https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389923001599%3Fshowall%3Dtrue,https://openreview.net/forum?id=iO4LZibEqW
47
+ Model,Capabilities,Third party capabilities evaluation,Are the model’s capabilities evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the developer. We will award this point if (i) a third party has conducted an evaluation of model capabilities, (ii) the results of this evaluation are publicly available, and (iii) these results are disclosed or referred to in the developer’s materials.",Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,Holistic Evaluation of Language Models,https://dl.acm.org/doi/10.1145/3514094.3534181,https://openreview.net/forum?id=iO4LZibEqW
48
+ Model,Limitations,Limitations description,Are the model's limitations disclosed?,"Limitations refer to the specific and distinctive functions that the model cannot perform (e.g. the model cannot answer questions about current events as it only contains data up to a certain time cutoff, the model is not very capable when it comes to a specific application). We recognize that different developers may use different terminology for limitations, or conceptualize limitations differently. We will award this point for any clear, but potentially incomplete, description of multiple limitations.",The Fallacy of AI Functionality,Holistic Evaluation of Language Models,https://dl.acm.org/doi/abs/10.1145/3531146.3533158,https://openreview.net/forum?id=iO4LZibEqW
49
+ Model,Limitations,Limitations demonstration,Are the model’s limitations demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the limitations that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for limitations, or conceptualize the limitations differently. We will award this point for clear demonstrations of multiple limitations.",The Fallacy of AI Functionality,Holistic Evaluation of Language Models,https://dl.acm.org/doi/abs/10.1145/3531146.3533158,https://openreview.net/forum?id=iO4LZibEqW
50
+ Model,Limitations,Third party evaluation of limitations,Can the model’s limitations be evaluated by third parties?,"By third parties, we mean entities that are significantly or fully independent of the model developers. In contrast to the third party evaluation indicators for capabilities and risks, we will award this point if third party evaluations are possible even if no third party has yet conducted them. Such evaluations are possible if, for example, the model is deployed via an API (or with open weights) and there are no restrictions on evaluating limitations (e.g. in the usage policy). ",Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,Holistic Evaluation of Language Models,https://dl.acm.org/doi/10.1145/3514094.3534181,https://openreview.net/forum?id=iO4LZibEqW
51
+ Model,Risks,Risks description,Are the model's risks disclosed?,"Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. This indicator requires disclosure of risks that may arise in the event of both (i) intentional (though possibly careless) use, such as bias or hallucinations and (ii) malicious use, such as fraud or disinformation. We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for any clear, but potentially incomplete, description of multiple risks.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
52
+ Model,Risks,Risks demonstration,Are the model’s risks demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the risks that are legible or understandable for the general public, without requiring specific technical expertise. This indicator requires demonstration of risks that may arise in the event of both (i) intentional (though possibly careless) use, such as biases or hallucinations and (ii) malicious use, such as fraud or disinformation. We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for clear demonstrations of multiple risks.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
53
+ Model,Risks,Unintentional harm evaluation,"Are the model’s risks related to unintentional harm rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?","Rigorous evaluations refer to precise quantifications of the model's behavior in relation to such risks. Unintentional harms include bias, toxicity, and issues relating to fairness. We recognize that unintended harms may not perfectly align with risk evaluations, and that different developers may associate risks with evaluations differently. We will award this point for clear evaluations of multiple such risks. We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's risks related to unintentional harm are presented as especially unusual or severe.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
54
+ Model,Risks,External reproducibility of unintentional harm evaluation,Are the evaluations of the model’s risks related to unintentional harm reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the developer for why it is not possible for the evaluation to be made reproducible may suffice.",Leakage and the reproducibility crisis in machine-learning-based science,Ethical and social risks of harm from Language Models,https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389923001599%3Fshowall%3Dtrue,https://arxiv.org/abs/2112.04359
55
+ Model,Risks,Intentional harm evaluation,"Are the model’s risks related to intentional harm rigorously evaluated, with the results of these evaluations reported prior to or concurrent with the initial release of the model?.","Rigorous evaluations refer to precise quantifications of the model's behavior in relation to such risks. Intentional harms include fraud, disinformation, scams, cybersecurity attacks, designing weapons or pathogens, and uses of the model for illegal purposes. We recognize that unintentional harms may not perfectly align with risk evaluations, and that different developers may associate risks with evaluations differently. We will award this point for clear evaluations of multiple such risks. We note that evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator, though they may not if the model's risks related to unintentional harm are presented as especially unusual or severe.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
56
+ Model,Risks,External reproducibility of intentional harm evaluation,Are the evaluations of the model’s risks related to intentional harm reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may suffice.",Leakage and the reproducibility crisis in machine-learning-based science,Ethical and social risks of harm from Language Models,https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389923001599%3Fshowall%3Dtrue,https://arxiv.org/abs/2112.04359
57
+ Model,Risks,Third party risks evaluation,Are the model’s risks evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the developer. A third party risk evaluation might involve the developer allowing a third party to choose a methodology for evaluating risks that differs from that of the developer. We will award this point if (i) a third party has conducted an evaluation of model risks, (ii) the results of this evaluation are publicly available, and (iii) these results are disclosed or referred to in the developer’s materials. If the results are not made public (but are disclosed to have been conducted) and/or the results are not discoverable in the developer’s materials, we will not award this point. We may accept a justification from either the third party or the developer for why part of the evaluation is not disclosed in relation to risks.",Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,Ethical and social risks of harm from Language Models,https://dl.acm.org/doi/10.1145/3514094.3534181,https://arxiv.org/abs/2112.04359
58
+ Model,Model Mitigations,Mitigations description,Are the model mitigations disclosed?,"By model mitigations, we refer to interventions implemented by the developer at the level of the model to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for any clear, but potentially incomplete, description of multiple mitigations associated with the model's risks. Alternatively, we will award this point if the developer reports that it does not mitigate risk.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
59
+ Model,Model Mitigations,Mitigations demonstration,Are the model mitigations demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the mitigations that are legible or understandable for the general public, without requiring specific technical expertise. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for clear demonstrations of multiple mitigations. We will also award this point if the developer reports that it does not mitigate the risks associated with the model.",Evaluating the Social Impact of Generative AI Systems in Systems and Society,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2306.05949,https://arxiv.org/abs/2112.04359
60
+ Model,Model Mitigations,Mitigations evaluation,"Are the model mitigations rigorously evaluated, with the results of these evaluations reported?",Rigorous evaluations refer to precise quantifications of the model's behavior in relation to the mitigations associated with its risks. We will award this point for clear evaluations of multiple mitigations.,Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation,Ethical and social risks of harm from Language Models,https://arxiv.org/abs/2310.06987,https://arxiv.org/abs/2112.04359
61
+ Model,Model Mitigations,External reproducibility of mitigations evaluation,Are the model mitigation evaluations reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. In the case of mitigations evaluations, this will usually involve details about a comparison to some baseline, which may be a different, unmitigated version of the model. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. We will award this point for reproducibility of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a justification by the model developer for why it is not possible for the evaluation to be made reproducible may suffice.",Leakage and the reproducibility crisis in machine-learning-based science,Ethical and social risks of harm from Language Models,https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389923001599%3Fshowall%3Dtrue,https://arxiv.org/abs/2112.04359
62
+ Model,Model Mitigations,Third party mitigations evaluation,Can the model mitigations be evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the model developers. This indicator assesses whether it is possible for third parties to assess mitigations, which is not restricted to the methods the developer uses to assess mitigations. In contrast to the third party evaluation indicators for capabilities and risks, we will award this point if third party evaluations are possible even if no third party has yet conducted them.",Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,Ethical and social risks of harm from Language Models,https://dl.acm.org/doi/10.1145/3514094.3534181,https://arxiv.org/abs/2112.04359
63
+ Model,Trustworthiness,Trustworthiness evaluation,"Is the trustworthiness of the model rigorously evaluated, with the results of these evaluations disclosed?","Rigorous evaluations refer to precise quantifications of the model's behavior in relation to its trustworthiness. For example, this may include evaluations of the model’s robustness or reliability, its uncertainty, calibration, or causality, or its interpretability or explainability. We recognize that trustworthiness may not perfectly align with evaluations, and that different developers may associate trustworthiness with evaluations differently. We will award this point for a clear evaluation of the trustworthiness of the model.",Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims,DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,https://arxiv.org/abs/2004.07213,https://arxiv.org/abs/2306.11698
64
+ Model,Trustworthiness,External reproducibility of trustworthiness evaluation,Are the trustworthiness evaluations reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated data is either (i) publicly available or (ii) described sufficiently such that a reasonable facsimile can be constructed by the external entity. In addition, the evaluation protocol should be sufficiently described such that if the evaluation is reproduced, any discrepancies with the developer's results can be resolved. We recognize that there does not exist an authoritative or consensus standard for what is required for an evaluation to be deemed externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently reproducible for the purposes of this index. We will award this point for reproducibility of at least one evaluation. In the event that an evaluation is not reproducible, we may accept a justification by the model developer for why it is not possible for the evaluation to be made reproducible.",Leakage and the reproducibility crisis in machine-learning-based science,"Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-centered AI Systems",https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389923001599%3Fshowall%3Dtrue,https://dl.acm.org/doi/10.1145/3419764
65
+ Model,Inference,Inference duration evaluation,Is the time required for model inference disclosed for a clearly-specified task on a clearly-specified set of hardware?,"The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). We recognize that no established standard exists for the standardized reporting of inference evaluation. Therefore, we permit the developer to specify the task and hardware setup, as long as both are disclosed. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself. For example, the specific task might be generating 100,000 tokens as 5,000 sequences of length 20 and the fixed set of hardware might be 8 NVIDIA A100s. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself.",MLPerf Inference Benchmark,Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs,https://arxiv.org/abs/1911.02549,https://arxiv.org/abs/2305.02440
66
+ Model,Inference,Inference compute evaluation,Is the compute usage for model inference disclosed for a clearly-specified task on a clearly-specified set of hardware?,"Compute usage for inference should be reported in FLOPS to a precision of one significant figure (e.g. 5 x $10^{25}$ FLOPS). We recognize that no established standard exists for the standardized reporting of inference evaluation. Therefore, we permit the developer to specify the task and hardware setup, as long as both are clear. For example, the specific task might be generating 100k tokens as 5k sequences of length 20 and the fixed set of hardware might be 8 NVIDIA A100s. The hardware in this evaluation need not be the hardware the developer uses for inference if it in fact does any inference itself.",MLPerf Inference Benchmark,Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs,https://arxiv.org/abs/1911.02549,https://arxiv.org/abs/2305.02440
67
+ Downstream,Distribution,Release decision-making,Is the developer’s protocol for deciding whether or not to release a model disclosed?,"We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award this point for any clear protocol that discusses the decision-making process, including if the protocol is more general to the developer rather than the specific foundation model under consideration.",The Gradient of Generative AI Release: Methods and Considerations,The Time Is Now to Develop Community Norms for the Release of Foundation Models,https://arxiv.org/abs/2302.04844,https://hai.stanford.edu/news/time-now-develop-community-norms-release-foundation-models
68
+ Downstream,Distribution,Release process,Is a description of the process of how the model was released disclosed?,"A description of the release process might include information about who received access to the model at what stage of the release of the model. For example, a developer might conduct a staged release where it releases the model to a select group at first and subsequently makes the model more widely available. We recognize that the release of a foundation model falls along a spectrum, with many different forms of release, and that different developers may conceptualize release differently. We will award this point for any detailed discussion of the release process, including if the discussion is more general to the developer rather than the specific foundation model under consideration.",The Gradient of Generative AI Release: Methods and Considerations,The Time Is Now to Develop Community Norms for the Release of Foundation Models,https://arxiv.org/abs/2302.04844,https://hai.stanford.edu/news/time-now-develop-community-norms-release-foundation-models
69
+ Downstream,Distribution,Distribution channels,Are all distribution channels disclosed?,"By distribution channel, we mean any pathway by which the model is made accessible to entities beyond the developer. We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. We will award this point if the developer discloses all of the distribution channels of which it is aware.",Understanding accountability in algorithmic supply chains,Thinking Upstream: Ethics and Policy Opportunities in AI Supply Chains,https://dl.acm.org/doi/10.1145/3593013.3594073,https://arxiv.org/abs/2303.07529
70
+ Downstream,Distribution,Products and services,Does the developer disclose whether any products and services offered by the developer are dependent on the model?,We recognize that a developer may provide many products and services that depend on a foundation model or internal derivatives of the model. We will award this point for a reasonable best-effort description of any ways the developer makes internal use of the model in its products or services.,Understanding accountability in algorithmic supply chains,On AI Deployment: AI supply chains (and why they matter),https://dl.acm.org/doi/10.1145/3593013.3594073,https://aipolicy.substack.com/p/supply-chains-2
71
+ Downstream,Distribution,Detection of machine-generated content,Are any mechanisms for detecting content generated by this model disclosed?,"Such a mechanism might include storing a copy of all outputs generated by the model to compare against, implementing a watermark when generating content using the model, or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it has no such mechanism.",A Watermark for Large Language Models,Robust Distortion-free Watermarks for Language Models,https://arxiv.org/abs/2301.10226,https://www.semanticscholar.org/paper/Robust-Distortion-free-Watermarks-for-Language-Kuditipudi-Thickstun/ccaff61e0c1e629d91d78f82a64b3cbc8f3f7023
72
+ Downstream,Distribution,Model License,Is a license for the model disclosed?,"In the event that licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefor have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use.","Stronger Together: on the Articulation of Ethical Charters, Legal Tools, and Technical Documentation in ML",An investigation of licensing of datasets for machine learning based on the GQM model,https://arxiv.org/abs/2305.18615,https://arxiv.org/abs/2303.13735
73
+ Downstream,Distribution,Terms of service,Are terms of service disclosed for each distribution channel?,We will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.,Terms-we-Serve-with: a feminist-inspired social imaginary for improved transparency and engagement in AI,Identifying Terms and Conditions Important to Consumers using Crowdsourcing,https://arxiv.org/abs/2206.02492,https://arxiv.org/abs/2111.12182
74
+ Downstream,Usage policy,Permitted and prohibited users,Is a description of who can and cannot use the model disclosed?,"Such restrictions may relate to countries (e.g. US-only), organizations (e.g. no competitors), industries (e.g. no weapons industry users) or other relevant factors. These restrictions on users are often contained in multiple policies; we group them here for simplicity. We will awarded this point for a clear description of permitted, restricted, and prohibited users of the model.",Best Practices for Deploying Language Models,Meta Platform Terms,https://txt.cohere.com/best-practices-for-deploying-language-models/,https://developers.facebook.com/terms/#datause
75
+ Downstream,Usage policy,"Permitted, restricted, and prohibited uses","Are permitted, restricted, and prohibited uses of the model disclosed?","We will award this point if at least two of the following three categories are disclosed: (i) permitted uses, (ii) restricted uses, and (iii) prohibited uses. By restricted uses, we mean uses that require a higher level of scrutiny (such as permission from or a separate contract with the developer) to be permitted. These uses are generally included in an acceptable use policy, model license, or usage policy.",Best Practices for Deploying Language Models,Meta Platform Terms,https://txt.cohere.com/best-practices-for-deploying-language-models/,https://developers.facebook.com/terms/#datause
76
+ Downstream,Usage policy,Usage policy enforcement,Is the enforcement protocol for the usage policy disclosed?,"By enforcement protocol, we refer to (i) mechanisms for identifying permitted and prohibited users, (ii) mechanisms for identifying permitted/restricted/prohibited uses, (iii) steps the developer takes to enforce its policies related to such uses, and (iv) the developer’s procedures for carrying out these steps. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not enforce its usage policy.",Best Practices for Deploying Language Models,Meta Platform Terms,https://txt.cohere.com/best-practices-for-deploying-language-models/,https://developers.facebook.com/terms/#datause
77
+ Downstream,Usage policy,Justification for enforcement action,Do users receive a justification when they are subject to an enforcement action for violating the usage policy?,"For example, does the developer disclose a protocol for telling users which part of the usage policy they violated, when they did so, and what specifically was violative? Enforcement actions refer to measures to limit a user’s ability to use the model, such as banning a user or restricting their ability to purchase tokens. We will award this point if the developer discloses that it gives justification for enforcement actions or, alternatively, if it discloses that it does not provide justification for enforcement actions or that it does not enforce its usage policy.",Best Practices for Deploying Language Models,Meta Platform Terms,https://txt.cohere.com/best-practices-for-deploying-language-models/,https://developers.facebook.com/terms/#datause
78
+ Downstream,Usage policy,Usage policy violation appeals mechanism,Is a mechanism for appealing potential usage policy violations disclosed?,"We will award this point if the developer provides a usage policy violation appeals mechanism, regardless of whether it is provided via a user interface or distribution channel.",Best Practices for Deploying Language Models,Meta Platform Terms,https://txt.cohere.com/best-practices-for-deploying-language-models/,https://developers.facebook.com/terms/#datause
79
+ Downstream,Model behavior policy,"Permitted, restricted, and prohibited model behaviors","Are model behaviors that are permitted, restricted, and prohibited disclosed?","We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from generating child sexual abuse material). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior.",I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models,"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!",https://arxiv.org/abs/2306.03423,https://arxiv.org/abs/2310.03693
80
+ Downstream,Model behavior policy,Model behavior policy enforcement,Is the enforcement protocol for the model behavior policy disclosed?,"By enforcement protocol, we refer to mechanisms for identifying whether model behavior is permitted or prohibited and actions that may arise in the event the model behavior policy is violated. For example, the developer may make updates to the model in response to issues with the model’s adherence to the model behavior policy. We will award this point if there is a clear description of the enforcement protocol, or if the developer reports that it does not enforce its model behavior policy or that it has no such restrictions on the model’s behavior.",Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims,"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!",https://arxiv.org/abs/2004.07213,https://arxiv.org/abs/2310.03693
81
+ Downstream,Model behavior policy,Interoperability of usage and model behavior policies,Is the way that the usage policy and the model behavior policy interoperate disclosed?,"For example, if a user attempts to use the model for a prohibited use such as spam, how does the model behavior policy apply if at all? We will also award this point if the developer reports that it does not impose any restrictions on its model's behavior in the event of usage policy violation.",I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models,"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!",https://arxiv.org/abs/2306.03423,https://arxiv.org/abs/2310.03693
82
+ Downstream,User Interface,User interaction with AI system,"For distribution channels with user-facing interfaces, are users notified (i) that they are interacting with an AI system, (ii) of the specific foundation model they are interacting with, and (iii) that outputs are machine-generated?","A user-facing interface refers to the means by which the user interacts with the foundation model, including how the user can observe outputs from the foundation model and other notifications. We will award this point if, for all distribution channels with user-facing interfaces, the user is provided adequate transparency as to the foundation model being distributed and the potential presence of any model outputs.",Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges,Towards Responsible AI: A Design Space Exploration of Human-Centered Artificial Intelligence User Interfaces to Investigate Fairness,https://dl.acm.org/doi/10.1145/3544548.3581278,https://arxiv.org/abs/2206.00474
83
+ Downstream,User Interface,Usage disclaimers,"For distribution channels with user-facing interfaces, are users provided with disclaimers involving model use?","A user-facing interface refers to the means by which the user interacts with the foundation model, including how the user can observe outputs from the foundation model and other notifications. Usage disclaimers could include information about what constitutes a usage policy violations or how users should interpret model outputs. We will award this point if, for all distribution channels with user-facing interfaces, the user is provided with usage disclaimers.",Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges,Towards Responsible AI: A Design Space Exploration of Human-Centered Artificial Intelligence User Interfaces to Investigate Fairness,https://dl.acm.org/doi/10.1145/3544548.3581278,https://arxiv.org/abs/2206.00474
84
+ Downstream,User data protection,User data protection policy,"Are the protocols for how the developer stores, accesses, and shares user data disclosed?",We will also award this point if the developer reports that it has no user data protection policy.,Privacy as Contextual Integrity,Redesigning Data Privacy: Reimagining Notice & Consent for human technology interaction,https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10/,https://www.weforum.org/reports/redesigning-data-privacy-reimagining-notice-consent-for-humantechnology-interaction/
85
+ Downstream,User data protection,Permitted and prohibited use of user data,Are permitted and prohibited uses of user data disclosed?,"Developers use user data for a range of purposes such as building future models, updating existing models, and evaluating both existing and future models. We will award this point if a developer discloses its policy on the use of user data from interactions associated with this model, including both permitted and prohibited uses. This may span different distribution channels if multiple channels supply user data to the developer. Alternatively, we will award this point if the developer reports it does not impose any limits on its use of user data.",Privacy as Contextual Integrity,Redesigning Data Privacy: Reimagining Notice & Consent for human technology interaction,https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10/,https://www.weforum.org/reports/redesigning-data-privacy-reimagining-notice-consent-for-humantechnology-interaction/
86
+ Downstream,User data protection,Usage data access protocol,Is a protocol for granting external entities access to usage data disclosed?,"Usage data refers to the data created through user interaction with the model, such as user inputs to the model and associated metadata such as the duration of the interaction. A usage data access protocol refers to the steps, requirements, and considerations involved in granting external entities access to usage data; this goes beyond stating the conditions under which related personal information may be shared with external entities. We will award this point for a clear description of the usage data access protocol or if the developer reports it does not share usage data with external entities.",How Cambridge Analytica Sparked the Great Privacy Awakening,Redesigning Data Privacy: Reimagining Notice & Consent for human technology interaction,https://www.wired.com/story/cambridge-analytica-facebook-privacy-awakening/,https://www.weforum.org/reports/redesigning-data-privacy-reimagining-notice-consent-for-humantechnology-interaction/
87
+ Downstream,Model Updates,Versioning protocol,Is there a disclosed version and versioning protocol for the model?,"By versioning, we mean that each instance of the model is uniquely identified and that the model is guaranteed to not change when referring to a fixed version number; alternatively, the version clearly indicating a specific instance of the model may be able to change by noting that it is the ""latest"" or an ""unstable"" version. We recognize that different developers may adopt different versioning practices that may differ from standard semantic versioning practices used elsewhere in software engineering.",How is ChatGPT's behavior changing over time?,Putting the Semantics into Semantic Versioning,https://arxiv.org/abs/2307.09009,https://arxiv.org/abs/2008.07069
88
+ Downstream,Model Updates,Change log,Is there a disclosed change log for the model?,"By change log, we mean a description associated with each change to the model (which should be indicated by a change in version number). We recognize that different developers may adopt different practices for change logs that may differ from practices used elsewhere in software engineering. We will award this point if the change log provides a clear description of changes that is legible to a technical audience.",How is ChatGPT's behavior changing over time?,Watch out for This Commit! A Study of Influential Software Changes,https://arxiv.org/abs/2307.09009,https://arxiv.org/abs/1606.03266
89
+ Downstream,Model Updates,Deprecation policy,Is there a disclosed deprecation policy for the developer?,"By deprecation policy, we refer to a description of what it means for a model to be deprecated and how users should respond to the deprecation (e.g. instructions to migrate to a newer version). We will award this point for a clear disclosure of a deprecation policy or if there is no risk of deprication (e.g. if the developer openly releases model weights).",How is ChatGPT's behavior changing over time?,Automatic Android Deprecated-API Usage Update by Learning from Single Updated Example,https://arxiv.org/abs/2307.09009,https://arxiv.org/abs/2005.13220
90
+ Downstream,Feedback,Feedback mechanism,Is a feedback mechanism disclosed?,"By feedback mechanism, we refer to a means for external entities to report feedback or issues that arise in relation to the foundation model. Such entities may include but are not necessarily limited to users. We will award this point if the developer discloses a feedback mechanism that has been implemented.",Ecosystem Graphs: The Social Footprint of Foundation Models,Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985,https://dl.acm.org/doi/10.1145/3514094.3534181
91
+ Downstream,Feedback,Feedback summary,"Is a report or summary disclosed regarding the feedback the developer received or, alternatively, the way the developer responded to that feedback?","We recognize that there does not exist an authoritative or consensus standard for what is required in a feedback report. For this reason, we will award this point if there is a meaningful, though potentially vague or incomplete, summary of feedback received.",Achieving Transparency Report Privacy in Linear Time,Evaluating a Methodology for Increasing AI Transparency: A Case Study,https://arxiv.org/abs/2104.00137,https://arxiv.org/abs/2201.13224
92
+ Downstream,Feedback,Government inquiries,Is a summary of government inquiries related to the model received by the developer disclosed?,"Such government inquiries might include requests for user data, requests that certain content be banned, or requests for information about a developer’s business practices. We recognize that there does not exist an authoritative or consensus standard for what is required for such a summary of government inquiries. For this reason, we will award this point if (i) there is a meaningful, though potentially vague or incomplete, summary of government inquiries, or (ii) a summary of government inquiries related to user data.",Transparency Report: Government requests on the rise,Ecosystem Graphs: The Social Footprint of Foundation Models,https://blog.google/technology/safety-security/transparency-report-government-requests/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
93
+ Downstream,Impact,Monitoring mechanism,"For each distribution channel, is a monitoring mechanism for tracking model use disclosed?","By monitoring mechanism, we refer to a specific protocol for tracking model use that goes beyond an acknowledgement that usage data is collected. We will also award this point for a reasonable best-effort attempt to describe monitoring mechanisms, or if a developer discloses that a distribution channel is not monitored.",Progressive Disclosure: Designing for Effective Transparency,Ecosystem Graphs: The Social Footprint of Foundation Models,https://arxiv.org/abs/1811.02164,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
94
+ Downstream,Impact,Downstream applications,"Across all forms of downstream use, is the number of applications dependent on the foundation model disclosed?","We recognize that there does not exist an authoritative or consensus standard for what qualifies as an application. We will award this point if there is a meaningful estimate of the number of downstream applications, along with some description of what it means for an application to be dependent on the model.",Market concentration implications of foundation models: The Invisible Hand of ChatGPT,Ecosystem Graphs: The Social Footprint of Foundation Models,https://www.brookings.edu/articles/market-concentration-implications-of-foundation-models-the-invisible-hand-of-chatgpt/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
95
+ Downstream,Impact,Affected market sectors,"Across all downstream applications, is the fraction of applications corresponding to each market sector disclosed?","By market sector, we refer to an identifiable part of the economy. While established standards exist for describing market sectors, we recognize that developers may provide vague or informal characterizations of market impact. We will award this point if there is a meaningful, though potentially vague or incomplete, summary of affected market sectors.",Market concentration implications of foundation models: The Invisible Hand of ChatGPT,Ecosystem Graphs: The Social Footprint of Foundation Models,https://www.brookings.edu/articles/market-concentration-implications-of-foundation-models-the-invisible-hand-of-chatgpt/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
96
+ Downstream,Impact,Affected individuals,"Across all forms of downstream use, is the number of individuals affected by the foundation model disclosed?","By affected individuals, we principally mean the number of potential users of applications. We recognize that there does not exist an authoritative or consensus standard for what qualifies as an affected individual. We will award this point if there is a meaningful estimate of the number of affected individuals along with a clear description of what it means for an individual to be affected by the model.",Market concentration implications of foundation models: The Invisible Hand of ChatGPT,Ecosystem Graphs: The Social Footprint of Foundation Models,https://www.brookings.edu/articles/market-concentration-implications-of-foundation-models-the-invisible-hand-of-chatgpt/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
97
+ Downstream,Impact,Usage reports,Is a usage report that gives usage statistics describing the impact of the model on users disclosed?,"We recognize that there does not exist an authoritative or consensus standard for what is required in a usage report. Usage statistics might include, for example, a description of the major categories of harm that has been caused by use of the model. We will award this point if there is a meaningful, though potentially vague or incomplete, summary of usage statistics.",Expert explainer: Allocating accountability in AI supply chains,Ecosystem Graphs: The Social Footprint of Foundation Models,https://www.adalovelaceinstitute.org/resource/ai-supply-chains/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
98
+ Downstream,Impact,Geographic statistics,"Across all forms of downstream use, are statistics of model usage across geographies disclosed?","We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.",Expert explainer: Allocating accountability in AI supply chains,Ecosystem Graphs: The Social Footprint of Foundation Models,https://www.adalovelaceinstitute.org/resource/ai-supply-chains/,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
99
+ Downstream,Impact,Redress mechanism,Is any mechanism to provide redress to users for harm disclosed?,We will also award this point if the developer reports it does not have any such redress mechanism.,Computational Power and AI,Ecosystem Graphs: The Social Footprint of Foundation Models,https://ainowinstitute.org/publication/policy/compute-and-ai,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985
100
+ Downstream,Documentation for Deployers,Centralized documentation for downstream use,Is documentation for downstream use centralized in a centralized artifact?,"Centralized documentation for downstream use refers to an artifact, or closely-linked artifacts, that consolidate relevant information for making use of or repurposing the model. Examples of these kinds of artifacts include a website with dedicated documentation information, a github repository with dedicated documentation information, and an ecosystem card. We recognize that different developers may take different approaches to centralizing information. We will award this point if there is a clearly-identified artifact(s) that contains the majority of substantive information (e.g. capabilities, limitations, risks, evaluations, distribution channels, model license, usage policies, model behavior policies, feedback and redress mechanisms, dependencies).",Datasheets for Datasets,Model Cards for Model Reporting,https://arxiv.org/abs/1803.09010,https://arxiv.org/abs/1810.03993
101
+ Downstream,Documentation for Deployers,Documentation for responsible downstream use,Is documentation for responsible downstream use disclosed?,"Such documentation might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.",Ecosystem Graphs: The Social Footprint of Foundation Models,Expert explainer: Allocating accountability in AI supply chains,https://www.semanticscholar.org/paper/Ecosystem-Graphs%3A-The-Social-Footprint-of-Models-Bommasani-Soylu/8ed7c9ba7cdb33e816135381ca502ace649c7985,https://www.adalovelaceinstitute.org/resource/ai-supply-chains/
fmti_indicators.csv DELETED
@@ -1,689 +0,0 @@
1
- Index,Category,Subcategory,Characteristic,Definition,Notes,References
2
- 1,Upstream,Data,Data size,"For the data used in building the model, is the data size disclosed?","Data size should be reported in appropriate units (e.g. bytes, words, tokens, images,
3
- frames) and broken down by modality. Data size should be reported to a precision of one
4
- significant figure (e.g. 4 trillion tokens, 200 thousand images). No form of decomposition
5
- into data phases is required.","Data Statements for Natural Language Processing: Toward Mitigating System
6
- Bias and Enabling Better Science, Datasheets for Datasets"
7
- 2,Upstream,Data,Data sources,"For all data used in building the model, are the data sources disclosed?","To receive this point, a meaningful decomposition of sources must be listed in an
8
- understandable way (e.g. named URLs/domains/databases/data providers). It does not suffice
9
- to say data is “sourced from the Internet"" or comes from ""licensed sources”.","Datasheets for Datasets, Towards Accountability for Machine Learning Datasets:
10
- Practices from Software Engineering and Infrastructure"
11
- 3,Upstream,Data,Data creators,"For all data used in building the model, is there some characterization of the
12
- people who created the data?","While information about data creators may not be easily discernible for some data
13
- scraped from the web, the general sources (URLs/domains) should be listed, and, for other
14
- data that is bought, licensed, or collected, a reasonable attempt at characterizing the under-
15
- lying people who provided the data is required to receive this point. The relevant properties
16
- of people can vary depending on context: for example, relevant properties could include
17
- demographic information like fraction of Black individuals contributing to the dataset,
18
- geographic information like fraction of European individuals contributing to the dataset,
19
- language information like fraction of L1 English speakers, or occupational information like
20
- the fraction of professional artists.","Datasheets for Datasets, Towards Accountability for Machine Learning Datasets:
21
- Practices from Software Engineering and Infrastructure"
22
- 4,Upstream,Data,Data source selection,Are the selection protocols for including and excluding data sources disclosed?,"Selection protocols refer to procedures used to choose which datasets or subsets
23
- of datasets will be used to build a model. We will award this point even if the selection
24
- protocols are non-exhaustive.","Datasheets for Datasets, Towards Accountability for Machine Learning Datasets:
25
- Practices from Software Engineering and Infrastructure
26
- 1"
27
- 5,Upstream,Data,Data curation,"For all data sources, are the curation protocols for those data sources disclosed?","Curation protocols refer to steps taken to further modify data sources, such as
28
- procedures to manage, annotate, and organize data. The aims of curation might include
29
- improving the quality, relevance, and representativeness of the data. We will award this
30
- point if the developer reports that it does not perform any further curation beyond the data
31
- sources.","Datasheets for Datasets, Towards Accountability for Machine Learning Datasets:
32
- Practices from Software Engineering and Infrastructure"
33
- 6,Upstream,Data,Data augmentation,Are any steps the developer takes to augment its data sources disclosed?,"Such steps might include augmenting data sources with synthetic data. We will
34
- award this point if the developer reports that it does not take any steps to augment its data.","Datasheets for Datasets, Towards Accountability for Machine Learning Datasets:
35
- Practices from Software Engineering and Infrastructure"
36
- 7,Upstream,Data,Harmful data filtration,"If data is filtered to remove harmful content, is there a description of the associ-
37
- ated filter?","Such harmful content might relate to violence or child sexual abuse material. We
38
- will award this point if the developer reports that it does not perform any harmful data
39
- filtration.","Documenting Large Webtext Corpora: A Case Study on the Colossal Clean
40
- Crawled Corpus, A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age,
41
- Domain Coverage, Quality, Toxicity"
42
- 8,Upstream,Data,Copyrighted data,"For all data used in building the model, is the associated copyright status dis-
43
- closed?","To receive this point, the copyright status (e.g. copyrighted, public domain) must
44
- relate to some decomposition of the data. We will award this point if there is some meaningful
45
- decomposition of the data, even if the decomposition is insufficient to receive the Data
46
- Creators point or if the disclosure is not comprehensive relative to legal copyright standards.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retro-
47
- spective Datasheet for BookCorpus, Machine Learning and Artificial Intelligence: Legal
48
- Concepts"
49
- 9,Upstream,Data,Data license,"For all data used in building the model, is the associated license status disclosed?","To receive this point, the license status must relate to some decomposition of the
50
- data. We will award this point if there is some meaningful decomposition of the data, even
51
- if the decomposition is insufficient to receive the Data Creators point.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retro-
52
- spective Datasheet for BookCorpus, Machine Learning and Artificial Intelligence: Legal
53
- Concepts"
54
- 10,Upstream,Data,Personal information in data,"For all data used in building the model, is the inclusion or exclusion of personal
55
- information in that data disclosed?","To receive this point, the disclosure of personal information must relate to some
56
- decomposition of the data. We will award this point if there is some meaningful decomposi-
57
- tion of the data, even if the decomposition is insufficient to receive the Data Creators point.
58
- Additionally, we will award this point if the developer reports the inclusion of personal
59
- information, independent of if and how they mitigate related privacy concerns.","Data Capitalism: Redefining the Logics of Surveillance and Privacy, What Does
60
- it Mean for a Language Model to Preserve Privacy?"
61
- 11,Upstream,Data labor,Use of human labor,Are the phases of the data pipeline where human labor is involved disclosed?,"Phases of the data pipeline that involve human labor include activities and tasks
62
- performed by people to collect, annotate, clean, or validate data. This indicator is inclusive
63
- of all data that is created by or on behalf of the developer. We will award this point if the
64
- developer gives a reasonable best-effort description of the use of human labor in their data
65
- pipeline.","The future of crowd work, AI Is a Lot of Work: As the technology becomes
66
- ubiquitous, a vast tasker underclass is emerging — and not going anywhere"
67
- 12,Upstream,Data labor,Employment of data laborers,"Is the organization that directly employs the people involved in data labor
68
- disclosed for each phase of the data pipeline?","Phases of the data pipeline that involve human labor include activities and tasks
69
- performed by people to collect, annotate, clean, or validate data. This indicator is inclusive
70
- of all data that is created by or on behalf of the developer. We will award this point if the
71
- developer provides the name of the organization that employs data laborers, even if other
72
- details about the employment relationship are not disclosed.","The future of crowd work, AI Is a Lot of Work: As the technology becomes
73
- ubiquitous, a vast tasker underclass is emerging — and not going anywhere"
74
- 13,Upstream,Data labor,Geographic distribution of data laborers,"Is geographic information regarding the people involved in data labor disclosed
75
- for each phase of the data pipeline?","This indicator is inclusive of all data that is created by or on behalf of the developer.
76
- We will award this point if the developer gives a reasonable best-effort description of the
77
- geographic distribution of labor at the country-level.","Cleaning Up ChatGPT Takes Heavy Toll on Human Workers, Ghost Work: How
78
- to Stop Silicon Valley from Building a New Global Underclass"
79
- 14,Upstream,Data labor,Wages,Are the wages for people who perform data labor disclosed?,"This indicator is inclusive of data labor at all points of the model development
80
- process, such as training data annotation or red teaming data used to control the model.
81
- We will award this point if the developer reports that it does not compensate workers. For
82
- all data that is created by or on behalf of the developer,","The future of crowd work, AI Is a Lot of Work: As the technology becomes
83
- ubiquitous, a vast tasker underclass is emerging — and not going anywhere"
84
- 15,Upstream,Data labor,Instructions for creating data,Are the instructions given to people who perform data labor disclosed?,"This indicator is inclusive of all data that is created by or on behalf of the developer.
85
- We will award this point if the developer makes a reasonable best-effort attempt to disclose
86
- instructions given to people who create data used to build the model for the bulk of the
87
- data phases involving human labor.","Everyone wants to do the model work, not the data work, The future of crowd
88
- work"
89
- 16,Upstream,Data labor,Labor protections,Are the labor protections for people who perform data labor disclosed?,"This indicator is inclusive of data labor at all points of the model development
90
- process, such as training data annotation or red teaming data used to control the model. It
91
- is also inclusive of all data that is created by or on behalf of the developer. As an example,
92
- labor protections might include protocols to reduce the harm to workers’ mental health
93
- stemming from exposure to violent content when annotating training data. We will award
94
- this point if the developer reports that it does not protect workers or if it does not use data
95
- laborers and therefore has no labor protections.","The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence,
96
- Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass"
97
- 17,Upstream,Data labor,Third party partners,"Are the third parties who were or are involved in the development of the model
98
- disclosed?","This indicator is inclusive of partnerships that go beyond data labor as there may be
99
- third party partners at various stages in the model development process. We will award
100
- this point if the developer reports that it was the sole entity involved in the development of
101
- the model.","The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence,
102
- Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass"
103
- 18,Upstream,Data access,Queryable external data access,"Are external entities provided with queryable access to the data used to build
104
- the model?","We will award this point for any reasonable mechanism for providing access: direct
105
- access to the data, an interface to query the data, a developer-mediated access program
106
- where developers can inspect requests, etc. Developers may receive this point even if there
107
- are rate-limits on the number of queries permitted to an external entity and restrictions
108
- on which external entities are given access, insofar as these limits and restrictions are
109
- transparent and ensure a reasonable amount of external access. We may accept justifications
110
- for prohibiting queries of specific parts of the data.","Datasheets for Datasets, The ROOTS Search Tool: Data Transparency for LLMs"
111
- 19,Upstream,Data access,Direct external data access,"Are external entities provided with direct access to the data used to build the
112
- model?","We will award this point if external entities can directly access the data without any
113
- form of gating from the developer. With that said, we may award this point if the developer
114
- provides justifications for prohibiting access to specific parts of the data or to unauthorized
115
- external entities.","Datasheets for Datasets, The ROOTS Search Tool: Data Transparency for LLMs"
116
- 20,Upstream,Compute,Compute usage,Is the compute required for building the model disclosed?,"Compute should be reported in appropriate units, which most often will be floating
117
- point operations (FLOPS). Compute should be reported to a precision of one significant
118
- figure (e.g. 5 x 1025FLOPS). We will award this point even if there is no decomposition of the
119
- reported compute usage into compute phases, but it should be clear whether the reported
120
- compute usage is for a single model run or includes additional runs, or hyperparameter
121
- tuning, or training other models like reward models, or other steps in the model development
122
- process that necessitate compute expenditure.","Towards the Systematic Reporting of the Energy and Carbon Footprints of
123
- Machine Learning, Energy and Policy Considerations for Deep Learning in NLP"
124
- 21,Upstream,Compute,Development duration,Is the amount of time required to build the model disclosed?,"The continuous duration of time required to build the model should be reported
125
- in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). No form
126
- of decomposition into phases of building the model is required for this indicator, but it
127
- should be clear what the duration refers to (e.g. training the model, training and subsequent
128
- evaluation and red teaming).","Compute Trends Across Three Eras of Machine Learning, Training Compute-
129
- Optimal Large Language Models"
130
- 22,Upstream,Compute,Compute hardware,"For the primary hardware used to build the model, is the amount and type of
131
- hardware disclosed?","In most cases, this indicator will be satisfied by information regarding the number
132
- and type of GPUs or TPUs used to train the model. The number of hardware units should be
133
- reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not
134
- award this point if (i) the training hardware generally used by the developer is disclosed, but
135
- the specific hardware for the given model is not, or (ii) the training hardware is disclosed,
136
- but the amount of hardware is not. We will award this point even if information about the
137
- interconnects between hardware units is not disclosed.","Compute Trends Across Three Eras of Machine Learning, Training Compute-
138
- Optimal Large Language Models"
139
- 23,Upstream,Compute,Hardware owner,"For the primary hardware used in building the model, is the owner of the
140
- hardware disclosed?","For example, the hardware owner may be the model developer in the case of a self-
141
- owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon
142
- Web Services, or a national supercomputer. In the event that hardware is owned by multiple
143
- sources or is highly decentralized, we will award this point if a developer makes a reasonable
144
- effort to describe the distribution of hardware owners.","Compute Trends Across Three Eras of Machine Learning, Training Compute-
145
- Optimal Large Language Models"
146
- 24,Upstream,Compute,Energy usage,Is the amount of energy expended in building the model disclosed?,"Energy usage should be reported in appropriate units, which most often will be
147
- megawatt-hours (mWh). Energy usage should be reported to a precision of one significant
148
- figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it
149
- should be clear whether the reported energy usage is for a single model run or includes
150
- additional runs, or hyperparameter tuning, or training other models like reward models, or
151
- other steps in the model development process that necessitate energy usage.","Quantifying the Carbon Emissions of Machine Learning, Carbon Emissions and
152
- Large Neural Network Training"
153
- 25,Upstream,Compute,Carbon emissions,"Is the amount of carbon emitted (associated with the energy used) in building
154
- the model disclosed?","Emissions should be reported in appropriate units, which most often will be tons
155
- of carbon dioxide emitted (tCO2). Emissions should be reported to a precision of one
156
- significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required,
157
- but it should be clear whether the reported emissions is for a single model run or includes
158
- additional runs, or hyperparameter tuning, or training other models like reward models, or
159
- other steps in the model development process that generate emissions.","Quantifying the Carbon Emissions of Machine Learning, Carbon Emissions and
160
- Large Neural Network Training"
161
- 26,Upstream,Compute,Broader environmental impact,"Are any broader environmental impacts from building the model besides carbon
162
- emissions disclosed?","While the most direct environmental impact of building a foundation model is the
163
- energy used and, therefore, the potential carbon emissions, there may be other environmen-
164
- tal impacts. For example, these may include the use of other resources such as water for
165
- cooling data centers or metals for producing specialized hardware. We recognize that there
166
- does not exist an authoritative or consensus list of broader environmental factors. For this
167
- reason, we will award this point if there is a meaningful, though potentially incomplete,
168
- discussion of broader environmental impact.","Counting Carbon: A Survey of Factors Influencing the Emissions of Machine
169
- Learning, Energy and Policy Considerations for Deep Learning in NLP"
170
- 27,Upstream,Methods,Model stages,Are all stages in the model development process disclosed?,"Stages refer to each identifiable step that constitutes a substantive change to the
171
- model during the model building process. We recognize that different developers may use
172
- different terminology for these stages, or conceptualize the stages differently. We will award
173
- this point if there is a clear and complete description of these stages.","Model Cards for Model Reporting, Scaling Instruction-Finetuned Language
174
- Models"
175
- 28,Upstream,Methods,Model objectives,"For all stages that are described, is there a clear description of the associated
176
- learning objectives or a clear characterization of the nature of this update to the model?","We recognize that different developers may use different terminology for these
177
- stages, or conceptualize the stages differently. We will award this point if there is a clear
178
- description of the update to the model related to each stage, whether that is the intent of the
179
- stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a
180
- specific loss function), or an empirical assessment (e.g. evaluation results conducted before
181
- and after the stage).","Model Cards for Model Reporting, Scaling Instruction-Finetuned Language
182
- Models"
183
- 29,Upstream,Methods,Core frameworks,Are the core frameworks used for model development disclosed?,"Examples of core frameworks include Tensorflow, PyTorch, Jax, Hugging Face Trans-
184
- formers, Seqio, T5X, Keras, SciKit, and Triton. If there are significant internal frameworks,
185
- there should be some description of their function and/or a reasonably similar publicly-
186
- available analogue. We recognize that there does not exist an authoritative or consensus
187
- list of core frameworks. For this reason, we will award this point if there is a meaningful,
188
- though potentially incomplete, list of major frameworks for the first version of the index.","Model Cards for Model Reporting, Scaling Instruction-Finetuned Language
189
- Models"
190
- 30,Upstream,Methods,Additional dependencies,"Are any dependencies required to build the model disclosed besides data, com-
191
- pute, and code?","For example, if the model depends on an external search engine, programmable APIs,
192
- or tools, this should be disclosed. We recognize that there is not widespread consensus
193
- regarding what constitutes key dependencies beyond the data, compute, and code. We
194
- will award this point only if developers give a reasonable best-effort description of any
195
- additional dependencies or make clear that no additional dependencies are required.","Analyzing Leakage of Personally Identifiable Information in Language Models,
196
- ProPILE: Probing Privacy Leakage in Large Language Models"
197
- 31,Upstream,Data Mitigations,Mitigations for privacy,"Are any steps the developer takes to mitigate the presence of PII in the data
198
- disclosed?","Such steps might include identifying personal information in the training data,
199
- filtering specific datasets to remove personal information, and reducing the likelihood that
200
- models will output personal information. We will award this point if the developer reports
201
- that it does not take steps to mitigate the presence of PII in the data.","Deduplicating Training Data Mitigates Privacy Risks in Language Models,
202
- Machine Learning and Artificial Intelligence: Legal Concepts"
203
- 32,Upstream,Data Mitigations,Mitigations for copyright,"Are any steps the developer takes to mitigate the presence of copyrighted
204
- information in the data disclosed?","Such steps might include identifying copyrighted data, filtering specific datasets to
205
- remove copyrighted data, and reducing the likelihood that models will output copyrighted
206
- information. We will award this point if the developer reports that it does take steps to
207
- mitigate the presence of copyrighted information in the data.","Addressing ""Documentation Debt"" in Machine Learning Research: A Retro-
208
- spective Datasheet for BookCorpus, Machine Learning and Artificial Intelligence: Legal
209
- Concepts"
210
- 33,Model,Model basics,Input modality,Are the input modalities for the model disclosed?,"Input modalities refer to the types or formats of information that the model can
211
- accept as input. Examples of input modalities include text, image, audio, video, tables,
212
- graphs.","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
213
- Approach to Model Documentation"
214
- 34,Model,Model basics,Output modality,Are the output modalities for the model disclosed?,"Output modalities refer to the types or formats of information that the model can
215
- accept as output. Examples of output modalities include text, image, audio, video, tables,
216
- graphs.","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
217
- Approach to Model Documentation"
218
- 35,Model,Model basics,Model components,Are all components of the model disclosed?,"Model components refer to distinct and identifiable parts of the model. We recognize
219
- that different developers may use different terminology for model components, or conceptu-
220
- alize components differently. Examples include: (i) For a text-to-image model, components
221
- could refer to a text encoder and an image encoder, which may have been trained separately.
222
- (ii) For a retrieval-augmented model, components could refer to a separate retriever module.","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
223
- Approach to Model Documentation"
224
- 36,Model,Model basics,Model size,"For all components of the model, is the associated model size disclosed?","This information should be reported in appropriate units, which generally is the
225
- number of model parameters, broken down by named component. Model size should be
226
- reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder,
227
- 20 billion parameters for image encoder).","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
228
- Approach to Model Documentation"
229
- 37,Model,Model basics,Model architecture,Is the model architecture disclosed?,"Model architecture is the overall structure and organization of a foundation model,
230
- which includes the way in which any disclosed components are integrated and how data
231
- moves through the model during training or inference. We recognize that different develop-
232
- ers may use different terminology for model architecture, or conceptualize the architecture
233
- differently. We will award this point for any clear, though potentially incomplete, description
234
- of the model architecture.","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
235
- Approach to Model Documentation"
236
- 38,Model,Model basics,Centralized model documentation,"Is key information about the model included in a centralized artifact such as a
237
- model card?","We recognize that different developers may share this information through different
238
- types of documentation, such as a system card or several clearly interrelated documents.
239
- We will award this point for the disclosure of any such centralized artifact that provides
240
- key information typically included in a model card, though the artifact may be longer-form
241
- than a standard model card (e.g. a technical report).","Model Cards for Model Reporting, Interactive Model Cards: A Human-Centered
242
- Approach to Model Documentation"
243
- 39,Model,Model access,External model access protocol,Is a protocol for granting external entities access to the model disclosed?,"A model access protocol refers to the steps, requirements, and considerations involved
244
- in granting authorized model access to external entities. We will award this point if the
245
- developer discloses key details of its protocol, including (i) where external entities can
246
- request access (e.g. via an access request form); (ii) explicit criteria for selecting external
247
- entities; and (iii) a transparent decision on whether access has been granted within a
248
- specified, reasonable period of time.","The Gradient of Generative AI Release: Methods and Considerations, Structured
249
- access: an emerging paradigm for safe AI deployment"
250
- 40,Model,Model access,Blackbox external model access,Is black box model access provided to external entities?,"Black box model access refers to the ability to query the model with inputs and
251
- receive outputs, potentially without further access. Examples of external entities that might
252
- be granted access include researchers, third-party auditors, and regulators. We will award
253
- this point for any reasonable access level: direct access to the model weights, an interface
254
- to query the model, a developer-mediated access program where developers can inspect
255
- requests, etc. Developers may receive this point even if there are rate-limits on the number
256
- of queries permitted to an external entity and restrictions on the external entities that are
257
- permitted access, insofar as these limits and restrictions are transparent.","The Gradient of Generative AI Release: Methods and Considerations, Structured
258
- access: an emerging paradigm for safe AI deployment"
259
- 41,Model,Model access,Full external model access,Is full model access provided to external entities?,"Full model access refers to the ability to access the model via the release of model
260
- weights. Developers may receive this point even if there are some restrictions on the external
261
- entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions
262
- are transparent (e.g. via some high-level description of who has been granted access to the
263
- foundation model).","The Gradient of Generative AI Release: Methods and Considerations, Structured
264
- access: an emerging paradigm for safe AI deployment"
265
- 42,Model,Capabilities,Capabilities description,Are the model’s capabilities described?,"Capabilities refer to the specific and distinctive functions that the model can perform.
266
- We recognize that different developers may use different terminology for capabilities, or
267
- conceptualize capabilities differently. We will award this point for any clear, but potentially
268
- incomplete, description of the multiple capabilities.","Beyond the Imitation Game: Quantifying and extrapolating the capabilities of
269
- language models, Holistic Evaluation of Language Models"
270
- 43,Model,Capabilities,Capabilities demonstration,Are the model’s capabilities demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the model’s
271
- capabilities that are legible or understandable for the general public, without requiring
272
- specific technical expertise. We recognize that different developers may use different termi-
273
- nology for capabilities, or conceptualize capabilities differently. We will award this point
274
- for clear demonstrations of multiple capabilities.","Beyond the Imitation Game: Quantifying and extrapolating the capabilities of
275
- language models, Holistic Evaluation of Language Models"
276
- 44,Model,Capabilities,Evaluation of capabilities,"Are the model’s capabilities rigorously evaluated, with the results of these
277
- evaluations reported prior to or concurrent with the initial release of the model?","Rigorous evaluations refer to precise quantifications of the model’s behavior in
278
- relation to its capabilities. We recognize that capabilities may not perfectly align with
279
- evaluations, and that different developers may associate capabilities with evaluations differ-
280
- ently. We will award this point for clear evaluations of multiple capabilities. For example,
281
- this may include evaluations of world knowledge, reasoning, state tracking or other such
282
- proficiencies. Or it may include the measurement of average performance (e.g. accuracy, F1)
283
- on benchmarks for specific tasks (e.g. text summarization, image captioning). We note that
284
- evaluations on standard broad-coverage benchmarks are likely to suffice for this indicator,
285
- though they may not if the model’s capabilities are presented as especially unusual such
286
- that standard evaluations will not suffice.","Beyond the Imitation Game: Quantifying and extrapolating the capabilities of
287
- language models, Holistic Evaluation of Language Models"
288
- 45,Model,Capabilities,External reproducibility of capabilities evaluation,Are the evaluations of the model’s capabilities reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated
289
- data is either (i) publicly available or (ii) described sufficiently such that a reasonable
290
- facsimile can be constructed by an external entity. In addition, the evaluation protocol
291
- should be sufficiently described such that if the evaluation is reproduced, any discrepancies
292
- with the developer’s results can be resolved. We recognize that there does not exist an
293
- authoritative or consensus standard for what is required for an evaluation to be deemed
294
- externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently
295
- reproducible for the purposes of this index. We will award this point for reproducibility
296
- of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a
297
- justification by the model developer for why it is not possible for the evaluation to be made
298
- reproducible may be sufficient to score this point.","Leakage and the reproducibility crisis in machine-learning-based science, Holis-
299
- tic Evaluation of Language Models"
300
- 46,Model,Capabilities,Third party capabilities evaluation,Are the model’s capabilities evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the
301
- developer. We will award this point if (i) a third party has conducted an evaluation of model
302
- capabilities, (ii) the results of this evaluation are publicly available, and (iii) these results
303
- are disclosed or referred to in the developer’s materials.","Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Gover-
304
- nance, Holistic Evaluation of Language Models"
305
- 47,Model,Limitations,Limitations description,Are the model’s limitations disclosed?,"Limitations refer to the specific and distinctive functions that the model cannot
306
- perform (e.g. the model cannot answer questions about current events as it only contains
307
- data up to a certain time cutoff, the model is not very capable when it comes to a specific
308
- application). We recognize that different developers may use different terminology for
309
- limitations, or conceptualize limitations differently. We will award this point for any clear,
310
- but potentially incomplete, description of multiple limitations.","The Fallacy of AI Functionality, Holistic Evaluation of Language Models"
311
- 48,Model,Limitations,Limitations demonstration,Are the model’s limitations demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the limita-
312
- tions that are legible or understandable for the general public, without requiring specific
313
- technical expertise. We recognize that different developers may use different terminology
314
- for limitations, or conceptualize the limitations differently. We will award this point for
315
- clear demonstrations of multiple limitations.","The Fallacy of AI Functionality, Holistic Evaluation of Language Models"
316
- 49,Model,Limitations,Third party evaluation of limitations,Can the model’s limitations be evaluated by third parties?,"By third parties, we mean entities that are significantly or fully independent of the
317
- model developers. In contrast to the third party evaluation indicators for capabilities and
318
- risks, we will award this point if third party evaluations are possible even if no third party
319
- has yet conducted them. Such evaluations are possible if, for example, the model is deployed
320
- via an API (or with open weights) and there are no restrictions on evaluating limitations
321
- (e.g. in the usage policy).","Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Gover-
322
- nance, Holistic Evaluation of Language Models"
323
- 50,Model,Risks,Risks description,Are the model’s risks disclosed?,"Risks refer to possible negative consequences or undesirable outcomes that can arise
324
- from the model’s deployment and usage. This indicator requires disclosure of risks that
325
- may arise in the event of both (i) intentional (though possibly careless) use, such as bias
326
- or hallucinations and (ii) malicious use, such as fraud or disinformation. We recognize
327
- that different developers may use different terminology for risks, or conceptualize risks
328
- differently. We will award this point for any clear, but potentially incomplete, description
329
- of multiple risks.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
330
- Ethical and social risks of harm from Language Models"
331
- 51,Model,Risks,Risks demonstration,Are the model’s risks demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the risks
332
- that are legible or understandable for the general public, without requiring specific technical
333
- expertise. This indicator requires demonstration of risks that may arise in the event of
334
- both (i) intentional (though possibly careless) use, such as biases or hallucinations and (ii)
335
- malicious use, such as fraud or disinformation. We recognize that different developers may
336
- use different terminology for risks, or conceptualize risks differently. We will award this
337
- point for clear demonstrations of multiple risks.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
338
- Ethical and social risks of harm from Language Models"
339
- 52,Model,Risks,Unintentional harm evaluation,"Are the model’s risks related to unintentional harm rigorously evaluated, with
340
- the results of these evaluations reported prior to or concurrent with the initial release of
341
- the model?","Rigorous evaluations refer to precise quantifications of the model’s behavior in
342
- relation to such risks. Unintentional harms include bias, toxicity, and issues relating to
343
- fairness. We recognize that unintended harms may not perfectly align with risk evaluations,
344
- and that different developers may associate risks with evaluations differently. We will award
345
- this point for clear evaluations of multiple such risks. We note that evaluations on standard
346
- broad-coverage benchmarks are likely to suffice for this indicator, though they may not
347
- if the model’s risks related to unintentional harm are presented as especially unusual or
348
- severe.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
349
- Ethical and social risks of harm from Language Models"
350
- 53,Model,Risks,External reproducibility of unintentional harm evaluation,"Are the evaluations of the model’s risks related to unintentional harm repro-
351
- ducible by external entities?","For an evaluation to be reproducible by an external entity, we mean that the associated
352
- data is either (i) publicly available or (ii) described sufficiently such that a reasonable
353
- facsimile can be constructed by the external entity. In addition, the evaluation protocol
354
- should be sufficiently described such that if the evaluation is reproduced, any discrepancies
355
- with the developer’s results can be resolved. We recognize that there does not exist an
356
- authoritative or consensus standard for what is required for an evaluation to be deemed
357
- externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently
358
- reproducible for the purposes of this index. We will award this point for reproducibility
359
- of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a
360
- justification by the developer for why it is not possible for the evaluation to be made
361
- reproducible may suffice.","Leakage and the reproducibility crisis in machine-learning-based science, Ethical
362
- and social risks of harm from Language Models"
363
- 54,Model,Risks,Intentional harm evaluation,"Are the model’s risks related to intentional harm rigorously evaluated, with the
364
- results of these evaluations reported prior to or concurrent with the initial release of the
365
- model?.","Rigorous evaluations refer to precise quantifications of the model’s behavior in
366
- relation to such risks. Intentional harms include fraud, disinformation, scams, cybersecurity
367
- attacks, designing weapons or pathogens, and uses of the model for illegal purposes. We
368
- recognize that unintentional harms may not perfectly align with risk evaluations, and that
369
- different developers may associate risks with evaluations differently. We will award this
370
- point for clear evaluations of multiple such risks. We note that evaluations on standard
371
- broad-coverage benchmarks are likely to suffice for this indicator, though they may not
372
- if the model’s risks related to unintentional harm are presented as especially unusual or
373
- severe.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
374
- Ethical and social risks of harm from Language Models"
375
- 55,Model,Risks,External reproducibility of intentional harm evaluation,"Are the evaluations of the model’s risks related to intentional harm reproducible
376
- by external entities?","For an evaluation to be reproducible by an external entity, we mean that the associated
377
- data is either (i) publicly available or (ii) described sufficiently such that a reasonable
378
- facsimile can be constructed by the external entity. In addition, the evaluation protocol
379
- should be sufficiently described such that if the evaluation is reproduced, any discrepancies
380
- with the developer’s results can be resolved. We recognize that there does not exist an
381
- authoritative or consensus standard for what is required for an evaluation to be deemed
382
- externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently
383
- reproducible for the purposes of this index. We will award this point for reproducibility
384
- of multiple disclosed evaluations. In the event that an evaluation is not reproducible, a
385
- justification by the model developer for why it is not possible for the evaluation to be made
386
- reproducible may suffice.","Leakage and the reproducibility crisis in machine-learning-based science, Ethical
387
- and social risks of harm from Language Models"
388
- 56,Model,Risks,Third party risks evaluation,Are the model’s risks evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the
389
- developer. A third party risk evaluation might involve the developer allowing a third party
390
- to choose a methodology for evaluating risks that differs from that of the developer. We
391
- will award this point if (i) a third party has conducted an evaluation of model risks, (ii)
392
- the results of this evaluation are publicly available, and (iii) these results are disclosed or
393
- referred to in the developer’s materials. If the results are not made public (but are disclosed
394
- to have been conducted) and/or the results are not discoverable in the developer’s materials,
395
- we will not award this point. We may accept a justification from either the third party or
396
- the developer for why part of the evaluation is not disclosed in relation to risks.","Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Gover-
397
- nance, Ethical and social risks of harm from Language Models"
398
- 57,Model,Model Mitigations,Mitigations description,Are the model mitigations disclosed?,"By model mitigations, we refer to interventions implemented by the developer at
399
- the level of the model to reduce the likelihood and/or the severity of the model’s risks.
400
- We recognize that different developers may use different terminology for mitigations, or
401
- conceptualize mitigations differently. We will award this point for any clear, but poten-
402
- tially incomplete, description of multiple mitigations associated with the model’s risks.
403
- Alternatively, we will award this point if the developer reports that it does not mitigate risk.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
404
- Ethical and social risks of harm from Language Models"
405
- 58,Model,Model Mitigations,Mitigations demonstration,Are the model mitigations demonstrated?,"Demonstrations refer to illustrative examples or other forms of showing the mitiga-
406
- tions that are legible or understandable for the general public, without requiring specific
407
- technical expertise. We recognize that different developers may use different terminology
408
- for mitigations, or conceptualize mitigations differently. We will award this point for clear
409
- demonstrations of multiple mitigations. We will also award this point if the developer
410
- reports that it does not mitigate the risks associated with the model.","Evaluating the Social Impact of Generative AI Systems in Systems and Society,
411
- Ethical and social risks of harm from Language Models"
412
- 59,Model,Model Mitigations,Mitigations evaluation,"Are the model mitigations rigorously evaluated, with the results of these evalu-
413
- ations reported?","Rigorous evaluations refer to precise quantifications of the model’s behavior in
414
- relation to the mitigations associated with its risks. We will award this point for clear
415
- evaluations of multiple mitigations.","Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation, Ethical
416
- and social risks of harm from Language Models"
417
- 60,Model,Model Mitigations,External reproducibility of mitigations evaluation,Are the model mitigation evaluations reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated
418
- data is either (i) publicly available or (ii) described sufficiently such that a reasonable
419
- facsimile can be constructed by the external entity. In addition, the evaluation protocol
420
- should be sufficiently described such that if the evaluation is reproduced, any discrepancies
421
- with the developer’s results can be resolved. In the case of mitigations evaluations, this
422
- will usually involve details about a comparison to some baseline, which may be a different,
423
- unmitigated version of the model. We recognize that there does not exist an authoritative
424
- or consensus standard for what is required for an evaluation to be deemed externally
425
- reproducible. We will award this point for reproducibility of multiple disclosed evaluations.
426
- In the event that an evaluation is not reproducible, a justification by the model developer
427
- for why it is not possible for the evaluation to be made reproducible may suffice.","Leakage and the reproducibility crisis in machine-learning-based science, Ethical
428
- and social risks of harm from Language Models"
429
- 61,Model,Model Mitigations,Third party mitigations evaluation,Can the model mitigations be evaluated by third parties?,"By third party, we mean entities that are significantly or fully independent of the
430
- model developers. This indicator assesses whether it is possible for third parties to assess
431
- mitigations, which is not restricted to the methods the developer uses to assess mitigations.
432
- In contrast to the third party evaluation indicators for capabilities and risks, we will award
433
- this point if third party evaluations are possible even if no third party has yet conducted
434
- them.","Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Gover-
435
- nance, Ethical and social risks of harm from Language Models"
436
- 62,Model,Trustworthiness,Trustworthiness evaluation,"Is the trustworthiness of the model rigorously evaluated, with the results of
437
- these evaluations disclosed?","Rigorous evaluations refer to precise quantifications of the model’s behavior in
438
- relation to its trustworthiness. For example, this may include evaluations of the model’s
439
- robustness or reliability, its uncertainty, calibration, or causality, or its interpretability or
440
- explainability. We recognize that trustworthiness may not perfectly align with evaluations,
441
- and that different developers may associate trustworthiness with evaluations differently.
442
- We will award this point for a clear evaluation of the trustworthiness of the model.","Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable
443
- Claims, DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models"
444
- 63,Model,Trustworthiness,External reproducibility of trustworthiness evaluation,Are the trustworthiness evaluations reproducible by external entities?,"For an evaluation to be reproducible by an external entity, we mean that the associated
445
- data is either (i) publicly available or (ii) described sufficiently such that a reasonable
446
- facsimile can be constructed by the external entity. In addition, the evaluation protocol
447
- should be sufficiently described such that if the evaluation is reproduced, any discrepancies
448
- with the developer’s results can be resolved. We recognize that there does not exist an
449
- authoritative or consensus standard for what is required for an evaluation to be deemed
450
- externally reproducible. Evaluations on standard benchmarks are assumed to be sufficiently
451
- reproducible for the purposes of this index. We will award this point for reproducibility of
452
- at least one evaluation. In the event that an evaluation is not reproducible, we may accept a
453
- justification by the model developer for why it is not possible for the evaluation to be made
454
- reproducible.","Leakage and the reproducibility crisis in machine-learning-based science, Bridg-
455
- ing the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy
456
- Human-centered AI Systems"
457
- 64,Model,Inference,Inference duration evaluation,"Is the time required for model inference disclosed for a clearly-specified task on
458
- a clearly-specified set of hardware?","The duration should be reported in seconds to a precision of one significant figure
459
- (e.g. 0.002 seconds). We recognize that no established standard exists for the standardized
460
- reporting of inference evaluation. Therefore, we permit the developer to specify the task
461
- and hardware setup, as long as both are disclosed. The hardware in this evaluation need
462
- not be the hardware the developer uses for inference if it in fact does any inference itself.
463
- For example, the specific task might be generating 100,000 tokens as 5,000 sequences of
464
- length 20 and the fixed set of hardware might be 8 NVIDIA A100s. The hardware in this
465
- evaluation need not be the hardware the developer uses for inference if it in fact does any
466
- inference itself.","MLPerf Inference Benchmark, Cheaply Evaluating Inference Efficiency Metrics
467
- for Autoregressive Transformer APIs"
468
- 65,Model,Inference,Inference compute evaluation,"Is the compute usage for model inference disclosed for a clearly-specified task
469
- on a clearly-specified set of hardware?","Compute usage for inference should be reported in FLOPS to a precision of one
470
- significant figure (e.g. 5 x 1025FLOPS). We recognize that no established standard exists for
471
- the standardized reporting of inference evaluation. Therefore, we permit the developer to
472
- specify the task and hardware setup, as long as both are clear. For example, the specific task
473
- might be generating 100k tokens as 5k sequences of length 20 and the fixed set of hardware
474
- might be 8 NVIDIA A100s. The hardware in this evaluation need not be the hardware the
475
- developer uses for inference if it in fact does any inference itself.","MLPerf Inference Benchmark, Cheaply Evaluating Inference Efficiency Metrics
476
- for Autoregressive Transformer APIs"
477
- 66,Downstream,Distribution,Release decision-making,"Is the developer’s protocol for deciding whether or not to release a model
478
- disclosed?","We recognize that the release of a foundation model falls along a spectrum, with many
479
- forms of partial release, and that different developers may conceptualize release differently.
480
- We will award this point for any clear protocol that discusses the decision-making process,
481
- including if the protocol is more general to the developer rather than the specific foundation
482
- model under consideration.","The Gradient of Generative AI Release: Methods and Considerations, The Time
483
- Is Now to Develop Community Norms for the Release of Foundation Models"
484
- 67,Downstream,Distribution,Release process,Is a description of the process of how the model was released disclosed?,"A description of the release process might include information about who received
485
- access to the model at what stage of the release of the model. For example, a developer
486
- might conduct a staged release where it releases the model to a select group at first and
487
- subsequently makes the model more widely available. We recognize that the release of a
488
- foundation model falls along a spectrum, with many different forms of release, and that
489
- different developers may conceptualize release differently. We will award this point for any
490
- detailed discussion of the release process, including if the discussion is more general to the
491
- developer rather than the specific foundation model under consideration.","The Gradient of Generative AI Release: Methods and Considerations, The Time
492
- Is Now to Develop Community Norms for the Release of Foundation Models"
493
- 68,Downstream,Distribution,Distribution channels,Are all distribution channels disclosed?,"By distribution channel, we mean any pathway by which the model is made accessible
494
- to entities beyond the developer. We recognize that distribution channels may arise without
495
- the knowledge of the model developer. For example, the weights of a model may be released
496
- through one distribution channel and then be distributed through other channels. We will
497
- award this point if the developer discloses all of the distribution channels of which it is
498
- aware.","Understanding accountability in algorithmic supply chains, Thinking Upstream:
499
- Ethics and Policy Opportunities in AI Supply Chains"
500
- 69,Downstream,Distribution,Products and services,"Does the developer disclose whether any products and services offered by the
501
- developer are dependent on the model?","We recognize that a developer may provide many products and services that depend
502
- on a foundation model or internal derivatives of the model. We will award this point for
503
- a reasonable best-effort description of any ways the developer makes internal use of the
504
- model in its products or services.","Understanding accountability in algorithmic supply chains, On AI Deployment:
505
- AI supply chains (and why they matter)"
506
- 70,Downstream,Distribution,Detection of machine-generated content,Are any mechanisms for detecting content generated by this model disclosed?,"Such a mechanism might include storing a copy of all outputs generated by the
507
- model to compare against, implementing a watermark when generating content using the
508
- model, or training a detector post-hoc to identify such content. We will award this point if
509
- any such mechanism is disclosed or if the developer reports that it has no such mechanism.","A Watermark for Large Language Models, Robust Distortion-free Watermarks
510
- for Language Models"
511
- 71,Downstream,Distribution,Model License,Is a license for the model disclosed?,"In the event that licenses are written more generally, it should be clear which assets
512
- they apply to. We recognize that different developers may adopt different business models
513
- and therefor have different types of model licenses. Examples of model licenses include
514
- responsible AI licenses, open-source licenses, and licenses that allow for commercial use.","Stronger Together: on the Articulation of Ethical Charters, Legal Tools, and
515
- Technical Documentation in ML, An investigation of licensing of datasets for machine
516
- learning based on the GQM model"
517
- 72,Downstream,Distribution,Terms of service,Are terms of service disclosed for each distribution channel?,"We will award this point if there are terms-of-service that appear to apply to the
518
- bulk of the model’s distribution channels.","Terms-we-Serve-with: a feminist-inspired social imaginary for improved trans-
519
- parency and engagement in AI, Identifying Terms and Conditions Important to Consumers
520
- using Crowdsourcing"
521
- 73,Downstream,Usage policy,Permitted and prohibited users,Is a description of who can and cannot use the model disclosed?,"Such restrictions may relate to countries (e.g. US-only), organizations (e.g. no competi-
522
- tors), industries (e.g. no weapons industry users) or other relevant factors. These restrictions
523
- on users are often contained in multiple policies; we group them here for simplicity. We
524
- will awarded this point for a clear description of permitted, restricted, and prohibited users
525
- of the model.","Best Practices for Deploying Language Models, Meta Platform Terms"
526
- 75,Downstream,Usage policy,Usage policy enforcement,Is the enforcement protocol for the usage policy disclosed?,"By enforcement protocol, we refer to (i) mechanisms for identifying permitted and
527
- prohibited users, (ii) mechanisms for identifying permitted/restricted/prohibited uses, (iii)
528
- steps the developer takes to enforce its policies related to such uses, and (iv) the developer’s
529
- procedures for carrying out these steps. We will award this point for a reasonable best-effort
530
- attempt to provide the bulk of this information, though one line indicating the developer
531
- reserves the right to terminate accounts is insufficient. Alternatively, we will award this
532
- point if the developer reports that it does not enforce its usage policy.","Best Practices for Deploying Language Models, Meta Platform Terms"
533
- 76,Downstream,Usage policy,Justification for enforcement action,"Do users receive a justification when they are subject to an enforcement action
534
- for violating the usage policy?","For example, does the developer disclose a protocol for telling users which part
535
- of the usage policy they violated, when they did so, and what specifically was violative?
536
- Enforcement actions refer to measures to limit a user’s ability to use the model, such as
537
- banning a user or restricting their ability to purchase tokens. We will award this point if
538
- the developer discloses that it gives justification for enforcement actions or, alternatively, if
539
- it discloses that it does not provide justification for enforcement actions or that it does not
540
- enforce its usage policy.","Best Practices for Deploying Language Models, Meta Platform Terms"
541
- 77,Downstream,Usage policy,Usage policy violation appeals mechanism,Is a mechanism for appealing potential usage policy violations disclosed?,"We will award this point if the developer provides a usage policy violation appeals
542
- mechanism, regardless of whether it is provided via a user interface or distribution channel.","Best Practices for Deploying Language Models, Meta Platform Terms"
543
- 79,Downstream,Model behavior policy,Model behavior policy enforcement,Is the enforcement protocol for the model behavior policy disclosed?,"By enforcement protocol, we refer to mechanisms for identifying whether model
544
- behavior is permitted or prohibited and actions that may arise in the event the model
545
- behavior policy is violated. For example, the developer may make updates to the model in
546
- response to issues with the model’s adherence to the model behavior policy. We will award
547
- this point if there is a clear description of the enforcement protocol, or if the developer
548
- reports that it does not enforce its model behavior policy or that it has no such restrictions
549
- on the model’s behavior.","Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable
550
- Claims, Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do
551
- Not Intend To!"
552
- 80,Downstream,Model behavior policy,"Interoperability of usage and model behavior
553
- policies","Is the way that the usage policy and the model behavior policy interoperate
554
- disclosed?","For example, if a user attempts to use the model for a prohibited use such as spam,
555
- how does the model behavior policy apply if at all? We will also award this point if the
556
- developer reports that it does not impose any restrictions on its model’s behavior in the
557
- event of usage policy violation.","I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative
558
- Language Models, Fine-tuning Aligned Language Models Compromises Safety, Even When
559
- Users Do Not Intend To!"
560
- 81,Downstream,User Interface,User interaction with AI system,"For distribution channels with user-facing interfaces, are users notified (i) that
561
- they are interacting with an AI system, (ii) of the specific foundation model they are
562
- interacting with, and (iii) that outputs are machine-generated?","A user-facing interface refers to the means by which the user interacts with the
563
- foundation model, including how the user can observe outputs from the foundation model
564
- and other notifications. We will award this point if, for all distribution channels with user-
565
- facing interfaces, the user is provided adequate transparency as to the foundation model
566
- being distributed and the potential presence of any model outputs.","Designing Responsible AI: Adaptations of UX Practice to Meet Responsible
567
- AI Challenges, Towards Responsible AI: A Design Space Exploration of Human-Centered
568
- Artificial Intelligence User Interfaces to Investigate Fairness"
569
- 82,Downstream,User Interface,Usage disclaimers,"For distribution channels with user-facing interfaces, are users provided with
570
- disclaimers involving model use?","A user-facing interface refers to the means by which the user interacts with the
571
- foundation model, including how the user can observe outputs from the foundation model
572
- and other notifications. Usage disclaimers could include information about what constitutes
573
- a usage policy violations or how users should interpret model outputs. We will award this
574
- point if, for all distribution channels with user-facing interfaces, the user is provided with
575
- usage disclaimers.","Designing Responsible AI: Adaptations of UX Practice to Meet Responsible
576
- AI Challenges, Towards Responsible AI: A Design Space Exploration of Human-Centered
577
- Artificial Intelligence User Interfaces to Investigate Fairness"
578
- 83,Downstream,User data protection,User data protection policy,"Are the protocols for how the developer stores, accesses, and shares user data
579
- disclosed?","We will also award this point if the developer reports that it has no user data
580
- protection policy.","Privacy as Contextual Integrity, Redesigning Data Privacy: Reimagining Notice
581
- Consent for human technology interaction"
582
- 84,Downstream,User data protection,Permitted and prohibited use of user data,Are permitted and prohibited uses of user data disclosed?,"Developers use user data for a range of purposes such as building future models,
583
- updating existing models, and evaluating both existing and future models. We will award this
584
- point if a developer discloses its policy on the use of user data from interactions associated
585
- with this model, including both permitted and prohibited uses. This may span different
586
- distribution channels if multiple channels supply user data to the developer. Alternatively,
587
- we will award this point if the developer reports it does not impose any limits on its use of
588
- user data.","Privacy as Contextual Integrity, Redesigning Data Privacy: Reimagining Notice
589
- Consent for human technology interaction"
590
- 85,Downstream,User data protection,Usage data access protocol,Is a protocol for granting external entities access to usage data disclosed?,"Usage data refers to the data created through user interaction with the model, such
591
- as user inputs to the model and associated metadata such as the duration of the interaction.
592
- A usage data access protocol refers to the steps, requirements, and considerations involved
593
- in granting external entities access to usage data; this goes beyond stating the conditions
594
- under which related personal information may be shared with external entities. We will
595
- award this point for a clear description of the usage data access protocol or if the developer
596
- reports it does not share usage data with external entities.","How Cambridge Analytica Sparked the Great Privacy Awakening, Redesigning
597
- Data Privacy: Reimagining Notice Consent for human technology interaction"
598
- 86,Downstream,Model Updates,Versioning protocol,Is there a disclosed version and versioning protocol for the model?,"By versioning, we mean that each instance of the model is uniquely identified and
599
- that the model is guaranteed to not change when referring to a fixed version number;
600
- alternatively, the version clearly indicating a specific instance of the model may be able to
601
- change by noting that it is the ""latest"" or an ""unstable"" version. We recognize that different
602
- developers may adopt different versioning practices that may differ from standard semantic
603
- versioning practices used elsewhere in software engineering.","How is ChatGPT’s behavior changing over time?, Putting the Semantics into
604
- Semantic Versioning"
605
- 87,Downstream,Model Updates,Change log,Is there a disclosed change log for the model?,"By change log, we mean a description associated with each change to the model
606
- (which should be indicated by a change in version number). We recognize that different
607
- developers may adopt different practices for change logs that may differ from practices
608
- used elsewhere in software engineering. We will award this point if the change log provides
609
- a clear description of changes that is legible to a technical audience.","How is ChatGPT’s behavior changing over time?, Watch out for This Commit!
610
- A Study of Influential Software Changes"
611
- 88,Downstream,Model Updates,Deprecation policy,Is there a disclosed deprecation policy for the developer?,"By deprecation policy, we refer to a description of what it means for a model to be
612
- deprecated and how users should respond to the deprecation (e.g. instructions to migrate to
613
- a newer version). We will award this point for a clear disclosure of a deprecation policy or
614
- if there is no risk of deprication (e.g. if the developer openly releases model weights).","How is ChatGPT’s behavior changing over time?, Automatic Android Deprecated-
615
- API Usage Update by Learning from Single Updated Example"
616
- 89,Downstream,Feedback,Feedback mechanism,Is a feedback mechanism disclosed?,"By feedback mechanism, we refer to a means for external entities to report feedback
617
- or issues that arise in relation to the foundation model. Such entities may include but are not
618
- necessarily limited to users. We will award this point if the developer discloses a feedback
619
- mechanism that has been implemented.","Ecosystem Graphs: The Social Footprint of Foundation Models, Outsider Over-
620
- sight: Designing a Third Party Audit Ecosystem for AI Governance"
621
- 90,Downstream,Feedback,Feedback summary,"Is a report or summary disclosed regarding the feedback the developer received
622
- or, alternatively, the way the developer responded to that feedback?","We recognize that there does not exist an authoritative or consensus standard for
623
- what is required in a feedback report. For this reason, we will award this point if there is a
624
- meaningful, though potentially vague or incomplete, summary of feedback received.","Achieving Transparency Report Privacy in Linear Time, Evaluating a Method-
625
- ology for Increasing AI Transparency: A Case Study"
626
- 91,Downstream,Feedback,Government inquiries,"Is a summary of government inquiries related to the model received by the
627
- developer disclosed?","Such government inquiries might include requests for user data, requests that certain
628
- content be banned, or requests for information about a developer’s business practices. We
629
- recognize that there does not exist an authoritative or consensus standard for what is
630
- required for such a summary of government inquiries. For this reason, we will award this
631
- point if (i) there is a meaningful, though potentially vague or incomplete, summary of
632
- government inquiries, or (ii) a summary of government inquiries related to user data.","Transparency Report: Government requests on the rise, Ecosystem Graphs: The
633
- Social Footprint of Foundation Models"
634
- 92,Downstream,Impact,Monitoring mechanism,"For each distribution channel, is a monitoring mechanism for tracking model
635
- use disclosed?","By monitoring mechanism, we refer to a specific protocol for tracking model use
636
- that goes beyond an acknowledgement that usage data is collected. We will also award
637
- this point for a reasonable best-effort attempt to describe monitoring mechanisms, or if a
638
- developer discloses that a distribution channel is not monitored.","Progressive Disclosure: Designing for Effective Transparency, Ecosystem Graphs:
639
- The Social Footprint of Foundation Models"
640
- 93,Downstream,Impact,Downstream applications,"Across all forms of downstream use, is the number of applications dependent
641
- on the foundation model disclosed?","We recognize that there does not exist an authoritative or consensus standard for
642
- what qualifies as an application. We will award this point if there is a meaningful estimate
643
- of the number of downstream applications, along with some description of what it means
644
- for an application to be dependent on the model.","Market concentration implications of foundation models: The Invisible Hand
645
- of ChatGPT, Ecosystem Graphs: The Social Footprint of Foundation Models"
646
- 94,Downstream,Impact,Affected market sectors,"Across all downstream applications, is the fraction of applications corresponding
647
- to each market sector disclosed?","By market sector, we refer to an identifiable part of the economy. While established
648
- standards exist for describing market sectors, we recognize that developers may provide
649
- vague or informal characterizations of market impact. We will award this point if there is a
650
- meaningful, though potentially vague or incomplete, summary of affected market sectors.","Market concentration implications of foundation models: The Invisible Hand
651
- of ChatGPT, Ecosystem Graphs: The Social Footprint of Foundation Models"
652
- 95,Downstream,Impact,Affected individuals,"Across all forms of downstream use, is the number of individuals affected by
653
- the foundation model disclosed?","By affected individuals, we principally mean the number of potential users of appli-
654
- cations. We recognize that there does not exist an authoritative or consensus standard for
655
- what qualifies as an affected individual. We will award this point if there is a meaningful
656
- estimate of the number of affected individuals along with a clear description of what it
657
- means for an individual to be affected by the model.","Market concentration implications of foundation models: The Invisible Hand
658
- of ChatGPT, Ecosystem Graphs: The Social Footprint of Foundation Models"
659
- 96,Downstream,Impact,Usage reports,"Is a usage report that gives usage statistics describing the impact of the model
660
- on users disclosed?","We recognize that there does not exist an authoritative or consensus standard for
661
- what is required in a usage report. Usage statistics might include, for example, a description
662
- of the major categories of harm that has been caused by use of the model. We will award
663
- this point if there is a meaningful, though potentially vague or incomplete, summary of
664
- usage statistics.","Expert explainer: Allocating accountability in AI supply chains, Ecosystem
665
- Graphs: The Social Footprint of Foundation Models"
666
- 97,Downstream,Impact,Geographic statistics,"Across all forms of downstream use, are statistics of model usage across geogra-
667
- phies disclosed?","We will award this point if there is a meaningful, though potentially incomplete or
668
- vague, disclosure of geographic usage statistics at the country-level.","Expert explainer: Allocating accountability in AI supply chains, Ecosystem
669
- Graphs: The Social Footprint of Foundation Models"
670
- 98,Downstream,Impact,Redress mechanism,Is any mechanism to provide redress to users for harm disclosed?,"We will also award this point if the developer reports it does not have any such
671
- redress mechanism.","Computational Power and AI, Ecosystem Graphs: The Social Footprint of
672
- Foundation Models"
673
- 99,Downstream,Documentation for Deployers,"Centralized documentation for down-
674
- stream use",Is documentation for downstream use centralized in a centralized artifact?,"Centralized documentation for downstream use refers to an artifact, or closely-linked
675
- artifacts, that consolidate relevant information for making use of or repurposing the model.
676
- Examples of these kinds of artifacts include a website with dedicated documentation infor-
677
- mation, a github repository with dedicated documentation information, and an ecosystem
678
- card. We recognize that different developers may take different approaches to centralizing
679
- information. We will award this point if there is a clearly-identified artifact(s) that contains
680
- the majority of substantive information (e.g. capabilities, limitations, risks, evaluations,
681
- distribution channels, model license, usage policies, model behavior policies, feedback and
682
- redress mechanisms, dependencies).","Datasheets for Datasets, Model Cards for Model Reporting"
683
- 100,Downstream,Documentation for Deployers,"Documentation for responsible down-
684
- stream use",Is documentation for responsible downstream use disclosed?,"Such documentation might include details on how to adjust API settings to promote
685
- responsible use, descriptions of how to implement mitigations, or guidelines for responsible
686
- use. We will also award this point if the developer states that it does not provide any such
687
- documentation. For example, the developer might state that the model is offered as is and
688
- downstream developers are accountable for using the model responsibly.","Ecosystem Graphs: The Social Footprint of Foundation Models, Expert explainer:
689
- Allocating accountability in AI supply chains"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fmti_indicators.pdf DELETED
Binary file (377 kB)
 
pdf_parser.py DELETED
@@ -1,74 +0,0 @@
1
- import csv
2
- import re
3
-
4
- from PyPDF2 import PdfReader
5
-
6
-
7
- def parse_pdf(pdf_path: str):
8
- reader = PdfReader(pdf_path)
9
- extracted_data = []
10
-
11
- for page in reader.pages:
12
- text = page.extract_text()
13
- print(text)
14
-
15
- # Regular expression pattern to capture the required fields:
16
- # 1. A number followed by a period (the index).
17
- # 2. A sequence of word characters, spaces, and hyphens followed by an arrow (the category).
18
- # 3. Another sequence of word characters, spaces, and hyphens followed by an arrow (the subcategory).
19
- # 4. Yet another sequence of word characters, spaces, and hyphens (the characteristic).
20
- # 5. Text following "•Definition :" until the next newline (the definition).
21
- # 6. Text following "•Notes :" until the references (the notes).
22
- # 7. Text following "•References :" until the next number followed by a period or the end of the file (the references).
23
- pattern = r"(\d+)\.\s+([\w\s-]+)→([\w\s-]+)→([\w\s-]+)\n•Definition :\s(.*?)\n•Notes :([\s\S]*?)•References :([\s\S]*?)(?=(\n\d+\.)|\Z)"
24
-
25
- for match in re.finditer(pattern, text, re.DOTALL):
26
- index = match.group(1)
27
- category = match.group(2).strip()
28
- subcategory = match.group(3).strip()
29
- characteristic = match.group(4).strip()
30
- definition = match.group(5).strip()
31
- notes = match.group(6).strip()
32
- references = match.group(7).strip()
33
-
34
- extracted_data.append(
35
- [
36
- index,
37
- category,
38
- subcategory,
39
- characteristic,
40
- definition,
41
- notes,
42
- references,
43
- ]
44
- )
45
-
46
- try:
47
- assert len(extracted_data) == 100, "The parser did not find 100 indicators"
48
- except AssertionError:
49
- indices = [int(item[0]) for item in extracted_data]
50
- missing_indices = [i for i in range(1, 101) if i not in indices]
51
- assert not missing_indices, f"Missing indices: {missing_indices}"
52
-
53
- return extracted_data
54
-
55
-
56
- data = parse_pdf(pdf_path="fmti_indicators.pdf")
57
-
58
- with open("fmti_indicators.csv", "w", newline="", encoding="utf-8") as csv_file:
59
- writer = csv.writer(csv_file)
60
- writer.writerow(
61
- [
62
- "Index",
63
- "Category",
64
- "Subcategory",
65
- "Characteristic",
66
- "Definition",
67
- "Notes",
68
- "References",
69
- ]
70
- )
71
- writer.writerows(data)
72
-
73
- csv_file.seek(0)
74
- lines = csv_file.readlines()