trustai / README.md

Update README.md

1c01129 verified 4 months ago

4.15 kB

	---
	license: apache-2.0
	datasets:
	- allenai/real-toxicity-prompts
	language:
	- en
	metrics:
	- accuracy
	base_model: facebook/opt-350m
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- code
	- medical
	---
	# TrustyAI Detoxify Causal Language Model

	## Model Description
	The `TrustyAI Detoxify Causal Language Model` is a fine-tuned version of the OPT-350m model, specifically adapted to reduce toxicity in generated text. This model is designed to handle harmful language by identifying and replacing or neutralizing toxic phrases in real-time, making it suitable for applications like social media moderation, customer support, and more.

	## Intended Use
	This model is intended for use in scenarios where toxic or harmful language is a concern. Some potential use cases include:
	- Social Media Moderation: Automatically detecting and neutralizing toxic comments in posts or messages.
	- Customer Support: Ensuring responses generated by AI-powered customer support tools are polite and non-offensive.
	- Online Gaming: Monitoring and filtering player communication to maintain a positive environment.
	- Community Management: Assisting moderators in identifying and managing toxic behavior in online communities.

	## Training Data
	The model was fine-tuned on a curated dataset designed to reflect various forms of toxic language, including hate speech, insults, and other harmful content. The dataset was preprocessed to balance positive, neutral, and negative examples, ensuring the model learned to neutralize toxicity effectively without compromising the original context of the input text.

	## Training Procedure
	The model was fine-tuned using the Supervised Fine-Tuning (SFT) methodology, following the guidelines provided by RedHat's TrustyAI project. Key steps included:
	- Model: Based on `opt-350m_CASUAL_LM`.
	- Dataset: Preprocessed and balanced for toxicity.
	- Hyperparameters:
	- Learning rate: 2e-5
	- Batch size: 32
	- Number of epochs: 3
	- Optimization: AdamW
	- Tools: Hugging Face Transformers, PyTorch, RedHat's TrustyAI framework.

	## Evaluation Metrics
	The model was evaluated using the following metrics:
	- Accuracy: Measures how often the model correctly identifies and neutralizes toxic phrases.
	- F1-score: Balances precision and recall, providing a holistic view of the model's performance.
	- Precision: The proportion of identified toxic phrases that were indeed toxic.
	- Recall: The proportion of actual toxic phrases that the model correctly identified.

	## Limitations
	While this model performs well in many scenarios, it has some limitations:
	- Context Sensitivity: The model might struggle with complex contexts where the meaning of a phrase depends heavily on surrounding text.
	- Edge Cases: Certain types of subtle or context-dependent toxicity may not be adequately neutralized.
	- Bias: Despite efforts to balance the dataset, some biases may still exist, affecting model performance in underrepresented scenarios.

	## Ethical Considerations
	Given the sensitive nature of toxic language, ethical considerations are paramount. This model is designed to assist in reducing harm, but users should be aware that:
	- False Positives/Negatives: The model might incorrectly flag non-toxic language as toxic or miss actual toxic content.
	- Fairness: Continuous monitoring and updates are recommended to address any biases that may emerge over time.

	## Model Versions
	- Version 1.0: Initial release with base fine-tuning on toxic language dataset.

	## License
	This model is licensed under the [Apache 2.0 License](LICENSE).

	## How to Use
	To use this model with Hugging Face Transformers, you can load it as follows:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "psabharwal/trustai"
	model = AutoModelForCausalLM.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	input_text = "You are a worthless piece of junk."
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs)
	generated_text = tokenizer.decode(outputs[0])
	print(generated_text)