Spaces:
Running
Running
File size: 2,419 Bytes
83c0113 38fd181 83c0113 22e1b62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
title: "FAKE NEWS DETECTION"
emoji: "π"
colorFrom: "green"
colorTo: "blue"
sdk: "gradio"
sdk_version: "5.13.1"
app_file: "application.py"
pinned: false
---
# [Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation
## **Getting Started**
1. **Clone the repository:**
```bash
git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection
```
2. **Set up the environment:**
Using virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate
```
3. **Install dependencies:**
- Torch: https://pytorch.org/get-started/locally/
- Others
```bash
pip install -r requirements.txt
```
1. **API Keys** (optional)
- Obtain API keys for the corresponding models and insert them into the `SimLLM.py` file:
- ChatGPT: [OpenAI API](https://openai.com/index/openai-api/)
- Gemini: [Google Gemini API](https://ai.google.dev/gemini-api/docs/api-key)
- Other LLMs: [Together API](https://api.together.ai/)
5. **Run the project:**
- Text only:
```bash
python SimLLM.py
```
### Parameters
- `LLMs`: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is `['ChatGPT', 'Yi', 'OpenChat']`.
- `train_indexes`: List of LLM indexes for training. Default is `[0, 1, 2]`.
- `test_indexes`: List of LLM indexes for testing. Default is `[0]`.
- `num_samples`: Number of samples. Default is 5000.
### Examples
- Running with default parameters:
`python SimLLM.py`
- Running with customized parameters:
`python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0`
## Dataset
The `dataset.csv` file contains both human and generated texts from 12 large language models, including:
ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna.
## Citation
```bibtex
@inproceedings{nguyen2024SimLLM,
title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation},
author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji},
booktitle={The Conference on Empirical Methods in Natural Language Processing},
year={2024}
}
```
## Acknowledgements
- BARTScore: [BARTScore GitHub Repository](https://github.com/neulab/BARTScore)
|