File size: 2,419 Bytes
83c0113
38fd181
 
 
 
 
 
 
83c0113
 
 
22e1b62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "FAKE NEWS DETECTION"
emoji: "πŸš€"
colorFrom: "green"
colorTo: "blue"
sdk: "gradio"
sdk_version: "5.13.1"
app_file: "application.py"
pinned: false
---

# [Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation

## **Getting Started**
1. **Clone the repository:**
  ```bash
  git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection
  ```

2. **Set up the environment:**
Using virtual environment:
  ```bash
  python -m venv .venv
  source .venv/bin/activate
  ```

3. **Install dependencies:**
  - Torch: https://pytorch.org/get-started/locally/
  - Others
  ```bash
  pip install -r requirements.txt
  ```


1. **API Keys** (optional)
   - Obtain API keys for the corresponding models and insert them into the `SimLLM.py` file:
     - ChatGPT: [OpenAI API](https://openai.com/index/openai-api/)
     - Gemini: [Google Gemini API](https://ai.google.dev/gemini-api/docs/api-key)
     - Other LLMs: [Together API](https://api.together.ai/)


5. **Run the project:**
  - Text only:
  ```bash
  python SimLLM.py
  ```

### Parameters

- `LLMs`: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is `['ChatGPT', 'Yi', 'OpenChat']`.
- `train_indexes`: List of LLM indexes for training. Default is `[0, 1, 2]`.
- `test_indexes`: List of LLM indexes for testing. Default is `[0]`.
- `num_samples`: Number of samples. Default is 5000.

### Examples

- Running with default parameters:
  `python SimLLM.py`

- Running with customized parameters:
  `python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0`

## Dataset

The `dataset.csv` file contains both human and generated texts from 12 large language models, including:
ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna.

## Citation

```bibtex
@inproceedings{nguyen2024SimLLM,
  title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation},
  author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji},
  booktitle={The Conference on Empirical Methods in Natural Language Processing},
  year={2024}
}
```

## Acknowledgements

- BARTScore: [BARTScore GitHub Repository](https://github.com/neulab/BARTScore)