metadata

title: FAKE NEWS DETECTION
emoji: 🚀
colorFrom: '#FF69B4'
colorTo: '#FF1493'
sdk: gradio
sdk_version: 5.8.0
app_file: application.py
pinned: false

[Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation

Getting Started

Clone the repository:

git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection

Set up the environment: Using virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

Torch: https://pytorch.org/get-started/locally/
Others

pip install -r requirements.txt

API Keys (optional)
- Obtain API keys for the corresponding models and insert them into the SimLLM.py file:
  - ChatGPT: OpenAI API
  - Gemini: Google Gemini API
  - Other LLMs: Together API
Run the project:

Text only:

python SimLLM.py

Parameters

LLMs: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is ['ChatGPT', 'Yi', 'OpenChat'].
train_indexes: List of LLM indexes for training. Default is [0, 1, 2].
test_indexes: List of LLM indexes for testing. Default is [0].
num_samples: Number of samples. Default is 5000.

Examples

Running with default parameters: python SimLLM.py
Running with customized parameters: python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0

Dataset

The dataset.csv file contains both human and generated texts from 12 large language models, including: ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna.

Citation

@inproceedings{nguyen2024SimLLM,
  title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation},
  author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji},
  booktitle={The Conference on Empirical Methods in Natural Language Processing},
  year={2024}
}

Acknowledgements

BARTScore: BARTScore GitHub Repository