--- title: "FAKE NEWS DETECTION" emoji: "🚀" colorFrom: "#FF69B4" colorTo: "#FF1493" sdk: "gradio" sdk_version: "5.8.0" app_file: "application.py" pinned: false --- # [Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation ## **Getting Started** 1. **Clone the repository:** ```bash git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection ``` 2. **Set up the environment:** Using virtual environment: ```bash python -m venv .venv source .venv/bin/activate ``` 3. **Install dependencies:** - Torch: https://pytorch.org/get-started/locally/ - Others ```bash pip install -r requirements.txt ``` 1. **API Keys** (optional) - Obtain API keys for the corresponding models and insert them into the `SimLLM.py` file: - ChatGPT: [OpenAI API](https://openai.com/index/openai-api/) - Gemini: [Google Gemini API](https://ai.google.dev/gemini-api/docs/api-key) - Other LLMs: [Together API](https://api.together.ai/) 5. **Run the project:** - Text only: ```bash python SimLLM.py ``` ### Parameters - `LLMs`: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is `['ChatGPT', 'Yi', 'OpenChat']`. - `train_indexes`: List of LLM indexes for training. Default is `[0, 1, 2]`. - `test_indexes`: List of LLM indexes for testing. Default is `[0]`. - `num_samples`: Number of samples. Default is 5000. ### Examples - Running with default parameters: `python SimLLM.py` - Running with customized parameters: `python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0` ## Dataset The `dataset.csv` file contains both human and generated texts from 12 large language models, including: ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna. ## Citation ```bibtex @inproceedings{nguyen2024SimLLM, title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation}, author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji}, booktitle={The Conference on Empirical Methods in Natural Language Processing}, year={2024} } ``` ## Acknowledgements - BARTScore: [BARTScore GitHub Repository](https://github.com/neulab/BARTScore)