Xin CHEN' s Homepage

[TACL 2025] RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

If you like our project, please give us a star ⭐ on GitHub for the latest update.

RepreGuard

This repository provides the official implementation of **RepreGuard**, a robust and efficient framework for detecting LLM-generated text (LGT) by leveraging hidden representation patterns, achieving state-of-the-art performance in both in-domain and out-of-domain settings.

📣 News

[2025.08.01] 🎉 Our paper is accepted by Transactions of the Association for Computational Linguistics (TACL 2025)!

🧐 Overview

RepreGuard is based on the hypothesis that the internal representation patterns of LLMs, when processing human-written texts (HWT) and LLM-generated texts (LGT), are distinct and can be systematically modeled. By employing a surrogate model as an observer, we extract neural activation patterns and identify discriminative features. The resulting RepreScore enables robust classification between HWT and LGT with minimal training data.

Zero-shot detection: Only a small sample of LGT/HWT pairs is needed for threshold calibration.
Strong OOD robustness: Outperforms all previous methods across different models, domains, text sizes, and attacks.
Resource-efficient: Competitive performance even with smaller surrogate models.

⚙️ Datasets, Environment and Experimental Reproduction

Datasets

We use datasets including DetectRL, XSum, Writing Prompts, Yelp Review, and ArXiv abstracts, covering diverse domains and both HWT/LGT pairs generated by ChatGPT, Claude, Google-PaLM, Llama-2-70b, and RAID including the llama-chat, mistral-chat, mpt-chat,mistral, mpt and gpt2 using both greedy and sampling decoding strategies, with and without the application of repetition penalties.

Environment

conda create -n repre_guard python==3.10
conda activate repre_guard
pip install -r requirements.txt

Running RepreGuard

python3 repreGuard_evaluation.py \
    --model_name_or_path meta-llama/Llama-3.1-8B \
    --train_data_path datasets/detectrl_dataset/main_dataset/detectrl_train_dataset_llm_type_ChatGPT.json \
    --test_data_paths datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_ChatGPT.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Google-PaLM.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Claude-instant.json, datasets/detectrl_dataset/main_dataset/detectrl_test_dataset_llm_type_Llama-2-70b.json \
    --ntrain 512 \
    --batch_size 8 \
    --rep_token 0.6 \
    --bootstrap_iter -1

Surrogate Model Selection

You can specify the LLM surrogate model (e.g., Llama-3-8B, Phi-2, Gemma-2B-Instruct) via the --model argument.

🔥 Overall Results

Detector	ID	OOD	16-shots	Text w/ Attack	Text w/ Various Size
RoBERTa	84.85	82.26	43.90	65.97	46.81
Binoculars	89.18	88.07	58.15	98.70	94.41
RepreGuard	96.34	93.49	80.92	96.61	94.61

See paper Table for more detail. Attack: Paraphrase & Perturbation Attack Various Size: 100 - 400

✏️ Citation

If you find our paper/code useful, please cite us and give your ⭐!

@article{chen2025repreguard,
  author       = {Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao and Derek F. Wong},
  title        = {RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns},
  journal      = {Transactions of the Association for Computational Linguistics},
  year         = {2025},
  url          = {https://github.com/NLP2CT/RepreGuard},
  note         = {Accepted at TACL 2025}
}