SCOPE: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Yi Zeng^1,* , Adam Nguyen^1,* , Bo Li² Ruoxi Jia¹ ,
¹Virginia Tech ²University of Chicago ^*Lead Authors

arXiv-Preprint, 2024

[arXiv] TBD [Project Page] [HuggingFace] [PyPI]

Notebook Demos

Explore our notebooks on various platforms like Jupyter Notebook/Lab, Google Colab, and VS Code Notebook.

Check out four demo notebooks below.

Jupyter Lite	Binder	Google Colab	Github Jupyter File

Quickstart

Installation (Under Development TBD)

To quickly use SCOPE in a notebook or Python code, install our pipeline with pip:

pip install SCOPE

from SCOPE import ScopePipeline

scope = ScopePipeline()

Further documentation Here

Use our Original Code

To go step by step through our SCOPE process using the original code that generated our HuggingFace dataset, clone or download this repository:

Clone the Repository:

git clone [email protected]:reds-lab/SCOPE.git
cd SCOPE/SCOPE_Research_Code

Create a New Conda Environment and Activate It:

conda create -n SCOPE python=3.9
conda activate SCOPE

Install Dependencies Using pip:
```
pip install -r requirements.txt
```
Run SCOPE's Main Bash Script:
```
./setup.sh
```
Further Documentation: Reference to additional documentation within the repository:

For more detailed instructions and further documentation, please refer to the documentation folder inside the repository.

Introduction

TL;DR: SCOPE is a scalable pipeline that generates test data to evaluate the spurious correlated safety refusal of foundation models through a systematic approach.

A Quick Glance

Case Studies

Case Study 1

The adaptive nature of SCOPE enables dynamic use cases and functionalities beyond serving as a static benchmark. In this case study, we demonstrate that dynamically generated “Woke” data from SCOPE provides timely identification of safety mechanism-dependent incorrect refusals. We fine-tuned a helpfulness-focused model, Mistral-7B-v0.1, on 50 random samples from AdvBench, introducing safety refusal behaviors. The evaluation compared the model’s safety on AdvBench samples and its incorrect refusal rate on SCOPE data versus static benchmarks like XSTest.

Case Study 2

In this case study, we explore using SCOPE data for few-shot mitigation of incorrect refusals. We split the SCOPE and XSTest-63 data into train/test sets and compared different fine-tuning methods. Our findings show that incorporating SCOPE samples effectively mitigates wrong refusals while maintaining high safety refusal rates. Model 1, which used SCOPE data, demonstrated generalizable mitigation on unseen data, outperforming models trained with larger benign QA samples or XSTest samples. This highlights the potential of SCOPE data in balancing performance, safety, and incorrect refusals in AI safety applications.

Ethics and Disclosure

The development and application of SCOPE adhere to high ethical standards and principles of transparency. Our primary aim is to enhance AI system safety and reliability by addressing incorrect refusals and improving model alignment with human values. The pipeline employs red-teaming datasets like HEx-PHI and AdvBench to identify and correct spurious features causing misguided refusals in language models. All data used in experiments is sourced from publicly available benchmarks, ensuring the exclusion of private or sensitive data.

We acknowledge the potential misuse of our findings and have taken measures to ensure ethical conduct and responsibility. Our methodology and results are documented transparently, and our code and methods are available for peer review. We emphasize collaboration and open dialogue within the research community to refine and enhance our approaches.

We stress that this work should strengthen safety mechanisms rather than bypass them. Our evaluations aim to highlight the importance of context-aware AI systems that can accurately differentiate harmful from benign requests.

The SCOPE project has been ethically supervised, adhering to our institution's guidelines. We welcome feedback and collaboration to ensure impactful and responsibly managed contributions to AI safety.

License

The software is available under the MIT License. ◊

Contact

If you have any questions, please open an issue or contact Adam Nguyen.

Special Thanks

Help us improve this readme. Any suggestions and contributions are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Scope_Research_Code		Scope_Research_Code
assets		assets
case_study		case_study
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCOPE: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Notebook Demos

Quickstart

Installation (Under Development TBD)

Use our Original Code

Introduction

A Quick Glance

Case Studies

Case Study 1

Case Study 2

Ethics and Disclosure

License

Contact

Special Thanks

About

Releases

Packages

Languages

License

reds-lab/SCOPE

Folders and files

Latest commit

History

Repository files navigation

SCOPE: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Notebook Demos

Quickstart

Installation (Under Development TBD)

Use our Original Code

Introduction

A Quick Glance

Case Studies

Case Study 1

Case Study 2

Ethics and Disclosure

License

Contact

Special Thanks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages