Yi Zeng1,* ,
Adam Nguyen1,* ,
Bo Li2
Ruoxi Jia1 ,
1Virginia Tech 2University of Chicago
*Lead Authors
arXiv-Preprint, 2024
[arXiv] TBD [Project Page] [HuggingFace] [PyPI]
Explore our notebooks on various platforms like Jupyter Notebook/Lab, Google Colab, and VS Code Notebook.
Check out four demo notebooks below.
Jupyter Lite | Binder | Google Colab | Github Jupyter File |
---|---|---|---|
To quickly use SCOPE in a notebook or Python code, install our pipeline with pip
:
pip install SCOPE
from SCOPE import ScopePipeline
scope = ScopePipeline()
Further documentation Here
To go step by step through our SCOPE process using the original code that generated our HuggingFace dataset, clone or download this repository:
-
Clone the Repository:
git clone [email protected]:reds-lab/SCOPE.git cd SCOPE/SCOPE_Research_Code
-
Create a New Conda Environment and Activate It:
conda create -n SCOPE python=3.9 conda activate SCOPE
-
Install Dependencies Using
pip
:pip install -r requirements.txt
-
Run SCOPE's Main Bash Script:
./setup.sh
-
Further Documentation: Reference to additional documentation within the repository:
For more detailed instructions and further documentation, please refer to the documentation folder inside the repository.
TL;DR: SCOPE is a scalable pipeline that generates test data to evaluate the spurious correlated safety refusal of foundation models through a systematic approach.
The adaptive nature of SCOPE enables dynamic use cases and functionalities beyond serving as a static benchmark. In this case study, we demonstrate that dynamically generated “Woke” data from SCOPE provides timely identification of safety mechanism-dependent incorrect refusals. We fine-tuned a helpfulness-focused model, Mistral-7B-v0.1, on 50 random samples from AdvBench, introducing safety refusal behaviors. The evaluation compared the model’s safety on AdvBench samples and its incorrect refusal rate on SCOPE data versus static benchmarks like XSTest.
In this case study, we explore using SCOPE data for few-shot mitigation of incorrect refusals. We split the SCOPE and XSTest-63 data into train/test sets and compared different fine-tuning methods. Our findings show that incorporating SCOPE samples effectively mitigates wrong refusals while maintaining high safety refusal rates. Model 1, which used SCOPE data, demonstrated generalizable mitigation on unseen data, outperforming models trained with larger benign QA samples or XSTest samples. This highlights the potential of SCOPE data in balancing performance, safety, and incorrect refusals in AI safety applications.
The development and application of SCOPE adhere to high ethical standards and principles of transparency. Our primary aim is to enhance AI system safety and reliability by addressing incorrect refusals and improving model alignment with human values. The pipeline employs red-teaming datasets like HEx-PHI and AdvBench to identify and correct spurious features causing misguided refusals in language models. All data used in experiments is sourced from publicly available benchmarks, ensuring the exclusion of private or sensitive data.
We acknowledge the potential misuse of our findings and have taken measures to ensure ethical conduct and responsibility. Our methodology and results are documented transparently, and our code and methods are available for peer review. We emphasize collaboration and open dialogue within the research community to refine and enhance our approaches.
We stress that this work should strengthen safety mechanisms rather than bypass them. Our evaluations aim to highlight the importance of context-aware AI systems that can accurately differentiate harmful from benign requests.
The SCOPE project has been ethically supervised, adhering to our institution's guidelines. We welcome feedback and collaboration to ensure impactful and responsibly managed contributions to AI safety.
The software is available under the MIT License. ◊
If you have any questions, please open an issue or contact Adam Nguyen.
Help us improve this readme. Any suggestions and contributions are welcome.