Publications | Shuo Chen

Please also check the Google Scholar for a comprehensive list.

2025

NeurIPS WS

Deep Research Brings Deeper Harm

Shuo Chen , Zonggen Li , Han Zhen , Bailan He , Tong Liu , Haokun Chen , Georg Groh , Philip Torr , Volker Tresp , and Jindong Gu

Reliable ML from Unreliable Data Workshop @ NeurIPS, 2025

PDF Website
OpenAI 🏆

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Shuo Chen , Han Zhen , Haokun Chen , Bailan He , Shengyun Si , Jingpei Wu , Philip Torr , Volker Tresp , and Jindong Gu

OpenAI Red-Teaming Challenge Honorable Mention Award, 2025

PDF Website
COLM

True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen , Jianzhe Liu , Zhen Han , Yan Xia , Daniel Cremers , Philip Torr , Volker Tresp , and Jindong Gu

Conference on Language Modeling (COLM), 2025

PDF Code Website
EMNLP

METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Mengyue Wang , Shuo Chen , Kristian Kersting , Volker Tresp , and Yunpu Ma

Conference on Empirical Methods in Natural Language Processing (EMNLP) Main, 2025

PDF
COLM

Supposedly Equivalent Facts That Aren’t? Entity Frequency in Pre-training Induces Asymmetry in LLMs

Yuan He , Bailan He , Zifeng Ding , Alisia Lupidi , Yuqicheng Zhu , Shuo Chen , Caiqi Zhang , Jiaoyan Chen , Yunpu Ma , and Volker Tresp

COLM, 2025
ACL

Multimodal pragmatic jailbreak on text-to-image models

Tong Liu , Zhixin Lai , Jiawen Wang , Gengyuan Zhang , Shuo Chen , Philip Torr , Vera Demberg , Volker Tresp , and Jindong Gu

ACL, 2025
WACV

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Shuo Chen , Zhen Han , Bailan He , Mark Buckley , Philip Torr , Volker Tresp , and Jindong Gu

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2025

Abs

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method.

2024

EMNLP

Visual question decomposition on multimodal large language models

Haowei Zhang , Jianzhe Liu , Zhen Han , Shuo Chen , Bailan He , Volker Tresp , Zhiqiang Xu , and Jindong Gu

Conference on Empirical Methods in Natural Language Processing (EMNLP) Findings, 2024
COLM

Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images

Zefeng Wang , Zhen Han , Shuo Chen , Fan Xue , Zifeng Ding , Xun Xiao , Volker Tresp , Philip Torr , and Jindong Gu

In Conference on Language Modeling (COLM) 2024 , 2024

Abs

Multimodal LLMs (MLLMs) with a great ability of text and image understanding have received great attention. To achieve better reasoning with MLLMs, Chain-of-Thought (CoT) reasoning has been widely explored, which further promotes MLLMs’ explainability by giving intermediate reasoning steps. Despite the strong power demonstrated by MLLMs in multimodal reasoning, recent studies show that MLLMs still suffer from adversarial images. This raises the following open questions: Does CoT also enhance the adversarial robustness of MLLMs? What do the intermediate reasoning steps of CoT entail under adversarial attacks? To answer these questions, we first generalize existing attacks to CoT-based inferences by attacking the two main components, i.e., rationale and answer. We find that CoT indeed improves MLLMs’ adversarial robustness against the existing attack methods by leveraging the multi-step reasoning process, but not substantially. Based on our findings, we further propose a novel attack method, termed as stop-reasoning attack, that attacks the model while bypassing the CoT reasoning process. Experiments on three MLLMs and two visual reasoning datasets verify the effectiveness of our proposed method. We show that stop-reasoning attack can result in misled predictions and outperform baseline attacks by a significant margin.
SET LLM @ ICLR

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Shuo Chen , Zhen Han , Bailan He , Zifeng Ding , Wenqian Yu , Philip Torr , Volker Tresp , and Jindong Gu

In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , 2024

Abs

Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V
arXiv

PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

Yilun Liu , Yunpu Ma , Shuo Chen , Zifeng Ding , Bailan He , Zhen Han , and Volker Tresp

arXiv preprint arXiv:2411.08212, 2024

Abs

The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-87B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

2023

NeurIPS

Benchmarking robustness of adaptation methods on pre-trained vision-language models

Shuo Chen , Jindong Gu , Zhen Han , Yunpu Ma , Philip Torr , and Volker Tresp

In Conference on Neural Information Processing Systems (NeruIPS) , 2023

Abs

Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains. As test samples in real-world applications usually differ from adaptation data, the robustness of these adaptation methods against distribution shifts are essential. In this study, we assess the robustness of 11 widely-used adaptation methods across 4 vision-language datasets under multimodal corruptions. Concretely, we introduce 7 benchmark datasets, including 96 visual and 87 textual corruptions, to investigate the robustness of different adaptation methods, the impact of available adaptation examples, and the influence of trainable parameter size during adaptation. Our analysis reveals that: 1) Adaptation methods are more sensitive to text corruptions than visual corruptions. 2) Full fine-tuning does not consistently provide the highest robustness; instead, adapters can achieve better robustness with comparable clean performance. 3) Contrary to expectations, our findings indicate that increasing the number of adaptation data and parameters does not guarantee enhanced robustness; instead, it results in even lower robustness. We hope this study could benefit future research in the development of robust multimodal adaptation methods. The benchmark, code, and dataset used in this study can be accessed at https://adarobustness. github. io.
arXiv

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Jindong Gu , Zhen Han , Shuo Chen , Ahmad Beirami , Bailan He , Gengyuan Zhang , Ruotong Liao , Yao Qin , Volker Tresp , and Philip Torr

arXiv preprint arXiv:2307.12980, 2023

Abs

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

2022

arXiv

Social Networks are Divulging Your Identity behind Crypto Addresses

Shuo Chen , and Shaikh Muhammad Uzair Norman

arXiv preprint, 2022

Abs

Cryptocurrencies, such as Bitcoin and Ethereum, are becoming increasingly prevalent mainly due to their anonymity, decentralization, transparency, and security. However, the completely public ledger makes the trace and analysis of each account possible as long as the identity behind the public address is revealed. Theoretically, social networks could make that happen when addresses are posted on social network platforms using accounts containing personal information. To verify such a possibility, we have collected public data from two major platforms, i.e. Twitter and Reddit, aiming to find potential privacy leakage behind the ETH public address. In the end, an easy-to-use retrieval application is also built for a better illustration.