publications | Yujia Bao

2025

PromptBridge: Cross-Model Prompt Transfer for Large Language Models

Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei, Yang Liu, and Yujia Bao

arXiv preprint arXiv:2512.01420 2025

Abs arXiv

Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

Jingbo Yang, Bairu Hou, Wei Wei, Shiyu Chang, and Yujia Bao

arXiv preprint arXiv:2510.06587 2025

Abs arXiv

Large language model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long horizon navigation, large scale information extraction, and reasoning under constraints. We present WebDART, a general framework that enables a single LLM to handle such complex chores. WebDART (i) dynamically decomposes each objective into three focused subtasks: navigation, information extraction, and execution, so the model concentrates on one skill at a time, and (ii) continuously replans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. Evaluated on WebChoreArena, WebDART lifts success rates by up to 13.7 percentage points over previous SOTA agents, while matching their performance on the easier WebArena suite and completing tasks with up to 14.7 fewer navigation steps.
MCP-Bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow

arXiv preprint arXiv:2508.20453 2025

Abs arXiv

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models

Gyuhak Kim, Sumiran Singh Thakur, Su Min Park, Wei Wei, and Yujia Bao

arXiv preprint arXiv:2506.15021 2025

Abs arXiv

Supervised fine-tuning (SFT) has become an essential step in tailoring large language models (LLMs) to align with human expectations and specific downstream tasks. However, existing SFT methods typically treat each training instance as a uniform sequence, giving equal importance to all tokens regardless of their relevance. This overlooks the fact that only a subset of tokens often contains critical, task-specific information. To address this limitation, we introduce Supervised Fine-Tuning with Group Optimization (SFT-GO), a novel approach that treats groups of tokens differently based on their this http URL-GO groups tokens in each sample based on their importance values and optimizes the LLM using a weighted combination of the worst-group loss and the standard cross-entropy loss. This mechanism adaptively emphasizes the most challenging token groups and guides the model to better handle different group distributions, thereby improving overall learning dynamics. We provide a theoretical analysis of SFT-GO’s convergence rate, demonstrating its efficiency. Empirically, we apply SFT-GO with three different token grouping strategies and show that models trained with SFT-GO consistently outperform baseline approaches across popular LLM benchmarks. These improvements hold across various datasets and base models, demonstrating the robustness and the effectiveness of our method.
Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Alireza Rezazadeh, Zichao Li, Ange Lou, Yuying Zhao, Wei Wei, and Yujia Bao

arXiv preprint arXiv:2505.18279 2025

Abs arXiv

Complex tasks are increasingly delegated to ensembles of specialized LLM-based agents that reason, communicate, and coordinate actions-both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single-agent performance, most approaches assume a monolithic, single-user context-overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi-user, multi-agent environments with asymmetric, time-evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory-private fragments visible only to their originating user; and (2) shared memory-selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user-agent-resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context-aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user-level information. Our framework enables safe, efficient, and interpretable cross-user knowledge sharing, with provable adherence to asymmetric, time-varying policies and full auditability of memory operations.
Advertising in AI systems: Society must be vigilant

Menghua Wu, and Yujia Bao

arXiv preprint arXiv:2505.18425 2025

Abs arXiv

AI systems have increasingly become our gateways to the Internet. We argue that just as advertising has driven the monetization of web search and social media, so too will commercial incentives shape the content served by AI. Unlike traditional media, however, the outputs of these systems are dynamic, personalized, and lack clear provenance – raising concerns for transparency and regulation. In this paper, we envision how commercial content could be delivered through generative AI-based systems. Based on the requirements of key stakeholders – advertisers, consumers, and platforms – we propose design principles for commercially-influenced AI systems. We then outline high-level strategies for end users to identify and mitigate commercial biases from model outputs. Finally, we conclude with open questions and a call to action towards these goals.
Enhancing Retrieval Systems with Inference-Time Logical Reasoning

Felix Faltings, Wei Wei, and Yujia Bao

ACL 2025

Abs arXiv

Traditional retrieval methods rely on transforming user queries into vector representations and retrieving documents based on cosine similarity within an embedding space. While efficient and scalable, this approach often fails to handle complex queries involving logical constructs such as negations, conjunctions, and disjunctions. In this paper, we propose a novel inference-time logical reasoning framework that explicitly incorporates logical reasoning into the retrieval process. Our method extracts logical reasoning structures from natural language queries and then composes the individual cosine similarity scores to formulate the final document scores. This approach enables the retrieval process to handle complex logical reasoning without compromising computational efficiency. Our results on both synthetic and real-world benchmarks demonstrate that the proposed method consistently outperforms traditional retrieval methods across different models and datasets, significantly improving retrieval performance for complex queries.
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang

NeurIPS 2025

Abs arXiv

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen

arXiv preprint arXiv:2502.12893 2025

Abs arXiv

Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI’s o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model’s own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.
DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, and Yang Liu

arXiv preprint arXiv:2511.05784 2025

Abs arXiv

Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.
Improving Data Efficiency via Curating LLM-Driven Rating Systems

Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, and Wei Wei

ICLR 2025

Abs arXiv

Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less."
LLM Unlearning via Loss Adjustment with Only Forget Data

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, and Wei Wei

ICLR 2025

Abs arXiv

Unlearning in Large Language Models (LLMs) is essential for ensuring ethical and responsible AI use, especially in addressing privacy leak, bias, safety, and evolving regulations. Existing approaches to LLM unlearning often rely on retain data or a reference LLM, yet they struggle to adequately balance unlearning performance with overall model utility. This challenge arises because leveraging explicit retain data or implicit knowledge of retain data from a reference LLM to fine-tune the model tends to blur the boundaries between the forgotten and retain data, as different queries often elicit similar responses. In this work, we propose eliminating the need to retain data or the reference LLM for response calibration in LLM unlearning. Recognizing that directly applying gradient ascent on the forget data often leads to optimization instability and poor performance, our method guides the LLM on what not to respond to, and importantly, how to respond, based on the forget data. Hence, we introduce Forget data only Loss AjustmenT (FLAT), a "flat" loss adjustment approach which addresses these issues by maximizing f-divergence between the available template answer and the forget answer only w.r.t. the forget data. The variational form of the defined f-divergence theoretically provides a way of loss adjustment by assigning different importance weights for the learning w.r.t. template responses and the forgetting of responses subject to unlearning. Empirical results demonstrate that our approach not only achieves superior unlearning performance compared to existing methods but also minimizes the impact on the model’s retained capabilities, ensuring high utility across diverse tasks, including copyrighted content unlearning on Harry Potter dataset and MUSE Benchmark, and entity unlearning on the TOFU dataset.
From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms

Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao

ICLR 2025

Abs arXiv

Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree’s depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model’s context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.
Sample, estimate, aggregate: A recipe for causal discovery foundation models

Menghua Wu, Yujia Bao, Regina Barzilay, and Tommi Jaakkola

TMLR 2025

Abs arXiv

Causal discovery, the task of inferring causal structure from data, has the potential to uncover mechanistic insights from biological experiments, especially those involving perturbations. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. For example, single-cell transcriptomics measures thousands of genes, but the nature of their relationships is not known, and there may be as few as tens of cells per intervention setting. To mitigate these challenges, we propose a foundation model-inspired approach: a supervised model trained on large-scale, synthetic data to predict causal graphs from summary statistics – like the outputs of classical causal discovery algorithms run over subsets of variables and other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of a discovery algorithm remain comparable across datasets. Theoretically, we show that the model architecture is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to misspecification and distribution shift using diverse datasets. Experiments on biological and synthetic data confirm that this model generalizes well beyond its training set, runs on graphs with hundreds of variables in seconds, and can be easily adapted to different underlying data assumptions.

2024

Harnessing business and media insights with large language models

Yujia Bao, Ankit Parag Shah, Neeru Narang, Jonathan Rivers, Rajeev Maksey, Lan Guan, Louise N Barrere, Shelley Evenson, Rahul Basole, Connie Miao, and others

arXiv preprint arXiv:2406.06559 2024

Abs arXiv

This paper introduces Fortune Analytics Language Model (FALM). FALM empowers users with direct access to comprehensive business analysis, including market trends, company performance metrics, and expert insights. Unlike generic LLMs, FALM leverages a curated knowledge base built from professional journalism, enabling it to deliver precise and in-depth answers to intricate business questions. Users can further leverage natural language queries to directly visualize financial data, generating insightful charts and graphs to understand trends across diverse business sectors clearly. FALM fosters user trust and ensures output accuracy through three novel methods: 1) Time-aware reasoning guarantees accurate event registration and prioritizes recent updates. 2) Thematic trend analysis explicitly examines topic evolution over time, providing insights into emerging business landscapes. 3) Content referencing and task decomposition enhance answer fidelity and data visualization accuracy. We conduct both automated and human evaluations, demonstrating FALM’s significant performance improvements over baseline methods while prioritizing responsible AI practices. These benchmarks establish FALM as a cutting-edge LLM in the business and media domains, with exceptional accuracy and trustworthiness.
Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words

Yujia Bao, Srinivasan Sivanandan, and Theofanis Karaletsos

ICLR 2024

Abs arXiv Code

Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. Our code is available at https://github.com/insitro/ChannelViT.

2023

Contextual Vision Transformers for Robust Representation Learning

Yujia Bao, and Theofanis Karaletsos

arXiv preprint arXiv:2305.19402 2023

Abs arXiv Code

We introduce Contextual Vision Transformers (ContextViT), a method designed to generate robust image representations for datasets experiencing shifts in latent factors across various groups. Derived from the concept of in-context learning, ContextViT incorporates an additional context token to encapsulate group-specific information. This integration allows the model to adjust the image representation in accordance with the group-specific context. Specifically, for a given input image, ContextViT maps images with identical group membership into this context token, which is appended to the input image tokens. Additionally, we introduce a context inference network to predict such tokens on-the-fly, given a batch of samples from the group. This enables ContextViT to adapt to new testing distributions during inference time. We demonstrate the efficacy of ContextViT across a wide range of applications. In supervised fine-tuning, we show that augmenting pre-trained ViTs with our proposed context conditioning mechanism results in consistent improvements in out-of-distribution generalization on iWildCam and FMoW. We also investigate self-supervised representation learning with ContextViT. Our experiments on the Camelyon17 pathology imaging benchmark and the JUMP-CP microscopy imaging benchmark demonstrate that ContextViT excels in learning stable image featurizations amidst distribution shift, consistently outperforming its ViT counterpart.

2022

Learning to Split for Automatic Bias Detection

Yujia Bao, and Regina Barzilay

arXiv preprint arXiv:2204.13749 2022

Abs arXiv Code

Classifiers are biased when trained on biased datasets. As a remedy, we propose Learning to Split (ls), an algorithm for automatic bias detection. Given a dataset with input-label pairs, ls learns to split this dataset so that predictors trained on the training split generalize poorly to the testing split. This performance gap provides a proxy for measuring the degree of bias in the learned features and can therefore be used to reduce biases. Identifying non-generalizable splits is challenging as we don’t have any explicit annotations about how to split. In this work, we show that the prediction correctness of the testing example can be used as a source of weak supervision: generalization performance will drop if we move examples that are predicted correctly away from the testing split, leaving only those that are mispredicted. We evaluate our approach on Beer Review, Waterbirds, CelebA and MNLI. Empirical results show that ls is able to generate astonishingly challenging splits that correlate with human-identified biases. Moreover, we demonstrate that combining robust learning algorithms (such as group DRO) with splits identified by ls enables automatic de-biasing. Compared with previous state-of-the-arts, we substantially improves the worst-group performance (23.4% on average) when the source of biases is unknown during training and validation.
Learning Stable Classifiers by Transferring Unstable Features

Yujia Bao, Shiyu Chang, and Regina Barzilay

In ICML 2022

Abs arXiv Code

We study transfer learning in the presence of spurious correlations. We experimen- tally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task. However, we hypothesize that the unstable features in the source task and those in the target task are directly related. By explicitly informing the target classifier of the source task’s unstable features, we can regularize the biases in the target task. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. On the target task, we cluster data from this representation, and achieve robustness by minimizing the worst-case risk across all clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task, outperforming the best baseline by 22.9% in absolute accuracy across 12 trans- fer settings. Our code is available at https://github.com/YujiaBao/Tofu.

2021

Predict then Interpolate: A Simple Algorithm to Learn Stable Classifiers

Yujia Bao, Shiyu Chang, and Regina Barzilay

In ICML 2021

Abs arXiv Code

We propose Predict then Interpolate (PI), a simple algorithm for learning correlations that are stable across environments. The algorithm follows from the intuition that when using a classifier trained on one environment to make predictions on examples from another environment, its mistakes are informative as to which correlations are unstable. In this work, we prove that by interpolating the distributions of the correct predictions and the wrong predictions, we can uncover an oracle distribution where the unstable correlation vanishes. Since the oracle interpolation coefficients are not accessible, we use group distributionally robust optimization to minimize the worst-case risk across all such interpolations. We evaluate our method on both text classification and image classification. Empirical results demonstrate that our algorithm is able to learn robust classifiers (outperforms IRM by 23.85% on synthetic environments and 12.41% on natural environments). Our code and data are available at https://github.com/YujiaBao/Predict-then-Interpolate.
Disease spectrum of gastric cancer susceptibility genes

Sophia K McKinley, Preeti Singh, Kanhua Yin, Jin Wang, Jingan Zhou, Yujia Bao, Menghua Wu, Kush Pathak, John T Mullen, Danielle Braun, and Kevin S Hughes

Medical Oncology 2021

Abs HTML

Pathogenic variants in germline cancer susceptibility genes can increase the risk of a large number of diseases. Our study aims to assess the disease spectrum of gastric cancer susceptibility genes and to develop a comprehensive resource of gene–disease associations for clinicians. Twenty-seven potential germline gastric cancer susceptibility genes were identified from three review articles and from six commonly used genetic information resources. The diseases associated with each gene were evaluated via a semi-structured review of six genetic resources and an additional literature review using a natural language processing (NLP)-based procedure. Out of 27 candidate genes, 13 were identified as gastric cancer susceptibility genes (APC, ATM, BMPR1A, CDH1, CHEK2, EPCAM, MLH1, MSH2, MSH6, MUTYH-Biallelic, PALB2, SMAD4, and STK11). A total of 145 gene–disease associations (with 45 unique diseases) were found to be associated with these 13 genes. Other gastrointestinal cancers were prominent among identified associations, with 11 of 13 gastric cancer susceptibility genes also associated with colorectal cancer, eight genes associated with pancreatic cancer, and seven genes associated with small intestine cancer. Gastric cancer susceptibility genes are frequently associated with other diseases as well as gastric cancer, with potential implications for how carriers of these genes are screened and managed. Unfortunately, commonly used genetic resources provide heterogeneous information with regard to these genes and their associated diseases, highlighting the importance of developing guides for clinicians that integrate data across available resources and the medical literature.
Non-medullary thyroid cancer susceptibility genes: evidence and disease spectrum

Jingan Zhou, Preeti Singh, Kanhua Yin, Jin Wang, Yujia Bao, Menghua Wu, Kush Pathak, Sophia K McKinley, Danielle Braun, Carrie C Lubitz, and Kevin S Hughes

Annals of Surgical Oncology 2021

Abs HTML

Background: The prevalence of non-medullary thyroid cancer (NMTC) is increasing worldwide. Although most NMTCs grow slowly, conventional therapies are less effective in advanced tumors. Approximately 5–15% of NMTCs have a significant germline genetic component. Awareness of the NMTC susceptibility genes may lead to earlier diagnosis and better cancer prevention. Objective: The aim of this study was to provide the current panorama of susceptibility genes associated with NMTC and the spectrum of diseases associated with these genes. Methods: Twenty-five candidate genes were identified by searching for relevant studies in PubMed. Each candidate gene was carefully checked using six authoritative genetic resources: ClinGen, National Comprehensive Cancer Network guidelines, Online Mendelian Inheritance in Man, Genetics Home Reference, GeneCards, and Gene-NCBI, and a validated natural language processing (NLP)-based literature review protocol was used to further assess gene–disease associations where there was ambiguity. Results: Among 25 candidate genes, 10 (APC, DICER1, FOXE1, HABP2, NKX2-1, PRKAR1A, PTEN, SDHB, SDHD, and SRGAP1) were verified among the six genetic resources. Two additional genes, CHEK2 and SEC23B, were verified using the NLP protocol. Seventy-nine diseases were found to be associated with these 12 NMTC susceptibility genes. The following diseases were associated with more than one NMTC susceptibility gene: colorectal cancer, breast cancer, gastric cancer, kidney cancer, gastrointestinal stromal tumor, paraganglioma, pheochromocytoma, and benign skin conditions. Conclusion: Twelve genes predisposing to NMTC and their associated disease spectra were identified and verified. Clinicians should be aware that patients with certain pathogenic variants may require more aggressive surveillance beyond their thyroid cancer risk.
Disease spectrum of breast cancer susceptibility genes

Jin Wang, Preeti Singh, Kanhua Yin, Jingan Zhou, Yujia Bao, Menghua Wu, Kush Pathak, Sophia K McKinley, Danielle Braun, and Kevin S Hughes

Frontiers in Oncology 2021

Abs HTML

Background: Pathogenic variants in cancer susceptibility genes can increase the risk of a spectrum of diseases, which clinicians must manage for their patients. We evaluated the disease spectrum of breast cancer susceptibility genes (BCSGs) with the aim of developing a comprehensive resource of gene-disease associations for clinicians. Methods: Twelve genes (ATM, BARD1, BRCA1, BRCA2, CDH1, CHEK2, NF1, PALB2, PTEN, RECQL, STK11, and TP53), all of which have been conclusively established as BCSGs by the Clinical Genome Resource (ClinGen) and/or the NCCN guidelines, were investigated. The potential gene-disease associations for these 12 genes were verified and evaluated based on six genetic resources (ClinGen, NCCN, OMIM, Genetics Home Reference, GeneCards, and Gene-NCBI) and an additional literature review using a semiautomated natural language processing (NLP) abstract classification procedure. Results: Forty-two diseases were found to be associated with one or more of the 12 BCSGs for a total of 86 gene-disease associations, of which 90% (78/86) were verified by ClinGen and/or NCCN. Four gene-disease associations could not be verified by either ClinGen or NCCN but were verified by at least three of the other four genetic resources. Four gene-disease associations were verified by the NLP procedure alone. Conclusion: This study is unique in that it systematically investigates the reported disease spectrum of BCSGs by surveying multiple genetic resources and the literature with the aim of developing a single consolidated, comprehensive resource for clinicians. This innovative approach provides a general guide for evaluating gene-disease associations for BCSGs, potentially improving the clinical management of at-risk individuals.

2020

Few-shot Text Classification with Distributional Signatures

Yujia Bao*, Menghua Wu*, Shiyu Chang, and Regina Barzilay

In ICLR 2020

Abs arXiv HTML Code

In this paper, we explore meta-learning for few-shot text classification. Meta-learning has shown strong performance in computer vision, where low-level patterns are transferable across learning tasks. However, directly applying this approach to text is challenging–words highly informative for one task may have little significance for another. Thus, rather than learning solely from words, our model also leverages their distributional signatures, which encode pertinent word occurrence patterns. Our model is trained within a meta-learning framework to map these signatures into attention scores, which are then used to weight the lexical representations of words. We demonstrate that our model consistently outperforms prototypical networks in both few-shot text classification and relation classification by a significant margin across six benchmark datasets (19.96% on average in 1-shot classification).
Natural language processing to facilitate breast cancer research and management

Kevin S. Hughes, Jingan Zhou, Yujia Bao, Preeti Singh, Jin Wang, and Kanhua Yin

The Breast Journal 2020

Abs HTML

The medical literature has been growing exponentially, and its size has become a barrier for physicians to locate and extract clinically useful information. As a promising solution, natural language processing (NLP), especially machine learning (ML)‐based NLP is a technology that potentially provides a promising solution. ML‐based NLP is based on training a computational algorithm with a large number of annotated examples to allow the computer to “learn” and “predict” the meaning of human language. Although NLP has been widely applied in industry and business, most physicians still are not aware of the huge potential of this technology in medicine, and the implementation of NLP in breast cancer research and management is fairly limited. With a real‐world successful project of identifying penetrance papers for breast and other cancer susceptibility genes, this review illustrates how to train and evaluate an NLP‐based medical abstract classifier, incorporate it into a semiautomatic meta‐analysis procedure, and validate the effectiveness of this procedure. Other implementations of NLP technology in breast cancer research, such as parsing pathology reports and mining electronic healthcare records, are also discussed. We hope this review will help breast cancer physicians and researchers to recognize, understand, and apply this technology to meet their own clinical or research needs.

2019

Validation of a Semiautomated Natural Language Processing–Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance

Zhengyi Deng*, Kanhua Yin*, Yujia Bao, Victor Diego Armengol, Cathy Wang, Ankur Tiwari, Regina Barzilay, Giovanni Parmigiani, Danielle Braun, and Kevin S. Hughes

JCO Clinical Cancer Informatics 2019

Abs HTML Code

PURPOSE: Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes—that is, penetrance—enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) –based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure. MATERIALS AND METHODS: We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene–cancer penetrance meta-analyses spanning 16 gene–cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage). RESULTS: Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%—we are able to identify 132 of 142 studies—before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies). CONCLUSION: We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

Yujia Bao*, Zhengyi Deng*, Yan Wang, Heeyoon Kim, Victor Diego Armengol, Francisco Acevedo, Nofal Ouardaoui, Cathy Wang, Giovanni Parmigiani, Regina Barzilay, Danielle Braun, and Kevin S Hughes

JCO Clinical Cancer Informatics 2019

Abs arXiv HTML Code

PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools monitoring and prioritizing the literature to understand the clinical implications of the pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance (risk of cancer for germline mutation carriers) or prevalence of germline genetic mutations. METHODS: We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated dataset for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule based on the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule based on the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS: For penetrance classification, we annotated 3740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieves 88.93% accuracy (percentage of papers that were correctly classified) while the CNN model achieves 88.53 % accuracy. For prevalence classification, we annotated 3753 paper titles and abstracts. The SVM model achieves 88.92% accuracy while the CNN model achieves 88.52 % accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene-cancer associations and keep the knowledge bases for clinical decision support tools up to date.
A Machine-Learning Based Drug Repurposing Approach Using Baseline Regularization

Zhaobin Kuang, Yujia Bao, James Thomson, Michael Caldwell, Peggy Peissig, Ron Stewart, Rebecca Willett, and David Page

Invited book chapter, In Silico Methods for Drug Repurposing: Methods and Protocols, Springer 2019

Abs PDF

We present the baseline regularization model for computational drug repurposing using electronic health records (EHRs). In EHRs, drug prescriptions of various drugs are recorded throughout time for various patients. In the same time, numeric physical measurements (e.g. fasting blood sugar level) are also recorded. Baseline regularization uses statistical relationships between the occurrences of prescriptions of some particular drugs and the increase or the decrease in the values of some particular numeric physical measurements to identify potential repurposing opportunities.

2018

Deriving Machine Attention from Human Rationales

Yujia Bao, Shiyu Chang, Mo Yu, and Regina Barzilay

In EMNLP 2018

Abs arXiv Code Slides

Attention-based models are successful when trained on large amounts of data. In this paper, we demonstrate that even in the low-resource scenario, attention can be learned effectively. To this end, we start with discrete human-annotated rationales and map them into continuous attention. Our central hypothesis is that this mapping is general across domains, and thus can be transferred from resource-rich domains to low-resource ones. Our model jointly learns a domain-invariant representation and induces the desired mapping between rationales and attention. Our empirical results validate this hypothesis and show that our approach delivers significant gains over state-of-the-art baselines, yielding over 15% average error reduction on benchmark datasets.

2017

Hawkes Process Modeling of Adverse Drug Reactions with Longitudinal Observational Data

Yujia Bao, Zhaobin Kuang, Peggy Peissig, David Page, and Rebecca Willett

In Machine Learning for Healthcare Conference 2017

Abs PDF Code Poster Slides

Adverse drug reaction (ADR) discovery is the task of identifying unexpected and negative events caused by pharmaceutical products. This paper describes a log-linear Hawkes process model for ADR discovery from longitudinal observational data such as electronic health records (EHRs). The proposed method leverages the irregular time-stamped events in EHRs to represent the time-varying effect of various drugs on the occurrence rate of adverse events. Experimental results on a large-scale cohort of real-world EHRs demonstrate that the proposed method outperforms a leading approach, multiple self-controlled case series (Simpson et al., 2013), in identifying benchmark ADRs defined by the Observational Medical Outcomes Partnership.

2016

Rank Revealing Algorithms and their Applications

Yujia Bao

Bachelor Thesis, Shanghai Jiao Tong University, 2016

Abs PDF Slides

As the era of big data is coming, rank revealing QR (RRQR) factorization has more and more applications on rank deficient problems such as subset selection, least squares problem, total least squares problem. This thesis systematically studied the RRQR algorithms. The main contributions are summarized as following: 1. This thesis presents a systematic review of three kinds of widely-used and high- performance RRQR algorithms. I extend some existing theorems and indepen- dently complete some parts of the theoretical analysis. 2. Based on the existing methods, I propose a new greedy strong RRQR algorithm for computing a strong RRQR factorization. The new algorithm greatly improves the time efficiency of the origin algorithm. 3. I design a series of numerical experiments to show the computation characteristics of different kinds of algorithms. Theoretical analysis and numerical results show that both the origin strong RRQR al- gorithm and the new greedy strong RRQR algorithm can promise a strong RRQR fac- torization while the new algorithm is significantly faster than the origin algorithm. For rank deficient problem, RRQR factorization gives satisfactory computation accuracy while it is much more efficient than the traditional method, which involves computing the SVD.