#84 - SaTML 2026

Bargav Jayaraman
Chhavi Yadav
Joseph Near

Submitted

Submission (2MB) Sep 25, 2025, 10:45:20 UTC · dcce603030418c5bbf1a8e69dd5e920c528e16aaa6692059681857588b26031ddcce6030

With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval-augmented generation (RAG) is particularly effective---it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $\varepsilon=10$ across different models and datasets.

T. Koga, R. Wu, Z. Zhang, K. Chaudhuri

Tatsuki Koga (University of California, San Diego) <tkoga@ucsd.edu>
Ruihan Wu (University of California, San Diego) <ruw076@ucsd.edu>
Zhiyuan Zhang (University of California, Los Angeles) <hollyzhang03@ucla.edu>
Kamalika Chaudhuri (University of California, San Diego) <kamalika@cs.ucsd.edu>

Privacy in machine learning

✓ Check: Double-blind submission

✓ Check: Prior Reviews

✓ Check: Usage of LLMs

To edit this submission, sign in using your email and password.

	OveMer	RevExp
Review #84A	4	3
Review #84B	1	3
Review #84C	1	3
Review #84D	2	3

Reviews in plain text

PC conflicts

Abstract

Authors (anonymous)

Topics and optionsTopic