SaTML 2026 Paper #84 Reviews and Comments
===========================================================================
Paper #84 Privacy-Preserving Retrieval-Augmented Generation with
Differential Privacy


Review #84A
===========================================================================

Overall merit
-------------
4. Accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper studies private Retrieval-Augmented Generation (RAG). In their setting, the large language model is assumed to be trained on public data and then, during inference, accesses sensitive data via RAG in a differentially private manner. The authors introduce two mechanisms: DPVoteRAG and its sparse variant, DPSparseVoteRAG. The sparse version is based on the sparse vector technique in the form of a Limited-Domain mechanism, which allows the model to spend the privacy budget only when a RAG-generated token differs from the most likely LLM token, thereby improving accuracy.

Strengths (Reasons to accept)
-----------------------------
I find the topic very timely and of strong interest to the privacy community. The proposed sparse algorithm is elegant, and the paper is well written with no obvious typos. To the best of my knowledge, it is the first work to introduce and study differentially private Retrieval-Augmented Generation (RAG).

The authors conduct an extensive experimental evaluation across multiple datasets and language models, showing that, at least in the low-privacy regime, the proposed method outperforms an LLM without RAG support.

Weaknesses (Reasons to reject)
------------------------------
The sparse algorithm has already been used in other LLM-related contexts and therefore presents limited novelty.
 
[2] Kareem Amin, Alex Bie, Weiwei Kong, Alexey Kurakin, Natalia Ponomareva, Umar Syed, Andreas Terzis, and Sergei Vassilvitskii. Private Prediction for Large-Scale Synthetic Text Generation.

The paper has received reviews in the past, and some of the reviewers raised concerns regarding the use of epsilon = 10, as it represents a very low level of privacy. I, however, do not find this a serious problem and see it as a natural consequence of academic research, where it is difficult to find auxiliary data for an LLM enhanced by RAG such that the LLM has not already seen this data and can still significantly benefit from it. I would assume that in a more challenging setting, where the auxiliary data is absolutely crucial for the model to answer the query, the method would still benefit from seeing it even under a high-privacy regime. That said, it is necessary to keep looking for more challenging settings and less trivial data.

Comments for authors
--------------------
To further strengthen the paper’s contribution, I encourage the authors to evaluate their method on the VaultGemma model available on Hugging Face, as an example of a fully private pipeline in which the model itself has also been trained with differential privacy.

It would also be advisable to include a comparison with the concurrent recent work “Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)”
 by Junki Mori, Kazuya Kakizaki, Taiki Miyagawa, and Jun Sakuma, which studies a related approach to private RAG.


Review #84B
===========================================================================

Overall merit
-------------
1. Reject

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
This paper studies retrieval-augmented generation (RAG) under privacy constraints. The setup is as follows: there is a private dataset D containing sensitive information. A user provides a prompt 
x. We then retrieve the top elements from D that help the LM answer x.

The main idea is to use a sample-and-aggregate framework. Assume a retriever R that, given 
x and D, returns mk elements. We partition these into m groups of size k and feed each group to 
m voters (independent runs of the same LLM). The final answer is produced via a differentially private aggregation (LimitedDomain) over the voted token; a variant adds sparse-vector gating to spend privacy budget only when the RAG token differs from the non-RAG baseline.

Strengths (Reasons to accept)
-----------------------------
I think the problem is interesting and relevant. The paper would be a nice contribution. However, I have a major concern about their privacy proof.

Weaknesses (Reasons to reject)
------------------------------
Main concern (privacy proof): 

The proof of Theorem 2 implicitly assumes a 1-stability property of the retriever that doesn’t hold in general. In Algorithm 2, line 3, the method retrieves the top-mk docs Dx=R(x,D;mk) Dx​’=R(x,D’;mk).

The proof then states that Dx and Dx′ “have at most one different document”, and on that basis concludes that after the random partition, at most one subset changes and the token histogram differs by at most one count.

This is generally false for top-k retrieval: even when only one individual’s record changes, inserting a higher-scoring document into the top-mk typically pushes some other document out. With retrievers whose scores depend on corpus-level statistics, many rankings can shift, so the difference between Dx and Dx’ can be larger still. Consequently, more than one voter subset can change, and the histogram’s sensitivity is > 1, invalidating the “one-token change” argument the proof relies on.

Possible fixes: (i) assume and state a bounded retrieval stability and scale the noise/analysis to sensitivity ccc; (ii) make retrieval itself DP (e.g., exponential mechanism / report-noisy-max) and then compose with LimitedDomain.


Review #84C
===========================================================================

Overall merit
-------------
1. Reject

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper proposes a differentially private algorithm to run RAG. It includes two algorithms, one utilizing SVT, and an empirical evaluation of utility and privacy over three datasets.

Strengths (Reasons to accept)
-----------------------------
+ Important and timely research question

+ Math seems correct

+ Assumes arbitrary RAG selector and ranker (though this is implicit)

+ Includes empirical evaluation of utility and privacy risk

Weaknesses (Reasons to reject)
------------------------------
- The paper is not self-contained; central algorithms from previous work are not defined (not even in the appendix: LimitedDomain and S²MIA); details of the evaluation protocol are missing (e.g., number of records and queries for each dataset)

- Empirical results do not support claims (see below for details)

- No clear comparison to other DP-ICL methods (e.g., [2])

Comments for authors
--------------------
1. My overall conclusion from Figure 3 (utility results of best-performing hyperparameters) is that there is no need for the private data at all for a reasonable value of ε (let's say up to and including 10). For the strongest LLM (Llama 3.1 8B, which is still rather a "weak" LLM in general), the gap between using RAG (orange) and not using RAG (blue) is rather minimal for the Trivia (≤3%) and the ChatDoctor (<1.5%) datasets. This suggests that the LLM can answer these questions without access to the external dataset, therefore it is not clear whether the evaluation used in the paper could measure an advantage of using the private data.

(On a side note, I think that the choice of y-axis ticks is misleading when arranged together in the same figure, especially for ChatDoctor.)

There are definitely important situations where the private data could have a great impact on the performance of RAG, but the evaluation in the paper, probably due to the choice of datasets, does not allow such advantage to surface.

2. I greatly appreciate the inclusion of empirical MIA assessment. However, I think that the threat model is rather weak. My understanding is that the DP algorithm provides protections against any selector/ranker of records to the RAG. Therefore, the MIA should be executed against the worst-case selector, i.e., include the true members in the selected examples under IN condition. The paper does not describe the attack in detail, so it is impossible to know if this is what was done.

3. I'd appreciate a more detailed discussion on the comparison of this method to other DP-ICL/inference approaches (e.g., [2]), and perhaps even include it as part of the empirical evaluation.


Review #84D
===========================================================================

Overall merit
-------------
2. Weak reject

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper presents algorithms for protecting the privacy of documents used in retrieval-augmented large language models (LLMs). 
The baseline algorithm partitions the document collection into subsets and uses each subset as knowledge augmentation during token generation, followed by a private selection of the most frequent token. 
The improved algorithm enhances the privacy guarantee through subsampling and the use of the sparse vector technique.

Strengths (Reasons to accept)
-----------------------------
The sparse vector technique is elegant.

Weaknesses (Reasons to reject)
------------------------------
The primary concern lies in how the privacy budget is consumed across both tokens and prompts. 
Specifically, the privacy cost accumulates not only over the tokens generated for a single prompt but also across different prompts that access the same document. 
For example, if each query of a document uses a privacy parameter of $\epsilon_0 = 1$, then querying the same document $T$ times results in an overall privacy guarantee of $\epsilon \in O(\sqrt{T} \, \epsilon_0)$. 
This means that for a total privacy budget of $\epsilon = 10$, the document can only support roughly $T = 100$ queries before exhausting its budget and needing to be removed from knowledge dataset.

Such a constraint appears impractical in real-world LLM deployments, where documents may be queried an unbounded number of times, even when subsampling or sparse vector techniques are applied.


Response by Author [Tatsuki Koga <tkoga@ucsd.edu>]
---------------------------------------------------------------------------
We sincerely thank all the reviewers for their astute feedback on our work. We appreciate that the reviewers find our work timely and valuable in this field. We address the comments/questions below.

# Review #84A
**Comparison with VaultGemma**
We agree that the comparison with VaultGemma would add more value to our work. We will add the comparison in our future revisions.

# Review #84B
**Privacy proof concern related to retrieval stability** We agree that we have implicitly assumed the stability property of the retriever. We will make the assumption explicit in our future revision. 
However, we would like to emphasize that the retriever we use in our experiments, i.e., Dense Passage Retriever (DPR), satisfies the stability property since the scores for individual records are independent of each other. Therefore, the privacy guarantees in our experimental setup are valid.

# Review #84C
**Need for private data in LLMs**
While the evaluation done in the paper does not necessarily demonstrate the substantial benefit of RAG, it has been shown both in academia and industry that RAG supplements the capability of LLMs. Given that RAG is most beneficial when the external knowledge is not in the training data, which is most likely due to privacy concerns, it is essential to have a formal RAG framework to preserve privacy. One of our contributions is to propose such a framework and to bridge the DP community with the RAG community.

**Comparison with DP-ICL**
We did not compare our methods against ICL since we believe RAG and ICL aim to solve fundamentally different problems. Unlike ICL, RAG augments the knowledge of LLMs by providing factual knowledge unseen or rarely seen during training.

**Paper not self-contained**
We did not include the details of LimitedDomain and S²MIA as they have little relevance to the main claims made in the paper. We will include these details in the appendix in our revision for completeness.

**Evaluation protocol details**
As described in Section IV.A.a, we carry out our evaluation on 100 questions for each dataset. As for the details of MIA, we will clarify the setup including the threat model in our revisions.

# Review #84D
**Privacy cost accumulation**
We agree that a setting with multiple queries is interesting and important. We will highlight this as future work in our revision.