This is Jessica. LLM-generated reviews are encroaching on review processes in computer science, such as the big machine learning conferences. But sometimes you don’t necessarily have to submit your paper to get an LLM review delivered to your inbox. Yesterday I received this email:
Hi there,
Congratulations on your recent preprint on arXiv, titled “Measure-Observe-Remeasure: An Interactive Paradigm for Differentially-Private Exploratory Analysis”. We are grateful for your hard work and dedication to the field, and we value your contributions!
We are part of a team from Northwestern University, Stanford University, and Cornell University, committed to providing research feedback to scholars with the assistance of advance AI models. We have followed your work closely and, upon a thorough examination, generated the suggestions below. These suggestions cover various aspects of your work, including the writing style, research design, and title. We hope they offer you fresh insights that may enhance the depth and impact of your research.
To view detailed comments regarding your research, please visit this link: https://feedback.kellogg.northwestern.edu/7SUGDR.html
Once again, congratulations on your achievement. We are certain that your work will have an impact on the future of your field and will inspire fellow researchers worldwide.
Should you have any questions, concerns, or suggestions, please do not hesitate to reach out to us at [email protected]. Your insights would be invaluable in helping us better support the community’s research development.
Best regards,
Feedback Team
The link they provide lists a bunch of critiques and associated recommendations for our paper, which proposes a constrained framework for querying data protected by differential privacy when you don’t know all your queries in advance. Many suggestions are to add more detail, ranging from reiterating definitions, to more description of the algorithm design and visualization interface, to adding more evaluations. Some propose including what would probably end up being another conference-length paper, like “Conduct a thorough comparative analysis with existing DP tools and techniques, highlighting where and why the MEASUREOBSERVEREMEASURE paradigm outperforms others.” A few suggestions seem offbase, like “The criteria for evaluating participant performance in the user study lack depth.” We use a rational agent framework where measures are well-defined in the context of statistical decision theory, far more than your average user study, so I’m not sure what it means to say they “lack depth.”
Overall, my appraisal is that many of the suggestions are reasonable, if high level. What really seems lacking though is awareness of the context in which papers like ours are produced. For one, most computer science conferences have strict page limits. For IEEE S&P (where this one is published) it’s 15 pages including references and appendices. Cramming a bunch more screenshots in might be helpful to some readers, but we’d be creating a different paper for different goals. There is also the dimension of knowing your audience. Different publication venues are associated with different implicit knowledge of what should be highlighted, what you can assume is understood, etc. In our case, reiterating what the privacy budget is seems unlikely to be necessary for the audience this venue draws. I’m all for the idea of making research papers more widely accessible, but trying to be as comprehensive or accessible as possible is just one subgoal of many that you’re trying to trade-off when you write.
This is not to say the paper we posted is perfect, by the way. There are still many ways it could be improved post-publication. I just don’t find this particular advice very helpful at this point in time, and wish they hadn’t wasted the resources on it (including the time it took me to read through the suggestions). It seems better suited to providing feedback to newer researchers like students as they are drafting their papers.
My favorite are the title suggestions. The paper is called: “Measure-Observe-Remeasure: An Interactive Paradigm for Differentially-Private Exploratory Analysis.” Obviously we should have called it:
- Interactive Differential Privacy in Exploratory Analysis: The MEASURE-OBSERVE-REMEASURE Framework,
- Adapting Differential Privacy Budgets in Exploratory Analysis: The MEASURE-OBSERVE-REMEASURE Workflow,
- Maximizing Utility in Differential Privacy: A MEASURE-OBSERVE-REMEASURE Approach,
- Efficient ϵ Allocation Strategies in Exploratory Data Analysis through Interactive Differential Privacy,
- Interactive Visualization for Optimal Differential Privacy in Exploratory Analysis
P.S. Northwestern is my institution, but I didn’t see any information about who is behind this. I’m curious about their motivation, and what base model and version they are using for the feedback generation. Even better, share the full details on the pipeline so I can at least learn something about how it was produced.
I didn’t see anything definitive saying it was generated by a LLM, although much of the format and writing certainly seems like that. So, I am wondering how you are sure it was a LLM generated review. Assuming it is, I don’t see this as a good use for AI. If the purpose is to check grammar, suggest titles, or other rote tasks, then I am fine with it. But I hope for reviews to be substantive and written by people well versed in the field (which sadly is not always the case). LLMs could provide some useful content and might even be able to fool me into thinking it was written by humans. But the probability of getting an insightful or meaningful review just seems too low to me. It is similar to people who advocate using LLMs to grade student work. I find it is the least meaningful parts of teaching that can be handed to a LLM – the roles of mentoring and directing I think should be done by humans. Even if the humans are “worse” than the LLM (just as human reviewers can be worse than a LLM), I still think this is a task for humans, not computers.
This strikes me as an example of what bothers me about AI – using it for inappropriate tasks.
The disclaimer in fine print that said the grammar check was LLM generated and the cover letter described it as “AI-assisted”. Those were true positive errors—it really did find a lot of run-on sentences. Disclaiming only part of it seems to imply the rest wasn’t LLM generated, but I can’t think of any reason why anyone would be doing something like this manually, especially when the result looks just like something LLM-generated.
My problem with the run-on sentences it claimed to find is that they are not actually run-on sentences. A run-on sentence is typically defined as a sentence where you have multiple independent clauses with no punctuation between them (e.g., ‘He was a great chatbot he made some suggestions’). Most of what it flagged are sentences with complex construction that could be split up, but are not grammatically incorrect.
Despite years of linguistics training, I didn’t realize what people meant by “run-on sentence” until just now. You’re right—your examples flagged as such were not run-on sentences. They might be dispreferred stylistically for packing too much information into subordinate clauses, but they’re not ungrammatical. The ungrammatical sentence it did catch was marked “sentence fragment,” but it’s something I would have called a “run-on sentence” before I learned what that phrase meant!
“But the probability of getting an insightful or meaningful review just seems too low to me. ”
Given the general quality of research work done by humans i’d say the probability of a thorough insightful review by another human is modest at best.
That’s not the point. I completely agree that some human reviews offer less insight than you’d get from AI. But if your criterion for using AI is whether it is “better” than some humans, or even the average human, then what role exactly do you think humans will play in the future? I want judgement to be done by humans, whether they are fallible or not. It is their poor judgement – and accountability for that judgement – that is the path to improving it. I’m fine with using AI for non-judgemental tasks, but not for things like “insights” in a manuscript review.
People talk a lot about “accountability” but unless I’m misunderstanding what that means, I think it’s synonymous with “ability to punish”.
like “the CEO was held accountable for the deaths of the patients who took the drug after it was discovered he had requested certain research be suppressed” is another way of saying “we put him in jail”
Is the ability to put a person in jail, or take away their source of income, or fine them or whatever the real measure of whether something/someone should be making decisions?
Does the fact that we can exact punishment actually make it any better when the decision making process fails?
What we really want is “good decisions” not “ability to jail people who screwed up”.
I don’t think accountability necessarily implies punishment, although it could. When I say someone is “accountable” I mean that if they make poor decisions, then they accept responsibility. It could mean they get punished through some official act, but it could simply be that they acknowledge (to themselves most importantly) that they erred. They “have skin in the game.” The problem with just wanting “good decisions” is that we may already be (or soon be) in a world where AIs can make better decisions than humans. Does that mean such decisions should be made by AIs? Could and should are different things.
If we think about all those philosophical queries (like the trolley problem, which I don’t care for), humans may make foolish decisions like devoting excessive resources to saving one person’s life. An AI may see the foolishness and say that those resources would save far more lives used in other ways. Which is a “better” decision? I believe there is no simple answer that will be appropriate for all circumstances. I want a human to make such a decision and feel responsible for it. Of course, a human could program a computer to reach the same decision and I’m fine with that as long as it is clear who programmed it that way. What I don’t want is for an anonymous AI to make such decisions – it reduces humans to passive recipients of a machine generated world, much like the humans floating around in Wall-E.
Dale:
I agree with you that accountability does not necessarily imply punishment.
Recall the concept from parenting of “natural consequences.”
Where’s the line of “trusting” the machine and not needing a person in the loop? Thermostats making decisions on temperature? Autopilot making decisions on altitude? Google providing search query results? Spell correction or grammar correction?
Bob
That’s why I used the word “judgement.” Admittedly, I don’t know how to define a clear line between where the machine can be trusted and where I want the human in the loop. But what I am feeling is that the line is more philosophical than technical: at the heart of it is something about what it means to be human (which I know philosophers have said much about and which I know little about). I don’t think a human must decide where to set the thermostat – actually that is a good example. I do think a human needs to decide how the thermostat should adjust to pre-determined settings such as day of week, time, outside weather conditions, etc. Having AI decide these things for a household without the human’s input seems wrong to me – I really don’t want Google or Microsoft deciding how those should be set. I already don’t like the many things that these companies decide on my behalf – and I don’t know who (or how) to hold someone responsible for the choices that are made on my behalf.
Now, I’m fairly sure that if we dig deeper there are many machine defaults that I am not even aware of and that serve my convenience – for example, many of the things about my automobile (which is not self-driving yet). But if I don’t like the way the manufacturer has set these things, then I can choose not to purchase that brand. It seems to me that many of the AI use cases go beyond what I can avoid by choosing. For example, if a company uses AI to screen job applicants, I may not even know that so I may not simply be able to choose to apply elsewhere. We don’t (yet) even have disclosure requirements for the use of AI.
I just don’t really understand the concept of “accountability” without some kind of punishment.
Person who is hit by bicyclist: “Damn you! I’m holding you accountable!!”
Bicyclist: “you do that!! bye!” (rides off)
What does it mean to hold a Bicyclist accountable for running you over if not to impose some kind of costs on the Bicyclist?
Don’t get my wrong. I’m 100% with you on not wanting some anonymous “other” to be deciding what the temperature should be in my house. I’ll set the thermostat level, or at least make choices about the algorithm, and then the computer can keep the temp at the level I agreed to.
This is why I hate Facebook and Twitter and YouTube, but I like Mastodon… On the “platforms” I see what *they* show me. On Mastodon, I see a consistent thing that I choose (everything certain people I followed say in chronological order). I’d be happy to have some other alternatives, and I kind of do, there are lists and filters and blocks and such, but basically all of it is choosable by *ME*.
Same for why I use ad-blockers and Firefox. I don’t want my browser sending data to Google, I don’t want ad people running code on my machine. I consider the web a hostile environment where every single website I go to is partially something I chose to see and partially a hacker compromising my computer security by remote-running software on my browser.
in any case, I understand having control, having a human express the goals and utilities, etc. What I don’t know is how “accountability” really helps unless it’s about imposing costs… Even then the costs imposed have to outweigh the benefits enough that the person second guesses doing the bad thing.
Daniel
I think the problem may be definitional. Accountability should mean there are costs and benefits attached to decisions. Punishment, the way you have used it, sounds like it requires some form of legal penalty. For me, it need not be. Suffering humiliation can be a consequence of poor decision making. If you want to call that punishment, then I don’t see any disagreement. But if you are saying that accountability requires imprisonment or some other particular form of legal punishment, then I don’t agree. For people with consciences (assuming there are any left), punishment can take many forms.
Dale, sure I really just mean “negative consequences” which would cause the person to think twice about causing harms.
All this breaks down for the group of 1-10% of the population with considerable primary and/or secondary psychopathy traits.
Daniel
some of whom are running for national office (and you may have underestimated the percentage)
Dale, in the general population I think yeah probably 1-10% but in the population of politicians, lawyers, police officers, special ops soldiers, and university professors I think at least 25-40%.
Whenever AI is announced as outperforming humans at a certain task, I wonder if that’s a credit to AI or an indictment of the quality of work of the humans assigned to that task.
High, Adede—in my experience, It’s a combination of both. When I was at Bell Labs and worked on automatic call routing from speech input for a call center in the 1990s, we were competing with overworked (they were strictly timed on calls), undertrained (there was little training and huge turnover because the jobs were so awful because of the overwork) humans who had to try to route callers to several hundred finely discriminated destinations at a vast financial institution. This was something where even with the crude tools we had in the 1990s, we could outperform humans in answering the query “How can I direct your call?” Of course, if you wanted to say anything other than a destination, you got nothing other than examples of things you could try saying that might work in routing you to a destination. So nobody was calling it “AI” (plus, you’d get laughed out of the room if you said “AI” in the 1990s).
LLMs are a whole different beast. The toy we built in the 1990s is a joke compared to modern LLMs. I’m pretty sure a decently long context LLM could solve the call routing problem nearly flawlessly given the instruction materials for the call center employees.
I really dislike that this starts “We have followed your work closely…”, when there is zero evidence they did more than enter it into an automated system. And obviously they should identify themselves. Of course, they’re probably in computer science and never thought to talk to their universities’ IRBs to get permission for this kind of time-wasting spam broadcast.
My favorite part was the (literally fine print) disclaimer about the one task for which ChatGPT is clearly better than humans—grammar:
Seriously? What’s the intended use for the rest of their suggestions?
P.S. We’re pretty sure one of our NeurIPS reviews was LLM-generated.
One of the best use cases of LLM chatbots is to solve the problem of not having enough spam; the second best is obscuring spam as if it were something genuinely different from mindless spam.
That was a challenge, but I’m pretty sure the spam is from one or more of the following people:
Liang, Weixin and Zhang, Yuhui and Cao, Hancheng and Wang, Binglu and Ding, Daisy and Yang, Xinyu and Vodrahalli, Kailas and He, Siyu and Smith, Daniel and Yin, Yian and McFarland, Daniel and Zou, James.
They’re from Cornell, Northwestern, and Stanford, and previously wrote this paper:
Can large language models provide useful feedback on research papers? A large-scale empirical analysis.
https://arxiv.org/pdf/2310.01783
I found this paper through the lead author’s GitHub repo for doing something nearly identical to what was sent to Jessica:
https://github.com/Weixin-Liang/LLM-scientific-feedback
P.S. My strategy is always to quote something from the text to find the code that generated it. The search that worked this time was the Google query:
"significance and novelty" site:github.comSince the GitHub repo is public, someone else entirely could in principle just be running their code. (Although it’s a little unclear why they would do that.)
That’s why I only said “pretty sure.” The code was modified from the GitHub. The email came with an announcement that the senders were from the same institutions, but that could’ve been a lie or even a coincidence.
I wrote off to some folks I know in NLP at Stanford to see if they really were responsible for this. I suspect I won’t get a response.
Wow, thanks! I do know a few of these people.
Which license was chosen? I wonder if they need to email you something if they decide to use your article as training data or whatever: https://info.arxiv.org/help/license/index.html
Good point. Jessica’s arXiv paper comes with this horrendous license from IEEE:
This means Jessica would have had to get permission from IEEE to post the paper legally on arXiv.
What strikes me is the spammy perhaps AI generated language of the email. The tone is the same as those invitation emails that come from predatory conferences and publishers. If I received such an email – it was unsolicited, correct? – I would have just deleted it for fear of the link taking me somewhere useless at best and nefarious at worst. You would think that a bunch of academics would realize the importance of personal connection when wanting others to engage with your work. Otherwise, I’m thinking they send out thousands of these emails hoping a handful click through. Perhaps I’m naive, but I just wouldn’t think most academics would want to interact with other academics that way.
“… proposes a constrained framework for querying data protected by differential privacy when you don’t know all your queries in advice”
advice = advance? Seems obvious, but just making sure.
Oops yes in advance.