Professor Mike Thelwall Had a Lecture on the Application Potential and Risks of Large Language Models in Research Quality Evaluation-武汉大学信息管理学院

English > News & Updates > Content

Contact Information

School of Information Management,
Wuhan University,
Wuhan, Hubei Province,
P.R.China. 430072

fuling@whu.edu.cn

Professor Mike Thelwall Had a Lecture on the Application Potential and Risks of Large Language Models in Research Quality Evaluation

2025-06-04 16:43:13

On the afternoon of April 18th, Mike Thelwall, winner of the Price Award (the highest honor in scientometrics and informetrics) and Professor at the University of Sheffield, gave an academic report titled How Effective are Large Language Models for Research Quality Evaluation? in Meeting Room 412 of our school. He introduced the latest research findings of his team, revealing the application potential and risks of large language models in the field of research quality evaluation. The report was chaired by Professor Lin Zhang.

undefined

Mike Thelwall introduced the application potential of large language models (LLMs) represented by ChatGPT, Gemini, and DeepSeek in the field of research quality evaluation in recent years. Based on the UK REF2021 framework, with originality, rigour, and scientific and social significance as the core criteria, Mike Thelwall simulated the expert review process by configuring ChatGPT and Gemini to test their evaluation effects on 185,000 research papers. The results showed that although the model's single scoring fluctuated, the stability and credibility of its scores were significantly improved by running it multiple times and taking the average, indicating that the model can partially capture the characteristics of research quality. In most disciplines such as library and information science, ChatGPT's evaluation results outperformed traditional citation indicators.

Mike Thelwall explored the risks of large language models in research quality evaluation. He used ChatGPT to evaluate the fictional paper Do squirrel surgeons generate more citation impact?, and the result was still a four-star rating, failing to identify the absurdity of "squirrels performing surgeries and writing research papers". However, when asked, "Can squirrels write papers?", the model decisively denied it, highlighting the limitations of its coexisting common-sense reasoning and logical matching. In another study, Mike Thelwall systematically examined ChatGPT's performance in evaluating retracted papers and found that the model almost completely ignored the retraction information and still gave positive evaluations of wrong or false conclusions, posing a serious risk of "information illusion". Facing 217 retracted or questionable papers, the model failed to identify any retraction information and still gave medium to high evaluations to most of them. In the Q&A about 61 retraction conclusions, nearly two-thirds of the answers tended to affirm the false content.

Mike Thelwall emphasized that although large language models bring new ideas to research quality evaluation, their application still needs to be cautious. Studies have found that ChatGPT has no failure warning for retracted papers, easily treating wrong conclusions as credible knowledge. If widely adopted, it may also cause researchers to deliberately cater to large language models, trigger issues such as uploading infringements, and unstable single scoring. Mike Thelwall suggested that developers should incorporate retraction detection into model training and review, and users must verify the source before citing the output of large language models and retain expert final review for approval.

During the event, teachers and students interacted warmly, and the on-site atmosphere was lively. Mike Thelwall enthusiastically responded to the questions. The participants discussed topics such as the differences between large language models and traditional evaluation methods (such as peer review and citation indicators), as well as the application ethics and safeguard measures of large language models.