Large Language Models Based Stemming for Information Retrieval: Promises, Pitfalls and Failures

被引：0

作者：

Wang, Shuai ^{[1
]}

Zhuang, Shengyao ^{[2
]}

Zuccon, Guido ^{[1
]}

机构：

[1] Univ Queensland, Brisbane, Qld, Australia

[2] CSIRO, Brisbane, Qld, Australia

来源：

PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024 | 2024年

基金：

澳大利亚研究理事会;

关键词：

Large Language Model; Text Stemming; Text Pre-processing;

D O I：

10.1145/3626772.3657949

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text stemming is a natural language processing technique that is used to reducewords to their base form, also known as the root form. In Information Retrieval (IR), stemming is used in keyword-based matching pipelines to normalise text before indexing and query processing to improve subsequent matching between document and query keywords. The use of stemming has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information. Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions. Code and results are made available at https://github.com/ielab/SIGIR-2024-LLM- Stemming.

引用

页码：2492 / 2496

页数：5

共 50 条

[1] Promises and Pitfalls: Using Large Language Models to Generate Visualization Items
Cui, Yuan
Ge, Lily W.
Ding, Yiren
Harrison, Lane
Yang, Fumeng
Kay, Matthew
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2025, 31 (01) : 1094 - 1104
[2] INTEGRATING LARGE LANGUAGE MODELS INTO AN EXISTING REVIEW PROCESS: PROMISES AND PITFALLS
Edwards, M.
di Ruffano, L. Ferrante
VALUE IN HEALTH, 2024, 27 (12)
[3] Stemming and Lemmatization for Information Retrieval Systems in Amazigh Language
Samir, Amri
Lahbib, Zenkouar
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018, 2018, 872 : 222 - 233
[4] Using Large Language Models for Math Information Retrieval
Mansouri, Behrooz
Maarefdoust, Reihaneh
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2693 - 2697
[5] Probabilistic language models in cognitive neuroscience: Promises and pitfalls
Armeni, Kristijan
Willems, Roel M.
Frank, Stefan L.
NEUROSCIENCE AND BIOBEHAVIORAL REVIEWS, 2017, 83 : 579 - 588
[6] Lexicon-free stemming for Kazakh language information retrieval
Tukeyev, Ualsher
Turganbayeva, Aliya
Abduali, Balzhan
Rakhimova, Diana
Amirova, Dina
Karibayeva, Aidana
2018 IEEE 12TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2018, : 95 - 98
[7] Large Language Models and Future of Information Retrieval: Opportunities and Challenges
Zhai, ChengXiang
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 481 - 490
[8] Language models for information retrieval
Croft, WB
19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 3 - 7
[9] Steering Large Language Models for Cross-lingual Information Retrieval
Guo, Ping
Ren, Yubing
Hu, Yue
Cao, Yanan
Li, Yunpeng
Huang, Heyan
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 585 - 596
[10] Czech Information Retrieval with Syntax-based Language Models
Strakova, Jana
Pecina, Pavel
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,

← 1 2 3 4 5 →