Extracting Predictive Models from Marked-Up Free-Text Documents at the Royal Botanic Gardens, Kew, London

被引:0
|
作者
Tucker, Allan [1 ]
Kirkup, Don [2 ]
机构
[1] Brunel Univ, Dept Comp Sci, Uxbridge UB8 3PH, Middx, England
[2] Royal Bot Gardens Kew, Richmond, England
关键词
LIFE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we explore the combination of text-mining, un-supervised and supervised learning to extract predictive models from a corpus of digitised historical floras. These documents deal with the nomenclature, geographical distribution, ecology and comparative morphology of the species of a region. Here we exploit the fact that portions of text in the floras are marked up as different types of trait and habitat. We infer models from these different texts that can predict different habitat-types based upon the traits of plant species. We also integrate plant taxonomy data in order to assist in the validation of our models. We have shown that by clustering text describing the habitat of different floras we can identify a number of important and distinct habitats that are associated with particular families of species along with statistical significance scores. We have also shown that by using these discovered habitat-types as labels for supervised learning we can predict them based upon a subset of traits, identified using wrapper feature selection.
引用
收藏
页码:309 / 320
页数:12
相关论文
共 1 条
  • [1] Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports
    Nowak, Sebastian
    Wulff, Benjamin
    Layer, Yannik C.
    Theis, Maike
    Isaak, Alexander
    Salam, Babak
    Block, Wolfgang
    Kuetting, Daniel
    Pieper, Claus C.
    Luetkens, Julian A.
    Attenberger, Ulrike
    Sprinkart, Alois M.
    RADIOLOGY, 2025, 314 (01)