The detection and effect of social events on Wikipedia data-set for studying human preferences

被引:0
|
作者
Assuied, Julien [1 ]
Gandica, Yerali [2 ,3 ]
机构
[1] CY Tech Cergy Paris Univ, Cergy, France
[2] Univ Int Valencia VIU, Dept Math & Master Big Data, Valencia, Spain
[3] CY Cergy Paris Univ, CNRS, Lab Phys Theor & Modelisat, Cergy, France
来源
FRONTIERS IN BIG DATA | 2023年 / 6卷
关键词
human preferences; Wikipedia; outliers detection; possible bias; massive events;
D O I
10.3389/fdata.2023.1077318
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Several studies have used Wikipedia (WP) data-set to analyse worldwide human preferences by languages. However, those studies could suffer from bias related to exceptional social circumstances. Any massive event promoting exceptional editions of WP can be defined as a source of bias. In this article, we follow a procedure for detecting outliers. Our study is based on 12 languages and 13 different categories. Our methodology defines a parameter, which is language-dependent instead of being externally fixed. We also study the presence of human cyclic behavior to evaluate apparent outliers. After our analysis, we found that the outliers in our data-set do not significantly affect the analysis of preferences by categories among different WP languages. While investigating the possibility of bias related to exceptional social circumstances is always a safe measure before doing any analysis on Big Data, we found that in the case of the first ten years of the Wikipedia data-set, outliers do not significantly affect using Wikipedia data-set as a digital footprint to analyse worldwide human preferences.
引用
收藏
页数:6
相关论文
共 37 条
  • [31] Inverse-Distance Weighted K- Nearest Neighbor for Raw and Scaled Data set in Human Detection using Odour
    Sabri, Ahmed Qusay
    Al-Nuaimi, Zainab
    2023 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE, ISCMI, 2023, : 161 - 165
  • [32] Determining the effect of social deprivation on the prevalence of healthcare-associated infections in acute hospitals: a multivariate analysis of a linked data set
    Packer, S. J.
    Cairns, S.
    Robertson, C.
    Reilly, J. S.
    Willocks, L. J.
    JOURNAL OF HOSPITAL INFECTION, 2015, 91 (04) : 351 - 357
  • [33] The effect of nodule segmentation on the accuracy of computerized lung nodule detection on CT scans: Comparison on a data set annotated by multiple radiologists
    Sahiner, Berkman
    Hadjilski, Lubomir M.
    Chan, Heang-Ping
    Shi, Jiazheng
    Way, Ted
    Cascade, Philip N.
    Kazerooni, Ella A.
    Zhou, Chuan
    Wei, Jun
    MEDICAL IMAGING 2007: COMPUTER-AIDED DIAGNOSIS, PTS 1 AND 2, 2007, 6514
  • [34] Studying human-nature relations in aquatic social-ecological systems using the social-ecological action situations framework: how to move from empirical data to conceptual models
    Herzog, Laura
    Tanguay, Louis
    Beisner, Beatrix E.
    Pahl-Wostl, Claudia
    Audet, Rene
    Schluter, Maja
    ECOLOGY AND SOCIETY, 2022, 27 (03):
  • [35] AI-based detection and classification of distal radius fractures using low-effort data labeling: evaluation of applicability and effect of training set size
    Patrick Tobler
    Joshy Cyriac
    Balazs K. Kovacs
    Verena Hofmann
    Raphael Sexauer
    Fabiano Paciolla
    Bram Stieltjes
    Felix Amsler
    Anna Hirschmann
    European Radiology, 2021, 31 : 6816 - 6824
  • [36] AI-based detection and classification of distal radius fractures using low-effort data labeling: evaluation of applicability and effect of training set size
    Tobler, Patrick
    Cyriac, Joshy
    Kovacs, Balazs K.
    Hofmann, Verena
    Sexauer, Raphael
    Paciolla, Fabiano
    Stieltjes, Bram
    Amsler, Felix
    Hirschmann, Anna
    EUROPEAN RADIOLOGY, 2021, 31 (09) : 6816 - 6824
  • [37] Failure mode and effect analysis by exploiting text mining and multi-view group consensus for the defect detection of electric vehicles in social media data
    Liang, Decui
    Li, Fangshun
    Chen, Xinyi
    ANNALS OF OPERATIONS RESEARCH, 2024, 340 (01) : 289 - 324