Online Corpus Construction of English Text Collection, Data Cleaning, and Similarity Analysis

被引:0
|
作者
Wang, Huanyu [1 ]
机构
[1] Tangshan Normal Univ, Tangshan 063000, Peoples R China
关键词
TECHNOLOGY; DISCOURSE;
D O I
10.1155/2022/3105790
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Corpora are applied to analyze and study the characteristics of the target language. In language education, corpora are playing an increasingly essential role due to their large capacity, authenticity, rapid and accurate retrieval, as well as quick and easy statistics. At present, a great number of universities are trying to apply the textbook corpus to English teaching. However, most of the existing corpora face the issue of poor sharing. In addition, these corpora may be limited to a specific textbook, which leads to the lack of wide coverage of the retrieval and analysis results. As a result, it is quite necessary to develop a set of English corpora that is highly relevant, well shared, and easy to use by fully integrating existing teaching resources according to the characteristics of English subjects in universities. In recent years, the use of corpus-assisted English language teaching has gained widespread attention and exploration as computers have become more and more popular. After all, a corpus-based teaching model can effectively eliminate the various drawbacks of traditional vocabulary teaching. In fact, the corpus has a large amount of authentic corpus. The authenticity and practicality of the corpus facilitate students' mastery and use of English vocabulary in real contexts. What is more, the new model of corpus-assisted English vocabulary teaching can greatly increase independent learning and cooperative activities, so that students can increase their internal motivation for learning. This study begins with a brief introduction to the concept and characteristics of corpora. To be specific, the advantages of the corpus application in foreign language teaching are explained. At the same time, this research further analyzes the shortcomings of the existing corpus in university English education from the perspective of the current development and application of English corpora as well as clarifies the importance of building a corpus of university English teaching materials. After that, the system's operating environment and main development techniques are determined according to the specific requirements of the corpus for university English textbooks. In other words, the overall design and detailed design of the corpus and its management system were then carried out on the basis of the chosen technology platform. In addition, the structure of the tables in the database is analyzed and the basic components and operation procedures of the system are introduced. Furthermore, the functional modules of the system are designed. At the same time, the automatic word and sentence separation methods of the original corpus, the corpus entry process, the cross-distance search of the corpus, and the statistical analysis of the search results are discussed in detail. In conclusion, this study is based on English text collection and data cleaning techniques to build an online corpus.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Construction of English Numerical Intelligence Text Translation Data Corpus in Colleges and Universities
    Zhai X.
    Applied Mathematics and Nonlinear Sciences, 2024, 9 (01)
  • [2] Construction and Analysis of a Large Vietnamese Text Corpus
    Dieu-Thu Le
    Quasthoff, Uwe
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 412 - 416
  • [3] Construction of Online English Corpus Based on Web Crawler Technology
    Qi, Yanfei
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [4] Toward construction of a corpus of English learners' utterances annotated with speaker proficiency profiles: Data collection and sample annotation
    Harada, Yasunari
    Maebo, Kanako
    Kawamura, Mayumi
    Suzuki, Masanori
    Suzuki, Yoichiro
    Kusumoto, Noriaki
    Maeno, Joji
    LARGE-SCALE KNOWLEDGE RESOURCES: CONSTRUCTION AND APPLICATION, 2008, 4938 : 171 - +
  • [5] Spectral analysis of text collection for similarity-based clustering
    Li, WY
    Ng, WK
    Lim, EP
    20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 833 - 833
  • [6] Spectral analysis of text collection for similarity-based clustering
    Li, WY
    Ng, WK
    Lim, EP
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2004, 3056 : 389 - 393
  • [7] Memetic Analysis on Construction of College English Translation Corpus
    Zhao, Jie
    Guo, Mengyuan
    Proceedings of the 2016 International Conference on Arts, Design and Contemporary Education, 2016, 64 : 1028 - 1029
  • [8] Construction and Analysis of Intelligent English Teaching Model Assisted by Personalized Virtual Corpus by Big Data Analysis
    Zhu, Jinxia
    Zhu, Changgui
    Tsai, Sang-Bing
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
  • [9] Conceptual Maps: Construction Over a Text Collection and Analysis
    Morenko, Egor N.
    Chernyak, Ekaterina L.
    Mirkin, Boris G.
    ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS, 2014, 436 : 163 - 168
  • [10] The Research on the Application of Online English Corpus in Data-driven Learning
    Gu Tongqing
    PROCEEDINGS OF 2014 INTERNATIONAL SYMPOSIUM - REFORM AND INNOVATION OF HIGHER ENGINEERING EDUCATION, 2014, : 319 - 322