Unsupervised identification of text reuse in early Chinese literature

被引:13
|
作者
Sturgeon, Donald [1 ]
机构
[1] Harvard Univ, Fairbank Ctr Chinese Studies, Room S126,CGIS South Bldg,1730 Cambridge St, Cambridge, MA 02138 USA
关键词
D O I
10.1093/llc/fqx024
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Text reuse in early Chinese transmitted texts is extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving the work of multiple authors and editors. In this study, a fully automated method of identifying and representing complex text reuse patterns is presented, and the results evaluated by comparison to a manually compiled reference work. The resultant data are integrated into a widely used and publicly available online database system with browse, search, and visualization functionality. These same results are then aggregated to create a model of text reuse relationships at a corpus level, revealing patterns of systematic reuse among groups of texts. Lastly, the large number of reuse instances identified make possible the analysis of frequently observed string substitutions, which are observed to be strongly indicative of partial synonymy between strings.
引用
收藏
页码:670 / 684
页数:15
相关论文
共 50 条