Exact distribution of word counts in shuffled sequences

被引:2
|
作者
Rodland, EA [1 ]
机构
[1] Univ Oslo, Rikshosp, Radiumhosp HF, Ctr Mol Biol & Neurosci,Inst Med Microbiol, N-0027 Oslo, Norway
关键词
sequence shuffling; Markov chain; word count; exact distribution; hypergeometric distribution; generalised hypergeometric series; moment generating function; genome sequence analysis; directed graph; Euler path;
D O I
10.1239/aap/1143936143
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In DNA sequences, specific words may take on biological functions as marker or signalling sequences. These may often be identified by frequent-word analyses as being particularly abundant. Accurate statistics is needed to assess the statistical significance of these word frequencies. The set of shuffled sequences - letter sequences having the same k-word composition, for some choice of k, as the sequence being analysed - is considered the most appropriate sample space for analysing word counts. However, little is known about these word counts. Here we present exact formulae for word counts in shuffled sequences.
引用
收藏
页码:116 / 133
页数:18
相关论文
共 50 条