We investigate the sizes of the bootstrap versions of the generalized likelihood ratio (GLR) test by Fan et al. [2001. Ann. Statist. 29, 153-193] and of the nonparametric model specification test by Li [1994. Discussion Paper No. 1994-1997, University of Guelph] and Zheng [1996. J. Econometrics 75, 263-289], henceforth the J(n) test for the drift function in some continuous time models. Fan and Zhang [2003. J. Amer. Statist. Assoc. 98, 118-134] argued that the bootstrap-based GLR test is a powerful testing method for model specification of several one-factor continuous time models. However, if the sizes of nonparametric specification tests are unstable over a range of bandwidth values, it is difficult to judge the power of the test. Our simulation study shows that in some standard finite sample situations the bootstrap-based GLR test does not provide stable sizes over a grid of bandwidth values in testing the drift function of some continuous time models, whereas such J, test usually does. Furthermore, we consider the wild bootstrap-based GLR test, inspired by the wild bootstrap approach used for the J, test. The conclusion is that the modified method does not show much improvement on the stable sizes. (C) 2005 Elsevier B.V. All rights reserved.