Recently, vision Transformers (ViTs) have achieved competitive performance on many computer vision tasks. However, vision Transformers show impaired performance on small datasets when training from scratch compared with Convolutional Neural Networks (CNNs), which is interpreted as the lack of locality inductive bias. This impedes the application of vision Transformers for small-size datasets. In this work, we propose Adaptive Hybrid Vision Transformer (AHVT) as the solution to boost the performance of vision Transformers on small-scale datasets. Specifically, on spatial dimension, we exploit a Convolutional Overlapping Patch Embedding (COPE) layer to inject desirable inductive bias in model, forcing the model to learn the local token features. On channel dimension, we insert a adaptive channel features aggregation block into vanilla feed forward network to calibrate channel responses. Meanwhile, we add several extra learnable "cardinality tokens" to patch token sequences to capture cross-channel interaction. We present extensive experiments to validate the effectiveness of our method on five small/medium datasets including CIFAR10/100, SVHN, Tiny-ImageNet and ImageNet-1k. Our approach attains state-of-the-art performance on above four small datasets when training from scratch.