In this paper, simple tasks mimicking visual label inspection are described to compare the accuracy of humans with that of deep learning techniques. The number of training samples that are required to obtain equal or higher accuracy as the inspection by humans is investigated using the simple task. In our method, letters printed on test labels are represented as symbols. The variations in the symbols are controlled by changing the angle of rotation, the defective position, and the defect rate. Training samples consisting of images and defect bounding boxes are automatically generated. The experimental results have shown that the number of training samples was needed to be in the order of several thousand to obtain equal or higher accuracy of humans in the simple task. They have been also demonstrated that the number of training samples was needed to be in the order of tens of thousands when the defect rate of the symbols was low. © 2020 Japan Society for Precision Engineering. All rights reserved.