The entropy measures have been used in feature selection for decades, and showed competitive performance. In general, the problem aims at minimizing the conditional entropy of the class label on the selected features. However, the generalization of the entropy measures has been neglected in literature. Specifically, the use of conditional entropy has two critical issues. First, the empirical conditional distribution of the class label may have a low confidence and thus is unreliable. Second, there may not be enough training instances for the selected features, and it is highly likely to encounter new examples in the test set. To address these issues, a bi-objective optimization model with a modified entropy measure called the Bayesian entropy is proposed. This model considers the confidence of the optimized conditional entropy value as well as the conditional entropy value itself. As a result, it produces multiple feature subsets with different trade-offs between the entropy value and its confidence. The experimental results demonstrate that by solving the proposed optimization model with the new entropy measure, the number of features can be dramatically reduced within a much shorter time than the existing algorithms. Furthermore, similar or even better classification accuracy was achieved for most test problems.