The need for accurate speech recognition systems has increased in recent years due to the growing demand for speech-based interfaces in various applications, such as mobile devices and smart speakers. However, current solutions for speech recognition in Vietnamese are limited in accuracy and practicality. To address these limitations, we proposed a novel framework for Vietnamese automatic speech recognition that leverages the Whisper model, a transformer-based approach, and our own collected dataset to improve the accuracy of speech recognition. While the Whisper model achieved state-of-the-art performance on languages with a large training dataset, it still leaves much to be desired for others, such as the Vietnamese language. Therefore, we collected a Vietnamese dataset with the intention of finetuning the Whisper model before incorporating it into our framework. Although the dataset can be collected without being domain-specific, our current dataset is in finance since we are working on applications in this domain. Through the implementation and evaluation of the proposed framework, we demonstrated the feasibility of using the Whisper model for Vietnamese speech recognition, which was confirmed by the improved accuracy compared to existing solutions. Our findings highlight the potential for further improvements and the practical application potential of the proposed framework in real-world settings. Furthermore, the proposed framework was deployed as a "Streamlit" app, highlighting its practical application potential in real-world settings and further contributing to the advancement of speech recognition technology.