The digital library serves as a multifaceted information hub, housing text, sound, images, video, and literature in digital form. It aids users with retrieval, download, and document transfer services. Amidst big data, a robust multimedia retrieval system is pivotal for enhancing digital library interactions and knowledge services. This paper delves into the amalgamation of image processing, big data, and deep learning for digital library integration. By analysing deep learning's concept, structure, and semantic search relations, it identifies issues like 'under-utilising cross-modal correlation' and 'insufficient multimedia resource organisation'. Proposing a cross-media semantic search framework for digital libraries rooted in deep learning, the study suggests optimisation strategies involving cross-modal correlation analysis and hierarchical knowledge inference. The implemented Pillar+Spring+Sleep method demonstrates an 11.53% improvement in overall search performance over the suboptimal index. This optimised scheme seeks to refine and advance multimedia retrieval systems within digital libraries, especially in managing the vast yet imprecise media data of the big data era. © The Author(s) 2024.