Human Activity Recognition (HAR) plays a critical role in fields such as healthcare, sports, and human-computer interaction. However, achieving high accuracy and robustness remains a challenge, particularly when dealing with noisy sensor data from accelerometers and gyroscopes. This paper introduces HARCNN, a novel approach leveraging Convolutional Neural Networks (CNNs) to extract hierarchical spatial and temporal features from raw sensor data, enhancing activity recognition performance. The HARCNN model is designed with 10 convolutional blocks, referred to as “ConvBlk.” Each block integrates a convolutional layer, a ReLU activation function, and a batch normalization layer. The outputs from specific blocks “ConvBlk_3 and ConvBlk_4,” “ConvBlk_6 and ConvBlk_7,” and “ConvBlk_9 and ConvBlk_10” are fused using a depth concatenation approach. The concatenated outputs are subsequently passed through a 2 × 2 max-pooling layer with a stride of 2 for further processing. The proposed HARCNN framework is evaluated using accuracy, precision, sensitivity, and f-score as key metrics, reflecting the model’s ability to correctly classify and differentiate between human activities. The proposed model’s performance is compared to traditional pre-trained Convolutional Neural Networks (CNNs) and other state-of-the-art techniques. By leveraging advanced feature extraction and optimized learning strategies, the proposed model demonstrates its efficacy in achieving accuracy of 97.87%, 99.12%, 96.58%, and 98.51% for various human activities datasets; UCI-HAR, KU-HAR, WISDM, and HMDB51, respectively. This comparison underscores the model’s robustness, highlighting improvements in minimizing false positives and false negatives, which are crucial for real-world applications where reliable predictions are essential. The experiments were conducted with various window sizes (50ms, 100ms, 200ms, 500ms, 1s, and 2s). The results indicate that the proposed method achieves high accuracy and reliability across these different window sizes, highlighting its ability to adapt to varying temporal granularities without significant loss of performance. This demonstrates the method’s effectiveness and robustness, making it well-suited for deployment in diverse HAR scenarios. Notably, the best results were obtained with a window size of 200ms.