Video-based human activity recognition (HAR) is an active and challenging research area in the field of computer vision. The presence of camera motion, irregular motion of humans, varying illumination conditions, complex backgrounds, and variations in the shape and size of human objects in video clips of the same activity category makes human activity recognition more difficult. Therefore, to overcome these challenges, we introduce a novel feature representation technique for human activity recognition based on the fusion of multiple features. This paper presents a robust and view-invariant feature descriptor based on the combination of motion information and the local appearance of human objects for video-based human activity recognition in realistic and multi-view environments. Firstly, we used a combination of Optical Flow (OF) and Histogram of Oriented Gradients (HOG) to compute the dynamic pattern of motion information. Then, we computed shape information by combining Local Ternary Pattern (LTP) and Zernike Moment (ZM) feature descriptors. Finally, a feature fusion strategy is used to integrate the motion information and shape information to construct the final feature vector. The experiments are performed on three different publically available video datasets– IXMAS, CASIA, and TV human interaction (TV-HI) and achieved classification accuracy values are 98.25%, 92.21%, 98.66%, and 96.48% respectively on IXMAS, CASIA Single Person, CASIA Interaction and TV-HI datasets. The results are evaluated in terms of seven different performance measures- accuracy, precision, recall, specificity, F-measure, Matthew's correlation coefficient (MCC) and computation time. The effectiveness of the proposed method is proven by comparing its results with other existing state-of-the-art methods. The obtained results have demonstrated the usefulness of the proposed method.