Early and accurate prediction of clinical outcomes holds great potential for patient prognostics and personalized treatment planning. Development of automated methods for estimation of medical image-based clinical parameters (e.g. total metabolic tumor volume, TMTV) could pave the way for predicting advanced clinical outcomes not explicitly available in the images, such as overall survival. We developed an automated framework that extracted tissue-wise multi-channel 2D projections from whole-body FDG-PET/CT volumes, by separating tissues based on CT Hounsfield units, and used a DenseNet-121 to estimate the TMTV from the projections. For transparency and interpretability, an image registration-based cohort saliency analysis was proposed. The network was applied on the autoPET cohort (501 scans representing lymphoma, lung cancer, melanoma) and evaluated using a single channel method (baseline) and a multi-channel method (proposed), for the purpose of comparison. The incorporation of multiple channels demonstrated an advantage in the TMTV prediction, outperforming the baseline model with a Delta MAE = -14.34 ml; Delta R-2 = 0.1584; Delta ICC = 0.1316 (p-value = 0.0098). The Pearson correlation coefficient (r) was computed between the ground truth (GT) tumor projections and the aggregated saliency maps. Statistical comparison, via bootstrapping, showed that the proposed model consistently outperformed the baseline, with significantly higher r across all cancer types and both sexes, except for melanoma in females. This implied that the aggregated saliency maps generated by the proposed model showed higher correspondence with the GT, compared to the baseline model. Our approach offers a promising and interpretable framework for the automated prediction of TMTV, with further potential to also predict advanced clinical outcomes.