Federated learning is a distributed paradigm that allows multiple parties to collaboratively train deep learning models without direct exchange of raw data. Nevertheless, the inherent non-independent and identically distributed (non-i.i.d.) nature of data distribution among clients results in significant degradation of the acquired model. The primary goal of this study is to develop a robust federated learning algorithm to address feature shift in clients' samples, potentially arising from a range of factors such as acquisition discrepancies in medical imaging. To reach this goal, we first propose federated feature augmentation (FEDFA(l)), a novel feature augmentation technique tailored for federated learning. FEDFA(l) is based on a crucial insight that each client's data distribution can be characterized by first-/second-order statistics (a.k.a., mean and standard deviation) of latent features; and it is feasible to manipulate these local statistics globally, i.e., based on information in the entire federation, to let clients have a better sense of the global distribution across clients. Grounded on this insight, we propose to augment each local feature statistic based on a normal distribution, wherein the mean corresponds to the original statistic, and the variance defines the augmentation scope. Central to FEDFA(l) is the determination of a meaningful Gaussian variance, which is accomplished by taking into account not only biased data of each individual client, but also underlying feature statistics represented by all participating clients. Beyond consideration of low-order statistics in FEDFA(l), we propose a federated feature alignment component (FEDFA(h)) that exploits higher-order feature statistics to gain a more detailed understanding of local feature distribution and enables explicit alignment of augmented features in different clients to promote more consistent feature learning. Combining FEDFA(l) and FEDFA(h) yields our full approach FEDFA+. FEDFA+ is non-parametric, incurs negligible additional communication costs, and can be seamlessly incorporated into popular CNN and Transformer architectures. We offer rigorous theoretical analysis, as well as extensive empirical justifications to demonstrate the effectiveness of the algorithm.