Actor-critic is a reinforcement learning method that learns a policy by collecting samples through online trial-and-error interaction with the environment, which is an effective tool for solving sequential perceptual decision problems. However, the active learning paradigm of online interaction raises cost and security issues when collecting samples in some complex real-world environments. Offline reinforcement learning, as a data-driven reinforcement learning paradigm, emphasizes learning policy from a static sample dataset without exploratory interaction with the environment, which has been a research hotspot in recent years and provides a feasible solution for real-world deployment applications such as robotics, autonomous driving, healthcare, and so on. At present, offline reinforcement learning methods face the challenge of distribution shift between the learned and behavior policies, which generates extrapolation errors in the value function estimation for the out-ofdistribution (OOD) actions of the static sample dataset. The extrapolation errors are accumulated with the Bellman bootstrapping operation, which leads to the performance degradation or even nonconvergence of offline reinforcement learning. In order to deal with the distribution shift problem, the policy constraint or value function regularization is usually used to restrict the agent access to OOD actions, which may result in overly conservative learning performance and hinder the generalization of value function network and performance improvement of policy. To this end, an offline deterministic actor-critic method based on uncertainty estimation (ODACUE) is proposed to balance the generalization and conservation of value function learning by utilizing the uncertainty estimation and OOD sampling. Firstly, for the deterministic policy, the definition of uncertainty estimation operator is given according to the different estimation methods of Q value function for the in-dataset and OOD actions. The in-dataset action value function is estimated according to the Bellman bootstrapping operation and ensemble uncertainty estimation. On the other hand, the OOD action value function is estimated based on a pseudo-target constructed by the ensemble uncertainty estimation and OOD sampling method. The pessimism of the uncertainty estimation operator is theoretically analyzed by ξ-uncertainty estimation theory. By choosing appropriate parameters, the Q value function learned according to the uncertainty estimation operator is a pessimistic estimation of the optimal Q value function. Then, by applying the uncertainty estimation operator to the deterministic actor-critic framework, the objective function of critic learning is constructed via a convex combination of the indataset and OOD action value functions, thus the conservative constraints and generalization of value function learning are balanced by using the convex combination coefficient. Moreover, the uncertainty estimation operator of value function is implemented by the critic target network during the indataset action value function learning process. During the OOD action value function learning process, the OOD sampling is implemented by the actor main network, and the uncertainty estimation operator of value function is implemented by the critic main network. Finally, ODACUE and some state-of-the-art baseline algorithms are evaluated on D4RL benchmark. Experimental results show that, in contrast to the comparative algorithms, the overall performance improvement of ODACUE on the 11 datasets with different quality levels is at least 9.56% and at most 64.92%. In addition, parameter analysis and ablation experiments further validate the stability and generalization ability of ODACUE. © 2024 Science Press. All rights reserved.