Optimization of Average Rewards of Time Nonhomogeneous Markov Chains

被引:15
|
作者
Cao, Xi-Ren [1 ,2 ,3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Automat, Minist Educ, Dept Finance, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Automat, Minist Educ, Key Lab Syst Control & Informat Proc, Shanghai 200240, Peoples R China
[3] Hong Kong Univ Sci & Technol, Inst Adv Study, Kowloon, Hong Kong, Peoples R China
关键词
Bias optimality; bias potential; confluencity; direct-comparison based optimization; HJB equation; performance potential; weak ergodicity; weak recurrence; DECISION-PROCESSES; BIAS OPTIMALITY;
D O I
10.1109/TAC.2015.2394951
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the optimization of average rewards of discrete time nonhomogeneous Markov chains, in which the state spaces, transition probabilities, and reward functions depend on time. The analysis encounters a few major difficulties: 1) Notions crucial to homogeneous Markov chains, such as ergodicity, stationarity, periodicity, and connectivity, no longer apply; 2) The average reward criterion is under-selective; i.e., it does not depend on the decisions in any finite period, and thus dynamic programming is not amenable; and 3) Because of the under-selectivity, an optimal average-reward policy may not be the best in any finite period. These issues are resolved by 1) We discover that a new notion, called "confluencity", is the base for optimization of average rewards of Markov chains. Confluencity refers to the property that two independent sample paths of a Markov chain starting from any two different initial states will eventually meet together; 2) We apply the direct-comparison based approach [ 3] to the average reward optimization and obtain the necessary and sufficient conditions for optimal policies; and 3) We study the bias optimality with biasmeasuring the transient reward; we show that for the transient reward to be optimal, one additional condition based on bias potentials is required.
引用
收藏
页码:1841 / 1856
页数:16
相关论文
共 50 条