A stationary policy and an initial state in an MDP (Markov decision process) induce a stationary probability distribution of the reward. The problem analyzed here is generating the Pareto optima in the sense of high mean and low variance of the stationary distribution. In the unichain case, Pareto optima can be computed either with policy improvement or with a linear program having the same number of variables and one more constraint than the formulation for gain-rate optimization. The same linear program suffices in the multichain case if the ergodic class is an element of choice.
机构:
Univ Tsukuba, Grad Sch Syst & Informat Engn, Div Social Syst & Management, Tsukuba, Ibaraki 3058573, JapanUniv Tsukuba, Grad Sch Syst & Informat Engn, Div Social Syst & Management, Tsukuba, Ibaraki 3058573, Japan
机构:
Univ Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, FranceUniv Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, France
Gourieroux, C
Laurent, JP
论文数: 0引用数: 0
h-index: 0
机构:
Univ Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, FranceUniv Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, France
Laurent, JP
Pham, H
论文数: 0引用数: 0
h-index: 0
机构:
Univ Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, FranceUniv Marne La Vallee, Equipe Anal & Math Appl, F-77454 Marne La Vallee 02, France