Risk-Sensitive Reinforcement Learning Algorithms With Generalized Average Criterion
-
摘要: 提出了一种新的算法.这个算法通过潜在地牺牲控制策略的最优性来获取其鲁棒性.这是因为,如果在理论模型与实际的物理系统之间存在不匹配,或者实际系统是非静态的,或者控制动作的可使用性随时间的变化而变化时,那么鲁棒性就可能成为一个十分重要的问题.主要工作是给出了一组逼近算法和它们的收敛结果.利用广义平均算子来替代最优算子max(或min),对激励学习中的一类最重要的算法——动态规划算法——进行了研究,并讨论了它们的收敛性,目的就是为了提高激励学习算法的鲁棒性.同时使用了更具一般性的风险敏感度性能评价体系,发现基于动态规划的学习算法中的一般结论在这种体系之下并不完全成立.Abstract: A new algorithm which immolates optimality of control policies potentially to obtain the robusticity of solutions is proposed.The robusticity of solutions may become a very important property for a learning system due to when there exists nonOmatching between theory models and practical physical system,or the practical system is not static,or availability of a control action will change along with variety of time.The main contribution is that a set of approximation algorithms and its convergence results will be given.Applying generalized average operator instead of the general optimal operator max(or min)a class of important learning algorithm,dynamic programming algorithm were studied,and their convergence from theoretic point of view was discussed.The purpose is to improve robusticity of reinforcement learning algorithms theoretically.
-
Key words:
- reinforcement learning /
- riskOsensitive /
- generalized average /
- algorithm /
- convergence
-
[1] Sutton R S.Learning to predict by the method of temporal difference[J].Machine Learning,1988,3(1):9-44. [2] Sutton R S. Open the oretical questions in reinforcement learning[A].In:Proc of Euro COLT'99(Computational Learning Theory)[C].Cambridge, MA: MIT Press,1999,11-17. [3] Sutton R S,Barto A G.Reinforcement Learning: An Introduction[M].Massachusetts: MIT Press, 1998, 20-300. [4] Watkins C J C H,Dayan P.Q-learning[J].Machine Learning,1992,8(13):279-292. [5] Watkins C J C H. Learning from delayed rewards[D].England:University of Cambridge,1989. [6] Bertsekas D P,Tsitsiklis J N.Parallel and Distributed Computation: Numerical Methods[M].Englewood Cliffs, New Jersey: Prentice-Hall,1989,10-109. [7] YIN Chang-ming,CHEN Huan-wen,XIE Li-juan. A Relative Value Iteration Q-learning Algorithm and its Convergence Based-on Finite Samples[J].Journal of Computer Research and Development,2002,39(9):1064-1070. [8] YIN Chang-ming,CHEN Huan-wen,XIE Li-juan.Optimality cost relative value iteration Q-learning algorithm based on finite samples[J].Journal of Computer Engineering and Applications,2002,38(11):65-67. [9] Wiering M, Schmidhuber J.Speeding up Q-learning[A].In:Proc of the 10th European Conf on Machine Learning[C].Germany:Springer-Verlag,1998,352-363. [10] Singh S.Soft dynamic programming algorithms: convergence proofs[A].In:Proceedings of Workshop on Computational Learning and Natural Learning (CLNL)[C].Massachusetts:Town of Provinceton.University of Massachuetts,1993. [11] Cavazos-Cadena R,Montes-de-Oca R.The value iteration algorithm in risk-sensitive average Markov decision chains with finite state[J].Mathematics of Operations Research,2003,28(4):752-776. doi: 10.1287/moor.28.4.752.20515 [12] Peng J,Williams R.Incremental multi-step Q-learning[J].Machine Learning,1996,22(4):283-290. [13] Singh S. Reinforcement learning algorithm for average-payoff Markovian decision processes[A].Procedins of the 12th National Conference on Artificial Intelligence[C].Taho city:Ca Morgan Kaufmann,1994,1:700-705.
点击查看大图
计量
- 文章访问数: 2117
- HTML全文浏览量: 40
- PDF下载量: 849
- 被引次数: 0