留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

风险敏感度激励学习的广义平均算法

殷苌茗 王汉兴 赵飞

殷苌茗, 王汉兴, 赵飞. 风险敏感度激励学习的广义平均算法[J]. 应用数学和力学, 2007, 28(3): 369-378.
引用本文: 殷苌茗, 王汉兴, 赵飞. 风险敏感度激励学习的广义平均算法[J]. 应用数学和力学, 2007, 28(3): 369-378.
YIN Chang-ming, WHANG Han-xing, ZHAO Fei. Risk-Sensitive Reinforcement Learning Algorithms With Generalized Average Criterion[J]. Applied Mathematics and Mechanics, 2007, 28(3): 369-378.
Citation: YIN Chang-ming, WHANG Han-xing, ZHAO Fei. Risk-Sensitive Reinforcement Learning Algorithms With Generalized Average Criterion[J]. Applied Mathematics and Mechanics, 2007, 28(3): 369-378.

风险敏感度激励学习的广义平均算法

基金项目: 国家自然科学基金资助项目(10471088;60572126)
详细信息
    作者简介:

    殷苌茗(1964- ),男,湖南人,副教授,博士(联系人.Tel:+86-731-5542939;E-mail:yinchm@csust.edu.cn).

  • 中图分类号: O23;TP182

Risk-Sensitive Reinforcement Learning Algorithms With Generalized Average Criterion

  • 摘要: 提出了一种新的算法.这个算法通过潜在地牺牲控制策略的最优性来获取其鲁棒性.这是因为,如果在理论模型与实际的物理系统之间存在不匹配,或者实际系统是非静态的,或者控制动作的可使用性随时间的变化而变化时,那么鲁棒性就可能成为一个十分重要的问题.主要工作是给出了一组逼近算法和它们的收敛结果.利用广义平均算子来替代最优算子max(或min),对激励学习中的一类最重要的算法——动态规划算法——进行了研究,并讨论了它们的收敛性,目的就是为了提高激励学习算法的鲁棒性.同时使用了更具一般性的风险敏感度性能评价体系,发现基于动态规划的学习算法中的一般结论在这种体系之下并不完全成立.
  • [1] Sutton R S.Learning to predict by the method of temporal difference[J].Machine Learning,1988,3(1):9-44.
    [2] Sutton R S. Open the oretical questions in reinforcement learning[A].In:Proc of Euro COLT'99(Computational Learning Theory)[C].Cambridge, MA: MIT Press,1999,11-17.
    [3] Sutton R S,Barto A G.Reinforcement Learning: An Introduction[M].Massachusetts: MIT Press, 1998, 20-300.
    [4] Watkins C J C H,Dayan P.Q-learning[J].Machine Learning,1992,8(13):279-292.
    [5] Watkins C J C H. Learning from delayed rewards[D].England:University of Cambridge,1989.
    [6] Bertsekas D P,Tsitsiklis J N.Parallel and Distributed Computation: Numerical Methods[M].Englewood Cliffs, New Jersey: Prentice-Hall,1989,10-109.
    [7] YIN Chang-ming,CHEN Huan-wen,XIE Li-juan. A Relative Value Iteration Q-learning Algorithm and its Convergence Based-on Finite Samples[J].Journal of Computer Research and Development,2002,39(9):1064-1070.
    [8] YIN Chang-ming,CHEN Huan-wen,XIE Li-juan.Optimality cost relative value iteration Q-learning algorithm based on finite samples[J].Journal of Computer Engineering and Applications,2002,38(11):65-67.
    [9] Wiering M, Schmidhuber J.Speeding up Q-learning[A].In:Proc of the 10th European Conf on Machine Learning[C].Germany:Springer-Verlag,1998,352-363.
    [10] Singh S.Soft dynamic programming algorithms: convergence proofs[A].In:Proceedings of Workshop on Computational Learning and Natural Learning (CLNL)[C].Massachusetts:Town of Provinceton.University of Massachuetts,1993.
    [11] Cavazos-Cadena R,Montes-de-Oca R.The value iteration algorithm in risk-sensitive average Markov decision chains with finite state[J].Mathematics of Operations Research,2003,28(4):752-776. doi: 10.1287/moor.28.4.752.20515
    [12] Peng J,Williams R.Incremental multi-step Q-learning[J].Machine Learning,1996,22(4):283-290.
    [13] Singh S. Reinforcement learning algorithm for average-payoff Markovian decision processes[A].Procedins of the 12th National Conference on Artificial Intelligence[C].Taho city:Ca Morgan Kaufmann,1994,1:700-705.
  • 加载中
计量
  • 文章访问数:  2117
  • HTML全文浏览量:  40
  • PDF下载量:  849
  • 被引次数: 0
出版历程
  • 收稿日期:  2006-02-20
  • 修回日期:  2007-01-16
  • 刊出日期:  2007-03-15

目录

    /

    返回文章
    返回