连续空间中的随机技能发现算法

来源：公文范文发布时间：2022-12-15 20:15:02 点击：

学习基础框架的改进的随机技能发现算法。通过定义随机Option生成一棵随机技能树，构造一个随机技能树集合。将任务目标分成子目标，通过学习低阶Option策略，减少因智能体增大而引起学习参数的指数增大。以二维有障碍栅格连续空间内两点间最短路径规划为任务，进行仿真实验和分析，实验结果表明：由于Option被随机定义，因此算法在初始性能上具有间歇的不稳定性，但是随着随机技能树集合的增加，能较快地收敛到近似最优解，能有效克服因为维数灾引起的难以求取最优策略或收敛速度过慢的问题。

关键词：强化学习； Option；连续空间；随机技能发现

中图分类号： TN911⁃34； TP18 文献标识码： A 文章编号： 1004⁃373X（2016）10⁃0014⁃04

A random skill discovery algorithm in continuous spaces

LUAN Yonghong 1，2， LIU Quan2，3， ZHANG Peng2

（1. Suzhou Institute of Industrial Technology， Suzhou 215104， China； 2. Institute of Computer Science and Technology， Soochow University， Suzhou 215006， China； 3. MOE Key Laboratory of Symbolic Computation and Knowledge Engineering， Jilin University， Changchun 130012， China）

Abstract： In allusion to the large and continuous space’s “dimension curse” problem caused by the increase of state dimension exponential order， an improved random skill finding algorithm based on Option hierarchical reinforcement learning framework is proposed. A random skill tree set is generated via defining random Option to construct a random skill tree set. The task goal is divided into several sub⁃goals， and then the increase of learning parameter exponent due to the increase of the intelligent agent is reduced through learning low⁃order Option policy. The simulation experiment and analysis were implemented by taking a shortest path between any two points in two⁃dimension maze with barriers in the continuous space as the task. The experiment result shows that the algorithm may have some intermittent instability in the initial performance because Option is defined randomly， but it can be converged to the approximate optimal solution quickly with the increase of the random skill tree set， which can effectively overcome the problem being hard to obtain the optimal policy and slow convergence due to “dimension curse”.

Keywords： reinforcement learning； Option； continuous space； random skill discovery

0 引言

强化学习[1⁃2]（Reinforcement Learning，RL）是Agent通过与环境直接交互，学习状态到行为的映射策略。经典的强化学习算法试图在所有领域中寻求一个最优策略，这在小规模或离散环境中是很有效的，但是在大规模和连续状态空间中会面临着“维数灾”的问题。为了解决“维数灾”等问题，研究者们提出了状态聚类法、有限策略空间搜索法、值函数逼近法以及分层强化学习等方法[3]。分层强化学习的层次结构的构建实质是通过在强化学习的基础上增加抽象机制来实现的，也就是利用了强化学习方法中的原始动作和高层次的技能动作[3]（也称为Option）来实现。

分层强化学习的主要研究目标之一是自动发现层次技能。近年来虽然有很多研究分层强化学习的方法，多数针对在较小规模的、离散领域中寻找层次技能。譬如Simsek与Osentoski等人通过划分由最近经验构成的局部状态转移图来寻找子目标[4⁃5]。McGovern和Batro等根据状态出现的频率选择子目标[6]。Matthew提出将成功路径上的高频访问状态作为子目标，Jong和Stone提出从状态变量的无关性选择子目标[7]。但是，这些方法都是针对较小规模、离散的强化学习领域。2009年Konidaris和Barto等人提出了在连续强化学习空间中的一种技能发现方法，称为技能链[8]。2010年Konidaris又提出根据改变子目标点检测方法[9]来分割每个求解路径为技能的CST算法，这种方法仅限于路径不是太长且能被获取的情况。

本文介绍了一种在连续RL域的随机技能发现算法。采用Option分层强化学习中自适应、分层最优特点，将每个高层次的技能定义为一个Option，且随机定义的，方法的复杂度与复杂学习领域的Option构建数量成比例。虽然Option的随机选择可能不是最合适的，但是由于构建的Option不仅是一个技能树还是一个技能树的集合，因此弥补了这个不足之处。

1 分层强化学习与Option框架

分层强化学习（Hierarchical Reinforcement Learning，HRL）的核心思想是引入抽象机制对整个学习任务进行分解。在HRL方法中，智能体不仅能处理给定的原始动作集，同时也能处理高层次技能。

4 结语

实验的性能结果表明了RSD算法能显著提高连续域中RL问题的性能，通过采用随机技能树集合和对每个树叶学习一个低阶的Option策略。RSD算法的优点，与其他的技能发现方法相比，可以采用Option框架更好地处理RL连续域的问题，无需分析训练集的图或值自动创建Option。因此，它可以降低搜索特定Option的负担，能使它更适应于大规模或连续状态空间，能分析一些困难较大的领域问题。

参考文献

[1] SUTTON R S， BARTO A G. Reinforcement learning： An introduction [M]. Cambridge， MA： MIT Press，1998.

[2] KAELBLING L P， LITTMAN M L， MOORE A W. Reinforcement learning： A survey [EB/OL]. [1996⁃05⁃01]. http：// www.cs.cmu.edu/afs/cs...vey.html.

[3] BARTO A G， MAHADEVAN S. Recent advances in hierarchical reinforcement learning [J]. Discrete event dynamic systems. 2003， 13（4）： 341⁃379.

[4] SIMSEK O， WOLFE A P， BARTO A G. Identifying useful subgoals in reinforcement learning by local graph partitioning [C]// Proceedings of the 22nd International Conference on Machine learning. USA： ACM， 2005， 8： 816⁃823.

[5] OSENTOSKI S， MAHADEVAN S. Learning state⁃action basis functions for hierarchical MDPs [C]// Proceedings of the 24th International Conference on Machine learning. USA： ACM， 2007， 7： 705⁃712.

[6] MCGOVERN A， BARTO A. Autonomous discovery of subgolas in reinfoeremente learning using deverse density [C]// Proceedings of the 8th Intemational Coference on Machine Learning. San Fransisco：Morgan Kaufmann， 2001： 36l⁃368.

[7] JONG N K， STONE P. State abstraction discovery from irrelevant state variables [J]. IJCAI， 2005， 8： 752⁃757.

[8] KONIDARIS G， BARTO A G. Skill discovery in continuous reinforcement learning domains using skill chaining [J]. NIPS， 2009， 8： 1015⁃1023.

[9] KONIDARIS G， KUINDERSMA S， BARTO A G， et al. Constructing skill trees for reinforcement learning agents from demonstration trajectories [J]. NIPS， 2010， 23： 1162⁃1170.

[10] 刘全，闫其粹，伏玉琛，等.一种基于启发式奖赏函数的分层强化学习方法[J].计算机研究与发展，2011，48（12）：2352⁃2358.

[11] 沈晶，刘海波，张汝波，等.基于半马尔科夫对策的多机器人分层强化学习[J].山东大学学报（工学版），2010，40（4）：1⁃7.

[12] KONIDARIS G， BARTO A. Efficient skill learning using abstraction selection [C]// Proceedings of the 21st International Joint Conference on Artificial Intelligence. Pasadena， CA， USA： [S.l.]， 2009： 1107⁃1113.

[13] XIAO Ding， LI Yitong， SHI Chuan. Autonomic discovery of subgoals in hierarchical reinforcement learning [J]. Journal of china universities of posts and telecommunications， 2014， 21（5）： 94⁃104.

[14] CHEN Chunlin， DONG Daoyi， LI Hanxiong， et al. Hybrid MDP based integrated hierarchical Q⁃learning [J]. Science China （information sciences）， 2011， 54（11）： 2279⁃2294.

推荐访问:算法随机技能连续发现

上一篇：丹红注射液与麝香保心丸联合治疗不稳定型心绞痛的临床观察
下一篇：井下综合物探技术在防治水中的应用

推荐文章

推荐内容

公文范文推荐文章

公文范文热门文章

连续空间中的随机技能发现算法

来源：公文范文 发布时间：2022-12-15 20:15:02 点击：

来源：公文范文发布时间：2022-12-15 20:15:02 点击：