论文标题
具有自然梯度,信任区域和熵控制
Regret-Aware Black-Box Optimization with Natural Gradients, Trust-Regions and Entropy Control
论文作者
论文摘要
最成功的随机黑盒优化器(例如CMA-Es)使用单个样本的排名来获得新的搜索分布。但是,排名的使用还引入了几个问题,例如基本优化目标通常不清楚,即我们不优化预期的健身。此外,尽管这些算法通常会对搜索分布产生高质量的平均估计值,但由于这些算法对遗憾一无所知,因此产生的样品的质量可能很差。最后,嘈杂的健身函数评估可能会导致期望高度优化的解决方案。相反,由策略梯度激励的随机优化器,例如基于模型的相对熵随机搜索(更多)算法,直接在不使用排名的情况下直接优化了预期的健身函数。可以通过应用自然政策梯度和兼容函数近似来得出更多信息,并使用信息理论约束来确保策略更新的稳定性。虽然更多的人不受列出的限制的困扰,但与基于排名的方法相比,它通常无法实现最新的性能。我们通过解耦来取得更大的改善,使搜索分布的平均值和协方差更新,从而使平均值更具积极性更新,同时保持协方差保守的更新,这是一种基于进化路径的改进的熵计划技术,从而导致收敛速度更快,并且与原始论文相比,简化和有效的模型学习方法。我们将我们的算法与在标准优化任务以及机器人技术中的情节RL任务上进行的最先进的黑盒优化算法进行了比较,在这些算法中,对遗憾也至关重要。我们在基准功能上获得了竞争成果,并且在RL任务上的遗憾方面明显优于基于排名的方法。
Most successful stochastic black-box optimizers, such as CMA-ES, use rankings of the individual samples to obtain a new search distribution. Yet, the use of rankings also introduces several issues such as the underlying optimization objective is often unclear, i.e., we do not optimize the expected fitness. Further, while these algorithms typically produce a high-quality mean estimate of the search distribution, the produced samples can have poor quality as these algorithms are ignorant of the regret. Lastly, noisy fitness function evaluations may result in solutions that are highly sub-optimal on expectation. In contrast, stochastic optimizers that are motivated by policy gradients, such as the Model-based Relative Entropy Stochastic Search (MORE) algorithm, directly optimize the expected fitness function without the use of rankings. MORE can be derived by applying natural policy gradients and compatible function approximation, and is using information theoretic constraints to ensure the stability of the policy update. While MORE does not suffer from the listed limitations, it often cannot achieve state of the art performance in comparison to ranking based methods. We improve MORE by decoupling the update of the mean and covariance of the search distribution allowing for more aggressive updates on the mean while keeping the update on the covariance conservative, an improved entropy scheduling technique based on an evolution path which results in faster convergence and a simplified and more effective model learning approach in comparison to the original paper. We compare our algorithm to state of the art black-box optimization algorithms on standard optimization tasks as well as on episodic RL tasks in robotics where it is also crucial to have small regret. We obtain competitive results on benchmark functions and clearly outperform ranking-based methods in terms of regret on the RL tasks.