Policy gradient methods for machine learning

Peters, J; Theodoru, E; Schaal, S

Datensatz

DATENSATZ AKTIONENEXPORT

DownloadE-Mail

Bitte beachten Sie, dass eine neuere Version dieses Datensatzes verfügbar ist:
https://pure.mpg.de/pubman/item/item_1790397_2

DetailsÜbersicht

Policy gradient methods for machine learning

Peters, J., Theodoru, E., & Schaal, S. (2007). Policy gradient methods for machine learning.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/11858/00-001M-0000-0013-CD0B-5 Versions-Permalink: https://hdl.handle.net/11858/00-001M-0000-0013-CD0C-3

Genre: Poster

ausblenden:

Urheber:
Peters, J^{1, 2}, Autor
Theodoru, E, Autor
Schaal, S, Autor

Affiliations:
1Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society, ou_1497795
2Dept. Empirical Inference, Max Planck Institute for Intelligent Systems, Max Planck Society, ou_1497647

Inhalt

einblenden:

ausblenden:

Schlagwörter: -

Zusammenfassung: We present an in-depth survey of policy gradient methods as they are used in the machine learning community for optimizing parameterized, stochastic control policies in Markovian systems with respect to the expected reward. Despite having been developed separately in the reinforcement learning literature, policy gradient methods employ likelihood ratio gradient estimators as also suggested in the stochastic simulation optimization community. It is well-known that this approach to policy gradient estimation traditionally suffers from three drawbacks, i.e., large variance, a strong dependence on baseline functions and a inefficient gradient descent. In this talk, we will present a series of recent results which tackles each of these problems. The variance of the gradient estimation can be reduced significantly through recently introduced techniques such as optimal baselines, compatible function approximations and all-action gradients. However, as even the analytically obtainable policy gradients perform unnatur ally slow, it required the step from ÔvanillaÕ policy gradient methods towards natural policy gradients in order to overcome the inefficiency of the gradient descent. This development resulted into the Natural Actor-Critic architecture which can be shown to be very efficient in application to motor primitive learning for robotics.