Policy Gradient Methods

Peters, J; Bagnell, JA

doi:10.1007/978-0-387-30164-8_640

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Book Chapter

Policy Gradient Methods

MPS-Authors

/persons/resource/persons84135

Peters, J
Department Empirical Inference, Max Planck Institute for Biological Cybernetics, Max Planck Society;
Max Planck Institute for Biological Cybernetics, Max Planck Society;

External Resource

https://link.springer.com/content/pdf/10.1007%2F978-0-387-30164-8_640.pdf
(Publisher version)

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Peters, J., & Bagnell, J. (2010). Policy Gradient Methods. In C. Sammut, & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 774-776). Berlin, Germany: Springer.

Cite as: https://hdl.handle.net/11858/00-001M-0000-0013-BD40-C

Abstract

A policy gradient method is a reinforcement learning approach that directly optimizes a parametrized control policy by a variant of gradient descent. These methods belong to the class of policy search techniques that maximize the expected return of a policy in a fixed policy class, in contrast with traditional value function approximation approaches that derive policies from a value function. Policy gradient approaches have various advantages: they enable the straightforward incorporation of domain knowledge in policy parametrization and often an optimal policy is more compactly represented than the corresponding value function; many such methods guarantee to convergence to at least a locally optimal policy; the methods naturally handle continuous states and actions and often even imperfect state information. The counterveiling drawbacks include difficulties in off-policy settings, the potential for very slow convergence and high sample complexity, as well as identifying local optima that are not globally optimal.