The learning agent
The agent comprises (1) a policy and (2) a learning algorithm, i.e. anything that doesn’t belong to the environment. The policy is a function approximator (eg. deep neural network, polynomials or lookup tables ) that maps given actions to observations. Learning iteratively updates the policy, using as inputs the agent’s actions, observations and rewards, with the objective of finding a policy that maximizes the cumulative reward, over a large number of iterations . Policies can be learned via methods termed as critics and actors. The actor selects actions, while the critic criticizes them ref. RL agents can use critics only (value-based agents), actors only (policy-based agents), or both of the methods (actor-critic agents) to inform the agent’s policy. More generally, value-based agents work more optimally with discrete actions, and can be more computationally expensive for continuous actions; policy-based agents are simpler and are more suited for sets of continuous actions; actor-critic agents can be used with both discrete and continuous sets of actions.
RL agents can be implemented by a variety of algorithms, some examples being:
- State-action-reward-state-action (SARSA)
- Policy gradient (PG),
- Deep Q-network, Deep deterministic policy gradient (DDPG)
- Twin-delayed deep deterministic policy gradient (TD3),
- Proximal policy optimization
- Asynchronous Actor-Critic Agents (A3C)
- Soft actor-critic (SAC)
- Q-Learning
- Deep Q-learning network (DQN)
- Custom agents
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.