One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path.
In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-world problems, we usually must settle for approximate solutions.
One representation for a learner's environment and goals is a Markov decision process or MDP. MDPs allow us to represent actions that have probabilistic outcomes, and to plan for complicated, temporally-extended goals. An MDP consists of a set of states that the environment can be in, together with rules for how the environment can change state and for what the learner is supposed to do.
One way to approach a large MDP is to try to compute an approximation to its optimal state evaluation function, the function which tells us how much reward the learner can be expected to achieve if the world is in a particular state. If the approximation is good enough, we can use a shallow search to find a good action from most states. Researchers have tried many different ways to approximate evaluation functions. This thesis aims for a middle ground, between algorithms that don't scale well because they use an impoverished representation for the evaluation function and algorithms that we can't analyze because they use too complicated a representation.
Cited By
- Farebrother J, Orbay J, Vuong Q, Taïga A, Chebotar Y, Xiao T, Irpan A, Levine S, Castro P, Faust A, Kumar A and Agarwal R Stop regressing Proceedings of the 41st International Conference on Machine Learning, (13049-13071)
- Xu J, Yin K, Chen Z, Gregory J, Stump E and Liu L (2024). Kernel-based diffusion approximated Markov decision processes for autonomous navigation and control on unstructured terrains, International Journal of Robotics Research, 43:7, (1056-1080), Online publication date: 1-Jun-2024.
- Afsar M, Crump T and Far B (2022). Reinforcement Learning based Recommender Systems: A Survey, ACM Computing Surveys, 55:7, (1-38), Online publication date: 31-Jul-2023.
- Sharifnassab A and Sutton R Toward efficient gradient-based value estimation Proceedings of the 40th International Conference on Machine Learning, (30827-30849)
- Su D, Ooi J, Lu T, Schuurmans D and Boutilier C ConQUR Proceedings of the 37th International Conference on Machine Learning, (9187-9195)
- Hernandez-Leal P, Kartal B and Taylor M (2019). A survey and critique of multiagent deep reinforcement learning, Autonomous Agents and Multi-Agent Systems, 33:6, (750-797), Online publication date: 1-Nov-2019.
- Lu T, Schuurmans D and Boutilier C Non-delusional Q-learning and value iteration Proceedings of the 32nd International Conference on Neural Information Processing Systems, (9971-9981)
- Lizotte D Convergent fitted value iteration with linear function approximation Proceedings of the 25th International Conference on Neural Information Processing Systems, (2537-2545)
- Ong S, Grinberg Y and Pineau J Goal-Directed online learning of predictive models Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning, (18-29)
- Heris S, Sistani M and Pariz N Using control theory for analysis of reinforcement learning and optimal policy properties in grid-world problems Proceedings of the Intelligent computing 5th international conference on Emerging intelligent computing technology and applications, (276-285)
- Ratliff N, Silver D and Bagnell J (2009). Learning to search, Autonomous Robots, 27:1, (25-53), Online publication date: 1-Jul-2009.
- Nouri A and Littman M Multi-resolution exploration in continuous spaces Proceedings of the 22nd International Conference on Neural Information Processing Systems, (1209-1216)
- Singh A and Gordon G A unified view of matrix factorization models Proceedings of the 2008th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II, (358-373)
- Guez A, Vincent R, Avoli M and Pineau J Adaptive treatment of epilepsy via batch-mode reinforcement learning Proceedings of the 20th national conference on Innovative applications of artificial intelligence - Volume 3, (1671-1678)
- Gordon G, Greenwald A and Marks C No-regret learning in convex games Proceedings of the 25th international conference on Machine learning, (360-367)
- Gordon G No-regret algorithms for online convex programs Proceedings of the 20th International Conference on Neural Information Processing Systems, (489-496)
- Kveton B, Hauskrecht M and Guestrin C (2006). Solving factored MDPs with hybrid state and action variables, Journal of Artificial Intelligence Research, 27:1, (153-201), Online publication date: 1-Sep-2006.
- Kveton B and Hauskrecht M Heuristic refinements of approximate linear programming for factored continuous-state Markov decision processes Proceedings of the Fourteenth International Conference on International Conference on Automated Planning and Scheduling, (306-314)
- Zinkevich M Online convex programming and generalized infinitesimal gradient ascent Proceedings of the Twentieth International Conference on International Conference on Machine Learning, (928-935)
- Munos R and Moore A (2019). Variable Resolution Discretization in Optimal Control, Machine Language, 49:2-3, (291-323), Online publication date: 1-Nov-2002.
- Ormoneit D and Sen Ś (2019). Kernel-Based Reinforcement Learning, Machine Language, 49:2-3, (161-178), Online publication date: 1-Nov-2002.
- Bousquet O and Elisseeff A (2002). Stability and generalization, The Journal of Machine Learning Research, 2, (499-526), Online publication date: 1-Mar-2002.
- Perkins T and Precup D A convergent form of approximate policy iteration Proceedings of the 16th International Conference on Neural Information Processing Systems, (1627-1634)
- Gordon G Generalized2 Linear2 models Proceedings of the 16th International Conference on Neural Information Processing Systems, (593-600)
- Azoury K and Warmuth M (2019). Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions, Machine Language, 43:3, (211-246), Online publication date: 1-Jun-2001.
- Ormoneit D and Glynn P Kernel-based reinforcement learning in average-cost problems Proceedings of the 14th International Conference on Neural Information Processing Systems, (1024-1030)
- Gordon G Regret bounds for prediction problems Proceedings of the twelfth annual conference on Computational learning theory, (29-40)
Recommendations
Variability Sensitive Markov Decision Processes
Considered are time-average Markov Decision Processes MDPs with finite state and action spaces. Two definitions of variability are introduced, namely, the expected time-average variability and time-average expected variability. The two criteria are in ...
Continuous-Time Markov Decision Processes with Discounted Rewards: The Case of Polish Spaces
This paper deals with continuous-time Markov decision processes in Polish spaces, under an expected discounted reward criterion. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates ...
Markov Decision Processes with Sample Path Constraints: The Communicating Case
We consider time-average Markov Decision Processes MDPs, which accumulate a reward and cost at each decision epoch. A policy meets the sample-path constraint if the time-average cost is below a specified value with probability one. The optimization ...