Graduation date: 2007
The thesis focuses on model-based approximation methods for reinforcement
learning with large scale applications such as combinatorial optimization problems.
First, the thesis proposes two new model-based methods to stablize the
value–function approximation for reinforcement learning. The first one is the
BFBP algorithm, a batch-like reinforcement learning process which iterates between
the exploration and exploitation stages of the learning process. For the
exploitation part, this thesis investigates the plausibility and performance of
using more efficient offline algorithms such as linear regression, regression trees,
and SVMs for value–function approximators. The thesis discovers that with systematic
local search methods such as Limited Discrepancy Search and a good
initial heuristic, the algorithm often coverges faster and to a better level of
performance, compared with epsilon greedy exploration methods.
The second method combines linear programming with the kernel trick to
find value–function approximators for reinforcement learning. One formulation
is based on SVM regression; the second is based on the Bellman equation; and
the third seeks only to ensure that good moves have an advantage over bad
moves. All formulations attempt to minimize the number of support vectors
while fitting the data. The advantage of the kernel methods is that they can
easily adjust the complexity of the function approximator to fit the complexity
of the value function.
The thesis also proposes a model-based policy gradient reinforcement learning
algorithm. In our approach, we learn the models P(s′|s, a) and R(s′|s, a),
and then use dynamic programming to compute the value of the policy directly
from the model. Unlike online sampling-based policy gradient algorithms, it
does not suffer from high variances, and it also converges faster.
In summary, the thesis purposed model-based approximation algorithms
for both value function based and policy gradient reinforcement learning, with
promising application results on multiple problem domains and job-shop scheduling
benchmarks.