Q-Learning policy doesn't agree with Value/Policy Iteration

up vote
0
down vote

favorite

I am playing with pymdptoolbox. It has a built-in problem of forest management. It can generate a transition matrix P and R by specifying a state value for forest function (default value is 3). The implementation of Q-Learning, PolicyIteration and ValueIteration to find the optimal policy is straightforward. However by creating a slightly more complicated problem by changing the state to a bit larger value than 4 (from 5 onwards), only PI and VI return the same policy while QL cannot find the optimal policy. This is very surprising and puzzling. Can anyone help me understand why is this for QL in this package?

By looking at the raw code of QL (using epsilon-greedy), it seems it ties the probability with iteration number, i.e. prob = 1 - (1/log(n+2)) and the learning rate is (1/math.sqrt(n+2)). Is there any specific reason why tying probability/learning rate to the iteration number, instead of making them independent variables (the code itself can be modified easily though).

I think my biggest puzzle is to understand why QL fails to find the policy for a vanilla problem. Thanks.

from mdptoolbox.mdp import ValueIteration, QLearning, PolicyIteration

from mdptoolbox.example import forest



Gamma = 0.99



states = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 50, 70, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]



compare_VI_QI_policy =  # True or False

compare_VI_PI_policy = 



for state in states:



    P, R = forest(state)



    VI = ValueIteration(P, R, Gamma)

    PI = PolicyIteration(P, R, Gamma)

    QL = QLearning(P, R, Gamma)



    ## run VI

    VI.run()



    # run PI

    PI.run()



    # run QL

    QL.run()



    compare_VI_QI_policy.append(QL.policy == VI.policy)

    compare_VI_PI_policy.append(VI.policy == PI.policy)



print compare_VI_QI_policy

print compare_VI_PI_policy

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

add a comment |

up vote
0
down vote

favorite

I think my biggest puzzle is to understand why QL fails to find the policy for a vanilla problem. Thanks.

from mdptoolbox.mdp import ValueIteration, QLearning, PolicyIteration

from mdptoolbox.example import forest



Gamma = 0.99



states = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 50, 70, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]



compare_VI_QI_policy =  # True or False

compare_VI_PI_policy = 



for state in states:



    P, R = forest(state)



    VI = ValueIteration(P, R, Gamma)

    PI = PolicyIteration(P, R, Gamma)

    QL = QLearning(P, R, Gamma)



    ## run VI

    VI.run()



    # run PI

    PI.run()



    # run QL

    QL.run()



    compare_VI_QI_policy.append(QL.policy == VI.policy)

    compare_VI_PI_policy.append(VI.policy == PI.policy)



print compare_VI_QI_policy

print compare_VI_PI_policy

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

add a comment |

up vote
0
down vote

favorite

I think my biggest puzzle is to understand why QL fails to find the policy for a vanilla problem. Thanks.

from mdptoolbox.mdp import ValueIteration, QLearning, PolicyIteration

from mdptoolbox.example import forest



Gamma = 0.99



states = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 50, 70, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]



compare_VI_QI_policy =  # True or False

compare_VI_PI_policy = 



for state in states:



    P, R = forest(state)



    VI = ValueIteration(P, R, Gamma)

    PI = PolicyIteration(P, R, Gamma)

    QL = QLearning(P, R, Gamma)



    ## run VI

    VI.run()



    # run PI

    PI.run()



    # run QL

    QL.run()



    compare_VI_QI_policy.append(QL.policy == VI.policy)

    compare_VI_PI_policy.append(VI.policy == PI.policy)



print compare_VI_QI_policy

print compare_VI_PI_policy

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

I think my biggest puzzle is to understand why QL fails to find the policy for a vanilla problem. Thanks.

from mdptoolbox.mdp import ValueIteration, QLearning, PolicyIteration

from mdptoolbox.example import forest



Gamma = 0.99



states = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 50, 70, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]



compare_VI_QI_policy =  # True or False

compare_VI_PI_policy = 



for state in states:



    P, R = forest(state)



    VI = ValueIteration(P, R, Gamma)

    PI = PolicyIteration(P, R, Gamma)

    QL = QLearning(P, R, Gamma)



    ## run VI

    VI.run()



    # run PI

    PI.run()



    # run QL

    QL.run()



    compare_VI_QI_policy.append(QL.policy == VI.policy)

    compare_VI_PI_policy.append(VI.policy == PI.policy)



print compare_VI_QI_policy

print compare_VI_PI_policy

python q-learning markov

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

edited Nov 20 at 5:35

Aqueous Carlos

301213

edited Nov 20 at 5:35

Aqueous Carlos

301213

edited Nov 20 at 5:35

Aqueous Carlos

301213

asked Nov 20 at 5:27

Chenyang

286

asked Nov 20 at 5:27

Chenyang

286

asked Nov 20 at 5:27

Chenyang

286

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53386742%2fq-learning-policy-doesnt-agree-with-value-policy-iteration%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk