Skip to content

Q learning implementation#22

Open
mohanasrujana wants to merge 1 commit intomainfrom
21-qlearning-implementation
Open

Q learning implementation#22
mohanasrujana wants to merge 1 commit intomainfrom
21-qlearning-implementation

Conversation

@mohanasrujana
Copy link
Collaborator

The MCTSNodeState class

Its legal_actions() method currently returns the full set of player moves (CALL, RAISE, FOLD) unconditionally, but we must refine this so that it alternates between “nature” actions (blinds, dealing the flop/turn/river) and valid player options, respecting bet sizes and turn order.

Should probably implement these three functions or more.
next_state(action), which chooses state in response to a chosen action
is_terminal(), which recognizes when play has ended—either by all-in showdown or someone folding
rollout(), which simulates a random continuation of the hand to produce reward for backpropagation.

Each MCTSNode wraps one of these states and tracks statistics for each action: a Q-value estimate and a visit count. When expanding, a node picks one untried legal action, calls the state’s transition function, and adds the resulting child to its children mapping. The best_child(c_param) method applies the classic UCT formula, balancing exploration and exploitation by combining each action’s Q-value with an exploration bonus proportional to the square root of the log of parent visits over action visits. During backpropagation, we update Q-values using a temporal-difference style rule—incrementing by α times the difference between observed reward plus discounted future value and the old Q-estimate—and propagate that update up to the root, incrementing visit counts along the way.

At the top level, the MCTSTree class orchestrates repeated search iterations. It begins by seeding the root node with the forced nature actions for the small and big blinds. Each iteration proceeds by selecting the most promising leaf according to UCT, expanding it if it is not already terminal, running a random rollout to a terminal game outcome, and then backpropagating the resulting reward. After a suitable number of iterations, calling best_action() on the root returns the player decision that was explored most often, which in practice corresponds to the statistically strongest move.

To complete this agent, our next steps are clear. We must implement the placeholder methods in MCTSNodeState so that the tree can accurately reflect poker dynamics: dealing cards, managing bet rounds, and evaluating hand strength at showdown. We will also expose the exploration constant (c_param), learning rate (α), and discount factor (γ) as configurable hyperparameters to facilitate experimentation. Finally, we need a comprehensive suite of unit and integration tests to verify that the tree grows correctly, that simulated rollouts yield sensible rewards, and that the chosen actions align with expected strategies against baseline opponents. Once these elements are in place, the MCTS agent will provide our poker bot with a powerful, statistically grounded decision-making capability.

@mohanasrujana mohanasrujana linked an issue May 6, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qlearning implementation

1 participant