Skip to content

Entropy-guided exploration for better tactical capabilities #780

Closed
@kk2796

Description

@kk2796

I originally posted this idea in Google Forum and was encouraged to echo it here. That thread lays out an intuitive case that using entropy to guide exploration can yield highly efficient search strategies that cannot be realized with "success likelihood" and "visit counts" alone. For the skeptics, please feel free to peruse the original thread (or to see something a bit more concrete, skim down in this thread to COOLEST POINT OF ENTIRE POST IMO )

I'm not going to spend any more time arguing for this change; I think the strongest case for it arises from exploring how it could/should work. On that note - I've taken a bit of time to explore the math of this idea, and I'd like to lay out a plausible implementation. I do not claim this is an optimal implementation.

First, it's necessary to define entropy in the context of PUCT search. This is the context where the search is in state "s" and must choose an action "a" using probabilities that align with the various U(s,a) values of all actions. The action chosen will accord to a playout to the node associated with that action. There is already a good amount of information that's calculated and pulled together to compute U(s,a) for action a:

  • Q(s,a) - the average of all value-head evaluations from prior visits to the node at a and its descendants
  • cPuct - multiplier to increase the impact of exploration
  • P(s,a) - the a-priori policy-head probability of action a
  • N(s,b) - the number of times this state has been reached.
  • N(s,a) - the number of times action a has already been chosen (always <= N(s,b)).

To this mix, I introduce the term E(s,a). E(s,a) is the entropy associated with the subtree rooted at the node of action a. The entropy of a subtree is the entropy of the distribution of probabilities of a playout from the root of that subtree reaching any particular leaf node (either an unvisited node or terminal "mate" node).
An algorithm for computing E(s,a) for subtree T:

  1. Enumerate every possible leaf node that can be reached with a playout from the root node of T.
  2. For each playout, computing the step-wise probability of each discrete action (that is, comparing magnitude of U(s',a') of each step to the sum of U(s',*) of all that step's sibling actions).
  3. compute the product of all the playout's step-wise probabilities (from root to leaf) to attain the final probability of the playout
  4. repeat 2-3 for all possible/legal playouts enumerated in (1)
  5. compute entropy from this final set of probabilities, using information-theory formula for entropy of N probabilities:
    Entropy({p1, p2, p3, ... pN}) = (-1)⋅ sum[n in 1..N]{ pn ⋅ log( pn , baseN) }
  6. if there is only one possible playout, e.g. the first visit to a node must visit that node, or there is only one legal response to a move, that equates to a set of a single probability {1}. Entropy is defined to be 0 in this case (in fact, {1} is the only probability set where entropy=0).

For those not familiar with the math here - this entropy calculation will always return a value between 0 and 1. The more uniform the probabilities are, the closer the entropy will be to 1; the more diverse the probabilities, the closer it will be to 0.
To illustrate, here are a few sets of 4 probabilities ("log4" indicates log base 4)

  • low entropy {0.01, 0.01, 0.01, 0.97} (-1)⋅(0.01⋅log4(0.01) + 0.01⋅log4(0.01) + 0.01⋅log4(0.01) + 0.97⋅log4(0.97)) = 0.12
  • moderate entropy {0.2, 0.3, 0.4, 0.1} = (-1)⋅(0.2⋅log4(0.2) + 0.3⋅log4(0.3) + 0.4⋅log4(0.4) + 0.1⋅log4(0.1)) = 0.92
  • high entropy {0.25, 0.25, 0.25, 0.25} = (-1)⋅(0.25⋅log4(0.25) + 0.25⋅log4(0.25) + 0.25⋅log4(0.25) + 0.25⋅log4(0.25)) = 1.00
    Note - the final example is actually perfectly uniform and thus achieves the maximum possible entropy of exactly 1.

Before I move onto the next topic (what to do with E(s,a)) I should mention that I'm not entirely clear on a low-cost method to track entropy according to this definition. In fact, this may be the largest barrier to implementation. However, there are two reasons to think it may not be a problem:

  1. as subtrees grow, the newfound bias toward exploring areas of low entropy should push the entropy values to climb faster than they would otherwise climb by chance
  2. the manner in which I plan to incorporate E(s,a) into the overall U(s,a) computation will actually cause a subtree's entropy to become less influential as the number of visits to that subtree go up.

So - what do I propose we do with E(s,a)? No reason to draw it out:
Old formula without E(s,a)
U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a))
New formula:
U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)⋅E(s,a))

Pausing for a quick FAQ:
Q: So, "voila" you've come up with a better search formula than Deepmind?
A: As has been the case in many other regards, I suspect Deepmind team's sole concern was to find a simple solution that was "good enough" to carry their thesis. I also suspect the benefits of entropy-bias are far higher depending on sharpness of game... so while this might've made a huge difference if A0 has looked at entropy for chess, it might have been negligible against Go and Shogi.
Q: Did... did you just shove the new E(s,a) factor onto the end and hope it would work?
A: Maybe. But hear me out: this winds up being pretty slick!
Q: Are you seriously wasting github developer's time with cheesy fake dialog? Shouldn't you get on with this?
A: Ah - yeah, good point.

So, as I alluded to in the FAQ - while this may seem arbitrary at first blush, it truly is cool how well it works:

  1. It's 100% consistent with current behavior for unexplored nodes: both the number of visits N(s,a) and entropy E(s,a) have value of zero, thus N(s,a) = 0 = N(s,a)⋅E(s,a) = 0⋅0

  2. It's almost identical to current behavior if distributions of probabilities are nearly uniform. In that case, E(s,a) will be very close to 1, and E(s,a)⋅N(s,a) =~= N(s,a). In other words, the math will only allow the behavior to deviate from typical PUCT search if there are areas of high entropy.

COOLEST POINT OF ENTIRE POST IMO (if you read nothing else, read this)
In one circumstance where the E(s,a) significantly impacts U(s,a) and deviates from A0/Leela results, not only does the E(s,a)-based result make more sense than the current formula, but it's so obviously more-correct that it seems the entire case for E(s,a) could stand on this example. The circumstance I'm alluding to is a chess move that corresponds to action a, for which there is just one legal response, a'.

Under the current/Leela/A0 approach to exploration, when action a is unvisited, N(s,a) = 0 and the exploration term for that action has small denominator of (1+0), giving overall exploration factor of N(s,b)^0.5 ⋅ (1/(1+0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. In other words, for as long as action a remains unvisited, as N(s,b) grows, the likelihood of this unexplored node being played will climb with sqrt of N(s,b). After the action a is visited, N(s,a') will be 1, and the exploration term will be abruptly cut in half, from factor N(s,b)^0.5 to factor of (1+N(s,b))^0.5 ⋅ (1/(1+1)) =~= (N(s,b)^0.5) / 2

Now consider what happens when E(s,a) is added to the equation. The U(s,a) value for unvisited node is the same as without entropy: N(s,b)^0.5 ⋅ (1/(1+0⋅0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. However , after action a, if we compute the entropy of the subtree obtained, we would find that the playout to root node of a' was forced to follow a 100% probability path to a single node. In other words, the subtree with a single forced move has the same E(s,a') = 0 as E(s,a) did for the unvisited node.

The effect is clear: if there is an unvisited node that leads to a forced response, visiting that node once has no diminishing effect on how good a candidate it is for a follow up visit. And IMO, nothing could be closer to what our human intuition tells us: it would be absurd to evaluate a position, see that it leaves just one legal response, and not be bothered to consider what that response would be.

This is more a continuation of point #3. Consider: what if instead of there being just one response that was legal, instead there was only one move that doesn't lose instantly. For example, perhaps the initial action corresponds to a pawn attacking a queen, with only one escape square open to the queen. In this case, the response "move queen to safety" won't have a "pure" 100% policy. But it certainly may have a policy that looks like {0.92, 0.01, 0.02, 0.01, 0.03, 0.01}. For once-visited nodes, these policies become the playout probabilities for the next visit; and in this example, the entropy of those probabilities comes out to 0.22.

For those who just read through #3, it should already be evident that this is going to yield a weaker version of what we saw in #3. Instead of the exploration term doubling, ie denominator increasing from (1+0) = 1 to (1+1) = 2, this low entropy will subdue the rise of denominator; it'll increase from 1.0 to 1.22 and a follow-up visit to this queen-attack node will only be slightly less likely than the original visit.

There's actually quite a bit more I'd like to add... For example, I have dreamed up a much simpler metric than entropy that would capture a lot of the same perks - but I'll stop here for a breath and leave those for a follow-up.

Thanks to those who read this all the way through!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions