Hold on.
The first lesson in Probability Theory class is the following.
Whenever you approach a problem, the first thing you should do is to define a probability space. If the probability space is not defined, then all these concepts don't make sense. There is no "probability", there is no "expectation" and "divergence". Also there is no "entropy" and "Shanon entropy".
The probability space has not been defined in the paper. If the choice of the probability space is obvious, then you can easily fill this gap and define probability space instead of the author.
They do define the probability space 2^l which in the case of bitcoin is 2^256.
I is finite, so it's a discrete probability space. The elementary event in this probability space is "hash h, h in 2^I, has been generated". The probability of every elementary event is 2^-I regardless of what hash h corresponds to this elementary event.
Now let's consider the event "block with target T has been mined". It means a miner has calculated a hash h<=T. Every two hashes h1 and h2, which satisfy this inequality, have the same probability, regardless of what they look like and how many zeros are there in their binary representation. They carry the same "information".
The probability P of mining a block in this setting is (T+1)/2^I. The average number of attempts before the success is 1/P. That is 2^I/(T+1). This number is called the "block difficulty" and "block weight". It's unbiased estimation of the work executed by miners. That's what we need and the story often ends here.
In the paper an extra step is made. There is an assumption that every hash an "information" or "weight". There is an attempt "to reduce an entropy" or, in other words, as I understood, there is an attempt to reduce the standard deviation of estimation from the estimated value. The fact is that when every hash has the same weight, the deviation is minimal.
I think, at some point in the paper, when the entropy was introduced, an unintentional switch to another probability space has occurred. When you start treating hashes by quantity of leading zeros, the fraction of "information" within the hash get lost. You get a new probability space, a new problem and a new optimal unbiased estimation. However, this new solution might not be a solution to the original problem in the original probability space.