Probably a lot of people know about the tricks in HAKMEM for counting the number of nonzero bits in a machine word. And that in practice it's probably best to avoid multiplication, and either repeatedly remove the lowest nonzero bit with $$x \mathrel{\&\!=} x-1$$ (if few nonzeros are expected, as in bitboard chess programs) or do a table lookup on longer blocks of bits.

So what follows is not likely to be especially useful.

If one repeatedly removes the lowest nonzero bit, in words of length $$W$$, the worst case time is $$O(W)$$. If one does table lookup with a table of size $$n$$, the time is $$O(W/\log n)$$. Which could be slow if you're working on a machine with such huge words and such limited memory that $$W$$ is significantly larger than $$\log n$$, but that's not an accurate description of the computers we generally use nowadays.

It's also possible to count bits by masking, shifting, and adding, as exhibited in the HAKMEM code: if $$M$$ is a mask with the binary representation $$...01010101$$, and we view an input word $$x$$ as a sequence of $$W$$ $$1$$-bit numbers packed together, then $$y=(x\mathbin{\&} M)+((x\gg 1)\mathbin{\&} M)$$ can be viewed as a sequence of $$W/2$$ $$2$$-bit numbers, each formed by adding two of the $$1$$-bit numbers. Similarly if $$N$$ is a mask with binary representation $$\dots 001100110011$$ then $$z=(y\mathbin{\&} N)+((y\gg 2)\mathbin{\&} N)$$ can be viewed as a sequence of $$W/4$$ $$4$$-bit numbers, each of which counts the number of nonzeros in the corresponding $$4$$-bit block of $$x$$. So by repeatedly shifting and masking, we can count the nonzeros in longer and longer blocks of the binary representation of the input, until we have a single block giving the number of nonzeros in the whole input. This method takes time $$O(\log W)$$, and it is also a contender for practicality as it uses only fast shift, add, and mask operations, and doesn't use up as much of the memory cache as the table lookup idea.

The HAKMEM trick is to realize that, as this doubling idea progresses, the numbers stored in each block are considerably smaller than the maximum value representable in the number of bits used by the block. In fact, all these numbers are at most $$W$$, because that's the number of nonzeros in the whole input. So when we reach blocks of $$k=\log W$$ bits, each block is capable of storing the whole answer, and we can add the numbers in each block without fear of overflow. We can add all blocks together in a single operation, $$x=(x*B)\gg (W-k)$$ if $$B$$ has the binary representation $$\dots ...0^{k-1}10^{k-1}100^{k-1}1$$, where the superscripts mean to repeat the digit that many times. The actual HAKMEM code uses a division rather than a multiplication but the idea is similar. So this method takes time $$O(\log\log W)$$ (if one assumes constant time per operation as before) since that's the number of shift-and-add stages needed to reach the block size from which we can do the multiplication. But it involves multiplication as well as addition, shifts, and masks, which may be a significantly slower operation on modern architectures.

What I realized recently is that the HAKMEM multiplication trick can be extended to all stages of the shift-and-add idea. At any stage of the algorithm, one has blocks of length $$2^i$$, that store binary numbers counting the bits in the corresponding blocks of the input; thus, these values are themselves at most $$2^i$$ in magnitude but are stored in blocks capable of holding numbers up to $$2^{2^i}-1$$ in magnitude. By a constant number of shifts, adds, masks, and multiplies, one can add up groups of $$2^{2^i-i}$$ blocks at a time without overflowing into the adjoining blocks, forming new blocks of length $$2^{2^i}$$. Repeating this process finds the population count in time $$O(\log^* W)$$, where $$\log^* x$$ is the height of the shortest tower of powers $$2^{2^{2^{\dots}}}$$ that equals or exceeds $$x$$.

Generally when one sees log* in an algorithm analysis it means it's a very small factor that should be treated almost as if it were a constant. So it's unusual to see a situation where a $$\log^*$$ is beaten by a bound of the form $$O( \mbox{input length} / \log)$$. Maybe that's the right moral for this story: be careful of constant factors and input size assumptions, because sometimes the asymptotics doesn't tell you the right answer.

A smart compiler could unroll the loop, and replace x & masks[i] with bitmasking a register against hard coded values.