Morris Counters

convergence is just compression with extra (discrete) steps

Dec 19, 2023

A few things before we begin:

This all seems pretty simple, but it is worth sharing anyways as a primer on the subject (there may be a follow-up on the same subject)
I’ve done this as an exercise in deriving things from first principles (no internet, pick three textbooks and go away for a while)
I’ve primarily done this to solve a problem, and like many problems, the solution involves touching computational grass, so my interest in this is limited to what can be applied directly

I work with hard-real-time hardware systems in my day-to-day by writing firmware and such system have a lot of constraints, namely:

There is not a lot of RAM
Many paths are bottlenecks to full system performance
Many paths have lower speed limits before quality-of-service issues cause functional problems

I’m interested in pulling a lot of data out of this system, but the above points are pretty annoying constraints. However, if we assume the following:

Measurements are independent from each other
Flows are (reasonably) deterministic

then we can spread the impact in at least one of two ways:

Probabilistically measure all values at once along a certain flow
Measure the values exactly and split those counters, running multiple versions along the same flow

This blog post is all about (1). (2) has shown itself to be very useful, but given what I do that’s off limits.

Counters

A lot of the data I’d like to collect involves counting the frequency of events. This has a few interesting properties:

Although the values of the counters may be related, the act of measurement is independent
A counter is relatively small, the number of counters is relatively large

Additionally, I’m only interested in orders of magnitude difference in values.

Think of a counter like a function, like this:

\(x => x + 1\)

Obviously, it has a “derivative” of 1.

Consider another type of counter, like this:

\(x=> x + 1_n(x)\)

This derivative depends on the indicator variable sampling from some probability distribution. That isn’t specified beyond the fact it depends on the current value. The key difference here is the value is not the number of increments, but the value after some number of trials. The relationship between the number of increments and the stored value is determined by the underlying distribution. Counters of this type are called Morris counters and the algorithm is called the approximate counting algorithm.

Calculus is tempting, but let’s not use that for now. I’ve found the following two ways of reasoning about this effective: nested polynomials and Markov Chains.

Nested Polynomials

Let’s define the following (for the case of log base 2):

\(H_1 = (1/2^n)\)

\(M_1 = 1 - H_1 = (1 - 1/2^n)\)

Let’s also define a split, which defines the total number of hits (H). Splits 3, 2, 1 and 0 are defined as follows (items in braces are to be interpreted as a sequence of events in the order listed):

\(S_3 = \lbrace H_1,H_2,H_3 \rbrace\)

\(S_2 = \lbrace H_1,H_2,M_3 \rbrace, \lbrace H_1,M_2,H_2 \rbrace, \lbrace M_1,H_1,H_2 \rbrace\)

\(S_1 = \lbrace M_1,M_1,H_1 \rbrace, \lbrace M_1,H_1,M_2 \rbrace, \lbrace H_1,M_2,M_2 \rbrace\)

\(S_0 = \lbrace M_1,M_1,M_1 \rbrace\)

Each hit is an increment by one, so an interpretation as products and a summation should provide us with an expected value. From a probability perspective, the order of events does not matter, so all “hits” can be factored out to the front:

\(E[S_1] = 1 * H_1((M_1^0 * M_2^2) + (M_1^1 * M_2^1) + (M_2^2 + M_1^0))\)

\(E[S_1] = 1 * H_1(M_2M_2 + M_1M_2 + M_1M_1)\)

\(E[S_1] = 1 * H_1(M_1(M_1 + M_2) + M_2(M_2))\)

The polynomial can be seen as a valid probability because the events are all disjoint from each other (as each addition represents a unique path in the expansion and all paths are disjoint by definition). This can generalize to nested summations for arbitrary values across all possible splits. I’ve done this for my own purposes, but the pattern should be self-evident.

Markov Chain

Another approach is with Markov chains. The transition matrix is defined as such:

\(P_{ii} = 1 - \frac{1}{2^i}\)

\(P_{i,i+1} = \frac{1}{2^i}\)

The initial state is the vector with only the first element set to 1, all other values 0. We need to know up-front how many trials to run to determine some constants (the number of constants is denoted by “n”):

The matrix is of size (n + 1)x(n + 1) since the initial state starts with a constant value
The transition matrix is multiplied n times

The resulting vector needs to be item-wise multiplied by its position to produce an expected value.

Final Notes

Both of these approaches are different lenses into fundamentally the same concept. The matrix approach encodes extra information, but in a structure that we all know and love (matrices). All algorithms listed have numerical stability issues that may be addressable with more thought. I’ve just deployed arbitrary precision rational arithmetic libraries at the problem. I may have a follow-up blog-post reasoning about the variance, but I need better performance on the numerics side (mostly accuracy with floating point numbers) before I venture onto that (I’ve done work on that front, but the numerics aren’t consistent with the theory, so I’m skeptical to publish that as-is).

dihedral.group

Discussion about this post