This link was posted by jasondavies 18 hours ago on HN. It received 164 points and 49 comments.
This is beautiful. It sounds so simple and obvious that it's pretty amazing this algorithm did not exist yet.
reply
From HN comments:
This algorithm seems to resemble HyperLogLog (and all its variants), which is also cited in the research paper. Using the same insight of the estimation value of tracking whether we've hit a "run" of heads or tails, but flipping the idea on its head (heh), it leads to the simpler algorithm described, which is about discarding memorized values on the basis of runs of heads/tails. This also works especially well (that is, efficiently) in the streaming case, allowing you to keep something resembling a "counter" for the distinct elements, albeit with a error rate.
The benefit of HyperLogLog is that it behaves similarly to a hash set in some respects -- you can add items, count distinct them, and, importantly, merge two HLLs together (union), all the while keeping memory fixed to mere kilobytes even for billion-item sets. In distributed data stores, this is the trick behind Elasticsearch/OpenSearch cardinality agg, as well as behind Redis/Redict with its PFADD/PFMERGE/PFCOUNT.
I am not exactly sure how this CVM algorithm compares to HLL, but they got Knuth to review it, and they claim an undergrad can implement it easily, so it must be pretty good!
I hope one day SN will have this level of input in comments on topics that are unrelated to Bitcoin~~
reply