Probabilistic search of prefixes vs random+sequential

This script compares two ways of searching for a single element (identified by its RIPEMD-160 hash) within a list of numbers. The idea is to see which of the two methods is faster and more efficient at finding that element.

To ensure a fair comparison, several simulations are run, always using the same rules:

- First, the list is divided into "blocks".
- The blocks are analyzed in random order.
- Within that list, the number we want to find is randomly selected.
- The process is run multiple times to gather statistics: we count how many checks each method makes and how many times each method "wins" (i.e., finds the target with the fewest checks).

Two methods:

1. Sequential Search:This method is very straightforward. It goes through each block from beginning to end, checking one by one if the element's hash matches the one we are looking for. It's like reviewing each page of a book without skipping a single one. Although it's simple and serves as a reference, if the element is at the end, the method ends up doing a lot of checking.

2. Prefix-Based Search:This method does something more clever. Although it also goes block by block in random order, before checking the entire block, it first checks if the hash of any element begins with the same first three characters as the hash we're searching for. In hexadecimal, each character has 16 options, which means that with three characters, you have 16³, or 4,096 possible combinations.

Ideally, if the block had 4,096 elements, each combination would appear, on average, only once. Therefore, if it finds an element that has the correct prefix, it focuses on that part instead of checking the entire block. This saves a lot of checking.

In short, while the sequential method checks each element unconditionally, the prefix-based method filters the input and focuses only on where it's most likely to find the target. Thus, at the end of all the simulations, an "average success rate" is calculated (total number of checks divided by the number of times each method won) to see which method uses the least effort to find the desired element.

The goal is to demonstrate that prefix-based search, thanks to this intelligent filtering, is generally more efficient than traditional sequential search, since fewer checks are performed when it wins.

script

Code:

import hashlib
import random

# Configuration
TOTAL_SIZE = 100_000
RANGE_SIZE = 5_000
PREFIX_LENGTH = 3
SIMULATIONS = 500
SECP256K1_ORDER = int("0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141", 16)

print(f"""
=== Configuration ===
Total numbers: {TOTAL_SIZE:,}
Block size: {RANGE_SIZE:,}
Prefix: {PREFIX_LENGTH} characters (16^{PREFIX_LENGTH} combinations)
Simulations: {SIMULATIONS}
secp256k1 Order: {SECP256K1_ORDER}
""")

def generate_h160(data):
    
    h = hashlib.new('ripemd160', str(data).encode('utf-8'))
    return h.hexdigest()

def shuffled_range(n):
   
    arr = list(range(n + 1))
    random.shuffle(arr)
    return arr

def sequential_search(dataset, block, target_hash, order):
   
    checks = 0
    for idx in order:
        start = idx * block
        end = start + block
        for i in range(start, end):
            checks += 1
            if generate_h160(dataset[i]) == target_hash:
                return {"checks": checks, "found": True}
    return {"checks": checks, "found": False}

def precise_search(dataset, block, prefix, target_hash, order):
    
    prefix_hash = target_hash[:prefix]
    checks = 0
    ranges = []
    for idx in order:
        start = idx * block
        end = start + block
        found_prefix = False
        for i in range(start, end):
            checks += 1
            h = generate_h160(dataset[i])
            if h == target_hash:
                return {"checks": checks, "found": True}
            if not found_prefix and h.startswith(prefix_hash):
                found_prefix = True
                ranges.append({"start": i + 1, "end": end})
                break

    for r in ranges:
        for i in range(r["end"] - 1, r["start"] - 1, -1):
            checks += 1
            if generate_h160(dataset[i]) == target_hash:
                return {"checks": checks, "found": True}
    return {"checks": checks, "found": False}

def compare_methods():
    results = {
        "sequential": {"wins": 0, "total_checks": 0},
        "precise": {"wins": 0, "total_checks": 0},
        "ties": 0
    }

    for i in range(SIMULATIONS):
        max_range = SECP256K1_ORDER - TOTAL_SIZE - 1
        random_offset = random.randrange(max_range)
        R = 1 + random_offset

        dataset = [R + i for i in range(TOTAL_SIZE)]
        target_num = random.choice(dataset)
        target_hash = generate_h160(target_num)
        blocks = TOTAL_SIZE // RANGE_SIZE
        order = shuffled_range(blocks - 1)

        #print(f"\nSimulation {i + 1}:")
        #print(f"\nRange= {hex(R)}:{hex(R+TOTAL_SIZE)}")
        #print(f"\nTarget= {target_hash}")
        seq_result = sequential_search(dataset, RANGE_SIZE, target_hash, order)
        pre_result = precise_search(dataset, RANGE_SIZE, PREFIX_LENGTH, target_hash, order)

        # Suma de checks globales
        results["sequential"]["total_checks"] += seq_result["checks"]
        results["precise"]["total_checks"] += pre_result["checks"]

        if seq_result["checks"] < pre_result["checks"]:
            results["sequential"]["wins"] += 1
        elif seq_result["checks"] > pre_result["checks"]:
            results["precise"]["wins"] += 1
        else:
            results["ties"] += 1

        print(f"Checks: Sequential = {seq_result['checks']} | Prefix = {pre_result['checks']}")

    # Calcular avg success rate
    avg_success_rate_sequential = (results["sequential"]["total_checks"] / results["sequential"]["wins"]
                                   if results["sequential"]["wins"] > 0 else float('inf'))
    avg_success_rate_precise = (results["precise"]["total_checks"] / results["precise"]["wins"]
                                if results["precise"]["wins"] > 0 else float('inf'))

    print(f"""
=== FINAL RESULTS ===
Wins:
Sequential: {results['sequential']['wins']}
Prefix: {results['precise']['wins']}
Ties: {results['ties']}
Total Checks:
Sequential: {results['sequential']['total_checks']}
Prefix: {results['precise']['total_checks']}
Average Success Rates:
Total Avg / Wins
Sequential(1 victory for each): {avg_success_rate_sequential:.2f}
Prefix(1 1 victory for each): {avg_success_rate_precise:.2f}
""")


if __name__ == '__main__':
    compare_methods()

Results

Code:

=== FINAL RESULTS ===
Wins:
Sequential: 194
Prefix: 286
Ties: 20
Total Checks:
Sequential: 26611276
Prefix: 25342870
Average Success Rates:
Total Avg / Wins
Sequential(1 victory for each): 137171.53
Prefix(1 1 victory for each): 88611.43

Let's analyze what these results show:

In a total of 500 simulations, the sequential method won 194 times, while the prefix-based method won 286 times (with 20 ties). This already tells us that, in more than half of the cases, the prefix method found the target using fewer checks than the sequential method.

In addition, the total number of verifications performed by each method was measured:

Code:

Sequential: 26,611,276 verifications

Code:

Prefixes: 25,342,870 verifications

Now, the key factor is the so-called "average success rate", which in this case is calculated by dividing the total verifications by the number of wins. This gives:

Code:

Sequential: 137,171.53 verifications per win

Code:

Prefixes: 88,611.43 verifications per win

This means that, on average, each time the prefix method "wins" (i.e.finds the target with fewer verifications), significantly less computational effort was used than in the case of the sequential method.

Why is this? The trick lies in the relationship between prefix size and block size. When using a 3-character prefix in hexadecimal, there are 16³ = 4096 possible combinations. If the block fits this size (or is close enough), each prefix combination will appear on average only once in the block.

Thus, instead of having to exhaustively check each element (as the sequential method does), the prefix-based method "gets ahead" by searching for prefix matches. This allows it to focus its efforts only on the part of the block where an exact match is most likely to be found, thus avoiding unnecessary checks.

In short, these numbers tell us that the prefix-based method is more efficient because, thanks to this smart filter, it reduces the number of checks required to find the target.

Recommendation: Reproduce the test with the number of simulations you deem appropriate.

This is a self-moderated post meant to maintain a strict conversation, one that is respectful and based on arguments supported by verifiable evidence in practice.