← all posts

· Performance · 6 min read

Why my duplicate-detection was 100× slower than it should have been

I hit a funny performance cliff building musicmod, a self-hosted music player. Users can import their library from multiple sources, and inevitably end up with duplicates — same song, different rip, different filename. I wanted a "suggest duplicates" feature that doesn't rely on filename or tags (which lie), so I reached for acoustic fingerprinting: listen to the actual audio and compare.

Got it working. It was correct. It was also absurdly slow. Rematching a 1200-track library was projected to take hours. Here's why, and how one change to numpy collapses it to ~20 seconds.

The mental model I started with (wrong)

When I hear "fingerprint comparison" my brain jumps to "hash comparison" — two fixed-size values, XOR them, done in nanoseconds. That's how perceptual image hashes (dHash, pHash) work: you get a 64-bit integer per image, and similarity is one popcount.

Audio fingerprints aren't like that.

What a chromaprint fingerprint actually is

Chromaprint (the library fpcalc ships from) doesn't emit one hash per song. It emits a time-series: every ~125ms of audio produces one 32-bit integer describing the dominant frequencies at that moment. A 4-minute song ends up as ~1900 of these integers in a row — call it a "frequency fingerprint stream."

song_A: [0x3f12, 0x3f12, 0x3f13, 0x2f13, 0x2f03, 0x2e03, ...]  # ~1900 long
song_B: [0x3f11, 0x3f12, 0x3f13, 0x2f13, 0x2f13, 0x2e02, ...]  # ~1900 long

Why not a single hash? Because the whole point is to be robust to re-encoding. An MP3 at 128 kbps and a FLAC of the same source will produce nearly-identical streams, bit-for-bit, with a scattering of samples flipped. A single hash would collapse that similarity information. Keeping the stream means we can still see "90% of bits match" even when the encodings differ.

Similarity = counting bit differences, across the stream

To compare two fingerprints, you XOR corresponding samples and count how many bits differ. Fewer bits different → more similar. Normalize by total bits → a 0..1 score.

def similarity(a, b):
    n = min(len(a), len(b))
    bits_diff = 0
    for i in range(n):
        bits_diff += (a[i] ^ b[i]).bit_count()
    return 1.0 - bits_diff / (n * 32)

Plus one twist: one rip might have half a second of silence before the music starts; another might not. Without handling that, a[0] ("silence") gets compared to b[0] ("first beat") and you get nonsense. Solution: slide one stream against the other through a small offset window and take the best score.

ALIGN_WINDOW = 20  # ±2.5s at 8 samples/sec
def similarity(a, b):
    best = 0.0
    for offset in range(-ALIGN_WINDOW, ALIGN_WINDOW + 1):
        best = max(best, _hamming_score(a, b, offset))
    return best

This is correct. It is also dreadfully slow.

Why it's slow — the interpreter tax

Per pair, this code does:

That's nothing for a C program. An XOR is one CPU cycle. A popcount is one instruction (POPCNT on any x86 from the last decade). 82,000 cycles @ 3 GHz = 27 microseconds.

In CPython it's ~10 _milli_seconds.

Where do the other ~370× go? Interpreter overhead. Every iteration of the Python for loop does ~20 bytecode ops behind the scenes: load the list, compute the index, subscript it twice, call __xor__, call bit_count, add to the accumulator, store. Each bytecode is dispatched through an interpreter loop and involves allocating a Python int box for the intermediate. The actual bit math is invisible under this pile of object-management.

Now multiply 10 ms/pair by how many pairs you need to compare:

Both numbers disproportionate to the "a few cycles per compare" mental model I started with.

The fix: stop iterating in Python

import numpy as np

def similarity(a: np.ndarray, b: np.ndarray) -> float:
    best = 0.0
    for offset in range(-ALIGN_WINDOW, ALIGN_WINDOW + 1):
        x, y = _align(a, b, offset)
        diff = np.bitwise_xor(x, y)
        bits_diff = np.unpackbits(diff.view(np.uint8)).sum()
        score = 1.0 - bits_diff / (x.size * 32)
        best = max(best, score)
    return best

Pack each fingerprint once as a numpy.uint32 array on load. Compare with np.bitwise_xor — one C call, operates on the whole array in a tight SIMD loop. Count bits with np.unpackbits().sum() — also one C call. (On NumPy 2.x there's a prettier np.bitwise_count if you want it.)

Per pair: ~30 microseconds. ~300× faster than the Python version.

Full-library rematch over 1200 tracks: ~20 seconds instead of 2 hours.

The moral

"Hash comparison is fast" is true for the math. It's sometimes very wrong for the code you wrote. If your hot loop has a for i in range(n) and is doing something that feels like a single CPU instruction per iteration, you're paying the interpreter tax 100× over for the privilege of expressing it in Python.

Three things I'll take with me:

  1. When something feels asymptotically wrong, profile before rearchitecting. I was reaching for "run it on a GPU worker in a batch job" before I realized the per-pair cost itself was the bug. The algorithm was fine; the implementation was 100× too slow.
  2. "Numpy it" is not premature optimization when the inner loop is interpreted. It's a correctness-of-impl fix dressed up as a performance fix.
  3. Audio fingerprints are streams, not hashes. The robustness-to-re-encoding property comes from not collapsing the time dimension. Comparing them means bitwise Hamming over the stream, with a tiny alignment wiggle to handle leading silence.

The broader lesson is the same one that keeps coming up in this kind of systems work: the machine is embarrassingly fast, and most of our slowness is self-inflicted by how we're asking it to do the work. Finding those moments is half the fun.