· Performance · 6 min read
Why my duplicate-detection was 100× slower than it should have been
I hit a funny performance cliff building musicmod, a self-hosted music player. Users can import their library from multiple sources, and inevitably end up with duplicates — same song, different rip, different filename. I wanted a "suggest duplicates" feature that doesn't rely on filename or tags (which lie), so I reached for acoustic fingerprinting: listen to the actual audio and compare.
Got it working. It was correct. It was also absurdly slow. Rematching a 1200-track library was projected to take hours. Here's why, and how one change to numpy collapses it to ~20 seconds.
The mental model I started with (wrong)
When I hear "fingerprint comparison" my brain jumps to "hash comparison" — two fixed-size values, XOR them, done in nanoseconds. That's how perceptual image hashes (dHash, pHash) work: you get a 64-bit integer per image, and similarity is one popcount.
Audio fingerprints aren't like that.
What a chromaprint fingerprint actually is
Chromaprint (the library fpcalc ships from) doesn't emit one hash per song. It emits a time-series: every ~125ms of audio produces one 32-bit integer describing the dominant frequencies at that moment. A 4-minute song ends up as ~1900 of these integers in a row — call it a "frequency fingerprint stream."
song_A: [0x3f12, 0x3f12, 0x3f13, 0x2f13, 0x2f03, 0x2e03, ...] # ~1900 long
song_B: [0x3f11, 0x3f12, 0x3f13, 0x2f13, 0x2f13, 0x2e02, ...] # ~1900 long
Why not a single hash? Because the whole point is to be robust to re-encoding. An MP3 at 128 kbps and a FLAC of the same source will produce nearly-identical streams, bit-for-bit, with a scattering of samples flipped. A single hash would collapse that similarity information. Keeping the stream means we can still see "90% of bits match" even when the encodings differ.
Similarity = counting bit differences, across the stream
To compare two fingerprints, you XOR corresponding samples and count how many bits differ. Fewer bits different → more similar. Normalize by total bits → a 0..1 score.
def similarity(a, b):
n = min(len(a), len(b))
bits_diff = 0
for i in range(n):
bits_diff += (a[i] ^ b[i]).bit_count()
return 1.0 - bits_diff / (n * 32)
Plus one twist: one rip might have half a second of silence before the music starts; another might not. Without handling that, a[0] ("silence") gets compared to b[0] ("first beat") and you get nonsense. Solution: slide one stream against the other through a small offset window and take the best score.
ALIGN_WINDOW = 20 # ±2.5s at 8 samples/sec
def similarity(a, b):
best = 0.0
for offset in range(-ALIGN_WINDOW, ALIGN_WINDOW + 1):
best = max(best, _hamming_score(a, b, offset))
return best
This is correct. It is also dreadfully slow.
Why it's slow — the interpreter tax
Per pair, this code does:
- ~2000 samples per stream
- × 41 alignment offsets
- = ~82,000 XOR +
bit_countoperations
That's nothing for a C program. An XOR is one CPU cycle. A popcount is one instruction (POPCNT on any x86 from the last decade). 82,000 cycles @ 3 GHz = 27 microseconds.
In CPython it's ~10 _milli_seconds.
Where do the other ~370× go? Interpreter overhead. Every iteration of the Python for loop does ~20 bytecode ops behind the scenes: load the list, compute the index, subscript it twice, call __xor__, call bit_count, add to the accumulator, store. Each bytecode is dispatched through an interpreter loop and involves allocating a Python int box for the intermediate. The actual bit math is invisible under this pile of object-management.
Now multiply 10 ms/pair by how many pairs you need to compare:
- Comparing 25 newly-scanned tracks against 1127 existing ones: ~28k pairs × 10 ms = 4.5 minutes. (Yes, measured. Every source scan stalls for minutes before new dupes appear in the UI.)
- Rematching the whole library after a threshold change: 1127² / 2 ≈ 634k pairs × 10 ms = ~2 hours.
Both numbers disproportionate to the "a few cycles per compare" mental model I started with.
The fix: stop iterating in Python
import numpy as np
def similarity(a: np.ndarray, b: np.ndarray) -> float:
best = 0.0
for offset in range(-ALIGN_WINDOW, ALIGN_WINDOW + 1):
x, y = _align(a, b, offset)
diff = np.bitwise_xor(x, y)
bits_diff = np.unpackbits(diff.view(np.uint8)).sum()
score = 1.0 - bits_diff / (x.size * 32)
best = max(best, score)
return best
Pack each fingerprint once as a numpy.uint32 array on load. Compare with np.bitwise_xor — one C call, operates on the whole array in a tight SIMD loop. Count bits with np.unpackbits().sum() — also one C call. (On NumPy 2.x there's a prettier np.bitwise_count if you want it.)
Per pair: ~30 microseconds. ~300× faster than the Python version.
Full-library rematch over 1200 tracks: ~20 seconds instead of 2 hours.
The moral
"Hash comparison is fast" is true for the math. It's sometimes very wrong for the code you wrote. If your hot loop has a for i in range(n) and is doing something that feels like a single CPU instruction per iteration, you're paying the interpreter tax 100× over for the privilege of expressing it in Python.
Three things I'll take with me:
- When something feels asymptotically wrong, profile before rearchitecting. I was reaching for "run it on a GPU worker in a batch job" before I realized the per-pair cost itself was the bug. The algorithm was fine; the implementation was 100× too slow.
- "Numpy it" is not premature optimization when the inner loop is interpreted. It's a correctness-of-impl fix dressed up as a performance fix.
- Audio fingerprints are streams, not hashes. The robustness-to-re-encoding property comes from not collapsing the time dimension. Comparing them means bitwise Hamming over the stream, with a tiny alignment wiggle to handle leading silence.
The broader lesson is the same one that keeps coming up in this kind of systems work: the machine is embarrassingly fast, and most of our slowness is self-inflicted by how we're asking it to do the work. Finding those moments is half the fun.