Fix for ssdeep/ppdeep hash mismatches #4
+587
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hey Marcin,
First off, thanks for your dnstwist work and for making this simplified ssdeep library!
I was doing some research and testing to ensure parity between the
ssdeepandppdeeplibraries for a project and seemed to stumble on a subtle difference in hashes between the two. The result of that research is this PR that consists of a Dockerfile to emulate the testing environment, the script used to find, generate, and test those edge cases, and the PR itself.In full disclosure, Claude Opus 4.5 was used to build out the script and eventually the fix for the library. At this time, I'm not going to pretend I understand how any of it works under the hood. With that being said, here's what I found.
First I had a script created to test various simple string cases. It passed at first without any issues. I then imported both libraries into a different project and added code to notify me if the hashes were different. This larger corpus of string and binary data uncovered cases where the hashes would be off by 1-2 characters and thus kicked off my investigation as to why.
I leverage AI to add functionality to the script to:
that made the 2 libraries result in different hashes.
Looping those findings and prompting AI to inspect both ppdeep and ssdeeep (this one https://github.com/DinoTools/python-ssdeep) repos, it was able to settle on the solution found in this PR. Below is the testing environment walkthrough and the AI generated summary of the bugs is at the end.
Testing Environment
Edit the Dockerfile to swap the ppdeep versions
Generate random byte strings of length 128, 256, etc.
Find file mismatches
Summary of Bugs Found in ppdeep.py
Bug 1: Missing 32-bit Unsigned Integer Overflow Masking
Location: Roll hash computation in
_spamsum()functionProblem: In C's ssdeep, the roll hash components (
h1,h2,h3) areuint32_t(32-bit unsigned integers). When subtraction produces a "negative" result, C's unsigned arithmetic wraps around (e.g.,-1becomes0xFFFFFFFF). Python integers have arbitrary precision with no overflow, causing different roll hash values and triggers at wrong byte positions.Fix: Add
& 0xFFFFFFFFmasking to all roll hash operations:Bug 2: Missing Final Character When
rh == 0Location: End-of-stream logic in
_spamsum()functionProblem: ppdeep only appended the final hash character when
rh != 0. But the C code has TWO paths:h != 0: use current hash valueppdeep completely ignored the second path.
Fix: Track the last character stored at each reset point (
last_char1,last_char2) and use them whenrh == 0:Bug 3: Duplicate Final Character
Location: Reset point logic in
_spamsum()functionProblem: After storing a character at a reset point and appending it to the hash string, the stored character wasn't cleared. This caused it to be appended again at the end of processing.
Fix: Clear
last_char1andlast_char2after successfully appending them:Root Cause
These bugs stem from ppdeep being a pure-Python port that didn't fully replicate C's low-level behavior:
digest[dindex](stored character) vs computed hashhalfdigestbehavior that clears after dindex exceeds half length