Skip to content

Conversation

@derpadoo
Copy link

@derpadoo derpadoo commented Feb 9, 2026

Hey Marcin,

First off, thanks for your dnstwist work and for making this simplified ssdeep library!

I was doing some research and testing to ensure parity between the ssdeep and ppdeep libraries for a project and seemed to stumble on a subtle difference in hashes between the two. The result of that research is this PR that consists of a Dockerfile to emulate the testing environment, the script used to find, generate, and test those edge cases, and the PR itself.

In full disclosure, Claude Opus 4.5 was used to build out the script and eventually the fix for the library. At this time, I'm not going to pretend I understand how any of it works under the hood. With that being said, here's what I found.

First I had a script created to test various simple string cases. It passed at first without any issues. I then imported both libraries into a different project and added code to notify me if the hashes were different. This larger corpus of string and binary data uncovered cases where the hashes would be off by 1-2 characters and thus kicked off my investigation as to why.

I leverage AI to add functionality to the script to:

  1. generate random byte strings
  2. identify any files in the container

that made the 2 libraries result in different hashes.

Looping those findings and prompting AI to inspect both ppdeep and ssdeeep (this one https://github.com/DinoTools/python-ssdeep) repos, it was able to settle on the solution found in this PR. Below is the testing environment walkthrough and the AI generated summary of the bugs is at the end.

Testing Environment

# Determine what version of ppdeep to use in the Dockerfile..

# 1) Current 20251115 ppdeep version
# RUN pip install ppdeep==20251115

# 2) Updated ppdeep PR
# COPY ./ppdeep.py .
# COPY ./setup.py .
# COPY ./README.md .
# RUN python setup.py install

# Build the container.
docker build -t ppdeep .

# Drop into a shell
docker run --rm -it ppdeep bash

# Run the comparison script.
python compare_ssdeep_ppdeep.py

Edit the Dockerfile to swap the ppdeep versions

# Re-build the container.
docker build -t ppdeep .

# Drop into a shell
docker run --rm -it ppdeep bash

# Run the comparison script.
python compare_ssdeep_ppdeep.py

Generate random byte strings of length 128, 256, etc.

python compare_ssdeep_ppdeep.py --random-test 128 --num-test 10000

python compare_ssdeep_ppdeep.py --random-test 256 --num-test 10000

Find file mismatches

python compare_ssdeep_ppdeep.py --find-mismatch --start-path /

Summary of Bugs Found in ppdeep.py

Bug 1: Missing 32-bit Unsigned Integer Overflow Masking

Location: Roll hash computation in _spamsum() function

Problem: In C's ssdeep, the roll hash components (h1, h2, h3) are uint32_t (32-bit unsigned integers). When subtraction produces a "negative" result, C's unsigned arithmetic wraps around (e.g., -1 becomes 0xFFFFFFFF). Python integers have arbitrary precision with no overflow, causing different roll hash values and triggers at wrong byte positions.

Fix: Add & 0xFFFFFFFF masking to all roll hash operations:

roll_h2 = (roll_h2 - roll_h1 + (ROLL_WINDOW * b)) & 0xFFFFFFFF
roll_h1 = (roll_h1 + b - roll_win[roll_n]) & 0xFFFFFFFF
roll_h3 = ((roll_h3 << 5) ^ b) & 0xFFFFFFFF
rh = (roll_h1 + roll_h2 + roll_h3) & 0xFFFFFFFF

Bug 2: Missing Final Character When rh == 0

Location: End-of-stream logic in _spamsum() function

Problem: ppdeep only appended the final hash character when rh != 0. But the C code has TWO paths:

  1. If h != 0: use current hash value
  2. Else if stored character exists: use the character saved at the last reset point

ppdeep completely ignored the second path.

Fix: Track the last character stored at each reset point (last_char1, last_char2) and use them when rh == 0:

if rh != 0:
    hash_string1 += B64[block_hash1]
    hash_string2 += B64[block_hash2]
else:
    if last_char1:
        hash_string1 += last_char1
    if last_char2:
        hash_string2 += last_char2

Bug 3: Duplicate Final Character

Location: Reset point logic in _spamsum() function

Problem: After storing a character at a reset point and appending it to the hash string, the stored character wasn't cleared. This caused it to be appended again at the end of processing.

Fix: Clear last_char1 and last_char2 after successfully appending them:

if len(hash_string1) < (SPAMSUM_LENGTH - 1):
    hash_string1 += last_char1
    last_char1 = str()  # Clear after appending

Root Cause

These bugs stem from ppdeep being a pure-Python port that didn't fully replicate C's low-level behavior:

  • C's implicit integer overflow semantics
  • C's separate tracking of digest[dindex] (stored character) vs computed hash
  • C's halfdigest behavior that clears after dindex exceeds half length

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant