Fix handling of global discussions files from planet.osm.org#388
Merged
joto merged 3 commits intoosmcode:masterfrom Jul 3, 2025
Merged
Fix handling of global discussions files from planet.osm.org#388joto merged 3 commits intoosmcode:masterfrom
joto merged 3 commits intoosmcode:masterfrom
Conversation
… as osm.bz2 files of all changesets with discussions (e.g. discussions-250519.osm.bz2 from planet.osm.org)
…for comment management to prevent corrupted memory access when the size of comments go larger and triggers buffer reallocation handling (e.g. changeset with id 62278154).
Member
|
When I look at the discussions-250519.osm.bz2 I do get an error message. But your changes don't fix that for me. What tool did you run and what error message did you get? |
Contributor
Author
|
@joto The code (simplified from my project and tested): #include <iostream>
#include <string>
#include <cstdio>
#include <functional>
#include "osmium/handler.hpp"
#include "osmium/visitor.hpp"
#include "osmium/io/any_input.hpp"
#include "osmium/osm/entity_bits.hpp"
#include "osmium/util/progress_bar.hpp"
class FastChangesetHandler : public osmium::handler::Handler {
private:
volatile long long counter = 0;
volatile long long sum_ids = 0;
volatile long long sum_changes = 0;
volatile long long total_comments = 0;
volatile long long total_comment_length = 0;
std::function<void()> progress_callback;
public:
void changeset(const osmium::Changeset& changeset) {
counter++;
// Do actual work to prevent optimization
sum_ids += changeset.id();
sum_changes += changeset.num_changes();
// Process discussions to prevent optimization
long long comment_count = 0;
for (auto &comment : changeset.discussion()) {
// Prevent optimization by accessing comment properties
total_comment_length += strlen(comment.user());
total_comment_length += strlen(comment.text());
comment_count++;
// Safety check to prevent segfault as mentioned in original code
if (comment_count >= changeset.num_comments()) {
break;
}
}
total_comments += comment_count;
// Update progress bar
if (progress_callback) {
progress_callback();
}
}
void attach_progress_callback(std::function<void()> callback) {
progress_callback = callback;
}
void print_summary() {
printf("\nFinal Summary:\n");
printf("==============\n");
printf("Processed changesets: %lld\n", counter);
printf("Sum of changeset IDs: %lld\n", sum_ids);
printf("Sum of changes: %lld\n", sum_changes);
printf("Total discussion comments: %lld\n", total_comments);
printf("Total comment text length: %lld characters\n", total_comment_length);
if (counter > 0) {
printf("Average comments per changeset: %.2f\n", (double)total_comments / counter);
printf("Average comment length: %.2f characters\n",
total_comments > 0 ? (double)total_comment_length / total_comments : 0.0);
}
}
};
int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <osm-file>\n", argv[0]);
return 1;
}
std::string filename = argv[1];
try {
// Disable stdio synchronization for faster operation
std::ios_base::sync_with_stdio(false);
// Open the OSM file
osmium::io::File infile(filename);
// Create reader - only read changesets
osmium::io::Reader reader(infile, osmium::osm_entity_bits::all);
// Setup progress bar
osmium::ProgressBar progress_bar{reader.file_size(), true};
// Create our fast handler
auto handler = std::make_unique<FastChangesetHandler>();
long long processed = 0;
handler->attach_progress_callback([&processed, &progress_bar, &reader]() {
if (++processed % 100== 0) {
progress_bar.update(reader.offset());
}
});
printf("Processing changesets from: %s\n", filename.c_str());
// Apply the handler to iterate through changesets
osmium::apply(reader, *handler);
progress_bar.done();
reader.close();
// Print final summary
handler->print_summary();
} catch (const std::exception& e) {
fprintf(stderr, "Error: %s\n", e.what());
return 1;
}
return 0;
}With the current version of libosmium: After patching bzip2_compression.hpp: After patching osm_object_builder.hpp the file can be processed correctly. |
joto
requested changes
Jul 3, 2025
Member
joto
left a comment
There was a problem hiding this comment.
I have now been able to verify this. The bzip2 fixes are fine, but I have some comments for the ChangesetDiscussionBuilder.
Contributor
Author
|
@joto Thanks for your comments! The code has been updated accordingly. |
Member
|
Thanks @yarray for finding and fixing this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes critical issues when processing global discussions files distributed by planet.osm.org, which contain changesets with comments. While libosmium can handle a large portion of the changesets, two issues would happen for some changesets.
Problem
Global discussions files (changesets with comments) from planet.osm.org such as discussions-20250519.osm.bz2 were not being processed successfully due to two issues:
The tested inpiut file is https://planet.osm.org/planet/2025/discussions-250519.osm.bz2. The first issue happened at about 21% of the processing. The second happened for changeset 62278154's 9th comment on my computer after resolving the first issue. Both
osmium::applyandosmium::io::make_input_iterator_rangepresent the same behavior.Solution
1. Fix bzip2 multi-stream support
include/osmium/io/bzip2_compression.hppto handle concatenated streams2. Improve memory-safety of ChangesetDiscussionBuilder
include/osmium/builder/osm_object_builder.hppget_comment_ptr()method that safely recalculates pointers after buffer reallocationsImpact
This fix enables reliable processing of planet.osm.org global discussions files by:
Testing
Tested with all current unit tests to ensure the fix does not break anything.