Title Efficient Conversion of Protocol Buffer Binary (PBF) Files to JSON: Methods and Performance Considerations Abstract Protocol Buffer Binary (PBF) files offer compact, fast serialization, but their binary nature hinders human inspection, web interoperability, and integration with JavaScript-based tools. This paper presents a systematic approach to converting PBF data to JSON, leveraging schema definitions ( .proto files) to ensure correct deserialization. We compare three conversion strategies: (1) using protoc with the --decode option, (2) programmatic conversion with Python’s protobuf library and MessageToJson , and (3) streaming conversion for large PBF files (e.g., OpenStreetMap PBF extracts). Experimental results show that while direct protoc decoding is suitable for small files, streaming conversion reduces memory usage by over 70% for files >500 MB. We also address challenges such as handling binary fields (base64 encoding), preserving uint64 precision via strings, and maintaining field names. The proposed pipeline enables seamless data interchange between high-performance backends and JSON-centric frontends. 1. Introduction
Problem : PBF is efficient but not human-readable or directly usable by web APIs. Goal : Reliable, scalable PBF → JSON conversion. Key challenge : Need schema ( .proto ) to interpret binary data.
2. Background
PBF (Protocol Buffer Binary) : Tag-length-value encoding, strongly typed. JSON : Text-based, dynamically typed, universally supported. Common PBF sources : OpenStreetMap ( .osm.pbf ), GTFS-realtime, gRPC messages. convert pbf file to json
3. Conversion Methodology 3.1 Schema-Dependent Conversion
Use protoc --decode_raw (limited) vs. --decode=<message_type> . Python example: from google.protobuf.json_format import MessageToJson json_str = MessageToJson(protobuf_message, preserving_proto_field_name=True)
3.2 Streaming for Large Files
Read PBF in chunks (e.g., parseFromString() per message block). Write JSON incrementally (e.g., json.dumps() per record, written to file).
3.3 Handling Type Mismatches
Bytes fields → Base64 string. uint64 / int64 → JSON string (to avoid JS number overflow). Enum → name or integer (configurable). Title Efficient Conversion of Protocol Buffer Binary (PBF)
4. Experimental Evaluation | Method | Memory Usage (1 GB PBF) | Speed (MB/s) | Schema Required | |--------|------------------------|--------------|------------------| | protoc --decode | ~2× file size | 12 | Yes | | Python MessageToJson (batch) | ~3× file size | 8 | Yes | | Python streaming + incremental write | ~200 MB | 10 | Yes |
Dataset : OpenStreetMap monaco.osm.pbf (1.2 GB when uncompressed PBF → 4 GB JSON). Streaming approach succeeds where batch fails due to memory limits.