Chapter 4: Encoding and Evolution
How data moves across process boundaries, why binary formats matter, and how schema evolution keeps systems compatible over time.
Applications use data in two different forms:
- In-memory representation: objects, structs, arrays, hash maps, trees.
- Byte-level representation: a self-contained sequence of bytes for disk/network.
This conversion is encoding/serialization (and decoding/deserialization on read).
I. Why Encoding Exists
flowchart LR
appData[In-memory Objects] --> encode[Encoder]
encode --> wire[Bytes on Disk or Network]
wire --> decode[Decoder]
decode --> appData2[In-memory Objects]
Pointers are process-local, so data crossing process boundaries must be encoded into portable bytes.
II. Language-Specific Encodings and Problems
Language-coupled formats are convenient but risky:
- Interoperability pain across polyglot services.
- Security risks from arbitrary class instantiation during decode.
- Weak schema evolution support in many native serializers.
- Performance overhead (e.g., classic Java serialization bloat).
III. Text Formats: JSON, XML, CSV
Number ambiguity
- XML/CSV cannot reliably distinguish numbers from numeric strings without schema.
- JSON has number/string distinction, but precision and int-vs-float details can be ambiguous.
Binary data handling
JSON/XML are text-oriented; raw bytes are usually Base64-encoded (size overhead).
CSV schema ambiguity
CSV has no built-in schema; producer/consumer changes need careful coordination.
IV. Why Binary Encodings Matter
At scale, format choice impacts:
- Network bandwidth
- Storage footprint
- CPU parsing cost
- Latency
JSON is compact vs XML, but still verbose compared to binary.
V. Thrift and Protocol Buffers
- Protocol Buffers (Google)
- Thrift (Facebook)
Both are schema-driven and use generated code for multiple languages.
JSON:
Thrift:
struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}
Protobuf:
message Person {
required string user_name = 1;
optional int64 favorite_number = 2;
repeated string interests = 3;
}
VI. Thrift BinaryProtocol: Byte-Level View
JSON payload:
Schema:
Encoded bytes:
Parsing pattern:
[Field Type] + [Field ID] + [Value] ... until STOP marker
flowchart LR
B1[08 00 01 00 00 01 00] --> F1[Field1 i32 productId=256]
B2[0b 00 02 00 00 00 03 43 75 70] --> F2[Field2 string name=Cup]
B3[02 00 03 01] --> F3[Field3 bool inStock=true]
B4[00] --> EndMarker[STOP]
Interactive Decoder Lab
Use this mini game to decode the same payload step-by-step.
VII. Schema Evolution and Compatibility
Field IDs (tags) are the wire contract.
- Field names can change.
- Tag numbers must remain stable.
- Type changes require caution.
Forward compatibility
Old readers skip unknown tags written by new writers.
Real-life example (mobile app rollout):
Your backend adds a new optional field user_tier = 4 (FREE, PRO) to a Protobuf UserProfile response. Users on older app versions (old reader) don't know tag 4, so they ignore it and still render name/email correctly.
Backward compatibility
New readers can read old data if newly added fields are optional/defaulted.
Real-life example (event streaming with mixed producers):
You deploy a new consumer that expects an optional discount_code field in OrderCreated events. Some services are still publishing the old event schema without that field. The new consumer reads those old events and safely treats discount_code as empty/default.
flowchart TD
NewWriter[New Writer with optional fields] --> OldReader[Old Reader]
OldReader --> IgnoreUnknown[Unknown tags ignored]
OldWriter[Old Writer] --> NewReader[New Reader]
NewReader --> FillDefaults[Missing fields defaulted]
Rules:
- Never recycle old tag numbers.
- New fields should be optional/defaulted.
- Remove only optional fields, and never reuse their tags.
VIII. Data Type Changes and Risk
Example: int32 -> int64
- New readers can generally read old
int32. - Old readers may truncate large
int64values.
So type changes must be rolled out carefully.
IX. Repeated Fields in Protocol Buffers
The same tag can appear multiple times and is decoded as a list.
Protocol Buffers: Byte-Level Mini Lab
Protobuf on the wire looks different from Thrift. Before the interactive lab, here is the mental model.
How Protobuf differs from Thrift (one sentence)
- Thrift BinaryProtocol sends:
[type byte] + [field id] + [value](field names never appear; type and id are separate bytes). - Protobuf sends:
[tag byte] + [value]where the tag byte already encodes both which field and how to read the value.
So in Protobuf you do not see a separate “string type” byte like Thrift’s 0b. You infer meaning from the wire type embedded in the tag.
What is a “tag” byte?
Each field in the binary stream starts with one tag byte built from your .proto schema:
| Piece | Meaning |
|---|---|
| field_number | The number you wrote in .proto (= 1, = 2, = 3) — same idea as Thrift field id |
| wire_type | Tells the parser how many bytes to read next and how to interpret them |
Example: tag 08 (hex) = decimal 8 = binary 0000 1000
- Lower 3 bits (
000) → wire type 0 (varint) - Upper bits (
00001) → field 1
So 08 means: “Field 1 is next, and its value is encoded as a varint.”
What is a “wire type”?
Wire type is not “int vs string” in the Thrift sense. It is the encoding shape on the wire:
| Wire type | Name | How the parser reads the value |
|---|---|---|
| 0 | Varint | Read 1+ bytes until a varint ends; decode as integer/bool/enum |
| 1 | 64-bit | Read exactly 8 bytes (fixed64) |
| 2 | Length-delimited | Read a varint length, then read that many raw bytes (strings, bytes, embedded messages) |
| 5 | 32-bit | Read exactly 4 bytes (fixed32) |
Your schema (.proto) still says int32, string, bool — the wire type only tells the parser the byte pattern to consume. The generated code then maps that to the correct Go/Java/C++ type.
What is a varint?
A varint (variable-length integer) uses 1–10 bytes for an integer:
- Each byte uses 7 bits for data.
- The high bit (
0x80) means “more bytes follow.” - Small numbers use fewer bytes (efficient on the wire).
Why 256 becomes 80 02 (not 00 00 01 00 like Thrift):
256needs more than 7 bits.- First chunk:
0x80 | 0→80(continuation + low 7 bits). - Second chunk:
2→02. - Decoder:
(0 & 0x7F) + (2 << 7) = 256.
Booleans in Protobuf are often wire type 0 with value 01 (true) or 00 (false) — still a varint, not a single dedicated bool byte like Thrift’s 02 type + 01.
How strings work (wire type 2)
For string name = 2:
- Tag byte: field
2, wire type2→ hex12(because(2 << 3) | 2 = 18=0x12). - Length as varint:
03= “next 3 bytes are payload.” - Payload: raw UTF-8 bytes
43 75 70="Cup".
No separate “string type” byte — only tag + length + bytes.
Our example schema → wire layout
message Product {
int32 productId = 1; // varint -> tag 08, then varint value
string name = 2; // string -> tag 12, then length + bytes
bool inStock = 3; // bool -> tag 18, then 01 or 00
}
flowchart LR
subgraph stream [Protobuf byte stream]
T1[08 tag field1 varint]
V1[80 02 value 256]
T2[12 tag field2 length-delimited]
L2[03 length]
S2[43 75 70 Cup]
T3[18 tag field3 varint]
V3[01 true]
end
T1 --> V1 --> T2 --> L2 --> S2 --> T3 --> V3
Full payload for { productId: 256, name: "Cup", inStock: true }:
| Bytes | Meaning |
|---|---|
08 |
Field 1, wire type 0 (varint) → productId |
80 02 |
Varint value 256 |
12 |
Field 2, wire type 2 (length-delimited) → name |
03 |
Length 3 |
43 75 70 |
UTF-8 "Cup" |
18 |
Field 3, wire type 0 (varint) → inStock |
01 |
true |
Use the explorer below to step through each group with highlighting.
X. REST vs SOAP (Quick Contrast)
- REST: architectural style over HTTP.
- SOAP: strict XML-based protocol.
Specs:
- REST commonly uses OpenAPI/Swagger.
- SOAP commonly uses WSDL.
Last Updated: May 28, 2026
End Note: Encoding is a long-term contract between independently evolving systems. Teams that treat schema evolution as a first-class concern avoid brittle migrations, silent corruption, and expensive rewrites.