Zero-Copy Strings in Apache DataFusion: How StringViewArray Boosted Performance by 8%
How StringViewArray cut the copy tax on string operations and lifted ClickBench performance by 8%.
What if you could perform complex string operations like substr, trim, and split_part without copying a single byte of string data?
In Apache DataFusion, the introduction of StringViewArray (and its sibling BinaryViewArray) did exactly that. By moving away from the traditional offset-based string storage to a “view-based” architecture, DataFusion achieved an 8% performance improvement across the entire ClickBench suite just by enabling it for Parquet reads.
The Problem: The “Copy Tax” of Traditional Strings
Traditionally, Arrow strings (StringArray) use two buffers:
- An offsets buffer (e.g.,
[0, 5, 11]) - A values buffer (e.g.,
"HelloWorld")
This design is simple but has a hidden “copy tax.” If you want to take a substring or trim a string, you almost always have to allocate a new values buffer and copy the bytes into it. In a high-performance query engine processing billions of rows, these copies add up to massive CPU and memory pressure.
The Setup: StringViewArray
StringViewArray implements the “German Style” string optimization. Instead of offsets, each string is represented by a 16-byte “view”:
- Length (4 bytes): The length of the string.
- Prefix (4 bytes): The first 4 bytes of the string (inlined!).
- Buffer Index (4 bytes): Which data buffer contains the string.
- Offset (4 bytes): Where the string starts in that buffer.
For strings $\le$ 12 bytes, the entire string is inlined directly into the 16-byte view, requiring zero additional memory lookups.
The Key Result: ClickBench & Beyond
When DataFusion enabled StringViewArray by default for Parquet, the results were immediate:
| Metric | Traditional StringArray | StringViewArray | Improvement |
|---|---|---|---|
| ClickBench Suite | Baseline | -8% Runtime | 8% Faster |
starts_with (Scalar) | Baseline | ~1.5x Faster | Inlined Prefix |
substr / split_part | Heavy Copies | Zero-Copy | O(1) vs O(N) |
Why It Happens: The Magic of Zero-Copy
The real power of StringViewArray is that it allows “virtual” transformations.
1. Zero-Copy Substrings
When you call substr(col, 1, 5), DataFusion doesn’t copy the bytes. It simply creates a new view pointing to the same underlying buffer but with a different offset and length.
// Conceptual zero-copy view creation
let sub_view = make_view(
substr.as_bytes(),
original_view.buffer_index,
original_view.offset + start_offset
);
2. Fast Path Comparison
Because the first 4 bytes are inlined in the view, DataFusion can often determine that two strings are not equal without ever touching the actual string data in memory. This is a massive win for GROUP BY and JOIN operations.
3. Efficient Aggregation
Specialized structures like ArrowBytesViewMap allow COUNT DISTINCT and GROUP BY to intern unique strings without copying them into the hash map.
pub struct ArrowBytesViewMap<V> {
/// Views for all stored values (zero-copy!)
views: Vec<u128>,
/// Completed buffers containing string data
completed: Vec<Buffer>,
// ...
}
Evidence: Real-World Performance
In the DataFusion codebase, we see this optimization applied across the board:
starts_with: Uses the inlined 4-byte prefix to fast-reject non-matches.trim: Adjusts the offset and length in the view.split_part: Returns views pointing into the original data buffers.
Practical Takeaways
- Inlining is King: For short strings ($\le$ 12 bytes),
StringViewArrayeliminates memory fragmentation and cache misses. - Avoid the Copy Tax: By using views, complex string manipulations become metadata-only operations.
- Metadata-First Processing: Inlining prefixes allows the CPU to skip expensive memory accesses during comparisons.
Run It Yourself
You can see these optimizations in action in the Apache DataFusion repository. Check out the string benchmarks:
# Run the starts_with benchmark
cargo bench --bench starts_with
Conclusion
The shift to StringViewArray represents a fundamental shift in how DataFusion handles variable-length data. By treating strings as views rather than owned buffers, DataFusion has eliminated one of the most persistent bottlenecks in analytical processing.
Are you using Arrow’s View types in your Rust projects? The performance gains are hard to ignore.