edgarluque.com/content/blog/zstd-streaming-in-rust.md
2022-01-06 10:55:51 +01:00

3.8 KiB

+++ title = "Parsing compressed files efficiently with Rust" description = "Sometimes you need a bit of a stream." date = 2022-01-06 [taxonomies] categories = ["rust"] +++

I recently wanted to create a tool to create plots showing concurrent players each day on the open-source game DDraceNetwork (DDNet for short).

DDNet hosts an HTTP "master server", which is what the game client uses to fetch information about game servers they can join. Thankfully they keep online the master server status of previous days.

Each .tar.zstd file contains a JSON file every 5 seconds starting from 00:00 to 23:59 which has information about all servers and players within those servers at that current time.

The problem

These files, while compressed use only about ~8mb, but they are very efficiently compressed, when decompressed they take about 7gb.

So if we don't want to use a lot of disk space or memory we need to parse the data in a streaming way.

The libraries

We will use the following libraries to achieve this:

  • tar: To read the entries of the tar archive.
  • zstd: To decompress the files.
  • ureq: To get the archives.

Fetching the data

let resp = ureq::get("https://ddnet.tw/stats/master/2022-01-04.tar.zstd").call()?;

// Read the content length from the header.
let len: usize = resp.header("Content-Length").unwrap().parse()?;

// Initialize the vector with the given length capacity.
let mut bytes_compressed: Vec<u8> = Vec::with_capacity(len);

// Read everything.
resp.into_reader()
    .take(15 * 1024 * 1024) // read max 15mb
    .read_to_end(&mut bytes_compressed)?;

Processing the data

In Rust i/o operations are modeled around 2 traits: Read and Write, thanks to this it's really ergonomic to use both libraries (tar and zstd) together.

First, since Vec doesn't implement Read we need to wrap it around a Cursor which implements Read.

let buffer = Cursor::new(bytes_compressed);

Good, now we can pass this buffer to the zstd Decoder, which takes anything that implements Read, it also wraps it around a BufReader for buffered reading.

// The type of decoder is Decoder<BufReader<Cursor<Vec<u8>>>>
let decoder = zstd::stream::Decoder::new(buffer)?;

Now we need to pass this decoder to tar to get its entries:

let mut archive = tar::Archive::new(decoder);

// Loop over the entries
for entry in archive.entries()? {
    let entry = entry.unwrap();
    let path = entry.path().unwrap();
    let filename = path.file_name().expect("exist");
    // process each entry
}

Here entry implements Read too, in our case each entry is a json file, we could parse it this way, for example using serde and simd_json:

let data: ServerList = simd_json::from_reader(entry).expect("parse json");

This way, we are parsing each file efficiently while using almost no memory thanks to the streaming nature of these operations.

This all fits really well thanks to the design of Read and Write.

The tool

Here is the source code of the tool: https://github.com/edg-l/teemasterparser

And an image of the result: