edgarluque.com/content/blog/zstd-streaming-in-rust.md

83 lines
3.3 KiB
Markdown
Raw Normal View History

2022-01-06 09:40:24 +00:00
+++
title = "Parsing compressed files efficiently with Rust"
description = "Sometimes you need a bit of a stream."
date = 2022-01-06
[taxonomies]
categories = ["rust"]
+++
2022-01-06 09:53:26 +00:00
I recently wanted to create a tool to create plots showing concurrent players each day on the open-source game [DDraceNetwork](https://ddnet.tw/) (DDNet for short).
2022-01-06 09:40:24 +00:00
DDNet hosts an HTTP "master server", which is what the game client uses to fetch information about game servers they can join.
Thankfully they keep online the master server status of [previous days](https://ddnet.tw/stats/master/).
2022-01-06 09:44:26 +00:00
Each `.tar.zstd` file contains a JSON file every 5 seconds starting from 00:00 to 23:59 which has information about all servers and players within those servers at that current time.
2022-01-06 09:40:24 +00:00
## The problem
2022-01-06 09:45:21 +00:00
These files, while compressed use only about ~8mb, but they are very efficiently compressed, when decompressed they take about 7gb.
2022-01-06 09:40:24 +00:00
2022-01-06 09:45:51 +00:00
So if we don't want to use a lot of disk space or memory we need to parse the data in a streaming way.
2022-01-06 09:40:24 +00:00
## The libraries
We will use the following libraries to achieve this:
- [tar](https://lib.rs/crates/tar): To read the entries of the tar archive.
- [zstd](https://lib.rs/crates/zstd): To decompress the files.
- [ureq](https://lib.rs/crates/ureq): To get the archives.
## Fetching the data
2022-01-06 17:44:39 +00:00
With ureq we can fetch the data easily:
2022-01-06 09:40:24 +00:00
```rust
2022-01-06 09:41:59 +00:00
let resp = ureq::get("https://ddnet.tw/stats/master/2022-01-04.tar.zstd").call()?;
2022-01-06 09:40:24 +00:00
```
## Processing the data
In Rust i/o operations are modeled around 2 traits: [Read](https://doc.rust-lang.org/std/io/trait.Read.html) and [Write](https://doc.rust-lang.org/std/io/trait.Write.html),
thanks to this it's really ergonomic to use both libraries (tar and zstd) together.
2022-01-06 17:44:39 +00:00
Now we convert the response into a Reader and pass it to the zstd [Decoder](https://docs.rs/zstd/0.9.0+zstd.1.5.0/zstd/stream/read/struct.Decoder.html), which takes anything that implements [Read](https://doc.rust-lang.org/std/io/trait.Read.html),
2022-01-06 09:48:30 +00:00
it also wraps it around a [BufReader](https://doc.rust-lang.org/nightly/std/io/struct.BufReader.html) for buffered reading.
2022-01-06 09:40:24 +00:00
```rust
2022-01-06 17:44:39 +00:00
let decoder = zstd::stream::Decoder::new(resp.into_reader())?;
2022-01-06 09:40:24 +00:00
```
Now we need to pass this `decoder` to tar to get its entries:
```rust
let mut archive = tar::Archive::new(decoder);
// Loop over the entries
for entry in archive.entries()? {
2022-01-06 09:55:51 +00:00
let entry = entry.unwrap();
2022-01-06 09:40:24 +00:00
let path = entry.path().unwrap();
2022-01-06 09:49:37 +00:00
let filename = path.file_name().expect("exist");
2022-01-06 09:40:24 +00:00
// process each entry
}
```
Here entry implements Read too, in our case each entry is a json file, we could parse it this way, for example using `serde` and `simd_json`:
```rust
let data: ServerList = simd_json::from_reader(entry).expect("parse json");
```
2022-01-06 09:50:35 +00:00
This way, we are parsing each file efficiently while using almost no memory thanks to the streaming nature of these operations.
2022-01-06 09:40:24 +00:00
2022-01-06 09:41:59 +00:00
This all fits really well thanks to the design of [Read](https://doc.rust-lang.org/std/io/trait.Read.html) and [Write](https://doc.rust-lang.org/std/io/trait.Write.html).
## The tool
Here is the source code of the tool: <https://github.com/edg-l/teemasterparser>
And an image of the result:
2022-01-06 17:44:39 +00:00
<img src="https://github.com/edg-l/teemasterparser/raw/master/example.svg" width="100%">
[Discussion on reddit.](https://www.reddit.com/r/rust/comments/rxav4e/parsing_compressed_files_efficiently_with_rust/)