I recently attended SIGMOD and chatted with people in the data management and graph processing communities.
These conversations and other interactions led me to come up with a few research ideas.
I lack the time to actively pursue them, but I list them below for future reference.
If you are working on something similar or would like to collaborate, please drop me a line!
Single-node LDBC Datagen: I have long thought about developing a single-node variant of the LDBC Datagen....
Cards and apps in the Netherlands
I moved to the Netherlands during the summer of 2020 — about five years ago. Over time, I’ve learned about a bunch of useful cards and apps that make everyday life easier. Here’s a brief collection of them.
Cards Albert Heijn Bonuskaart: When shopping at the Albert Heijn supermarket, you need to scan the bonus card; otherwise, the discounts are not applied. A few years ago, it was quite easy to get an anonymous card (just ask for one and don’t register it — the card still works as intended)....
Graph news – May 2025
A lot of things happened in the graph space so far. Here’s a quick summary with a few comments.
On DB Engines ranking, graph databases have continued their rebound and have been on a growth trajectory for the last 5 months. In 2021, Gartner predicted that “by 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% in 2021, facilitating rapid decision making across the organization”....
Regular expressions for catching typical writing errors
Some regular expressions for catching typical writing errors – spelling issues (US spelling), repeat words, etc.:
ag --md " a [aeioAEIO]" ag --md " an [bcdfghjklmnpqrstvxzwyBCDGHJKPQRTVWXYZ]" ag --md -i "\s\b(\w+) +\1\b" ag --md -i "\s\b(\w+ +\w+) +\1\b" ag --md '\w+isation\b' ag --md 'e\.g\. ' ag --md 'i\.e\. ' ...
Databases conferences in 2025
In 2025, both top-tier database conferences will be in Europe: SIGMOD in Berlin (June 22–27) and VLDB in London (September 1–5). There are quite a few papers and satellite events I am looking forward to – I listed them below.
SIGMOD 2025 The papers presented at SIGMOD are listed on the website: research track, industry track.
Update: the detailed programme is out!
Keynotes:
How to Build a Brain by Christos H....
Generating TPCx-BB data sets
TPCx-BB (née BigBench) is a Big Data benchmark. To generate the TPCx-BB data sets, download the TPCX-BB_Tools_vX.Y.Z.zip package from the TPC website.
To run the generator, you’ll need a Java 8-compatible Java Virtual Machine. To obtain this, you can, e.g., install the Zulu JVM via SDKMAN!. Then, you can run the generator as follows:
java -cp pdgf.jar pdgf.Controller To list the available commands, run:
help To start the data generation, run:...
Cloudflare R2 command line snippet
I am a big fan of Cloudflare R2, an object storage that provides egress-free downloads.
R2 is compatible with the AWS S3 API, so you can use the AWS CLI tool – with a few caveats. These include:
You need to add the --endpoint-url https://<account_id>.r2.cloudflarestorage.com argument for every call. When copying to R2, you need to pass the --checksum-algorithm CRC32 argument. I often store multiple AWS configurations, which requires passing an additional argument: --profile <your_r2_profile_name>....
DuckDB vs. coreutils
A few months ago, I wrote a post on the DuckDB blog where I explained how DuckDB’s SQL can express operations that developers typically implement with UNIX commands. Then earlier this week, I published a light-hearted social media post about DuckDB beating the UNIX wc -l command for counting the lines in a CSV file by a significant margin (1.2 vs. 2.9 seconds).
This post received a lot of feedback with the criticisms centered around two points:...
“Data Science at the Command Line” book in DuckDB
Today I solved the exercises in Chapter 5 of the Data Science at the Command Line book using the DuckDB command line client. This page documents my solutions.
Prerequisites Clone the https://github.com/jeroenjanssens/dsutils repository and add it to the PATH.
To get the results for the reference solutions, you also need csvkit, which contains the csvlook, csvcut, csvsql, etc. CLI tools.
brew install csvkit DuckDB Solutions In the following, I give the DuckDB solutions for each exercise....
Installing tidyverse on macOS
The tidyverse R package cannot be installed on macOS because one of its dependencies, ragg fails to compile with the following error:
clang++ -std=gnu++17 -I"/opt/homebrew/Cellar/r/4.4.1/lib/R/include" -DNDEBUG -I./agg/include -I/opt/homebrew/opt/freetype/include/freetype2 -I/opt/homebrew/opt/libpng/include/libpng16 -I/opt/homebrew/Cellar/libtiff/4.6.0/include -I/opt/homebrew/opt/zstd/include -I/opt/homebrew/Cellar/xz/5.6.2/include -I/opt/homebrew/Cellar/jpeg-turbo/3.0.3/include -I'/opt/homebrew/lib/R/4.4/site-library/systemfonts/include' -I'/opt/homebrew/lib/R/4.4/site-library/textshaping/include' -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include -fPIC -g -O2 -Wall -pedantic -fdiagnostics-color=always -c agg/src/agg_vcgen_stroke.cpp -o agg/src/agg_vcgen_stroke.o agg/src/agg_font_freetype.cpp:116:18: warning: variable 'len' set but not used [-Wunused-but-set-variable] unsigned len = 0; ^ agg/src/agg_font_freetype.cpp:182:35: error: assigning to 'char *' from 'unsigned char *' converts between pointers to integer types where one is of the unique plain 'char' type and the other is not tags = outline....