Regular Expressions for Catching Typical Writing Errors
Some regular expressions for catching typical writing errors (spelling issues, repeat words, etc.):
ag --md " a [aeioAEIOU]" ag --md -i " an [bcdfghjklmnpqrstvxzwy]" ag --md -i "\s\b(\w+) +\1\b" ag --md -i "\s\b(\w+ +\w+) +\1\b" ag --md '\w+isation\b' ag --md 'e\.g\. ' ag --md 'i\.e\. ' ...
Databases conferences in 2025
In 2025, both top-tier database conferences will be in Europe: SIGMOD in Berlin (June 22–27) and VLDB in London (September 1–5). There are quite a few papers and satellite events I am looking forward to – I listed them below.
SIGMOD 2025 Graphs The GRADES-NDA 2025 workshop on Friday
Revisiting Graph Analytics Benchmarks by Lingkai Meng et al. The authors analyze the LDBC Graphalytics benchmark and propose several improvements to both the graph generator and the benchmark suite....
Generating TPCx-BB data sets
TPCx-BB (née BigBench) is a Big Data benchmark. To generate the TPCx-BB data sets, download the TPCX-BB_Tools_vX.Y.Z.zip package from the TPC website.
To run the generator, you’ll need a Java 8-compatible Java Virtual Machine. To obtain this, you can, e.g., install the Zulu JVM via SDKMAN!. Then, you can run the generator as follows:
java -cp pdgf.jar pdgf.Controller To list the available commands, run:
help To start the data generation, run:...
Cloudflare R2 command line snippet
I am a big fan of Cloudflare R2, an object storage that provides egress-free downloads.
R2 is compatible with the AWS S3 API, so you can use the AWS CLI tool – with a few caveats. These include:
You need to add the --endpoint-url https://<account_id>.r2.cloudflarestorage.com argument for every call. When copying to R2, you need to pass the --checksum-algorithm CRC32 argument. I often store multiple AWS configurations, which requires passing an additional argument: --profile <your_r2_profile_name>....
DuckDB vs. coreutils
A few months ago, I wrote a post on the DuckDB blog where I explained how DuckDB’s SQL can express operations that developers typically implement with UNIX commands. Then earlier this week, I published a light-hearted social media post about DuckDB beating the UNIX wc -l command for counting the lines in a CSV file by a significant margin (1.2 vs. 2.9 seconds).
This post received a lot of feedback with the criticisms centered around two points:...
Data Science at the Command Line Book in DuckDB
Today I solved the exercises in Chapter 5 of the Data Science at the Command Line book using the DuckDB command line client. This page documents my solutions.
Prerequisites Clone the https://github.com/jeroenjanssens/dsutils repository and add it to the PATH.
To get the results for the reference solutions, you also need csvkit, which contains the csvlook, csvcut, csvsql, etc. CLI tools.
brew install csvkit DuckDB Solutions In the following, I give the DuckDB solutions for each exercise....
Installing tidyverse on macOS
The tidyverse R package cannot be installed on macOS because one of its dependencies, ragg fails to compile with the following error:
clang++ -std=gnu++17 -I"/opt/homebrew/Cellar/r/4.4.1/lib/R/include" -DNDEBUG -I./agg/include -I/opt/homebrew/opt/freetype/include/freetype2 -I/opt/homebrew/opt/libpng/include/libpng16 -I/opt/homebrew/Cellar/libtiff/4.6.0/include -I/opt/homebrew/opt/zstd/include -I/opt/homebrew/Cellar/xz/5.6.2/include -I/opt/homebrew/Cellar/jpeg-turbo/3.0.3/include -I'/opt/homebrew/lib/R/4.4/site-library/systemfonts/include' -I'/opt/homebrew/lib/R/4.4/site-library/textshaping/include' -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include -fPIC -g -O2 -Wall -pedantic -fdiagnostics-color=always -c agg/src/agg_vcgen_stroke.cpp -o agg/src/agg_vcgen_stroke.o agg/src/agg_font_freetype.cpp:116:18: warning: variable 'len' set but not used [-Wunused-but-set-variable] unsigned len = 0; ^ agg/src/agg_font_freetype.cpp:182:35: error: assigning to 'char *' from 'unsigned char *' converts between pointers to integer types where one is of the unique plain 'char' type and the other is not tags = outline....
Setting up a MacBook for Presentations
Overview A recurring task in my day job is to organize technical conferences (most recently, DuckCon #4 and #5), and to run the event from my laptop. To this end, I configure my laptop to ensure the best experience for both speakers and attendees.
Most of the events I organize are free, so there is a limited budget available. Additionally, there is a limited amount of time prepare. For example, it is not possible to conduct rehearsals with speakers....
macOS command line tricks
Make git beep upon failed push Motivation: When I issue a git push command, I immediately navigate away from the terminal. Therefore, if the command fails due to the remote rejecting it after a second, I do not see this and assume that the push was successful.
To avoid this, we’ll configure the shell so when git push fails, it gives a small beep sound. To do so, follow these steps:...
DuckDB workshop
Setup DuckDB installation site duckman: DuckDB Version Manager railway.ipynb Jupyter notebook Weather data set Source: Visual Crossing Weather
wget https://blobs.duckdb.org/data/amsterdam-weather.csv
Railway data set Source: Rijden de Treinen
wget https://blobs.duckdb.org/nl-railway/stations-2022-01.csv
wget https://blobs.duckdb.org/nl-railway/tariff-distances-2022-01.csv
wget https://blobs.duckdb.org/nl-railway/services-2019.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2020.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2021.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2022.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2023.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2024-01.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2024-02.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2024-03.csv.gz
wget https://blobs.duckdb.org/nl-railway/services-2024-04.csv.gz
VS Code hotkey Define keyboard shortcut Ctrl + Enter for executing selection in terminal To use the same editor (VS Code) and the same keyboard shortcut (Ctrl + Enter) to run the active line / selected piece of code (for CLI) or the active cell (for Python notebooks), add the CLI hotkey as follows:...