Today I solved the exercises in Chapter 5 of the Data Science at the Command Line book using the DuckDB command line client. This page documents my solutions.

Prerequisites

Clone the https://github.com/jeroenjanssens/dsutils repository and add it to the PATH.

To get the results for the reference solutions, you also need csvkit, which contains the csvlook, csvcut, csvsql, etc. CLI tools.

brew install csvkit

DuckDB Solutions

In the following, I give the DuckDB solutions for each exercise. If the command is part of a pipeline (e.g., when the input is read from the standard input), DuckDB is called as ... | duckdb -c "⟨SQL query⟩". Other times, the solution is presented simply as a SQL script.

5.2 Transformations, Transformations Everywhere

seq 100 |\
  python ../ch04/fizzbuzz.py |\
  tee fb.seq | trim
grep -E "fizz|buzz" fb.seq | # <1>\
sort | uniq -c | sort -nr > fb.cnt # <2>
bat -A fb.cnt
< fb.cnt awk 'BEGIN { print "value,count" } { print $2","$1 }' > fb.csv
create table fb as from read_csv('fb.seq', header = false);
create table counts as
    select column0 as value, count(*) as count
    from fb
    where column0 similar to '^(fizz|buzz).*'
    group by all
    order by count desc;
copy counts to 'fb.csv';
bat fb.csv
csvlook fb.csv
rush plot -x value -y count --geom col --height 2 fb.csv > fb.png
display fb.png

Instead of rush, I used YouPlot (brew install youplot):

cat fb.csv | uplot bar -d, -H
                              count
            ┌                                        ┐
       fizz ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 27.0
       buzz ┤■■■■■■■■■■■■■■■■■■ 14.0
   fizzbuzz ┤■■■■■■■■ 6.0
            └                                        ┘

5.3 Plain Text

5.3.1 Filtering Lines

seq -f "Line %g" 10 | tee lines
< lines head -n 3
< lines sed -n '1,3p'
< lines awk 'NR <= 3'
copy (from read_csv('lines', header = false) limit 3)
to '/dev/stdout' (header false);
< lines tail -n 3
copy (
    select column0
    from (
        select column0, row_number() over () as rownum
        from read_csv('lines', header = false)
        order by rownum desc
        limit 3
    )
    order by rownum asc
) to '/dev/stdout' (header false);

(Okay, < lines tail -n 3 is a bit more concise and probably a lot faster too.)

< lines tail -n +4
< lines sed '1,3d'
< lines sed -n '1,3!p'
copy (from read_csv('lines', header = false) offset 3)
to '/dev/stdout' (header false);
< lines head -n -3
create or replace table lines as from read_csv('lines', header = false);
copy (from lines where rowid < (select count(*) from lines) - 3)
to '/dev/stdout' (header false);
< lines sed -n '4,6p'
< lines awk '(NR>=4) && (NR<=6)'
< lines head -n 6 | tail -n 3
copy (from lines where rowid+1 between 4 and 6)
to '/dev/stdout' (header false);
< lines sed -n '1~2p'
< lines awk 'NR%2'
copy (from lines where (rowid+1) % 2 = 1)
to '/dev/stdout' (header false);
< lines sed -n '0~2p'
< lines awk '(NR+1)%2'
copy (from lines where (rowid+1) % 2 = 0)
to '/dev/stdout' (header false);
< alice.txt grep -i chapter
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
< alice.txt grep -E '^CHAPTER (.*)\. The'
from read_csv('alice.txt', header = false)
where column0 similar to 'CHAPTER (.*)\. The.*';
< alice.txt grep '^CHAPTER (.*)\. The'
-- This example is only there to show that this regex does not have any results
-- with vanilla (non-extended mode) grep, so there's no DuckDB solution for this
< alice.txt grep -Ev '^\s$' | wc -l
select count(*) from read_csv('alice.txt', header = false)
where column0 not similar to '^\s$';
RANDOM=1337
seq -f "Line %g" 1000 | sample -r 1%
seq -f "Line %g" 1000 | sample -r 1% -d 1000 -s 5 | ts
# the DuckDB solution does not have a fixed random see
seq -f "Line %g" 1000 | \
    duckdb -cmd "from read_csv('/dev/stdin', header = false) using sample 1% (bernoulli)"

5.3.2 Extracting Values

grep -i chapter alice.txt | cut -d ' ' -f 3-
select list_reduce(column0.split(' ')[3:], (x, y) -> x || ' ' || y) as val
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
sed -rn 's/^CHAPTER ([IVXLCDM]{1,})\. (.*)$/\2/p' alice.txt | trim 3
select column0.regexp_replace('CHAPTER ([IVXLCDM]{1,})\. (.*)$', '\2') as val
from read_csv('alice.txt', header = false)
where column0.regexp_matches('CHAPTER ([IVXLCDM]{1,})\. (.*)$');
grep -i chapter alice.txt | cut -c 9-
select column0[9:] as title
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
< alice.txt grep -oE '\w{2,}' | trim
select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
from read_text('alice.txt');
< alice.txt tr '[:upper:]' '[:lower:]' |\
grep -oE '\w{2,}' |\
grep -E '^a.*e$' |\
sort | uniq | sort -nr | trim
select distinct word
from (
    select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
    from read_text('alice.txt')
)
where word like 'a%e'
order by word desc;

5.3.3 Replacing and Deleting Values

echo 'hello world!' | tr ' ' '_'
select replace('hello world!', ' ', '_');
echo 'hello world!' | tr ' !' '_?'
select replace('hello world!', ' ', '_').replace('!', '?');
echo 'hello world!' | tr -d ' !'
select regexp_replace('hello world!', ' !', ' ');
echo 'hello world!' | tr -d -c '[a-z]'
select regexp_replace('hello world!', '[^a-z]', '', 'g');
echo 'hello world!' | tr '[a-z]' '[A-Z]'
select upper('hello world!');
echo 'hello world!' | tr '[:lower:]' '[:upper:]'
echo 'hello world!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[:lower:]' '[:upper:]'
echo 'hallo wêreld!' | sed 's/[[:lower:]]*/\U&/g'
echo 'helló vilÃ�g' | tr '[:lower:]' '[:upper:]'
echo 'helló vilÃ�g' | sed 's/[[:lower:]]*/\U&/g'
select upper('helló világ');
echo ' hello     world!' |\
sed -re 's/hello/bye/' |\
sed -re 's/\s+/ /g' |\
sed -re 's/\s+//'
select (' hello     world!')
    .replace('hello', 'bye')
    .regexp_replace('\s+', ' ', 'g')
    .regexp_replace('\s+', '');
sed -re 's/hello/bye/;s/\s+/ /g;s/\s+//'
-- This sed shorthand is not applicable to DuckDB

5.4 CSV

5.4.1 Bodies and Headers and Columns, Oh My!

echo -e "value\n7\n2\n5\n3" | body sort -n
echo -e "value\n7\n2\n5\n3" | \
    duckdb -c "copy (from read_csv('/dev/stdin') order by value) to '/dev/stdout'"
seq 5 | header -a count
seq 5 | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"
seq 5 | header -a count | wc -l
seq 5 \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'" \
    | wc -l
seq 5 | header -a count | body wc -l
seq 5 \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout' (header false)" \
    | wc -l \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"
< tips.csv header
copy (from 'tips.csv' limit 0) to '/dev/stdout';
< iris.csv header -d | trim
copy (from read_csv('iris.csv', header = true)) to '/dev/stdout' (header false);
seq 5 | header -a line | body wc -l | header -r count
seq 5 | header -a line | header -e "tr '[a-z]' '[A-Z]'"
< tips.csv cols -c day body "tr '[a-z]' '[A-Z]'" | head -n 5 | csvlook
# `csvlook` turns this into a date in 1 AD (0001), so just omit `csvlook`
# (this error is present both in the printed book and on the website)
select * replace (upper(day) as day) from 'tips.csv';
seq 5 | header -a val | csvsql --query "SELECT SUM(val) AS sum FROM stdin"
seq 5 | header -a val | duckdb -c "select sum(val) as sum from read_csv('/dev/stdin')"

5.4.3 Extracting and Reordering Columns

< iris.csv csvcut -c sepal_length,petal_length,sepal_width,petal_width | csvlook
select sepal_length,petal_length,sepal_width,petal_width from 'iris.csv';
< iris.csv csvcut -C species | csvlook
select * exclude species from 'iris.csv';
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
csvcut -c $(seq 1 2 9 | paste -sd,)
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
# hack for odd numbers < 10
duckdb -c "select #1, #3, #5, #7, #9 from read_csv('/dev/stdin')"
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | cut -d, -f 5,1,3
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
duckdb -c "select #1, #3, #5 from read_csv('/dev/stdin')"
< iris.csv csvsql --query "SELECT sepal_length, petal_length, "\\
"sepal_width, petal_width FROM stdin" | head -n 5 | csvlook
select sepal_length, petal_length, sepal_width, petal_width
from read_csv('iris.csv')
limit 4;

5.4.4 Filtering Rows

seq 5 | sed -n '3,5p'
seq 5 | header -a count | body sed -n '3,5p'
seq 5 | duckdb -c "from read_csv('/dev/stdin', header = true) offset 1 limit 3"
csvgrep -c size -i -r "[1-4]" tips.csv
from 'tips.csv' where not size < 5;
< tips.csv awk -F, 'NR==1 || ($1 > 40.0) && ($5 ~ /^S/)'
csvsql --query "SELECT * FROM tips WHERE bill > 40 AND day LIKE 'S%'" tips.csv
from 'tips.csv' where bill > 40 and day in ('Sat', 'Sun');

5.4.5 Merging Columns

csvlook -I names.csv
< names.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' |\
csvlook -I
< names.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' |\
csvlook -I
< names.csv |\
cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born |\
csvcut -c id,full_name,born |\
csvlook -I
< names.csv csvsql --query "SELECT id, first_name || ' ' || last_name "\\
"AS full_name, born FROM stdin" | csvlook -I
select id, first_name || ' ' || last_name as full_name, born
from read_csv('names.csv');
cat names-comma.csv
< names-comma.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' | tail -n 1
< names-comma.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' | tail -n 1
< names-comma.csv | cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born | csvcut -c id,full_name,born | tail -n 1
< names-comma.csv csvsql --query "SELECT id, first_name || ' ' || last_name AS full_name, born FROM stdin" | tail -n 1
< names-comma.csv rush run -t 'unite(df, full_name, first_name, last_name, sep = " ")' - | tail -n 1
select id, first_name || ' ' || last_name as full_name, born
from read_csv('names-comma.csv');

5.4.6 Combining Multiple CSV Files

< tips.csv csvcut -c bill,tip | tee bills.csv | head -n 3 | csvlook
< tips.csv csvcut -c day,time | tee datetime.csv |\
head -n 3 | csvlook -I
< tips.csv csvcut -c sex,smoker,size | tee customers.csv |\
head -n 3 | csvlook
copy (select bill, tip from 'tips.csv') to 'bills.csv';
copy (select day, time from 'tips.csv') to 'datetime.csv';
copy (select sex, smoker, size from 'tips.csv') to 'customers.csv';
paste -d, {bills,customers,datetime}.csv | head -n 3 | csvlook -I
from 'bills.csv'
positional join 'customers.csv'
positional join 'datetime.csv';
csvlook irismeta.csv
csvjoin -c species iris.csv irismeta.csv | csvcut -c sepal_length,sepal_width,species,usda_id | sed -n '1p;49,54p' | csvlook
csvsql --query 'SELECT i.sepal_length, i.sepal_width, i.species, m.usda_id FROM iris i JOIN irismeta m ON (i.species = m.species)' iris.csv irismeta.csv | sed -n '1p;49,54p' | csvlook
select i.sepal_length, i.sepal_width, i.species, m.usda_id
from 'iris.csv' i
natural join 'irismeta.csv' m;
sed -e 's/"gender":/"sex":/g' users.json | jq | trim
copy (
    select
        to_json(list(struct_pack(
            sex := result.gender,
            name := result.name,
            location := result.location,
            email := result.email,
            login := result.login,
            dob := result.dob,
            registered := result.registered,
            phone := result.phone,
            cell := result.cell,
            id := result.id,
            picture := result.picture,
            nat := result.nat
        ))) as results,
        info
    from
    (
        select unnest(results) as result, info
        from 'users.json'
    )
    group by info
)
to 'users2.json'
(format json, array false);

(Okay, this is not too practical. Just use sed to patch JSON fields.)

5.5 Working with XML/HTML and JSON

I skipped the XML/HTML exercises below.

curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' > wiki.html
< wiki.html trim
grep wikitable -A 21 wiki.html
< wiki.html pup 'table.wikitable tbody' | tee table.html | trim
< table.html xml2json > table.json
jq . table.json | trim 20
< table.json jq -r '.tbody.tr[1:][] | [.td[]["$t"]] | @csv' | header -a rank,country,border,surface,ratio > countries.csv
csvlook --max-column-width 28 countries.csv