Data Science at the Command Line Book in DuckDB

Today I solved the exercises in Chapter 5 of the Data Science at the Command Line book using the DuckDB command line client. This page documents my solutions.

Prerequisites

Clone the https://github.com/jeroenjanssens/dsutils repository and add it to the PATH.

To get the results for the reference solutions, you also need csvkit, which contains the csvlook, csvcut, csvsql, etc. CLI tools.

brew install csvkit

DuckDB Solutions

In the following, I give the DuckDB solutions for each exercise. If the command is part of a pipeline (e.g., when the input is read from the standard input), DuckDB is called as ... | duckdb -c "⟨SQL query⟩". Other times, the solution is presented simply as a SQL script.

5.2 Transformations, Transformations Everywhere

seq 100 |\
  python ../ch04/fizzbuzz.py |\
  tee fb.seq | trim

grep -E "fizz|buzz" fb.seq | # <1>\
sort | uniq -c | sort -nr > fb.cnt # <2>
bat -A fb.cnt
< fb.cnt awk 'BEGIN { print "value,count" } { print $2","$1 }' > fb.csv

create table fb as from read_csv('fb.seq', header = false);
create table counts as
    select column0 as value, count(*) as count
    from fb
    where column0 similar to '^(fizz|buzz).*'
    group by all
    order by count desc;
copy counts to 'fb.csv';

bat fb.csv
csvlook fb.csv
rush plot -x value -y count --geom col --height 2 fb.csv > fb.png
display fb.png

Instead of rush, I used YouPlot (brew install youplot):

cat fb.csv | uplot bar -d, -H

                              count
            ┌                                        ┐
       fizz ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 27.0
       buzz ┤■■■■■■■■■■■■■■■■■■ 14.0
   fizzbuzz ┤■■■■■■■■ 6.0
            └                                        ┘

5.3 Plain Text

5.3.1 Filtering Lines

seq -f "Line %g" 10 | tee lines

< lines head -n 3
< lines sed -n '1,3p'
< lines awk 'NR <= 3'

copy (from read_csv('lines', header = false) limit 3)
to '/dev/stdout' (header false);

< lines tail -n 3

copy (
    select column0
    from (
        select column0, row_number() over () as rownum
        from read_csv('lines', header = false)
        order by rownum desc
        limit 3
    )
    order by rownum asc
) to '/dev/stdout' (header false);

(Okay, < lines tail -n 3 is a bit more concise and probably a lot faster too.)

< lines tail -n +4
< lines sed '1,3d'
< lines sed -n '1,3!p'

copy (from read_csv('lines', header = false) offset 3)
to '/dev/stdout' (header false);

< lines head -n -3

create or replace table lines as from read_csv('lines', header = false);
copy (from lines where rowid < (select count(*) from lines) - 3)
to '/dev/stdout' (header false);

< lines sed -n '4,6p'
< lines awk '(NR>=4) && (NR<=6)'
< lines head -n 6 | tail -n 3

copy (from lines where rowid+1 between 4 and 6)
to '/dev/stdout' (header false);

< lines sed -n '1~2p'
< lines awk 'NR%2'

copy (from lines where (rowid+1) % 2 = 1)
to '/dev/stdout' (header false);

< lines sed -n '0~2p'
< lines awk '(NR+1)%2'

copy (from lines where (rowid+1) % 2 = 0)
to '/dev/stdout' (header false);

< alice.txt grep -i chapter

from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';

< alice.txt grep -E '^CHAPTER (.*)\. The'

from read_csv('alice.txt', header = false)
where column0 similar to 'CHAPTER (.*)\. The.*';

< alice.txt grep '^CHAPTER (.*)\. The'

-- This example is only there to show that this regex does not have any results
-- with vanilla (non-extended mode) grep, so there's no DuckDB solution for this

< alice.txt grep -Ev '^\s$' | wc -l

select count(*) from read_csv('alice.txt', header = false)
where column0 not similar to '^\s$';

RANDOM=1337
seq -f "Line %g" 1000 | sample -r 1%
seq -f "Line %g" 1000 | sample -r 1% -d 1000 -s 5 | ts

# the DuckDB solution does not have a fixed random see
seq -f "Line %g" 1000 | \
    duckdb -cmd "from read_csv('/dev/stdin', header = false) using sample 1% (bernoulli)"

5.3.2 Extracting Values

grep -i chapter alice.txt | cut -d ' ' -f 3-

select list_reduce(column0.split(' ')[3:], (x, y) -> x || ' ' || y) as val
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';

sed -rn 's/^CHAPTER ([IVXLCDM]{1,})\. (.*)$/\2/p' alice.txt | trim 3

select column0.regexp_replace('CHAPTER ([IVXLCDM]{1,})\. (.*)$', '\2') as val
from read_csv('alice.txt', header = false)
where column0.regexp_matches('CHAPTER ([IVXLCDM]{1,})\. (.*)$');

grep -i chapter alice.txt | cut -c 9-

select column0[9:] as title
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';

< alice.txt grep -oE '\w{2,}' | trim

select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
from read_text('alice.txt');

< alice.txt tr '[:upper:]' '[:lower:]' |\
grep -oE '\w{2,}' |\
grep -E '^a.*e$' |\
sort | uniq | sort -nr | trim

select distinct word
from (
    select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
    from read_text('alice.txt')
)
where word like 'a%e'
order by word desc;

5.3.3 Replacing and Deleting Values

echo 'hello world!' | tr ' ' '_'

select replace('hello world!', ' ', '_');

echo 'hello world!' | tr ' !' '_?'

select replace('hello world!', ' ', '_').replace('!', '?');

echo 'hello world!' | tr -d ' !'

select regexp_replace('hello world!', ' !', ' ');

echo 'hello world!' | tr -d -c '[a-z]'

select regexp_replace('hello world!', '[^a-z]', '', 'g');

echo 'hello world!' | tr '[a-z]' '[A-Z]'

select upper('hello world!');

echo 'hello world!' | tr '[:lower:]' '[:upper:]'
echo 'hello world!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[:lower:]' '[:upper:]'
echo 'hallo wêreld!' | sed 's/[[:lower:]]*/\U&/g'
echo 'helló vilÃ�g' | tr '[:lower:]' '[:upper:]'
echo 'helló vilÃ�g' | sed 's/[[:lower:]]*/\U&/g'

select upper('helló világ');

echo ' hello     world!' |\
sed -re 's/hello/bye/' |\
sed -re 's/\s+/ /g' |\
sed -re 's/\s+//'

select (' hello     world!')
    .replace('hello', 'bye')
    .regexp_replace('\s+', ' ', 'g')
    .regexp_replace('\s+', '');

sed -re 's/hello/bye/;s/\s+/ /g;s/\s+//'

-- This sed shorthand is not applicable to DuckDB

5.4 CSV

5.4.1 Bodies and Headers and Columns, Oh My!

echo -e "value\n7\n2\n5\n3" | body sort -n

echo -e "value\n7\n2\n5\n3" | \
    duckdb -c "copy (from read_csv('/dev/stdin') order by value) to '/dev/stdout'"

seq 5 | header -a count

seq 5 | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"

seq 5 | header -a count | wc -l

seq 5 \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'" \
    | wc -l

seq 5 | header -a count | body wc -l

seq 5 \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout' (header false)" \
    | wc -l \
    | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"

< tips.csv header

copy (from 'tips.csv' limit 0) to '/dev/stdout';

< iris.csv header -d | trim

copy (from read_csv('iris.csv', header = true)) to '/dev/stdout' (header false);

seq 5 | header -a line | body wc -l | header -r count
seq 5 | header -a line | header -e "tr '[a-z]' '[A-Z]'"
< tips.csv cols -c day body "tr '[a-z]' '[A-Z]'" | head -n 5 | csvlook
# `csvlook` turns this into a date in 1 AD (0001), so just omit `csvlook`
# (this error is present both in the printed book and on the website)

select * replace (upper(day) as day) from 'tips.csv';

seq 5 | header -a val | csvsql --query "SELECT SUM(val) AS sum FROM stdin"

seq 5 | header -a val | duckdb -c "select sum(val) as sum from read_csv('/dev/stdin')"

5.4.3 Extracting and Reordering Columns

< iris.csv csvcut -c sepal_length,petal_length,sepal_width,petal_width | csvlook

select sepal_length,petal_length,sepal_width,petal_width from 'iris.csv';

< iris.csv csvcut -C species | csvlook

select * exclude species from 'iris.csv';

echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
csvcut -c $(seq 1 2 9 | paste -sd,)

echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
# hack for odd numbers < 10
duckdb -c "select #1, #3, #5, #7, #9 from read_csv('/dev/stdin')"

echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | cut -d, -f 5,1,3

echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
duckdb -c "select #1, #3, #5 from read_csv('/dev/stdin')"

< iris.csv csvsql --query "SELECT sepal_length, petal_length, "\\
"sepal_width, petal_width FROM stdin" | head -n 5 | csvlook

select sepal_length, petal_length, sepal_width, petal_width
from read_csv('iris.csv')
limit 4;

5.4.4 Filtering Rows

seq 5 | sed -n '3,5p'
seq 5 | header -a count | body sed -n '3,5p'

seq 5 | duckdb -c "from read_csv('/dev/stdin', header = true) offset 1 limit 3"

csvgrep -c size -i -r "[1-4]" tips.csv

from 'tips.csv' where not size < 5;

< tips.csv awk -F, 'NR==1 || ($1 > 40.0) && ($5 ~ /^S/)'
csvsql --query "SELECT * FROM tips WHERE bill > 40 AND day LIKE 'S%'" tips.csv

from 'tips.csv' where bill > 40 and day in ('Sat', 'Sun');

5.4.5 Merging Columns

csvlook -I names.csv
< names.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' |\
csvlook -I
< names.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' |\
csvlook -I
< names.csv |\
cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born |\
csvcut -c id,full_name,born |\
csvlook -I
< names.csv csvsql --query "SELECT id, first_name || ' ' || last_name "\\
"AS full_name, born FROM stdin" | csvlook -I

select id, first_name || ' ' || last_name as full_name, born
from read_csv('names.csv');

cat names-comma.csv
< names-comma.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' | tail -n 1
< names-comma.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' | tail -n 1
< names-comma.csv | cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born | csvcut -c id,full_name,born | tail -n 1
< names-comma.csv csvsql --query "SELECT id, first_name || ' ' || last_name AS full_name, born FROM stdin" | tail -n 1
< names-comma.csv rush run -t 'unite(df, full_name, first_name, last_name, sep = " ")' - | tail -n 1

select id, first_name || ' ' || last_name as full_name, born
from read_csv('names-comma.csv');

5.4.6 Combining Multiple CSV Files

< tips.csv csvcut -c bill,tip | tee bills.csv | head -n 3 | csvlook
< tips.csv csvcut -c day,time | tee datetime.csv |\
head -n 3 | csvlook -I
< tips.csv csvcut -c sex,smoker,size | tee customers.csv |\
head -n 3 | csvlook

copy (select bill, tip from 'tips.csv') to 'bills.csv';
copy (select day, time from 'tips.csv') to 'datetime.csv';
copy (select sex, smoker, size from 'tips.csv') to 'customers.csv';

paste -d, {bills,customers,datetime}.csv | head -n 3 | csvlook -I

from 'bills.csv'
positional join 'customers.csv'
positional join 'datetime.csv';

csvlook irismeta.csv
csvjoin -c species iris.csv irismeta.csv | csvcut -c sepal_length,sepal_width,species,usda_id | sed -n '1p;49,54p' | csvlook
csvsql --query 'SELECT i.sepal_length, i.sepal_width, i.species, m.usda_id FROM iris i JOIN irismeta m ON (i.species = m.species)' iris.csv irismeta.csv | sed -n '1p;49,54p' | csvlook

select i.sepal_length, i.sepal_width, i.species, m.usda_id
from 'iris.csv' i
natural join 'irismeta.csv' m;

sed -e 's/"gender":/"sex":/g' users.json | jq | trim

copy (
    select
        to_json(list(struct_pack(
            sex := result.gender,
            name := result.name,
            location := result.location,
            email := result.email,
            login := result.login,
            dob := result.dob,
            registered := result.registered,
            phone := result.phone,
            cell := result.cell,
            id := result.id,
            picture := result.picture,
            nat := result.nat
        ))) as results,
        info
    from
    (
        select unnest(results) as result, info
        from 'users.json'
    )
    group by info
)
to 'users2.json'
(format json, array false);

(Okay, this is not too practical. Just use sed to patch JSON fields.)

5.5 Working with XML/HTML and JSON

I skipped the XML/HTML exercises below.

curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' > wiki.html
< wiki.html trim
grep wikitable -A 21 wiki.html
< wiki.html pup 'table.wikitable tbody' | tee table.html | trim
< table.html xml2json > table.json
jq . table.json | trim 20
< table.json jq -r '.tbody.tr[1:][] | [.td[]["$t"]] | @csv' | header -a rank,country,border,surface,ratio > countries.csv
csvlook --max-column-width 28 countries.csv

Prerequisites#

DuckDB Solutions#

5.2 Transformations, Transformations Everywhere#

5.3 Plain Text#

5.3.1 Filtering Lines#

5.3.2 Extracting Values#

5.3.3 Replacing and Deleting Values#

5.4 CSV#

5.4.1 Bodies and Headers and Columns, Oh My!#

5.4.3 Extracting and Reordering Columns#

5.4.4 Filtering Rows#

5.4.5 Merging Columns#

5.4.6 Combining Multiple CSV Files#

5.5 Working with XML/HTML and JSON#