Today I solved the exercises in Chapter 5 of the Data Science at the Command Line book using the DuckDB command line client. This page documents my solutions.
Prerequisites
Clone the https://github.com/jeroenjanssens/dsutils repository and add it to the PATH
.
To get the results for the reference solutions, you also need csvkit
,
which contains the csvlook
, csvcut
, csvsql
, etc. CLI tools.
brew install csvkit
DuckDB Solutions
In the following, I give the DuckDB solutions for each exercise.
If the command is part of a pipeline (e.g., when the input is read from the standard input), DuckDB is called as ... | duckdb -c "⟨SQL query⟩"
.
Other times, the solution is presented simply as a SQL script.
5.2 Transformations, Transformations Everywhere
seq 100 |\
python ../ch04/fizzbuzz.py |\
tee fb.seq | trim
grep -E "fizz|buzz" fb.seq | # <1>\
sort | uniq -c | sort -nr > fb.cnt # <2>
bat -A fb.cnt
< fb.cnt awk 'BEGIN { print "value,count" } { print $2","$1 }' > fb.csv
create table fb as from read_csv('fb.seq', header = false);
create table counts as
select column0 as value, count(*) as count
from fb
where column0 similar to '^(fizz|buzz).*'
group by all
order by count desc;
copy counts to 'fb.csv';
bat fb.csv
csvlook fb.csv
rush plot -x value -y count --geom col --height 2 fb.csv > fb.png
display fb.png
Instead of rush
, I used YouPlot (brew install youplot
):
cat fb.csv | uplot bar -d, -H
count
┌ ┐
fizz ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 27.0
buzz ┤■■■■■■■■■■■■■■■■■■ 14.0
fizzbuzz ┤■■■■■■■■ 6.0
└ ┘
5.3 Plain Text
5.3.1 Filtering Lines
seq -f "Line %g" 10 | tee lines
< lines head -n 3
< lines sed -n '1,3p'
< lines awk 'NR <= 3'
copy (from read_csv('lines', header = false) limit 3)
to '/dev/stdout' (header false);
< lines tail -n 3
copy (
select column0
from (
select column0, row_number() over () as rownum
from read_csv('lines', header = false)
order by rownum desc
limit 3
)
order by rownum asc
) to '/dev/stdout' (header false);
(Okay, < lines tail -n 3
is a bit more concise and probably a lot faster too.)
< lines tail -n +4
< lines sed '1,3d'
< lines sed -n '1,3!p'
copy (from read_csv('lines', header = false) offset 3)
to '/dev/stdout' (header false);
< lines head -n -3
create or replace table lines as from read_csv('lines', header = false);
copy (from lines where rowid < (select count(*) from lines) - 3)
to '/dev/stdout' (header false);
< lines sed -n '4,6p'
< lines awk '(NR>=4) && (NR<=6)'
< lines head -n 6 | tail -n 3
copy (from lines where rowid+1 between 4 and 6)
to '/dev/stdout' (header false);
< lines sed -n '1~2p'
< lines awk 'NR%2'
copy (from lines where (rowid+1) % 2 = 1)
to '/dev/stdout' (header false);
< lines sed -n '0~2p'
< lines awk '(NR+1)%2'
copy (from lines where (rowid+1) % 2 = 0)
to '/dev/stdout' (header false);
< alice.txt grep -i chapter
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
< alice.txt grep -E '^CHAPTER (.*)\. The'
from read_csv('alice.txt', header = false)
where column0 similar to 'CHAPTER (.*)\. The.*';
< alice.txt grep '^CHAPTER (.*)\. The'
-- This example is only there to show that this regex does not have any results
-- with vanilla (non-extended mode) grep, so there's no DuckDB solution for this
< alice.txt grep -Ev '^\s$' | wc -l
select count(*) from read_csv('alice.txt', header = false)
where column0 not similar to '^\s$';
RANDOM=1337
seq -f "Line %g" 1000 | sample -r 1%
seq -f "Line %g" 1000 | sample -r 1% -d 1000 -s 5 | ts
# the DuckDB solution does not have a fixed random see
seq -f "Line %g" 1000 | \
duckdb -cmd "from read_csv('/dev/stdin', header = false) using sample 1% (bernoulli)"
5.3.2 Extracting Values
grep -i chapter alice.txt | cut -d ' ' -f 3-
select list_reduce(column0.split(' ')[3:], (x, y) -> x || ' ' || y) as val
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
sed -rn 's/^CHAPTER ([IVXLCDM]{1,})\. (.*)$/\2/p' alice.txt | trim 3
select column0.regexp_replace('CHAPTER ([IVXLCDM]{1,})\. (.*)$', '\2') as val
from read_csv('alice.txt', header = false)
where column0.regexp_matches('CHAPTER ([IVXLCDM]{1,})\. (.*)$');
grep -i chapter alice.txt | cut -c 9-
select column0[9:] as title
from read_csv('alice.txt', header = false)
where column0 ilike '%chapter%';
< alice.txt grep -oE '\w{2,}' | trim
select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
from read_text('alice.txt');
< alice.txt tr '[:upper:]' '[:lower:]' |\
grep -oE '\w{2,}' |\
grep -E '^a.*e$' |\
sort | uniq | sort -nr | trim
select distinct word
from (
select unnest(content.lower().regexp_extract_all('\w{2,}')) as word
from read_text('alice.txt')
)
where word like 'a%e'
order by word desc;
5.3.3 Replacing and Deleting Values
echo 'hello world!' | tr ' ' '_'
select replace('hello world!', ' ', '_');
echo 'hello world!' | tr ' !' '_?'
select replace('hello world!', ' ', '_').replace('!', '?');
echo 'hello world!' | tr -d ' !'
select regexp_replace('hello world!', ' !', ' ');
echo 'hello world!' | tr -d -c '[a-z]'
select regexp_replace('hello world!', '[^a-z]', '', 'g');
echo 'hello world!' | tr '[a-z]' '[A-Z]'
select upper('hello world!');
echo 'hello world!' | tr '[:lower:]' '[:upper:]'
echo 'hello world!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[a-z]' '[A-Z]'
echo 'hallo wêreld!' | tr '[:lower:]' '[:upper:]'
echo 'hallo wêreld!' | sed 's/[[:lower:]]*/\U&/g'
echo 'helló vilÃ�g' | tr '[:lower:]' '[:upper:]'
echo 'helló vilÃ�g' | sed 's/[[:lower:]]*/\U&/g'
select upper('helló világ');
echo ' hello world!' |\
sed -re 's/hello/bye/' |\
sed -re 's/\s+/ /g' |\
sed -re 's/\s+//'
select (' hello world!')
.replace('hello', 'bye')
.regexp_replace('\s+', ' ', 'g')
.regexp_replace('\s+', '');
sed -re 's/hello/bye/;s/\s+/ /g;s/\s+//'
-- This sed shorthand is not applicable to DuckDB
5.4 CSV
5.4.1 Bodies and Headers and Columns, Oh My!
echo -e "value\n7\n2\n5\n3" | body sort -n
echo -e "value\n7\n2\n5\n3" | \
duckdb -c "copy (from read_csv('/dev/stdin') order by value) to '/dev/stdout'"
seq 5 | header -a count
seq 5 | duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"
seq 5 | header -a count | wc -l
seq 5 \
| duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'" \
| wc -l
seq 5 | header -a count | body wc -l
seq 5 \
| duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout' (header false)" \
| wc -l \
| duckdb -c "copy (select column0 as count from read_csv('/dev/stdin', header = false)) to '/dev/stdout'"
< tips.csv header
copy (from 'tips.csv' limit 0) to '/dev/stdout';
< iris.csv header -d | trim
copy (from read_csv('iris.csv', header = true)) to '/dev/stdout' (header false);
seq 5 | header -a line | body wc -l | header -r count
seq 5 | header -a line | header -e "tr '[a-z]' '[A-Z]'"
< tips.csv cols -c day body "tr '[a-z]' '[A-Z]'" | head -n 5 | csvlook
# `csvlook` turns this into a date in 1 AD (0001), so just omit `csvlook`
# (this error is present both in the printed book and on the website)
select * replace (upper(day) as day) from 'tips.csv';
seq 5 | header -a val | csvsql --query "SELECT SUM(val) AS sum FROM stdin"
seq 5 | header -a val | duckdb -c "select sum(val) as sum from read_csv('/dev/stdin')"
5.4.3 Extracting and Reordering Columns
< iris.csv csvcut -c sepal_length,petal_length,sepal_width,petal_width | csvlook
select sepal_length,petal_length,sepal_width,petal_width from 'iris.csv';
< iris.csv csvcut -C species | csvlook
select * exclude species from 'iris.csv';
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
csvcut -c $(seq 1 2 9 | paste -sd,)
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
# hack for odd numbers < 10
duckdb -c "select #1, #3, #5, #7, #9 from read_csv('/dev/stdin')"
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | cut -d, -f 5,1,3
echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |\
duckdb -c "select #1, #3, #5 from read_csv('/dev/stdin')"
< iris.csv csvsql --query "SELECT sepal_length, petal_length, "\\
"sepal_width, petal_width FROM stdin" | head -n 5 | csvlook
select sepal_length, petal_length, sepal_width, petal_width
from read_csv('iris.csv')
limit 4;
5.4.4 Filtering Rows
seq 5 | sed -n '3,5p'
seq 5 | header -a count | body sed -n '3,5p'
seq 5 | duckdb -c "from read_csv('/dev/stdin', header = true) offset 1 limit 3"
csvgrep -c size -i -r "[1-4]" tips.csv
from 'tips.csv' where not size < 5;
< tips.csv awk -F, 'NR==1 || ($1 > 40.0) && ($5 ~ /^S/)'
csvsql --query "SELECT * FROM tips WHERE bill > 40 AND day LIKE 'S%'" tips.csv
from 'tips.csv' where bill > 40 and day in ('Sat', 'Sun');
5.4.5 Merging Columns
csvlook -I names.csv
< names.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' |\
csvlook -I
< names.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' |\
csvlook -I
< names.csv |\
cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born |\
csvcut -c id,full_name,born |\
csvlook -I
< names.csv csvsql --query "SELECT id, first_name || ' ' || last_name "\\
"AS full_name, born FROM stdin" | csvlook -I
select id, first_name || ' ' || last_name as full_name, born
from read_csv('names.csv');
cat names-comma.csv
< names-comma.csv sed -re '1s/.*/id,full_name,born/g;2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' | tail -n 1
< names-comma.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"} {if(NR > 1) {print $1,$3" "$2,$4}}' | tail -n 1
< names-comma.csv | cols -c first_name,last_name tr \",\" \" \" |\
header -r full_name,id,born | csvcut -c id,full_name,born | tail -n 1
< names-comma.csv csvsql --query "SELECT id, first_name || ' ' || last_name AS full_name, born FROM stdin" | tail -n 1
< names-comma.csv rush run -t 'unite(df, full_name, first_name, last_name, sep = " ")' - | tail -n 1
select id, first_name || ' ' || last_name as full_name, born
from read_csv('names-comma.csv');
5.4.6 Combining Multiple CSV Files
< tips.csv csvcut -c bill,tip | tee bills.csv | head -n 3 | csvlook
< tips.csv csvcut -c day,time | tee datetime.csv |\
head -n 3 | csvlook -I
< tips.csv csvcut -c sex,smoker,size | tee customers.csv |\
head -n 3 | csvlook
copy (select bill, tip from 'tips.csv') to 'bills.csv';
copy (select day, time from 'tips.csv') to 'datetime.csv';
copy (select sex, smoker, size from 'tips.csv') to 'customers.csv';
paste -d, {bills,customers,datetime}.csv | head -n 3 | csvlook -I
from 'bills.csv'
positional join 'customers.csv'
positional join 'datetime.csv';
csvlook irismeta.csv
csvjoin -c species iris.csv irismeta.csv | csvcut -c sepal_length,sepal_width,species,usda_id | sed -n '1p;49,54p' | csvlook
csvsql --query 'SELECT i.sepal_length, i.sepal_width, i.species, m.usda_id FROM iris i JOIN irismeta m ON (i.species = m.species)' iris.csv irismeta.csv | sed -n '1p;49,54p' | csvlook
select i.sepal_length, i.sepal_width, i.species, m.usda_id
from 'iris.csv' i
natural join 'irismeta.csv' m;
sed -e 's/"gender":/"sex":/g' users.json | jq | trim
copy (
select
to_json(list(struct_pack(
sex := result.gender,
name := result.name,
location := result.location,
email := result.email,
login := result.login,
dob := result.dob,
registered := result.registered,
phone := result.phone,
cell := result.cell,
id := result.id,
picture := result.picture,
nat := result.nat
))) as results,
info
from
(
select unnest(results) as result, info
from 'users.json'
)
group by info
)
to 'users2.json'
(format json, array false);
(Okay, this is not too practical. Just use sed
to patch JSON fields.)
5.5 Working with XML/HTML and JSON
I skipped the XML/HTML exercises below.
curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' > wiki.html
< wiki.html trim
grep wikitable -A 21 wiki.html
< wiki.html pup 'table.wikitable tbody' | tee table.html | trim
< table.html xml2json > table.json
jq . table.json | trim 20
< table.json jq -r '.tbody.tr[1:][] | [.td[]["$t"]] | @csv' | header -a rank,country,border,surface,ratio > countries.csv
csvlook --max-column-width 28 countries.csv