<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Alán's blog</title><link>https://quasimorphic.com/</link><description>Recent content on Alán's blog</description><generator>Hugo</generator><language>en-uk</language><lastBuildDate>Sun, 12 Apr 2026 12:40:00 -0400</lastBuildDate><atom:link href="https://quasimorphic.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Exploring the MBTA public dataset using DuckDB</title><link>https://quasimorphic.com/archive/duckdb_mbta_explore/</link><pubDate>Sun, 12 Apr 2026 12:40:00 -0400</pubDate><guid>https://quasimorphic.com/archive/duckdb_mbta_explore/</guid><description>&lt;p>To showcase the real-life usefulness of &lt;a href="https://duckdb.org/">Duckdb&lt;/a> (and SQL-adjacent Domain Specific Languages in general) I decided to use the public &lt;a href="https://mbta-massdot.opendata.arcgis.com/">datasets&lt;/a> made available by the Massachusetts Bay Transport Authority (MBTA). I have lived in Boston for a couple of years and wanted to test if my intuition of the busy lines and stations lined up with their data.&lt;/p>
&lt;p>There are multiple available (tabular) datasets:&lt;/p>
&lt;ul>
&lt;li>Ridership by Trip, Route line and stop&lt;/li>
&lt;li>Monthly ridership by month&lt;/li>
&lt;li>Gated station entries&lt;/li>
&lt;li>Passenger surveys&lt;/li>
&lt;/ul>
&lt;!--listend-->
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>.maxrows &lt;span style="color:#ae81ff">11&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>INSTALL httpfs;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">LOAD&lt;/span> httpfs;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> monthly_ridership &lt;span style="color:#66d9ef">AS&lt;/span> (&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> read_csv(&lt;span style="color:#e6db74">&amp;#39;https://hub.arcgis.com/api/v3/datasets/a2d15ddd86b34867a31cd4b8e0a83932_0/downloads/data?format=csv&amp;amp;spatialRefId=4326&amp;amp;where=1%3D1&amp;#39;&lt;/span>));
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">column_name&lt;/span>, column_type &lt;span style="color:#66d9ef">FROM&lt;/span> (&lt;span style="color:#66d9ef">DESCRIBE&lt;/span> monthly_ridership);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌─────────────────────────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_name │ column_type │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────────────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ service_date │ TIMESTAMP WITH TIME ZONE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ mode │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ total_monthly_weekday_ridership │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ average_monthly_weekday_ridersh │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ countofdates_weekday │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ total_monthly_ridership │ DOUBLE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ average_monthly_ridership │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ countofdates │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ObjectId │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────────────────────┴──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└────────────────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We first loaded the &lt;code>httpfs&lt;/code> extension to pull the data directly from their website. I am using a local database but this should also work without writing to a file thanks to the &lt;a href="https://duckdb.org/docs/stable/connect/overview">in-memory&lt;/a> database capabilities of duckdb.&lt;/p>
&lt;p>Then we created a new table &lt;code>monthly_ridership&lt;/code> by running a subquery (another valid SQL expression surrounded by parentheses). This will download and save the CSV table into a table in the database (in-memory or into a file that works as a database).&lt;/p>
&lt;p>Lastly, we describe the table and I like to filter out other columns that are not informative. I predominantly care about the column names and data types. Here the ones we care about are either &lt;code>total_monthly_weekday_ridership&lt;/code> or &lt;code>average_monthly_weekday_ridership&lt;/code> alongside &lt;code>route_or_line&lt;/code>.&lt;/p>
&lt;p>We will thus group by route or line to see the average ridership per route.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, &lt;span style="color:#66d9ef">CAST&lt;/span>(MEAN(total_monthly_weekday_ridership) &lt;span style="color:#66d9ef">AS&lt;/span> INTEGER) &lt;span style="color:#66d9ef">AS&lt;/span> mean_monthly_weekday_ridership &lt;span style="color:#66d9ef">FROM&lt;/span> monthly_ridership &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line &lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> mean_monthly_weekday_ridership &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬────────────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ mean_monthly_weekday_ridership │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼────────────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Bus │ 7395469 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Red Line │ 5326706 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Orange Line │ 4410355 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Green Line │ 3893135 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Commuter Rail │ 2684943 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Blue Line │ 1392658 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Silver Line │ 724622 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ The RIDE │ 138221 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F1 │ 66248 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F3 │ 23922 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F4 │ 19975 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┴────────────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 11 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We can see here all 5 lines of the metro system, in addition to commuter rail, buses, The RIDE (a door-to-door service for folks unable to ride the fixed routes) and several boat routes that cross the Boston Harbour.&lt;/p>
&lt;p>Gated entries give info on specific entrances. We first fetch the table and print the schema.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> gated_entries &lt;span style="color:#66d9ef">AS&lt;/span> (&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> read_csv(&lt;span style="color:#e6db74">&amp;#39;https://hub.arcgis.com/api/v3/datasets/001c177f07594e7c99f193dde32284c9_0/downloads/data?format=csv&amp;amp;spatialRefId=4326&amp;amp;where=1%3D1&amp;#39;&lt;/span>));
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">column_name&lt;/span>, column_type &lt;span style="color:#66d9ef">FROM&lt;/span> (&lt;span style="color:#66d9ef">DESCRIBE&lt;/span> gated_entries);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_name │ column_type │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ service_date │ TIMESTAMP WITH TIME ZONE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ time_period │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ stop_id │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ gated_entries │ DOUBLE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ObjectId │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────┴──────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The schema makes sense in general, we care the most about &lt;code>station_name&lt;/code>, &lt;code>route_or_line&lt;/code> and &lt;code>gated_entries&lt;/code>. Before aggregating, it is worth checking the time span the data covers.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">MIN&lt;/span>(service_date) &lt;span style="color:#66d9ef">AS&lt;/span> start_date, &lt;span style="color:#66d9ef">MAX&lt;/span>(service_date) &lt;span style="color:#66d9ef">AS&lt;/span> end_date &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌──────────────────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ start_date │ end_date │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ timestamp with time zone │ timestamp with time zone │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────────────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 2024-08-25 00:00:00-04 │ 2026-02-28 00:00:00-05 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└──────────────────────────┴──────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So the dataset covers about a year and a half, from August 2024 to February 2026. We can perform &amp;ldquo;quality control&amp;rdquo; to check if there are stations with very few records.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> days_recorded,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#f92672">*&lt;/span>) &lt;span style="color:#66d9ef">AS&lt;/span> num_stations,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FIRST&lt;/span>(route_or_line) &lt;span style="color:#66d9ef">AS&lt;/span> example_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, route_or_line,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name, route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────┬──────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ days_recorded │ num_stations │ example_line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ int64 │ int64 │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────┼──────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 553 │ 26 │ Red Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 552 │ 3 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 551 │ 4 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 550 │ 2 │ Blue Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 549 │ 4 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 548 │ 3 │ Blue Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 536 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 526 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 523 │ 3 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 518 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 512 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┴──────────────┴──────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 22 rows (11 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└─────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>My first impression is that the Green Line has fewer records than the others. If you have lived in Boston this should make sense, since part of the Green Line runs like a tram/light rail, I would thus expect the logistics of data collection to be trickier in above-ground stations.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, ROUND(&lt;span style="color:#66d9ef">AVG&lt;/span>(days_recorded),&lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#66d9ef">AS&lt;/span> avg_days_per_station
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, station_name, &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line, station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> avg_days_per_station &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ avg_days_per_station │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ double │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Silver Line │ 551.7 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Orange Line │ 549.7 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Red Line │ 549.0 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Blue Line │ 546.3 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Green Line │ 538.3 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Mattapan Line │ 538.0 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────┴──────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Are the stations above ground the ones with fewer records?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">WHERE&lt;/span> route_or_line &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;Green Line&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">LIMIT&lt;/span> &lt;span style="color:#ae81ff">10&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌─────────────────┬───────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ days_recorded │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ int64 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────┼───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ 512 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ 518 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Medford/Tufts │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Gilman Square │ 526 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Copley │ 536 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boylston │ 537 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Kenmore │ 538 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Arlington │ 541 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────┴───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└─────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Indeed all the stations under 530 days recorded are part of the &lt;a href="https://en.wikipedia.org/wiki/Green_Line_Extension">Green Line extension&lt;/a>, opened in 2022. We know that they were running at the time this dataset was collected, so it is a bit surprising that they have the most missing data (even if less than 10%). I am curious about the stations that people use the most. Let&amp;rsquo;s look at the top 10 stations with the most gated entries.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, route_or_line, &lt;span style="color:#66d9ef">CAST&lt;/span>(&lt;span style="color:#66d9ef">SUM&lt;/span>(gated_entries) &lt;span style="color:#66d9ef">AS&lt;/span> INT) &lt;span style="color:#66d9ef">AS&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name, route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> gated_entries &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────────┬───────────────┬───────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ route_or_line │ gated_entries │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┼───────────────┼───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Harvard │ Red Line │ 5359023 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Back Bay │ Orange Line │ 4772945 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Copley │ Green Line │ 4184767 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ North Station │ Orange Line │ 4099727 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Central │ Red Line │ 4094223 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Kendall/MIT │ Red Line │ 4054482 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Government Center │ Blue Line │ 258449 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ Green Line │ 186641 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ Green Line │ 174971 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ Green Line │ 154011 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ Green Line │ 92067 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┴───────────────┴───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 78 rows (11 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There are two issues with naive aggregation: First, some stations are part of multiple routes. We thus should remove the route_or_line grouping. Second, the raw sums are skewed if not all stations have records covering the same time span. Since the dataset has a &lt;code>service_date&lt;/code> column, we can normalise by the number of distinct dates each station appears in to get a fairer average daily figure.&lt;/p>
&lt;p>We thus adjust our query to make these changes: Aggregate data from different lines and normalize it by the number of days recorded for each station.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> station_name,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> LIST(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> route_or_line) &lt;span style="color:#66d9ef">AS&lt;/span> lines,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">CAST&lt;/span>(&lt;span style="color:#66d9ef">SUM&lt;/span>(gated_entries) &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> INT) &lt;span style="color:#66d9ef">AS&lt;/span> avg_daily_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> avg_daily_entries &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────────┬───────────────────────────┬───────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ lines │ avg_daily_entries │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar[] │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┼───────────────────────────┼───────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ North Station │ [Orange Line, Green Line] │ 11315 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ South Station │ [Red Line, Silver Line] │ 10342 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Harvard │ [Red Line] │ 9691 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Downtown Crossing │ [Red Line, Orange Line] │ 9492 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Back Bay │ [Orange Line] │ 8774 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Park Street │ [Green Line, Red Line] │ 8017 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Medford/Tufts │ [Green Line] │ 540 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Suffolk Downs │ [Blue Line] │ 483 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ [Green Line] │ 365 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ [Green Line] │ 335 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ [Green Line] │ 297 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ [Green Line] │ 176 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┴───────────────────────────┴───────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 71 rows (12 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────────────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This better matches my intuitive impression from being in those areas. Harvard, despite being a major transit hub for buses and the T, is outranked by North and South Station once considering all lines. Park Street jumped from 19th before all the way up to sixth. The new Green Line stations in Cambridge still rank at the bottom of daily entries. There is a stark difference between the most and least &amp;ldquo;popular&amp;rdquo; stations. For every person recording their entry on East Somerville there are 62 in North Station. East Somerville (and all the other Green Line stations in Cambridge) do not enforce checks on ticket purchases. The same can be said for stations west of Kenmore, but those are located in an area with much denser population, since multiple universities are based in that area.&lt;/p>
&lt;p>Overall, I really like duckdb due to the flexibility and speed for data crunching analysis. It is fast, simple and it integrates well with notebook-like workflows and Command-Line Interface (CLI) usage. The translation from questions to queries to tables is seamless. I believe it is a tool worth mastering, since it provides a Swiss Army Knife for everyday data processing.&lt;/p>
&lt;p>There are some caveats worth mentioning. For instance, it is unclear to me how they are differentiating the Red and the Green Line entries in Park Street, since it is a two-layered station with the green lines on top, and one can only access the Red line from the Green Line. While it seemed negligible for these questions, the frequency and data acquisition differs across stations. There may be a weekday vs weekend bias that we are not accounting for. That said, I&amp;rsquo;m glad that my intuition of the usage of stations and lines matches my mental model.&lt;/p>
&lt;p>Meta conclusion: I used this post also to test literate programming to incrementally build a data crunching workflow. In this case I coupled with an org-mode notebook (that is how I generate this blog) to explore a public dataset for fun. When wrapping up, I had a couple of lingering questions and a notion of the necessary query, but not enough time. I used an agent to add those in the middle of the analysis, which I evaluated via org-babel in a quick feedback loop. It worked shockingly well. Turns out this is quite similar to the recently released &lt;a href="https://github.com/marimo-team/marimo-pair">marimo-pair&lt;/a>, an extension for reproducible data analysis notebooks using agents. I want to further explore the potential of having an agent as an interface for data analysis, where I still review and check that the code fulfills its intended purpose. In the end the goal is a reproducible artifact that gives us new insights on the data we are processing, and I think this approach facilitates rapid and reproducible data exploration and analysis.&lt;/p></description></item>/<item><title>Set up email hosting and a personal website on personal domain</title><link>https://quasimorphic.com/archive/setup_email_website_personal_domain/</link><pubDate>Mon, 24 Nov 2025 19:38:00 -0500</pubDate><guid>https://quasimorphic.com/archive/setup_email_website_personal_domain/</guid><description>&lt;p>At some point I was struggling to get access to my Gmail account. Since I usually block unwanted scripts from running on my computer, Google likes to flag my Gmail login attempts as suspicious activity. This would be fine if it didn&amp;rsquo;t also make one of my only alternative identification methods an SMS. If I were to lose my phone or number I could be locked out of my account. Most accounts I have assume I have access to this email, thus a good chunk of modern life requires me accessing it, the prospect of becoming unable to log in seems quite realistic. &lt;a href="https://www.migadu.com/blog/gmail/">This&lt;/a> post (by an email-hosting company) builds a case against Gmail ground of privacy. It was at last time to get my own domain and control over my email. I also started this blog this year and it is pleasing to give it a nice &lt;code>.com&lt;/code> home.&lt;/p>
&lt;h2 id="yer-a-domain-harry">&amp;lsquo;Yer a domain, Harry&lt;/h2>
&lt;p>Buy a domain from a domain provider. Common options are CloudFare, GoDaddy and NameCheap. I used CloudFare due to mixed comments online about customer service on the others. Also, NameCheap &lt;a href="https://news.ycombinator.com/item?id=44134152">may&lt;/a> pre-purchase domains that were looked up on domain search engines. If you are paranoid like me you can use &lt;a href="https://tracker.debian.org/pkg/whois">whois&lt;/a> (though it is being &lt;a href="https://www.icann.org/en/announcements/details/icann-update-launching-rdap-sunsetting-whois-27-01-2025-en">sunset&lt;/a>, I&amp;rsquo;d suggest looking for an alternative). Once you find a domain name you like and is available just pay for it, you can get it for up to 10-years at a time.&lt;/p>
&lt;h2 id="buy-email-hosting-services-and-link-them-to-the-domain">Buy email hosting services and link them to the domain&lt;/h2>
&lt;p>Doing some research (mostly HackerNews and Reddit comments) I narrowed down options to either &lt;a href="https://migadu.com/">Migadu&lt;/a> and &lt;a href="https://mxroute.com/">MXroute&lt;/a>. &lt;code>Migadu&lt;/code> had a free trial, so I gave it a shot, but in the end I went for &lt;code>MXroute&lt;/code> due to a promo they had at the time.&lt;/p>
&lt;p>Make an email account. On Migadu it should be straightforward. In the case of &lt;code>MXroute&lt;/code>, you have to pay at this point. the URL for management should be &lt;code>&amp;lt;SERVER&amp;gt;.mxrouting.net:2222&lt;/code>, were &lt;code>&amp;lt;SERVER&amp;gt;&lt;/code> is in the confirmation email. Once you have access to your email hosting server you can to link it to your domain.&lt;/p>
&lt;h3 id="link-email-service-to-domain">Link email service to domain&lt;/h3>
&lt;p>The first step once you have an account is to add DNS (Domain Name Service) records to the website. They are instructions stored in servers that provide information about how to handle requests, mostly linking names to IPs.&lt;/p>
&lt;h4 id="migadu">Migadu&lt;/h4>
&lt;ul>
&lt;li>Export the BIND records into a file and download it&lt;/li>
&lt;li>Load this file, add them CloudFare DNS records&lt;/li>
&lt;li>If the &lt;code>DKMS&lt;/code> and &lt;code>ARC&lt;/code> keys fail, make sure to untoggle the &lt;code>proxy&lt;/code> switch on their records&lt;/li>
&lt;/ul>
&lt;h4 id="mxroute">MXroute&lt;/h4>
&lt;p>Mostly follow the &lt;a href="https://gist.github.com/afermg/69df5d99ddc7e1201b471aa6eb564e51">instructions&lt;/a> sent upon registration, but it is basically the same as Migadu but without the import/export convenience. For instance, one of the MX records (with different fields separated by commas) is &lt;code>MX, &amp;lt;domain.com&amp;gt;, &amp;lt;server&amp;gt;.mxrouting.net&lt;/code>. The ones I had to copy over from the email were:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>TXT records: &lt;code>spf1&lt;/code> and &lt;code>DKIM1&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>MX records: &lt;code>&amp;lt;SERVER&amp;gt;.mxrouting.net&lt;/code> and &lt;code>&amp;lt;SERVER&amp;gt;-relay.mxrouting.net&lt;/code>&lt;/p>
&lt;p>Copy them on Cloudfare, specifically on DNS records:&lt;/p>
&lt;/li>
&lt;/ul>
&lt;!--listend-->
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Cloudfare&amp;#39;s domain home -&amp;gt; DNS Records -&amp;gt; Add record
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="test-the-setup">Test the setup&lt;/h3>
&lt;ul>
&lt;li>Log-in to yor account on &lt;code>webmail.migadu.com&lt;/code> or &lt;code>&amp;lt;SERVER&amp;gt;.mxrouting.net/roundcube&lt;/code>&lt;/li>
&lt;li>Create an account for your first user&lt;/li>
&lt;li>Send an mail to yourself to validate that it works&lt;/li>
&lt;/ul>
&lt;h3 id="optional-set-subdomain-for-server-access">Optional: Set subdomain for server access&lt;/h3>
&lt;p>To access my email from my phone&amp;rsquo;s Thunderbird App I had to set a subdomain. Only &lt;code>MXroute&lt;/code> required this.&lt;/p>
&lt;ul>
&lt;li>From the Control panel provided by &lt;code>MXroute&lt;/code> the verification key
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>&amp;lt;SERVER&amp;gt;.mxrouting.net -&amp;gt; (sidebar) Account manager -&amp;gt; DNS record
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>And add it as a TXT record on CloudFare&lt;/li>
&lt;li>Add &lt;code>mail.&amp;lt;domain.com&amp;gt;&lt;/code> and/or &lt;code>webmail.&amp;lt;domain.com&amp;gt;&lt;/code> subdomain(s) on CloudFare&lt;/li>
&lt;/ul>
&lt;h2 id="link-github-pages-website-to-domain">Link Github Pages website to domain&lt;/h2>
&lt;p>If we have one, we can also link our website to the domain. Some companies like to use &lt;code>blog.&amp;lt;domain.com&amp;gt;&lt;/code> for blogs and keep &lt;code>&amp;lt;domain.com&amp;gt;&lt;/code> for their landing page, but since I am almost certainly a person I skipped the subdomain approach (and used the so-called apex domain). In my case I am using Github to host my website, so I followed their &lt;a href="https://docs.github.com/en/pages/configuring-a-custom-domain-for-your-github-pages-site/managing-a-custom-domain-for-your-github-pages-site">instructions&lt;/a>:&lt;/p>
&lt;ol>
&lt;li>Go to DNS records
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Cloudfare&amp;#39;s domain home -&amp;gt; DNS Records -&amp;gt; Add record
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>Add &lt;code>A&lt;/code> records to CloudFare (e.g., &lt;code>A,&amp;lt;domain.com&amp;gt;,&amp;lt;185.ipv4.github.address&amp;gt;&lt;/code>)&lt;/li>
&lt;li>Add &lt;code>CNAME,www,&amp;lt;user&amp;gt;.github.io&lt;/code> record. This step wasn&amp;rsquo;t specified on the Github docs but I found it to be necessary for some reason.it&lt;/li>
&lt;li>Add cloudfare rules to redirect https
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Cloudfare&amp;#39;s domain home -&amp;gt; Cloudfare rules -&amp;gt; Templates -&amp;gt; Redirect http to HTTPS -&amp;gt; Deploy
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>Add custom domain to github pages:
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>Github repo -&amp;gt; settings -&amp;gt; pages on side panel -&amp;gt; Custom domain
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;/li>
&lt;li>Also on Github, the &lt;code>Enforce HTTPS&lt;/code> checkbox, if the rest was properly configured it should work without a hitch&lt;/li>
&lt;/ol>
&lt;h2 id="conclusions">Conclusions&lt;/h2>
&lt;p>I like that I have control over my email, and I could even give my family and friends their personal emails (if any of them actually wants one). The first time I did all this it took me around three hours, mostly because Github could not find my DNS records. Since I have my domain now I can probably do more fun stuff, such as self-hosting tools to share with friends and family.&lt;/p></description></item>/<item><title>Use dired-do-shell to explore the parquet schema from Emacs</title><link>https://quasimorphic.com/archive/emacs_dired_do_shell_duckdb/</link><pubDate>Thu, 23 Oct 2025 15:21:00 -0400</pubDate><guid>https://quasimorphic.com/archive/emacs_dired_do_shell_duckdb/</guid><description>&lt;p>I use &lt;code>dired-do-shell&lt;/code> command in Emacs to run CLI commands from within its file manager &lt;a href="https://www.gnu.org/software/emacs/manual/html_node/emacs/Dired.html">dired&lt;/a>. This workflow makes it easy to perform batch operations on files that would be annoying otherwise. The trouble arose when trying to use the &lt;code>duckdb&lt;/code> CLI to print the schema of a parquet file, as the notation for wildcards in emacs (&lt;code>*&lt;/code> and &lt;code>?&lt;/code>) conflicts with duckdb&amp;rsquo;s usage of the former. Thus running the following after &lt;code>M-x dired-do-shell&lt;/code> (bound to &lt;code>!&lt;/code> in &lt;code>dired-mode&lt;/code>) did not work:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>duckdb -c &lt;span style="color:#e6db74">&amp;#34;DESCRIBE * FROM read_parquet(&amp;#39;?&amp;#39;)&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>I am trying to get Emacs to substitute &lt;code>?&lt;/code> for the selected file(s) in dired and retain &lt;code>*&lt;/code> as a normal duckdb wildcard (for which it means &amp;ldquo;get all columns&amp;rdquo;), but Emacs interprets the star as a wildcard and not the question mark. Thus Emacs yapped about it:&lt;/p>
&lt;blockquote>
&lt;p>dired-do-shell-command: You can not combine ‘*’ and ‘?’ substitution marks&lt;/p>&lt;/blockquote>
&lt;p>The trick is that we need to surround &lt;code>?&lt;/code> with backquotes (&lt;code>`&lt;/code>), otherwise the spaces will be in the file name that &lt;code>read_parquet&lt;/code> ingests. To fix the &amp;ldquo;naked star&amp;rdquo; problem (which Emacs will try to substitute) we can envelop it with the &lt;code>COLUMNS()&lt;/code> command, and the lack of spaces around &lt;code>*&lt;/code> will prevent Emacs from substituting its value. Thus the following command in the Emacs command line does the trick:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>duckdb -c &lt;span style="color:#e6db74">&amp;#34;DESCRIBE COLUMNS(*) FROM read_parquet(&amp;#39;`?`&amp;#39;)&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Which is a bit verbose but I can easily reuse from my command history to explore the schemas or parquets without leaving the dired buffer.&lt;/p></description></item>/<item><title>Calculate the cumulative sum of a column using DuckDB</title><link>https://quasimorphic.com/archive/duckdb_cumsum/</link><pubDate>Wed, 22 Oct 2025 20:34:00 -0400</pubDate><guid>https://quasimorphic.com/archive/duckdb_cumsum/</guid><description>&lt;p>Duckdb, the (tabular) data exploration tool I use supports window operations. I recently discovered that it can also perform cumulative sums in a very efficient manner.&lt;/p>
&lt;p>Let us generate a toy dataset where we want to calculate the sum of one column relative to the order of another one.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e">-- seeding for reproducibility, creating a table to hide output
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> seed &lt;span style="color:#66d9ef">AS&lt;/span> &lt;span style="color:#66d9ef">SELECT&lt;/span> SETSEED(&lt;span style="color:#ae81ff">0&lt;/span>.&lt;span style="color:#ae81ff">1&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- Create a mock dataset with two integer columns
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> my_table &lt;span style="color:#66d9ef">AS&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">#&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span> &lt;span style="color:#66d9ef">AS&lt;/span> column_1,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">CAST&lt;/span>(FLOOR(RANDOM() &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">100&lt;/span>) &lt;span style="color:#66d9ef">AS&lt;/span> INT) &lt;span style="color:#66d9ef">AS&lt;/span> column_2
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> generate_series(&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">10&lt;/span>); &lt;span style="color:#75715e">-- This generates 10 rows
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> my_table;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">-- We write it to a csv for future use
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">COPY&lt;/span> my_table &lt;span style="color:#66d9ef">TO&lt;/span> my_table.csv;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌──────────┬──────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_1 │ column_2 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ int64 │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────┼──────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 1 │ 27 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 2 │ 45 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 3 │ 2 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 4 │ 84 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 5 │ 84 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 6 │ 26 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 7 │ 18 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 8 │ 65 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 9 │ 97 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 │ 11 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────┴──────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└─────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If we wanted to calculate the distribution of the cumulative sum of the table we could use the &lt;code>OVER&lt;/code> clause to perform the sum of &lt;code>column_2&lt;/code> in the order defined by &lt;code>column_1&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span>, &lt;span style="color:#66d9ef">sum&lt;/span>(column_2) OVER (&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">by&lt;/span> column_1) &lt;span style="color:#66d9ef">AS&lt;/span> cumulative_sum &lt;span style="color:#66d9ef">FROM&lt;/span> my_table
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌──────────┬──────────┬────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_1 │ column_2 │ cumulative_sum │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ int64 │ int32 │ int128 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────┼──────────┼────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 1 │ 27 │ 27 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 2 │ 45 │ 72 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 3 │ 2 │ 74 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 4 │ 84 │ 158 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 5 │ 84 │ 242 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 6 │ 26 │ 268 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 7 │ 18 │ 286 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 8 │ 65 │ 351 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 9 │ 97 │ 448 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 │ 11 │ 459 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────┴──────────┴────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└──────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The cumulative sum can be pretty handy to get a general notion of a distribution. As a bonus tip, I&amp;rsquo;ll show how to use duckdb in a one-liner to plot the data
directly in a terminal by using &lt;a href="https://gnuplotting.org/">gnuplot&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>duckdb -csv -c &lt;span style="color:#e6db74">&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> SELECT *, sum(column_2) OVER (ORDER by column_1) AS cumulative_sum
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> FROM read_csv(&amp;#39;my_table.csv&amp;#39;);&amp;#34;&lt;/span> |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> gnuplot -e &lt;span style="color:#e6db74">&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set terminal dumb;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set datafile separator &amp;#39;,&amp;#39;;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set style data histograms;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set style fill solid 1.00 border -1;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set xlabel &amp;#39;Column 1&amp;#39;;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set ylabel &amp;#39;CSum&amp;#39;;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> set title &amp;#39;Cumulative Sum of values&amp;#39;;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> plot &amp;#39;-&amp;#39; using 3:xtic(1);&amp;#34;&lt;/span> |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> tr -d &lt;span style="color:#e6db74">&amp;#39;\014&amp;#39;&lt;/span> &lt;span style="color:#75715e"># Remove a pesky ^L at the top&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Cumulative Sum of values
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 500 +-----------------------------------------------------------------+
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | + + + + + + + + + ++ |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 450 |-+ &amp;#39;-&amp;#39; using 3:xtic+-+ +-||--+-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 400 |-+ |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 350 |-+ ++ |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | || |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 300 |-+ +-+ || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 250 |-+ ++ |#| || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>CSum | +-+ || |#| || |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 200 |-+ |#| || |#| || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | |#| || |#| || |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 150 |-+ ++ |#| || |#| || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | || |#| || |#| || |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 100 |-+ +-+ || |#| || |#| || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 50 |-+ ++ |#| || |#| || |#| || |#| || +-|
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> | +-+ || |#| || |#| || |#| || |#| || |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 0 +-----------------------------------------------------------------+
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 1 2 3 4 5 6 7 8 9 10
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Column 1
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We get a cute ascii-like plot! That is a bit too long of a &amp;ldquo;one-liner&amp;rdquo;, I&amp;rsquo;ll go through the commands:&lt;/p>
&lt;ul>
&lt;li>Run a &lt;code>duckdb&lt;/code> command (&lt;code>-c&lt;/code>) that reads the previously-saved table. The &lt;code>-csv&lt;/code> flag at the starts converts the output to csv.&lt;/li>
&lt;li>Run gnuplot with certain specifications:
&lt;ul>
&lt;li>The flag &lt;code>-e&lt;/code> Allows to pass a series of commands without an interactive session.&lt;/li>
&lt;li>&lt;code>set terminal dumb&lt;/code>: it will send as plain text to stdout.&lt;/li>
&lt;li>&lt;code>set datafiler separator &amp;quot;,&amp;quot;&lt;/code>: The input is a CSV file.&lt;/li>
&lt;li>&lt;code>set style data histograms&lt;/code>: Changes the plotting style into a barplot.&lt;/li>
&lt;li>&lt;code>set style fill solid ...&lt;/code>: Visual adjustments to the bars for clarity.&lt;/li>
&lt;li>&lt;code>set xlabel ...&lt;/code> Adds the axis labels. Similar for &lt;code>ylabel&lt;/code> and &lt;code>title&lt;/code>.&lt;/li>
&lt;li>&lt;code>plot '-' using 3:xtic(1)&lt;/code>: Use stdin data to Plot the columns 3 on the y axis (&lt;code>cumulative_sum&lt;/code>) and the first column in the x-axis (&lt;code>column_1&lt;/code>).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Lastly, use the &lt;code>tr&lt;/code> command line tool to remove a &lt;code>^L&lt;/code> That appeared at the start of the output and was bothering me too much.&lt;/li>
&lt;/ul>
&lt;p>While there are a many other ways to wrangle tables such as via pandas or polars in Python, I find duckdb to be a powerful tool for exploratory analyses and data wrangling (often from within Python). It is flexible enough to be used by itself, via bindings in another language, or directly on the command line. Lastly, I showed that when used as a Command Line Interface (CLI) duckdb synergises with other tools for data visualisation from the comfort(?) of the terminal.&lt;/p></description></item>/<item><title>Run multiple python scripts in the background</title><link>https://quasimorphic.com/archive/screen-batch-model-deployment/</link><pubDate>Tue, 26 Aug 2025 14:29:00 -0400</pubDate><guid>https://quasimorphic.com/archive/screen-batch-model-deployment/</guid><description>&lt;p>To solve a multitude of challenges I have faced when processing high throughput microscopy data, have developed &lt;a href="https://github.com/afermg/nahual">Nahual&lt;/a>, a tool that allows me to move data across multiple Python environments that deploy deep learning models in the background. I usually keep these models &amp;ldquo;listening&amp;rdquo; in the background for the main analysis pipeline (&lt;a href="https://github.com/afermg/aliby">aliby&lt;/a>) to send them data to process. To be able to monitor what&amp;rsquo;s going on inside of these scripts I use &lt;a href="https://www.gnu.org/software/screen/">GNU screen&lt;/a>, which allows me to detach and reattach into these sessions whenever I need to. At some point I had to reboot my server and had rerun all these in independent screens. This rudimentary shell script did the job:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>cd cellpose
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>screen -d -S cellpose1 -m bash -c &lt;span style="color:#e6db74">&amp;#39;nix develop . --command bash -c &amp;#34;python server.py ipc:///tmp/cellpose1.ipc&amp;#34;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>screen -d -S cellpose2 -m bash -c &lt;span style="color:#e6db74">&amp;#39;nix develop . --command bash -c &amp;#34;python server.py ipc:///tmp/cellpose2.ipc&amp;#34;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd ../trackastra
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>screen -d -S trackastra -m bash -c &lt;span style="color:#e6db74">&amp;#39;nix develop . --command bash -c &amp;#34;python server.py ipc:///tmp/trackastra.ipc&amp;#34;&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd ..
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Basically the screen runs my Nix environment and deploys the model (in this case, my &lt;a href="https://github.com/afermg/nahual">fork&lt;/a> of cellpose with Nix dependency management) while detached. This executes a &lt;code>server.py&lt;/code> file within the Nix enviroment, it runs on a loop waiting to receive data and process it. Automatically deploying to multiple screens reduces the annoyance of having to the usual steps of (go to the folder -&amp;gt; run screen -&amp;gt; Nix environment -&amp;gt; run Python server -&amp;gt; Detach screen session). I just add more models if I want further deployments, put it in a bash script and call it a day.&lt;/p>
&lt;p>To access any of these screens for inspection I just use the name indicated after the &lt;code>-S&lt;/code> flag (e.g., &lt;code>screen -r cellpose1&lt;/code>). This way I can check if any issue crops up in the main analysis script or pipeline.&lt;/p></description></item>/<item><title>Simple progress indicators with awk</title><link>https://quasimorphic.com/archive/awk-simple-progress-indicator/</link><pubDate>Tue, 19 Aug 2025 18:58:00 -0400</pubDate><guid>https://quasimorphic.com/archive/awk-simple-progress-indicator/</guid><description>&lt;p>I wanted a simple way to see the progress of a data processing pipeline, and the internal progress bar tools were messed up by threading. I thus decided to use the number of output files in each folder as an indicator of progress. In my case the output of &lt;code>tree .&lt;/code> looks like this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>.
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└── steps
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ├── A01_001
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   ├── segment_nuclei
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── 0000.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── 0001.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── ...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   └── 0019.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   ├── tile
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── 0000.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── 0001.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> │   │   ├── ...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>I can get the info I need by counting the total number of files and the occurrences of the &lt;code>A01_001&lt;/code> -&amp;gt; &lt;code>P24_005&lt;/code> range (these are fields of view from a microscopy experiment). Using this simple &lt;code>find&lt;/code> command we get all the files in the current folder.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>find . -type f
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>which results in this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>./steps:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./steps/A01_003/tile/0007.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./steps/A01_003/tile/0009.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./steps/A01_003/tile/0018.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./steps/A01_003/tile/0016.npz
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We could use &lt;code>wc -l&lt;/code> to get the number of files per directory, we want a bunch of progress bars to get a better sense of change over time. For this I use &lt;code>awk&lt;/code>, my swiss-army knife for text processing, and I write a short script that counts, &lt;a href="https://stackoverflow.com/a/2458455">sorts&lt;/a> and &lt;a href="https://stackoverflow.com/a/68371463">prints&lt;/a> the number of occurrences as a number of dots. I also added a conditional to only track after more than one file has been produced, for pipelines that produce save one file before actually running the whole pipeline.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-awk" data-lang="awk">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># progress_bar.awk&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (&lt;span style="color:#66d9ef">match&lt;/span>(&lt;span style="color:#f92672">$&lt;/span>&lt;span style="color:#ae81ff">0&lt;/span>,&lt;span style="color:#e6db74">&amp;#34;([A-P][0-9]{2}_[0-9]{3})&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">capture&lt;/span>)){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">count&lt;/span>[&lt;span style="color:#a6e22e">capture&lt;/span>[&lt;span style="color:#ae81ff">1&lt;/span>]] &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>END{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">n&lt;/span>&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">asorti&lt;/span>(&lt;span style="color:#a6e22e">count&lt;/span>, &lt;span style="color:#a6e22e">sorted&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> (&lt;span style="color:#a6e22e">i&lt;/span>&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>; &lt;span style="color:#a6e22e">i&lt;/span>&lt;span style="color:#f92672">&amp;lt;=&lt;/span>&lt;span style="color:#a6e22e">n&lt;/span>; &lt;span style="color:#a6e22e">i&lt;/span>&lt;span style="color:#f92672">++&lt;/span>){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">nfiles&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#a6e22e">count&lt;/span>[&lt;span style="color:#a6e22e">sorted&lt;/span>[&lt;span style="color:#a6e22e">i&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (&lt;span style="color:#a6e22e">nfiles&lt;/span> &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">s&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">sprintf&lt;/span>(&lt;span style="color:#a6e22e">key&lt;/span> &lt;span style="color:#e6db74">&amp;#34;%*s&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">nfiles&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">gsub&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;.&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;.&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">s&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">print&lt;/span> &lt;span style="color:#a6e22e">sorted&lt;/span>[&lt;span style="color:#a6e22e">i&lt;/span>] &lt;span style="color:#e6db74">&amp;#34; &amp;#34;&lt;/span> &lt;span style="color:#a6e22e">s&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Running the &lt;code>find&lt;/code> command and the &lt;code>awk&lt;/code> script (&lt;code>find . -type f | awk -f progress_bar.awk&lt;/code>) yields the following snapshot of the processing progess&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>A01_001 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A01_002 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A01_003 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A01_004 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A01_005 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A02_001 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A02_002 ...............................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A02_003 .................................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A02_004 ..........................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A02_005 ........................................
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>A03_001 ..............................................
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Thus the last thing to do is to use `watch` to automatically refresh the status:&lt;/p>
&lt;p>&lt;code>watch -dc --interval 1 'find . -type f | awk -f progress_bar.awk | tac'&lt;/code>&lt;/p>
&lt;p>The &lt;code>watch&lt;/code> flag &lt;code>-d&lt;/code> highlight the changes over time and &lt;code>-c&lt;/code> enables intrepreting ANSI colours, in my terminal this makes the changes last stay longer, but YMMV. Finally, &lt;code>tac&lt;/code> makes sure that the last lines are displayed at the top. I like to run this command somewhere in another terminal or in a `screen` terminal multiplexer. When the number of rows becomes too high it may be useful find a heuristic to remove uninformative lines.&lt;/p></description></item>/<item><title>Update figure numbering</title><link>https://quasimorphic.com/archive/awk-update-figure-numbering/</link><pubDate>Thu, 14 Aug 2025 15:42:00 -0400</pubDate><guid>https://quasimorphic.com/archive/awk-update-figure-numbering/</guid><description>&lt;p>I was editing some markdown and had to insert a new figure in the middle. The problem is that this document already has an explicit figure numbering (e.g., &amp;ldquo;Figure 5&amp;rdquo;), so changing tens of figures felt dull. I like to run small (GNU) &lt;code>awk&lt;/code> scripts for this type of tasks.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-awk" data-lang="awk">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># update_figures.awk&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (&lt;span style="color:#66d9ef">match&lt;/span>(&lt;span style="color:#f92672">$&lt;/span>&lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;Figure ([0-9]+)&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">num&lt;/span>)){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (&lt;span style="color:#a6e22e">num&lt;/span>[&lt;span style="color:#ae81ff">1&lt;/span>] &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#a6e22e">after&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">gsub&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Figure ([0-9]+)&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;Figure &amp;#34;&lt;/span> &lt;span style="color:#a6e22e">num&lt;/span>[&lt;span style="color:#ae81ff">1&lt;/span>] &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#a6e22e">increase_by&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> };
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">print&lt;/span> &lt;span style="color:#f92672">$&lt;/span>&lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This changes Figure &lt;code>X&lt;/code> into Figure &lt;code>X&lt;/code> + &lt;code>increase_by&lt;/code> starting after the variable &amp;ldquo;after&amp;rdquo;. And we can run it as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>awk -v after&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">4&lt;/span> -v increase_by&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span> -f update_figures.awk input_file.md
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To edit the file in-place add the &lt;code>-i&lt;/code> flag.&lt;/p></description></item>/<item><title>Recursive search and replace</title><link>https://quasimorphic.com/archive/recursive-search-replace/</link><pubDate>Tue, 12 Aug 2025 13:07:00 -0400</pubDate><guid>https://quasimorphic.com/archive/recursive-search-replace/</guid><description>&lt;p>I needed to rename all occurrences of a pattern with another, where I knew there was no ambiguous situations. This uses &lt;code>ripgrep&lt;/code>, &lt;code>xargs&lt;/code> and &lt;code>GNU sed&lt;/code>. &lt;a href="https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#search-and-replace">source&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>rg old_pattern --files-with-matches | xargs sed -i &lt;span style="color:#e6db74">&amp;#39;s/old_pattern/new_pattern/g&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div></description></item>/<item><title>A workflow for bioimaging and data exploration</title><link>https://quasimorphic.com/archive/marimo-for-bioimage-exploration/</link><pubDate>Wed, 30 Jul 2025 13:05:00 -0400</pubDate><guid>https://quasimorphic.com/archive/marimo-for-bioimage-exploration/</guid><description>&lt;p>One of the common challenges when analysing large bioimaging datasets is to bring it all together in one place. I usually use tools like &lt;a href="https://duckdb.org/">DuckDB&lt;/a> for database querying and &lt;a href="https://github.com/cytomining/copairs">copairs&lt;/a> for selecting statistically significant subsets of the data. For one of my recent projects I built a &lt;a href="https://github.com/marimo-team/marimo">marimo&lt;/a> interface to explore the result of large-scale (~2TB images, ~2GB feature profiles) image-based profiles, then performs dimensionality reduction of the data, and finally retrieves back the images. This I think is the ideal workflow, one where you can be nimble and pull up the images alongside statistical analyses to be able to interpret the data structure in the biological context. The code is not yet available to the public, but you can find the demo &lt;a href="https://drive.google.com/file/d/1t2ygATiJ2r0GPkeEwdw6FqHZxoOwQmzW/view">here&lt;/a>.&lt;/p></description></item>/<item><title>Github code review on existing code base</title><link>https://quasimorphic.com/archive/github-review-existing-code/</link><pubDate>Tue, 26 Nov 2024 13:06:00 -0500</pubDate><guid>https://quasimorphic.com/archive/github-review-existing-code/</guid><description>&lt;p>Create an empty branch with one empty commit&lt;/p>
&lt;ol>
&lt;li>Create new branch &lt;code>git checkout --orphan review-1-target&lt;/code>&lt;/li>
&lt;li>Reset &lt;code>git reset .&lt;/code>&lt;/li>
&lt;li>Clean branch &lt;code>git clean -df&lt;/code>&lt;/li>
&lt;li>Add empty commit &lt;code>git commit --allow-empty -m 'Empty commit'&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>Rebase a branch to put this commit at the root&lt;/p>
&lt;ol>
&lt;li>Push to your fork &lt;code>git push -u origin review-1-target&lt;/code>&lt;/li>
&lt;li>Move to branch to review &lt;code>git checkout origin/main&lt;/code>&lt;/li>
&lt;li>Spin-off branch from here &lt;code>git checkout -b review-1&lt;/code>&lt;/li>
&lt;li>Rebase to empty branch &lt;code>git rebase -i review-1-target&lt;/code>, the empty commit must be at the start&lt;/li>
&lt;li>Push &lt;code>git push -u origin review-1&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>That should make a pull request possible, providing the code review tooling.
&lt;a href="https://thib.me/recipe-code-reviews-for-existing-code-with-github">source&lt;/a>&lt;/p></description></item>/</channel></rss>