<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>opendata on Alán's blog</title><link>https://quasimorphic.com/tags/opendata/</link><description>Recent content in opendata on Alán's blog</description><generator>Hugo</generator><language>en-uk</language><lastBuildDate>Sun, 12 Apr 2026 12:40:00 -0400</lastBuildDate><atom:link href="https://quasimorphic.com/tags/opendata/index.xml" rel="self" type="application/rss+xml"/><item><title>Exploring the MBTA public dataset using DuckDB</title><link>https://quasimorphic.com/archive/duckdb_mbta_explore/</link><pubDate>Sun, 12 Apr 2026 12:40:00 -0400</pubDate><guid>https://quasimorphic.com/archive/duckdb_mbta_explore/</guid><description>&lt;p>To showcase the real-life usefulness of &lt;a href="https://duckdb.org/">Duckdb&lt;/a> (and SQL-adjacent Domain Specific Languages in general) I decided to use the public &lt;a href="https://mbta-massdot.opendata.arcgis.com/">datasets&lt;/a> made available by the Massachusetts Bay Transport Authority (MBTA). I have lived in Boston for a couple of years and wanted to test if my intuition of the busy lines and stations lined up with their data.&lt;/p>
&lt;p>There are multiple available (tabular) datasets:&lt;/p>
&lt;ul>
&lt;li>Ridership by Trip, Route line and stop&lt;/li>
&lt;li>Monthly ridership by month&lt;/li>
&lt;li>Gated station entries&lt;/li>
&lt;li>Passenger surveys&lt;/li>
&lt;/ul>
&lt;!--listend-->
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>.maxrows &lt;span style="color:#ae81ff">11&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>INSTALL httpfs;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">LOAD&lt;/span> httpfs;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> monthly_ridership &lt;span style="color:#66d9ef">AS&lt;/span> (&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> read_csv(&lt;span style="color:#e6db74">&amp;#39;https://hub.arcgis.com/api/v3/datasets/a2d15ddd86b34867a31cd4b8e0a83932_0/downloads/data?format=csv&amp;amp;spatialRefId=4326&amp;amp;where=1%3D1&amp;#39;&lt;/span>));
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">column_name&lt;/span>, column_type &lt;span style="color:#66d9ef">FROM&lt;/span> (&lt;span style="color:#66d9ef">DESCRIBE&lt;/span> monthly_ridership);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌─────────────────────────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_name │ column_type │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────────────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ service_date │ TIMESTAMP WITH TIME ZONE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ mode │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ total_monthly_weekday_ridership │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ average_monthly_weekday_ridersh │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ countofdates_weekday │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ total_monthly_ridership │ DOUBLE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ average_monthly_ridership │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ countofdates │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ObjectId │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────────────────────┴──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└────────────────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We first loaded the &lt;code>httpfs&lt;/code> extension to pull the data directly from their website. I am using a local database but this should also work without writing to a file thanks to the &lt;a href="https://duckdb.org/docs/stable/connect/overview">in-memory&lt;/a> database capabilities of duckdb.&lt;/p>
&lt;p>Then we created a new table &lt;code>monthly_ridership&lt;/code> by running a subquery (another valid SQL expression surrounded by parentheses). This will download and save the CSV table into a table in the database (in-memory or into a file that works as a database).&lt;/p>
&lt;p>Lastly, we describe the table and I like to filter out other columns that are not informative. I predominantly care about the column names and data types. Here the ones we care about are either &lt;code>total_monthly_weekday_ridership&lt;/code> or &lt;code>average_monthly_weekday_ridership&lt;/code> alongside &lt;code>route_or_line&lt;/code>.&lt;/p>
&lt;p>We will thus group by route or line to see the average ridership per route.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, &lt;span style="color:#66d9ef">CAST&lt;/span>(MEAN(total_monthly_weekday_ridership) &lt;span style="color:#66d9ef">AS&lt;/span> INTEGER) &lt;span style="color:#66d9ef">AS&lt;/span> mean_monthly_weekday_ridership &lt;span style="color:#66d9ef">FROM&lt;/span> monthly_ridership &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line &lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> mean_monthly_weekday_ridership &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬────────────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ mean_monthly_weekday_ridership │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼────────────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Bus │ 7395469 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Red Line │ 5326706 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Orange Line │ 4410355 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Green Line │ 3893135 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Commuter Rail │ 2684943 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Blue Line │ 1392658 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Silver Line │ 724622 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ The RIDE │ 138221 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F1 │ 66248 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F3 │ 23922 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boat-F4 │ 19975 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┴────────────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 11 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We can see here all 5 lines of the metro system, in addition to commuter rail, buses, The RIDE (a door-to-door service for folks unable to ride the fixed routes) and several boat routes that cross the Boston Harbour.&lt;/p>
&lt;p>Gated entries give info on specific entrances. We first fetch the table and print the schema.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">OR&lt;/span> &lt;span style="color:#66d9ef">REPLACE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> gated_entries &lt;span style="color:#66d9ef">AS&lt;/span> (&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#66d9ef">FROM&lt;/span> read_csv(&lt;span style="color:#e6db74">&amp;#39;https://hub.arcgis.com/api/v3/datasets/001c177f07594e7c99f193dde32284c9_0/downloads/data?format=csv&amp;amp;spatialRefId=4326&amp;amp;where=1%3D1&amp;#39;&lt;/span>));
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">column_name&lt;/span>, column_type &lt;span style="color:#66d9ef">FROM&lt;/span> (&lt;span style="color:#66d9ef">DESCRIBE&lt;/span> gated_entries);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ column_name │ column_type │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ service_date │ TIMESTAMP WITH TIME ZONE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ time_period │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ stop_id │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ VARCHAR │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ gated_entries │ DOUBLE │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ObjectId │ BIGINT │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────┴──────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The schema makes sense in general, we care the most about &lt;code>station_name&lt;/code>, &lt;code>route_or_line&lt;/code> and &lt;code>gated_entries&lt;/code>. Before aggregating, it is worth checking the time span the data covers.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> &lt;span style="color:#66d9ef">MIN&lt;/span>(service_date) &lt;span style="color:#66d9ef">AS&lt;/span> start_date, &lt;span style="color:#66d9ef">MAX&lt;/span>(service_date) &lt;span style="color:#66d9ef">AS&lt;/span> end_date &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌──────────────────────────┬──────────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ start_date │ end_date │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ timestamp with time zone │ timestamp with time zone │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├──────────────────────────┼──────────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 2024-08-25 00:00:00-04 │ 2026-02-28 00:00:00-05 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└──────────────────────────┴──────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So the dataset covers about a year and a half, from August 2024 to February 2026. We can perform &amp;ldquo;quality control&amp;rdquo; to check if there are stations with very few records.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> days_recorded,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#f92672">*&lt;/span>) &lt;span style="color:#66d9ef">AS&lt;/span> num_stations,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FIRST&lt;/span>(route_or_line) &lt;span style="color:#66d9ef">AS&lt;/span> example_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, route_or_line,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name, route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────┬──────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ days_recorded │ num_stations │ example_line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ int64 │ int64 │ varchar │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────┼──────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 553 │ 26 │ Red Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 552 │ 3 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 551 │ 4 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 550 │ 2 │ Blue Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 549 │ 4 │ Orange Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 548 │ 3 │ Blue Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 536 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 526 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 523 │ 3 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 518 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 512 │ 1 │ Green Line │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┴──────────────┴──────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 22 rows (11 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└─────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>My first impression is that the Green Line has fewer records than the others. If you have lived in Boston this should make sense, since part of the Green Line runs like a tram/light rail, I would thus expect the logistics of data collection to be trickier in above-ground stations.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, ROUND(&lt;span style="color:#66d9ef">AVG&lt;/span>(days_recorded),&lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#66d9ef">AS&lt;/span> avg_days_per_station
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">SELECT&lt;/span> route_or_line, station_name, &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line, station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> avg_days_per_station &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────┬──────────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ route_or_line │ avg_days_per_station │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ double │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────┼──────────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Silver Line │ 551.7 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Orange Line │ 549.7 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Red Line │ 549.0 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Blue Line │ 546.3 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Green Line │ 538.3 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Mattapan Line │ 538.0 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────┴──────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Are the stations above ground the ones with fewer records?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">WHERE&lt;/span> route_or_line &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;Green Line&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> days_recorded
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">LIMIT&lt;/span> &lt;span style="color:#ae81ff">10&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌─────────────────┬───────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ days_recorded │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ int64 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────┼───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ 512 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ 518 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Medford/Tufts │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ 523 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Gilman Square │ 526 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Copley │ 536 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Boylston │ 537 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Kenmore │ 538 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Arlington │ 541 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├─────────────────┴───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 10 rows 2 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└─────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Indeed all the stations under 530 days recorded are part of the &lt;a href="https://en.wikipedia.org/wiki/Green_Line_Extension">Green Line extension&lt;/a>, opened in 2022. We know that they were running at the time this dataset was collected, so it is a bit surprising that they have the most missing data (even if less than 10%). I am curious about the stations that people use the most. Let&amp;rsquo;s look at the top 10 stations with the most gated entries.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span> station_name, route_or_line, &lt;span style="color:#66d9ef">CAST&lt;/span>(&lt;span style="color:#66d9ef">SUM&lt;/span>(gated_entries) &lt;span style="color:#66d9ef">AS&lt;/span> INT) &lt;span style="color:#66d9ef">AS&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name, route_or_line
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> gated_entries &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────────┬───────────────┬───────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ route_or_line │ gated_entries │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┼───────────────┼───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Harvard │ Red Line │ 5359023 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Back Bay │ Orange Line │ 4772945 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Copley │ Green Line │ 4184767 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ North Station │ Orange Line │ 4099727 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Central │ Red Line │ 4094223 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Kendall/MIT │ Red Line │ 4054482 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Government Center │ Blue Line │ 258449 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ Green Line │ 186641 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ Green Line │ 174971 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ Green Line │ 154011 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ Green Line │ 92067 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┴───────────────┴───────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 78 rows (11 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>There are two issues with naive aggregation: First, some stations are part of multiple routes. We thus should remove the route_or_line grouping. Second, the raw sums are skewed if not all stations have records covering the same time span. Since the dataset has a &lt;code>service_date&lt;/code> column, we can normalise by the number of distinct dates each station appears in to get a fairer average daily figure.&lt;/p>
&lt;p>We thus adjust our query to make these changes: Aggregate data from different lines and normalize it by the number of days recorded for each station.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">SELECT&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> station_name,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> LIST(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> route_or_line) &lt;span style="color:#66d9ef">AS&lt;/span> lines,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">CAST&lt;/span>(&lt;span style="color:#66d9ef">SUM&lt;/span>(gated_entries) &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#66d9ef">COUNT&lt;/span>(&lt;span style="color:#66d9ef">DISTINCT&lt;/span> service_date) &lt;span style="color:#66d9ef">AS&lt;/span> INT) &lt;span style="color:#66d9ef">AS&lt;/span> avg_daily_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">FROM&lt;/span> gated_entries
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">GROUP&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> station_name
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">ORDER&lt;/span> &lt;span style="color:#66d9ef">BY&lt;/span> avg_daily_entries &lt;span style="color:#66d9ef">DESC&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-text" data-lang="text">&lt;span style="display:flex;">&lt;span>┌───────────────────┬───────────────────────────┬───────────────────┐
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ station_name │ lines │ avg_daily_entries │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ varchar │ varchar[] │ int32 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┼───────────────────────────┼───────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ North Station │ [Orange Line, Green Line] │ 11315 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ South Station │ [Red Line, Silver Line] │ 10342 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Harvard │ [Red Line] │ 9691 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Downtown Crossing │ [Red Line, Orange Line] │ 9492 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Back Bay │ [Orange Line] │ 8774 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Park Street │ [Green Line, Red Line] │ 8017 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ · │ · │ · │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Medford/Tufts │ [Green Line] │ 540 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Suffolk Downs │ [Blue Line] │ 483 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Union Square │ [Green Line] │ 365 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Ball Square │ [Green Line] │ 335 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ Magoun Square │ [Green Line] │ 297 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ East Somerville │ [Green Line] │ 176 │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├───────────────────┴───────────────────────────┴───────────────────┤
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ 71 rows (12 shown) 3 columns │
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└───────────────────────────────────────────────────────────────────┘
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This better matches my intuitive impression from being in those areas. Harvard, despite being a major transit hub for buses and the T, is outranked by North and South Station once considering all lines. Park Street jumped from 19th before all the way up to sixth. The new Green Line stations in Cambridge still rank at the bottom of daily entries. There is a stark difference between the most and least &amp;ldquo;popular&amp;rdquo; stations. For every person recording their entry on East Somerville there are 62 in North Station. East Somerville (and all the other Green Line stations in Cambridge) do not enforce checks on ticket purchases. The same can be said for stations west of Kenmore, but those are located in an area with much denser population, since multiple universities are based in that area.&lt;/p>
&lt;p>Overall, I really like duckdb due to the flexibility and speed for data crunching analysis. It is fast, simple and it integrates well with notebook-like workflows and Command-Line Interface (CLI) usage. The translation from questions to queries to tables is seamless. I believe it is a tool worth mastering, since it provides a Swiss Army Knife for everyday data processing.&lt;/p>
&lt;p>There are some caveats worth mentioning. For instance, it is unclear to me how they are differentiating the Red and the Green Line entries in Park Street, since it is a two-layered station with the green lines on top, and one can only access the Red line from the Green Line. While it seemed negligible for these questions, the frequency and data acquisition differs across stations. There may be a weekday vs weekend bias that we are not accounting for. That said, I&amp;rsquo;m glad that my intuition of the usage of stations and lines matches my mental model.&lt;/p>
&lt;p>Meta conclusion: I used this post also to test literate programming to incrementally build a data crunching workflow. In this case I coupled with an org-mode notebook (that is how I generate this blog) to explore a public dataset for fun. When wrapping up, I had a couple of lingering questions and a notion of the necessary query, but not enough time. I used an agent to add those in the middle of the analysis, which I evaluated via org-babel in a quick feedback loop. It worked shockingly well. Turns out this is quite similar to the recently released &lt;a href="https://github.com/marimo-team/marimo-pair">marimo-pair&lt;/a>, an extension for reproducible data analysis notebooks using agents. I want to further explore the potential of having an agent as an interface for data analysis, where I still review and check that the code fulfills its intended purpose. In the end the goal is a reproducible artifact that gives us new insights on the data we are processing, and I think this approach facilitates rapid and reproducible data exploration and analysis.&lt;/p></description></item>/</channel></rss>