• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / timdex-dataset-api / 15144600754
95%
main: 93%

Build:
Build:
LAST BUILD BRANCH: USE-306-handle-missing-metadata-or-embeddings
DEFAULT BRANCH: main
Ran 20 May 2025 06:03PM UTC
Jobs 1
Files 6
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

20 May 2025 05:55PM UTC coverage: 94.565% (-1.1%) from 95.652%
15144600754

push

github

ghukill
TIMDEXRunManager for producing ETL run metadata

Why these changes are being introduced:

One of the challenges the architecture of the TIMDEX parquet dataset presents is
quick and easy metadata about ETL "runs" in the dataset.  The year/month/day partitioning
structure is very efficient for accessing a run if you know the date, where only a few
parquet files are scanned, but it's not geared towards quickly isolating runs (parquet files)
associated with a given source.

Having metadata about runs provides a map to efficiently access meaningful subsets of data.
One example would be fully refreshing a source in Opensearch.  To do, you'd want to access
all runs for a given source since, and including, the last run_type=full run.  Those runs
represent the current state of the source in TIMDEX.

Unfortunately, this is not terribly efficient to naively perform with pyarrow or DuckDB, where
potentially thousands of parquet files are touched.  Similar to how Apache Iceberg (a parquet
dataset architecture) works, we need some metadata about each "run" in the dataset which
correlates to parquet file(s).

How this addresses that need:

A new class TIMDEXRunManager exists to provide this functionality.  This class will produce
a pandas dataframe of metadata about all runs in the dataset, including the explicit parquet
filepath the run is associated with, in a highly efficient and parallelized way.

The is achieved by:
1. Getting a list of all parquet files from the dataset.
2. Reading the *first* row from each file, which contains metadata about the run that
produced the file.
3. Aggregating the results and grouping by "run_id".

The result is a dataframe that provides a precise map of run metadata to parquet files in
the dataset.  With those parquet files identified, this unblocks further functionality
for this library like "replaying" the runs for a given source in chronological order
to refresh it in Opensearch.

Side effects of this change:
* None.  No change... (continued)

63 of 69 new or added lines in 1 file covered. (91.3%)

261 of 276 relevant lines covered (94.57%)

0.95 hits per line

New Missed Lines in Diff

Lines Coverage ∆ File
6
91.3
timdex_dataset_api/run.py
Jobs
ID Job ID Ran Files Coverage
1 15144600754.1 20 May 2025 06:03PM UTC 6
94.57
GitHub Action Run
Source Files on build 15144600754
  • Tree
  • List 6
  • Changed 0
  • Source Changed 0
  • Coverage Changed 0
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #15144600754
  • f584a1dc on github
  • Prev Build on main (#15144487872)
  • Next Build on TIMX-494-run-metadata (#15163424367)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc