15144600754
95%
main: 93%

Ran 20 May 2025 06:03PM UTC

Jobs 1

Files 6

Run time 1min

Badge

Embed ▾

Committed 20 May 2025 05:55PM UTC coverage: 94.565% (-1.1%) from 95.652%

Build # 15144600754

Build Type

push

github

Committed by

ghukill

Commit Message

TIMDEXRunManager for producing ETL run metadata

Why these changes are being introduced:

One of the challenges the architecture of the TIMDEX parquet dataset presents is
quick and easy metadata about ETL "runs" in the dataset. The year/month/day partitioning
structure is very efficient for accessing a run if you know the date, where only a few
parquet files are scanned, but it's not geared towards quickly isolating runs (parquet files)
associated with a given source.

Having metadata about runs provides a map to efficiently access meaningful subsets of data.
One example would be fully refreshing a source in Opensearch. To do, you'd want to access
all runs for a given source since, and including, the last run_type=full run. Those runs
represent the current state of the source in TIMDEX.

Unfortunately, this is not terribly efficient to naively perform with pyarrow or DuckDB, where
potentially thousands of parquet files are touched. Similar to how Apache Iceberg (a parquet
dataset architecture) works, we need some metadata about each "run" in the dataset which
correlates to parquet file(s).

How this addresses that need:

A new class TIMDEXRunManager exists to provide this functionality. This class will produce
a pandas dataframe of metadata about all runs in the dataset, including the explicit parquet
filepath the run is associated with, in a highly efficient and parallelized way.

The is achieved by:
1. Getting a list of all parquet files from the dataset.
2. Reading the *first* row from each file, which contains metadata about the run that
produced the file.
3. Aggregating the results and grouping by "run_id".

The result is a dataframe that provides a precise map of run metadata to parquet files in
the dataset. With those parquet files identified, this unblocks further functionality
for this library like "replaying" the runs for a given source in chronological order
to refresh it in Opensearch.

Side effects of this change:
* None. No change... (continued)

Run Details

63 of 69 new or added lines in 1 file covered. (91.3%)

261 of 276 relevant lines covered (94.57%)

0.95 hits per line

New Missed Lines in Diff

Lines	Coverage	∆	File
6	91.3		timdex_dataset_api/run.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	15144600754.1	20 May 2025 06:03PM UTC	6	94.57	GitHub Action Run

MITLibraries / timdex-dataset-api / 15144600754
95%
main: 93%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

New Missed Lines in Diff

Jobs

Source Files on build 15144600754

MITLibraries / timdex-dataset-api / 15144600754 95% main: 93%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

New Missed Lines in Diff

Jobs

Source Files on build 15144600754

MITLibraries / timdex-dataset-api / 15144600754
95%
main: 93%

README BADGES
x