• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / timdex-dataset-api / 12794648113
97%
main: 93%

Build:
Build:
LAST BUILD BRANCH: USE-306-handle-missing-metadata-or-embeddings
DEFAULT BRANCH: main
Ran 15 Jan 2025 06:26PM UTC
Jobs 1
Files 5
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

15 Jan 2025 06:17PM UTC coverage: 97.423% (+0.03%) from 97.396%
12794648113

push

github

ghukill
Add run_record_offset column to dataset

Why these changes are being introduced:

Bulk reading and writing from the TIMDEX dataset is a primary responsibility,
but occassional random access (e.g. locating a single record row) will be
helpful (e.g. looking at the original source record for a problematic record).

Each TIMDEX JSON record in Opensearch will contain a "provenance" object that will
include things like run_date, run_id, and now run_record_offset.  This offset
allows for quicker (time) and more efficient (data read) retrieval of a single
record given information in the TIMDEX provenance object.

How this addresses that need:

Parquet files have metadata embedded that describe what values can be found
in subsets of the file, but this is only helpful when the min/max values
in that metadata can inform query engines if a desired record may be
present.  Unfortunately, the timdex_record_id is a) not lexicographically
sortable (at least not easily), and b) are not ordered during write.

By adding this offset, effectively an incrementing counter as records are
yielded for writing, we have a value that is pre-sorted and provides nice
ranges in the parquet file metadata.  Query engines can utilize this to
dramatically improve random access reads.  By including this offset integer
in the TIMDEX record "provenance" section we close the loop and provide
enough information in the Opensearch record to efficiently retrieve it
from the parquet dataset.

Side effects of this change:
* Dataset will now include a new column 'run_record_offset'

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-465

5 of 5 new or added lines in 3 files covered. (100.0%)

189 of 194 relevant lines covered (97.42%)

0.97 hits per line

Jobs
ID Job ID Ran Files Coverage
1 12794648113.1 15 Jan 2025 06:26PM UTC 5
97.42
GitHub Action Run
Source Files on build 12794648113
  • Tree
  • List 5
  • Changed 3
  • Source Changed 3
  • Coverage Changed 2
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #12794648113
  • 3a3fbbc0 on github
  • Prev Build on main (#12790041209)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc