• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / timdex-dataset-api / 15619230762
95%
main: 93%

Build:
Build:
LAST BUILD BRANCH: USE-306-handle-missing-metadata-or-embeddings
DEFAULT BRANCH: main
Ran 12 Jun 2025 07:31PM UTC
Jobs 1
Files 7
Run time 186min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

12 Jun 2025 07:22PM UTC coverage: 95.143% (-0.4%) from 95.541%
15619230762

push

github

ghukill
Yield current records by utilizing dataset fragments

Why these changes are being introduced:

A bug was discovered where the yielding of currect records could throw
an error for the unfiltered + filtered batches getting combined
via zip(..., strict=True).  It was learned that despite reading the
same parquet files, the number of batches *can* differ if filtering is
applied.  As such, we cannot just zip together the two batch iterators.

How this addresses that need:

What we can do is "backfill" seen timdex_record_ids everytime we get
to the end of a parquet file, and we can safely identify parquet files
via a dataset fragment,
https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset8FragmentE.

The new approach is as follows:
- yield filtered records, noting the fragment index
- if we encounter a new fragment we pause and backfill
timdex_record_ids from unfiltered batches that are part of the previous fragment
- we continue yielding records from the next fragment, confident that even if
filtered out we have marked all records as seen from the previous fragment (
more recent versions)

Side effects of this change:
* None, sans bug fixes for large current record pulls like Alma

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-504

41 of 45 new or added lines in 3 files covered. (91.11%)

333 of 350 relevant lines covered (95.14%)

0.95 hits per line

New Missed Lines in Diff

Lines Coverage ∆ File
4
96.8
-1.11% timdex_dataset_api/dataset.py
Jobs
ID Job ID Ran Files Coverage
1 15619230762.1 12 Jun 2025 07:31PM UTC 7
95.14
GitHub Action Run
Source Files on build 15619230762
  • Tree
  • List 7
  • Changed 2
  • Source Changed 2
  • Coverage Changed 1
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #15619230762
  • 12e78322 on github
  • Prev Build on main (#15424451546)
  • Next Build on TIMX-504-dataset-fragments-vs-batches (#15619734785)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc