• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / timdex-dataset-api / 15169783189
95%
main: 94%

Build:
Build:
LAST BUILD BRANCH: USE-142-dataset-embedding-imports
DEFAULT BRANCH: main
Ran 21 May 2025 06:28PM UTC
Jobs 1
Files 6
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

21 May 2025 06:19PM UTC coverage: 94.921% (+0.4%) from 94.565%
15169783189

Pull #144

github

ghukill
TIMDEXDataset capable of yielding current records only

Why these changes are being introduced:

With TIMDEXDataset capable of limiting to only parquet files
associated with current runs, the next logical step is providing
the ability to yield only the current version of a record.

This would support a "full refresh" of a TIMDEX source where an
application like TIM could yield only current records for a given
source and index those to Opensearch.

How this addresses that need:

When TIMDEXDataset is loaded with current_records=True, the private
attribute TIMDEXDataset._dedupe_on_read is set to True, informing
any read methods to dedupe during yielding.  Because all read
methods TIMDEXDataset.read_batches_iter() at the lowest level,
the deduping logic is required only there.

Because the ordering of the parquet files is already handled by
the load method, the read methods can be confident they are always
seeing the most recent version of a record first, and thus can
just maintain a "seen" list as they are encountered.  This keeps
the deduplication effectively instant and memory safe; no large
in-memory reordering or deduplication is required.

Side effects of this change:
* Applications like TIM now have the option of yielding only current
records for a source, or all sources, supporting new functionality
like fully reindexing a source in Opensearch from parquet dataset
data alone.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-494
Pull Request #144: TIMX 494 - yield deduped, most recent records

44 of 45 new or added lines in 3 files covered. (97.78%)

2 existing lines in 1 file now uncovered.

299 of 315 relevant lines covered (94.92%)

0.95 hits per line

New Missed Lines in Diff

Lines Coverage ∆ File
1
97.37
-0.1% timdex_dataset_api/dataset.py

Uncovered Existing Lines

Lines Coverage ∆ File
2
92.11
0.8% timdex_dataset_api/run.py
Jobs
ID Job ID Ran Files Coverage
1 15169783189.1 21 May 2025 06:28PM UTC 6
94.92
GitHub Action Run
Source Files on build 15169783189
  • Tree
  • List 6
  • Changed 3
  • Source Changed 3
  • Coverage Changed 2
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #15169783189
  • Pull Request #144
  • PR Base - main (#15168616734)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc