• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / transmogrifier / 12694200524
99%
main: 99%

Build:
Build:
LAST BUILD BRANCH: v3.8
DEFAULT BRANCH: main
Ran 09 Jan 2025 04:28PM UTC
Jobs 1
Files 18
Run time 108min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

09 Jan 2025 04:17PM UTC coverage: 98.757%. First build
12694200524

push

github

ghukill
Parse bs4 in isolated thread

Why these changes are being introduced:

A bug was discovered where the application did not exit cleanly
for XML sources when the number of records transformed resulted
in 2+ batches of writing to the parquet dataset.  The cause of this
memory fault was the combination of three factors:

1. parsing of XML strings into BeautifulSoup4 Tags
2. running Transmogrifier via 'pipenv run transform...'
3. transformed records are written to a parquet dataset, requiring
more than one batch of writes

Under the hood, BeautifulSoup4 uses lxml to parse the XML string
it receives.  Somehow, through a combination of these three factors,
there were memory pointers left dangling when the application
attempted to tear itself down.  This resulted in C code throwing
an abort() call, which bubbled up.

It was confirmed that not one of these factors was enough on its
own.  Non XML sources work fine, removing bs4 parsing works fine,
even changing bs4's parsing to use 'html.parser' works fine,
and running the transformation without 'pipenv run ...' works fine.

There is some documentation and forum posts out there that bs4
is somewhat messy with memory management, and it appears this
combination of factors triggers that.

How this addresses that need:

When producing a bs4 'source_record' for transformation, the
bs4 parsing is handled in a short-lived, isolated thread.  By doing
so, we do not see any memory management issues when the application
exits.  It is less than ideal to spawn a thread just to parse a
bs4 Tag, but the performance hit is minimal.

Furthermore, there is active discussion of removing BeautifulSoup4
altogether from Transmogrifier, so this patch is seen as temporary
until that time.

Side effects of this change:
* XML sources of decent size, that will trigger multiple batches
of writing to output parquet dataset, will not throw error exit codes.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-454

10 of 10 new or added lines in 1 file covered. (100.0%)

1748 of 1770 relevant lines covered (98.76%)

0.99 hits per line

Jobs
ID Job ID Ran Files Coverage
1 12694200524.1 09 Jan 2025 04:28PM UTC 0
98.76
GitHub Action Run
Source Files on build 12694200524
Detailed source file information is not available for this build.
  • Back to Repo
  • Github Actions Build #12694200524
  • 6772c18c on github
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc