• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

MITLibraries / transmogrifier
99%
main: 99%

Build:
Build:
LAST BUILD BRANCH: v3.8
DEFAULT BRANCH: main
Repo Added 30 Mar 2022 05:29PM UTC
Files 20
Badge
Embed â–¾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

LAST BUILD ON BRANCH TIMX-454-memory-error-on-exit
branch: TIMX-454-memory-error-on-exit
CHANGE BRANCH
x
Reset
  • TIMX-454-memory-error-on-exit
  • 2025-08-alma-locations-update
  • GDT-109-mitaardvark-adjustments
  • GDT-138-update-timdex-record-to-enable-access-filter
  • GDT-149-updates-from-full-mit-harvest
  • GDT-193-adr-handle-data-forms
  • GDT-201-data-type-refactor
  • GDT-203-add-provider-field
  • GDT-205-adr-institution-information
  • GDT-206-add-provider-field
  • GDT-210-publishers-refactor
  • GDT-217-handle-none-values
  • GDT-217-map-spatial-subjects-to-locations-field
  • GDT-247-update-rights-access-values
  • GDT-54-additional-aardvark-fields
  • GDT-54-more-MITAardvark-methods
  • GDT-54-workarounds-and-bug-fixes
  • GDT-68-update-external-url
  • GDT-82-transformer-base-class
  • GDT-83-jsontransformer
  • IN-1079-maintenance-2024-10
  • RDI-111-subject-grouping
  • RDI-141-citation-generation-refactor
  • RDI-151-datacite-refactor
  • RDI-166-record-filtering
  • RDI-216-default-content-type
  • RDI-53-zenodo-transform
  • RDI-55-whoas-transform
  • RDI-78
  • TIMX-18-dates-languages-fields
  • TIMX-18-holdings-field
  • TIMX-18-holdings-refactor
  • TIMX-18-marc-transform
  • TIMX-18-marc-transform-additional-fields
  • TIMX-18-marc-transform-even-more-fields
  • TIMX-18-marc-transform-initial-methods
  • TIMX-18-marc-transform-initial_methods
  • TIMX-18-marc-transform-more-fields
  • TIMX-18-marc-transform-yet-more-fields
  • TIMX-18-serial-holdings-field
  • TIMX-180-date-cleanup
  • TIMX-212-invalid-date-bug-fix
  • TIMX-212-invalid-date-fix
  • TIMX-227-springshare-sources
  • TIMX-232-springshare-ids
  • TIMX-234-update-aspace-identifiers
  • TIMX-235-aspace-date-range
  • TIMX-241-springshare-oai-dates
  • TIMX-246-methods-for-fields-adr
  • TIMX-254-mitlibwebsite-transformer
  • TIMX-270-date-bug-fix
  • TIMX-282-dspacedim-fmr-judgment-day
  • TIMX-282-dspacedim-fmr-origins
  • TIMX-283-springshare-field-method-refactor
  • TIMX-284-datacite-field-method-refactor
  • TIMX-284-datacite-fmr-beyond-thunderdome
  • TIMX-284-datacite-fmr-strikes-back
  • TIMX-285-aardvark-fmr
  • TIMX-285-aardvark-fmr-the-wrong-trousers
  • TIMX-286-dspace-mets-fmr-begins
  • TIMX-286-dspace-mets-fmr-returns
  • TIMX-287-ead-field-method-refactor
  • TIMX-287-ead-field-method-refactor-1
  • TIMX-287-ead-field-method-refactor-2
  • TIMX-287-ead-field-method-refactor-3
  • TIMX-287-ead-field-method-refactor-4
  • TIMX-288-marc-field-method-refactor
  • TIMX-288-marc-field-method-refactor-2
  • TIMX-288-marc-field-method-refactor-3
  • TIMX-288-marc-field-method-refactor-4
  • TIMX-288-marc-field-method-refactor-5
  • TIMX-291-orchestration
  • TIMX-332-dedupe-function
  • TIMX-355-control-field-index
  • TIMX-400-new-popupvinyl-location
  • TIMX-403-inputs-support-parquet-writing
  • TIMX-404-establish-feature-flagging-pathways
  • TIMX-405-write-output-to-dataset
  • TIMX-406-add-provenance-data
  • TIMX-447-dvd-collections-location
  • TIMX-459-update-logging
  • TIMX-489-remove-parquet-feature-flags
  • TIMX-496-update-tda
  • TIMX-501-additional-format-support
  • TIMX-501-update-aspace-oai-identifier-parsing
  • TIMX-509-cli-run-timestamp-arg
  • TIMX-537
  • TIMX-557-and-misc
  • TIMX-64-add-more-fields-to-get-optional-fields
  • TIMX-64-additional-fields
  • TIMX-64-complete-get-optional-fields
  • TIMX-64-ead-transform
  • TIMX-64-ead-transform-get-optional-fields
  • TIMX-64-expand-get-optional-fields
  • USE-238-exclusion-list
  • USE-256-new-fulltext-field
  • USE-259-mitlibwebsite-parse-html
  • USE-278-exit-early-exclusion-list-error
  • USE-69-mitlibwebsite-transformer-update
  • USE-98-mitlibwebsite-deleted-records
  • add-description-to-readme
  • bugfix
  • date-whitespace-bug-fix
  • dependabot-update
  • dependabot/docker/python-3.12-slim
  • dependabot/docker/python-3.13-slim
  • dependabot/pip/attrs-23.1.0
  • dependabot/pip/bandit-1.7.5
  • dependabot/pip/beautifulsoup4-4.12.0
  • dependabot/pip/beautifulsoup4-4.12.1
  • dependabot/pip/beautifulsoup4-4.12.2
  • dependabot/pip/beautifulsoup4-4.13.3
  • dependabot/pip/black-23.3.0
  • dependabot/pip/black-24.3.0
  • dependabot/pip/black-24.4.0
  • dependabot/pip/black-24.4.2
  • dependabot/pip/black-25.1.0
  • dependabot/pip/certifi-2022.12.7
  • dependabot/pip/certifi-2023.7.22
  • dependabot/pip/click-8.1.5
  • dependabot/pip/click-8.1.6
  • dependabot/pip/click-8.1.8
  • dependabot/pip/cryptography-42.0.2
  • dependabot/pip/cryptography-42.0.4
  • dependabot/pip/cryptography-44.0.1
  • dependabot/pip/flake8-6.1.0
  • dependabot/pip/gitpython-3.1.30
  • dependabot/pip/idna-3.7
  • dependabot/pip/ipython-8.21.0
  • dependabot/pip/ipython-8.22.2
  • dependabot/pip/ipython-8.23.0
  • dependabot/pip/ipython-8.30.0
  • dependabot/pip/ipython-8.31.0
  • dependabot/pip/ipython-8.32.0
  • dependabot/pip/jinja2-3.1.4
  • dependabot/pip/jinja2-3.1.5
  • dependabot/pip/lxml-4.9.1
  • dependabot/pip/lxml-4.9.3
  • dependabot/pip/lxml-5.2.0
  • dependabot/pip/mypy-1.1.1
  • dependabot/pip/mypy-1.10.0
  • dependabot/pip/mypy-1.14.0
  • dependabot/pip/mypy-1.14.1
  • dependabot/pip/mypy-1.2.0
  • dependabot/pip/mypy-1.3.0
  • dependabot/pip/mypy-1.4.1
  • dependabot/pip/mypy-1.9.0
  • dependabot/pip/pre-commit-3.6.1
  • dependabot/pip/pre-commit-3.6.2
  • dependabot/pip/pre-commit-3.7.0
  • dependabot/pip/pre-commit-4.1.0
  • dependabot/pip/pyarrow-18.1.0
  • dependabot/pip/pytest-7.2.2
  • dependabot/pip/pytest-7.3.0
  • dependabot/pip/pytest-7.3.1
  • dependabot/pip/pytest-7.3.2
  • dependabot/pip/pytest-7.4.0
  • dependabot/pip/pytest-8.0.2
  • dependabot/pip/pytest-8.1.0
  • dependabot/pip/pytest-8.1.1
  • dependabot/pip/pytest-8.3.4
  • dependabot/pip/python-dateutil-2.9.0.post0
  • dependabot/pip/requests-2.31.0
  • dependabot/pip/ruff-0.2.1
  • dependabot/pip/ruff-0.2.2
  • dependabot/pip/ruff-0.3.0
  • dependabot/pip/ruff-0.3.2
  • dependabot/pip/ruff-0.3.3
  • dependabot/pip/ruff-0.3.4
  • dependabot/pip/ruff-0.3.7
  • dependabot/pip/ruff-0.4.1
  • dependabot/pip/ruff-0.4.2
  • dependabot/pip/ruff-0.4.3
  • dependabot/pip/ruff-0.8.1
  • dependabot/pip/ruff-0.8.2
  • dependabot/pip/ruff-0.8.3
  • dependabot/pip/ruff-0.8.6
  • dependabot/pip/ruff-0.9.1
  • dependabot/pip/ruff-0.9.2
  • dependabot/pip/ruff-0.9.3
  • dependabot/pip/ruff-0.9.4
  • dependabot/pip/ruff-0.9.5
  • dependabot/pip/ruff-0.9.7
  • dependabot/pip/safety-3.2.14
  • dependabot/pip/safety-3.3.0
  • dependabot/pip/sentry-sdk-1.17.0
  • dependabot/pip/sentry-sdk-1.18.0
  • dependabot/pip/sentry-sdk-1.19.0
  • dependabot/pip/sentry-sdk-1.19.1
  • dependabot/pip/sentry-sdk-1.20.0
  • dependabot/pip/sentry-sdk-1.21.0
  • dependabot/pip/sentry-sdk-1.21.1
  • dependabot/pip/sentry-sdk-1.22.1
  • dependabot/pip/sentry-sdk-1.22.2
  • dependabot/pip/sentry-sdk-1.23.1
  • dependabot/pip/sentry-sdk-1.25.0
  • dependabot/pip/sentry-sdk-1.25.1
  • dependabot/pip/sentry-sdk-1.26.0
  • dependabot/pip/sentry-sdk-1.27.1
  • dependabot/pip/sentry-sdk-1.28.1
  • dependabot/pip/sentry-sdk-1.40.3
  • dependabot/pip/sentry-sdk-1.40.4
  • dependabot/pip/sentry-sdk-1.40.5
  • dependabot/pip/sentry-sdk-1.42.0
  • dependabot/pip/sentry-sdk-1.43.0
  • dependabot/pip/sentry-sdk-1.44.0
  • dependabot/pip/sentry-sdk-1.45.0
  • dependabot/pip/sentry-sdk-2.0.1
  • dependabot/pip/sentry-sdk-2.19.2
  • dependabot/pip/sentry-sdk-2.20.0
  • dependabot/pip/sentry-sdk-2.22.0
  • dependabot/pip/smart-open-7.0.3
  • dependabot/pip/types-python-dateutil-2.9.0.20240316
  • dependencies-update
  • dependency-updates
  • dependency-updates-23-03-29
  • field-method-adr
  • field-method-refactor
  • fix-bs4-recursion-issue
  • fix-recursion-bug
  • gdt-116-update-locations
  • gdt-199-publication-adr
  • gdt-54-aardvark-transform
  • gdt-82-transformer-class-refactor
  • helpers-typing
  • helpers_typing
  • hotfix-fully-remove-etl-version-flags
  • hotfix-vulnerability
  • install-tda-from-main
  • jpal-transform
  • linting-update
  • main
  • makefile-workflow-updates
  • mit-timdex-json-schema
  • new-holdings-collections
  • proposed-transform
  • rdi-106-complete-timdex-record-model
  • rdi-120-dspace-mets-transform
  • rdi-139-streaming-xml-parser
  • rdi-160-update-config
  • rdi-165-refactor-sources
  • rdi-187-handle-missing-title-field
  • rdi-233-subfield-bugfix
  • rdi-243-246-update-content-type-filters
  • rdi-52
  • rdi-77
  • refs/heads/dependabot/pip/sentry-sdk-1.24.0
  • refs/tags/v.1.3.1
  • refs/tags/v.1.3.4
  • refs/tags/v1.0
  • refs/tags/v1.1.0
  • refs/tags/v1.1.1
  • refs/tags/v1.2.1
  • refs/tags/v1.2.2
  • refs/tags/v1.3.0
  • refs/tags/v1.3.2
  • refs/tags/v1.3.3
  • refs/tags/v1.4.0
  • refs/tags/v1.4.1
  • stage-workflow-update
  • tda-version-bump-v0.3.0
  • timx-126-skip-deleted-records
  • timx-141-deleted-records
  • timx-184-marc-portfolio-mapping
  • timx-193-reduce-logs
  • timx-273-structure-exploration
  • timx-64-refactor-class-methods-archdesc-check
  • update-caller-workflows
  • update-dependencies-2025-10-22
  • v1.5
  • v1.6.0
  • v2.0
  • v2.2
  • v2.2.1
  • v2.4
  • v2.5
  • v2.6
  • v2.7
  • v3.8
  • workflow-makefile-updates

09 Jan 2025 04:17PM UTC coverage: 98.757%. First build
12694200524

push

github

ghukill
Parse bs4 in isolated thread

Why these changes are being introduced:

A bug was discovered where the application did not exit cleanly
for XML sources when the number of records transformed resulted
in 2+ batches of writing to the parquet dataset.  The cause of this
memory fault was the combination of three factors:

1. parsing of XML strings into BeautifulSoup4 Tags
2. running Transmogrifier via 'pipenv run transform...'
3. transformed records are written to a parquet dataset, requiring
more than one batch of writes

Under the hood, BeautifulSoup4 uses lxml to parse the XML string
it receives.  Somehow, through a combination of these three factors,
there were memory pointers left dangling when the application
attempted to tear itself down.  This resulted in C code throwing
an abort() call, which bubbled up.

It was confirmed that not one of these factors was enough on its
own.  Non XML sources work fine, removing bs4 parsing works fine,
even changing bs4's parsing to use 'html.parser' works fine,
and running the transformation without 'pipenv run ...' works fine.

There is some documentation and forum posts out there that bs4
is somewhat messy with memory management, and it appears this
combination of factors triggers that.

How this addresses that need:

When producing a bs4 'source_record' for transformation, the
bs4 parsing is handled in a short-lived, isolated thread.  By doing
so, we do not see any memory management issues when the application
exits.  It is less than ideal to spawn a thread just to parse a
bs4 Tag, but the performance hit is minimal.

Furthermore, there is active discussion of removing BeautifulSoup4
altogether from Transmogrifier, so this patch is seen as temporary
until that time.

Side effects of this change:
* XML sources of decent size, that will trigger multiple batches
of writing to output parquet dataset, will not throw error exit codes.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-454

10 of 10 new or added lines in 1 file covered. (100.0%)

1748 of 1770 relevant lines covered (98.76%)

0.99 hits per line

Relevant lines Covered
Build:
Build:
1770 RELEVANT LINES 1748 COVERED LINES
0.99 HITS PER LINE
Source Files on TIMX-454-memory-error-on-exit
Detailed source file information is not available for this build.

Recent builds

Builds Branch Commit Type Ran Committer Via Coverage
12694200524 TIMX-454-memory-error-on-exit Parse bs4 in isolated thread Why these changes are being introduced: A bug was discovered where the application did not exit cleanly for XML sources when the number of records transformed resulted in 2+ batches of writing to the parquet dataset.... push 09 Jan 2025 04:28PM UTC ghukill github
98.76
See All Builds (709)
  • Repo on GitHub
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc