idanmoradarthas / DataScienceUtils / 21079409936 / 8
100%
master: 100%

Ran 16 Jan 2026 08:14PM UTC

Files 7

Run time 1s

Badge

Committed 16 Jan 2026 08:09PM UTC coverage: 99.7% (-0.3%) from 100.0%

Job # 21079409936.8

Build Type

push

github

Committed by

Commit Message

Refactor `append_tags_to_frame` to use `MultiLabelBinarizer` (#87)

* refactor: Use MultiLabelBinarizer in append_tags_to_frame

Replaced `CountVectorizer` with `MultiLabelBinarizer` in the `append_tags_to_frame` function.

The previous implementation used `CountVectorizer` with a custom tokenizer, which was not the most suitable tool for the task. `MultiLabelBinarizer` is a more direct and efficient choice for creating a binary matrix from pre-tokenized tags.

The refactoring includes manual implementations for the `min_df` and `max_features` parameters to ensure that the function's behavior remains identical to the original implementation.

- Replaced `CountVectorizer` with `MultiLabelBinarizer`.
- Manually implemented `min_df` filtering based on document frequency.
- Manually implemented `max_features` selection of the most frequent tags.
- Ensured that the function signature and all existing tests remain unchanged and pass.

* refactor: Use MultiLabelBinarizer in append_tags_to_frame

Replaced `CountVectorizer` with `MultiLabelBinarizer` in the `append_tags_to_frame` function and incorporated feedback from the code review.

The previous implementation used `CountVectorizer` with a custom tokenizer, which was not the most suitable tool for the task. `MultiLabelBinarizer` is a more direct and efficient choice for creating a binary matrix from pre-tokenized tags.

The refactoring includes manual implementations for the `min_df` and `max_features` parameters to ensure that the function's behavior remains identical to the original implementation.

Changes based on code review feedback:
- Added a clearer comment to the sorting logic for `max_features`.
- Deduplicated code by creating a `_prepare_tags` helper function.
- Added a new test case to handle the edge case where no tags are left after filtering.

- Replaced `CountVectorizer` with `MultiLabelBinarizer`.
- Manually implemented `min_df` filtering based on document frequency.
- Manually implemented... (continued)

Run Details

665 of 667 relevant lines covered (99.7%)

1.0 hits per line

Source Files on job windows-latest-python-3.12 - 21079409936.8