20401440535

Committed 20 Dec 2025 11:19PM UTC coverage: 89.38% (+0.01%) from 89.367%

Build # 20401440535

Build Type

push

github

Committed by

bramp

Commit Message

Standardize RuleBasedClassifier effect finding and resolve PartCount conflicts

- Set default effects_margin to 2.0 in RuleBasedClassifier.
- Removed effects_max_area_ratio and effects_target_types from base class to simplify API.
- Implemented local filtering in PartCountClassifier and PartNumberClassifier to only consume Drawing effects, preventing conflicts with PartImage Image blocks.
- Opted-out non-content classifiers (Divider, ProgressBarBar, etc.) from automatic effect finding to maintain behavior.
- Cleaned up ProgressBarIndicatorClassifier to use standard base class properties.
- Updated golden files for pages 072 and 176 to reflect minor bbox changes from refined PartCount behavior.

Run Details

37 of 39 new or added lines in 10 files covered. (94.87%)

26 existing lines in 7 files now uncovered.

13701 of 15329 relevant lines covered (89.38%)

0.89 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

98.75

/src/build_a_long/pdf_extract/classifier/rule_based_classifier.py

"""
Rule-based classifier implementation.
"""

from __future__ import annotations

import logging
from abc import abstractmethod
from collections.abc import Sequence

from build_a_long.pdf_extract.classifier.block_filter import (
    find_contained_effects,
)
from build_a_long.pdf_extract.classifier.candidate import Candidate
from build_a_long.pdf_extract.classifier.classification_result import (
    ClassificationResult,
)
from build_a_long.pdf_extract.classifier.label_classifier import (
    LabelClassifier,
)
from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext
from build_a_long.pdf_extract.classifier.score import Score, Weight
from build_a_long.pdf_extract.extractor.bbox import BBox
from build_a_long.pdf_extract.extractor.page_blocks import Blocks

log = logging.getLogger(__name__)


class RuleScore(Score):
    """Generic score based on rules."""

    components: dict[str, float]
    total_score: float

    def score(self) -> Weight:
        return self.total_score

    def get(self, rule_name: str, default: float = 0.0) -> float:
        """Get the score for a specific rule name."""
        return self.components.get(rule_name, default)


class StepNumberScore(RuleScore):
    """Score for step number candidates that includes the parsed step value.

    This avoids re-parsing the step number from source blocks when the value
    is needed later (e.g., for building StepNumber elements or sorting).
    """

    step_value: int
    """The parsed step number value (e.g., 1, 2, 3, 42)."""


class RuleBasedClassifier(LabelClassifier):
    """Base class for classifiers that use a list of rules to score candidates.

    This class provides a declarative way to create classifiers using rules.
    Instead of implementing custom scoring logic, subclasses declare a list
    of Rule objects that are evaluated for each block.

    How It Works
    ------------

    1. Define your rules in the `rules` property
    2. Rules are evaluated sequentially for each block
    3. Each rule returns a score (0.0 to 1.0) or None (skipped)
    4. Required rules with score 0.0 cause immediate rejection
    5. Final score is weighted average of all applicable rules
    6. Blocks meeting `min_score` threshold become candidates

    Scoring Calculation
    -------------------

    .. code-block:: python

        final_score = sum(rule.score * rule.weight) / sum(rule.weight)

    - If any required rule scores 0.0, the block is rejected immediately
    - Rules can return None to be skipped (not counted in average)
    - All rule scores and weights are stored in the score_details for debugging

    Best Practices
    --------------

    **Rule Design:**
    - Use Filter rules to eliminate invalid blocks early
    - Use Score rules to rate blocks on intrinsic properties
    - Set `required=True` for rules that MUST pass (e.g., type filters)
    - Use weights to emphasize important scoring factors

    **Score Object:**
    - Override `_create_score()` to return custom Score subclass
    - Use custom scores to store parsed values (e.g., step numbers)
    - Always inherit from `Score` abstract base class

    **Source Blocks:**
    - Override `_get_additional_source_blocks()` to include related blocks
    - Default implementation finds visual effects (shadows, outlines)
    - Set `effects_margin` to automatically include nearby drawings/images

    Example Implementation
    ----------------------

    .. code-block:: python

        class MyClassifier(RuleBasedClassifier):
            output = "my_label"
            requires = frozenset()  # Or frozenset({"dependency"})

            @property
            def min_score(self) -> float:
                return 0.6  # Require 60% confidence

            @property
            def rules(self) -> Sequence[Rule]:
                return [
                    # Filter: Only accept Text blocks
                    IsInstanceFilter((Text,)),

                    # Required rule: Must be in top half of page
                    PositionScore(
                        scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
                        weight=1.0,
                        required=True,
                    ),

                    # Optional scoring: Prefer larger text
                    FontSizeScore(
                        target_size=24.0,
                        weight=0.5,
                    ),
                ]

            # Optional: Custom score with parsed data
            def _create_score(
                self, components: dict[str, float], total: float,
                source_blocks: Sequence[Blocks]
            ) -> RuleScore:
                # Parse and store additional info from primary block
                value = self._parse_value(source_blocks[0])
                return MyCustomScore(
                    components=components,
                    total_score=total,
                    parsed_value=value,
                )

            def build(self, candidate, result) -> MyElement:
                score = candidate.score_details
                assert isinstance(score, MyCustomScore)
                return MyElement(
                    bbox=candidate.bbox,
                    value=score.parsed_value,
                )

    Built-in Hooks
    --------------

    Override these methods to customize behavior:

    - `_should_accept(score)`: Custom acceptance logic beyond min_score
    - `_create_score()`: Return custom Score subclass with additional data
    - `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)

    Visual Effects Support
    ----------------------

    RuleBasedClassifier automatically includes nearby Drawing/Image blocks
    as visual effects (outlines, shadows) if `effects_margin` is set:

    .. code-block:: python

        @property
        def effects_margin(self) -> float | None:
            return 2.0  # Include blocks within 2 units

        @property
        def effects_max_area_ratio(self) -> float | None:
            return 5.0  # Effect can be at most 5x the primary block area

    This ensures that when a candidate wins, all associated visual effects
    are consumed together, preventing other classifiers from incorrectly
    using shadow/outline blocks.

    See Also
    --------
    - Classifier: Main orchestrator with comprehensive best practices
    - Rule: Base class for scoring rules
    - rules module: Available rule implementations
    """

    @property
    @abstractmethod
    def rules(self) -> Sequence[Rule]:
        """Get the list of rules for this classifier."""
        pass

    @property
    def min_score(self) -> float:
        """Minimum score threshold for acceptance. Defaults to 0.0."""
        return 0.0

    @property
    def effects_margin(self) -> float | None:
        """Margin to expand block bbox to find visual effects (outlines, shadows).

        If None, no automatic effect finding is performed.
        Defaults to 2.0.
        """
        return 2.0

    def _create_score(
        self,
        components: dict[str, float],
        total_score: float,
        source_blocks: Sequence[Blocks],
    ) -> RuleScore:
        """Create the score object for a candidate.

        Subclasses can override this to return a more specific score type
        that contains additional information (e.g., parsed values, cluster
        validation results).

        Args:
            components: Dictionary of rule name to score
            total_score: The weighted total score from rules
            source_blocks: All blocks that will be part of the candidate.
                The primary block (that passed the rules) is source_blocks[0].
                Additional blocks from _get_additional_source_blocks() follow.

        Returns:
            A RuleScore (or subclass) instance

        TODO: Consider adding a cluster_rules property if multiple classifiers
        need to validate/score complete clusters. This would allow expressing
        cluster validation (e.g., count >= 3, cluster bbox aspect ratio) as
        declarative rules instead of imperative code in _create_score().
        For now, the imperative approach is simpler for the few classifiers
        that need it (e.g., LoosePartSymbolClassifier).
        """
        return RuleScore(components=components, total_score=total_score)

    def _score(self, result: ClassificationResult) -> None:
        """Score blocks using rules."""
        context = RuleContext(result.page_data, self.config, result)
        rules = self.rules

        for block in result.page_data.blocks:
            components = {}
            weighted_sum = 0.0
            total_weight = 0.0
            failed = False

            for rule in rules:
                score = rule.calculate(block, context)

                # If rule returns None, it's skipped (not applicable)
                if score is None:
                    continue

                # If required rule fails (score 0), fail the block immediately
                if rule.required and score == 0.0:
                    failed = True
                    # log.debug(
                    #    "[%s] block_id=%s failed required rule '%s'",
                    #    self.output,
                    #    block.id,
                    #    rule.name,
                    # )
                    break

                rule_weight = rule.weight  # Using direct weight from Rule instance

                weighted_sum += score * rule_weight
                total_weight += rule_weight
                components[rule.name] = score

            if failed:
                continue

            # Calculate final score from rules
            final_score = weighted_sum / total_weight if total_weight > 0 else 0.0

            # Build source blocks list, deduplicating as we go
            seen_ids: set[int] = {block.id}
            source_blocks: list[Blocks] = [block]

            # Add any classifier-specific additional source blocks
            for b in self._get_additional_source_blocks(block, result):
                if b.id not in seen_ids:
                    seen_ids.add(b.id)
                    source_blocks.append(b)

            # Create score object (subclasses can override _create_score)
            # This can validate the complete cluster and adjust the score
            score_details = self._create_score(components, final_score, source_blocks)

            # Get actual score (may differ from final_score after validation)
            actual_score = score_details.score()

            # Check classifier-specific acceptance logic on the actual score
            if not self._should_accept(actual_score):
                log.debug(
                    "[%s] block_id=%s "
                    "rejected: score=%.3f < min_score=%.3f components=%s",
                    self.output,
                    block.id,
                    actual_score,
                    self.min_score,
                    components,
                )
                continue

            log.debug(
                "[%s] block_id=%s cluster accepted: score=%.3f components=%s",
                self.output,
                block.id,
                actual_score,
                components,
            )

            # Compute bbox as the union of all source blocks
            # This ensures the candidate bbox matches the source_blocks union,
            # required by validation (assert_element_bbox_matches_source_and_children)
            candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])

            # Create candidate
            candidate = Candidate(
                bbox=candidate_bbox,
                label=self.output,
                score=actual_score,
                score_details=score_details,
                source_blocks=source_blocks,
            )
            result.add_candidate(candidate)

    def _get_additional_source_blocks(
        self, block: Blocks, result: ClassificationResult
    ) -> Sequence[Blocks]:
        """Get additional source blocks to include with the candidate.

        Subclasses can override this to include related blocks (e.g.,
        overlapping drawings, drop shadows) in the candidate's source_blocks.
        These blocks will be marked as removed if the candidate wins.

        The default implementation automatically includes Drawing/Image blocks
        that appear to be visual effects (outlines, shadows) by calling
        find_contained_effects if self.effects_margin is not None.
        """
        margin = self.effects_margin
        if margin is not None:
            return find_contained_effects(
                block,
                result.page_data.blocks,
                margin=margin,
            )
        return []

    def _should_accept(self, score: float) -> bool:
        """Determine if a score is high enough to be a candidate.

        Subclasses can override this.
        """
        return score >= self.min_score

1	"""
2	Rule-based classifier implementation.
3	"""
4
5	from __future__ import annotations	1✔
6
7	import logging	1✔
8	from abc import abstractmethod	1✔
9	from collections.abc import Sequence	1✔
10
11	from build_a_long.pdf_extract.classifier.block_filter import (	1✔
12	find_contained_effects,
13	)
14	from build_a_long.pdf_extract.classifier.candidate import Candidate	1✔
15	from build_a_long.pdf_extract.classifier.classification_result import (	1✔
16	ClassificationResult,
17	)
18	from build_a_long.pdf_extract.classifier.label_classifier import (	1✔
19	LabelClassifier,
20	)
21	from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext	1✔
22	from build_a_long.pdf_extract.classifier.score import Score, Weight	1✔
23	from build_a_long.pdf_extract.extractor.bbox import BBox	1✔
24	from build_a_long.pdf_extract.extractor.page_blocks import Blocks	1✔
25
26	log = logging.getLogger(__name__)	1✔
27
28
29	class RuleScore(Score):	1✔
30	"""Generic score based on rules."""
31
32	components: dict[str, float]	1✔
33	total_score: float	1✔
34
35	def score(self) -> Weight:	1✔
36	return self.total_score	1✔
37
38	def get(self, rule_name: str, default: float = 0.0) -> float:	1✔
39	"""Get the score for a specific rule name."""
40	return self.components.get(rule_name, default)	1✔
41
42
43	class StepNumberScore(RuleScore):	1✔
44	"""Score for step number candidates that includes the parsed step value.
45
46	This avoids re-parsing the step number from source blocks when the value
47	is needed later (e.g., for building StepNumber elements or sorting).
48	"""
49
50	step_value: int	1✔
51	"""The parsed step number value (e.g., 1, 2, 3, 42)."""	1✔
52
53
54	class RuleBasedClassifier(LabelClassifier):	1✔
55	"""Base class for classifiers that use a list of rules to score candidates.
56
57	This class provides a declarative way to create classifiers using rules.
58	Instead of implementing custom scoring logic, subclasses declare a list
59	of Rule objects that are evaluated for each block.
60
61	How It Works
62	------------
63
64	1. Define your rules in the `rules` property
65	2. Rules are evaluated sequentially for each block
66	3. Each rule returns a score (0.0 to 1.0) or None (skipped)
67	4. Required rules with score 0.0 cause immediate rejection
68	5. Final score is weighted average of all applicable rules
69	6. Blocks meeting `min_score` threshold become candidates
70
71	Scoring Calculation
72	-------------------
73
74	.. code-block:: python
75
76	final_score = sum(rule.score * rule.weight) / sum(rule.weight)
77
78	- If any required rule scores 0.0, the block is rejected immediately
79	- Rules can return None to be skipped (not counted in average)
80	- All rule scores and weights are stored in the score_details for debugging
81
82	Best Practices
83	--------------
84
85	Rule Design:
86	- Use Filter rules to eliminate invalid blocks early
87	- Use Score rules to rate blocks on intrinsic properties
88	- Set `required=True` for rules that MUST pass (e.g., type filters)
89	- Use weights to emphasize important scoring factors
90
91	Score Object:
92	- Override `_create_score()` to return custom Score subclass
93	- Use custom scores to store parsed values (e.g., step numbers)
94	- Always inherit from `Score` abstract base class
95
96	Source Blocks:
97	- Override `_get_additional_source_blocks()` to include related blocks
98	- Default implementation finds visual effects (shadows, outlines)
99	- Set `effects_margin` to automatically include nearby drawings/images
100
101	Example Implementation
102	----------------------
103
104	.. code-block:: python
105
106	class MyClassifier(RuleBasedClassifier):
107	output = "my_label"
108	requires = frozenset() # Or frozenset({"dependency"})
109
110	@property
111	def min_score(self) -> float:
112	return 0.6 # Require 60% confidence
113
114	@property
115	def rules(self) -> Sequence[Rule]:
116	return [
117	# Filter: Only accept Text blocks
118	IsInstanceFilter((Text,)),
119
120	# Required rule: Must be in top half of page
121	PositionScore(
122	scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
123	weight=1.0,
124	required=True,
125	),
126
127	# Optional scoring: Prefer larger text
128	FontSizeScore(
129	target_size=24.0,
130	weight=0.5,
131	),
132	]
133
134	# Optional: Custom score with parsed data
135	def _create_score(
136	self, components: dict[str, float], total: float,
137	source_blocks: Sequence[Blocks]
138	) -> RuleScore:
139	# Parse and store additional info from primary block
140	value = self._parse_value(source_blocks[0])
141	return MyCustomScore(
142	components=components,
143	total_score=total,
144	parsed_value=value,
145	)
146
147	def build(self, candidate, result) -> MyElement:
148	score = candidate.score_details
149	assert isinstance(score, MyCustomScore)
150	return MyElement(
151	bbox=candidate.bbox,
152	value=score.parsed_value,
153	)
154
155	Built-in Hooks
156	--------------
157
158	Override these methods to customize behavior:
159
160	- `_should_accept(score)`: Custom acceptance logic beyond min_score
161	- `_create_score()`: Return custom Score subclass with additional data
162	- `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)
163
164	Visual Effects Support
165	----------------------
166
167	RuleBasedClassifier automatically includes nearby Drawing/Image blocks
168	as visual effects (outlines, shadows) if `effects_margin` is set:
169
170	.. code-block:: python
171
172	@property
173	def effects_margin(self) -> float \| None:
174	return 2.0 # Include blocks within 2 units
175
176	@property
177	def effects_max_area_ratio(self) -> float \| None:
178	return 5.0 # Effect can be at most 5x the primary block area
179
180	This ensures that when a candidate wins, all associated visual effects
181	are consumed together, preventing other classifiers from incorrectly
182	using shadow/outline blocks.
183
184	See Also
185	--------
186	- Classifier: Main orchestrator with comprehensive best practices
187	- Rule: Base class for scoring rules
188	- rules module: Available rule implementations
189	"""
190
191	@property	1✔
192	@abstractmethod	1✔
193	def rules(self) -> Sequence[Rule]:	1✔
194	"""Get the list of rules for this classifier."""
UNCOV 195	pass	×
196
197	@property	1✔
198	def min_score(self) -> float:	1✔
199	"""Minimum score threshold for acceptance. Defaults to 0.0."""
200	return 0.0	1✔
201
202	@property	1✔
203	def effects_margin(self) -> float \| None:	1✔
204	"""Margin to expand block bbox to find visual effects (outlines, shadows).
205
206	If None, no automatic effect finding is performed.
207	Defaults to 2.0.
208	"""
209	return 2.0	1✔
210
211	def _create_score(	1✔
212	self,
213	components: dict[str, float],
214	total_score: float,
215	source_blocks: Sequence[Blocks],
216	) -> RuleScore:
217	"""Create the score object for a candidate.
218
219	Subclasses can override this to return a more specific score type
220	that contains additional information (e.g., parsed values, cluster
221	validation results).
222
223	Args:
224	components: Dictionary of rule name to score
225	total_score: The weighted total score from rules
226	source_blocks: All blocks that will be part of the candidate.
227	The primary block (that passed the rules) is source_blocks[0].
228	Additional blocks from _get_additional_source_blocks() follow.
229
230	Returns:
231	A RuleScore (or subclass) instance
232
233	TODO: Consider adding a cluster_rules property if multiple classifiers
234	need to validate/score complete clusters. This would allow expressing
235	cluster validation (e.g., count >= 3, cluster bbox aspect ratio) as
236	declarative rules instead of imperative code in _create_score().
237	For now, the imperative approach is simpler for the few classifiers
238	that need it (e.g., LoosePartSymbolClassifier).
239	"""
240	return RuleScore(components=components, total_score=total_score)	1✔
241
242	def _score(self, result: ClassificationResult) -> None:	1✔
243	"""Score blocks using rules."""
244	context = RuleContext(result.page_data, self.config, result)	1✔
245	rules = self.rules	1✔
246
247	for block in result.page_data.blocks:	1✔
248	components = {}	1✔
249	weighted_sum = 0.0	1✔
250	total_weight = 0.0	1✔
251	failed = False	1✔
252
253	for rule in rules:	1✔
254	score = rule.calculate(block, context)	1✔
255
256	# If rule returns None, it's skipped (not applicable)
257	if score is None:	1✔
258	continue	1✔
259
260	# If required rule fails (score 0), fail the block immediately
261	if rule.required and score == 0.0:	1✔
262	failed = True	1✔
263	# log.debug(
264	# "[%s] block_id=%s failed required rule '%s'",
265	# self.output,
266	# block.id,
267	# rule.name,
268	# )
269	break	1✔
270
271	rule_weight = rule.weight # Using direct weight from Rule instance	1✔
272
273	weighted_sum += score * rule_weight	1✔
274	total_weight += rule_weight	1✔
275	components[rule.name] = score	1✔
276
277	if failed:	1✔
278	continue	1✔
279
280	# Calculate final score from rules
281	final_score = weighted_sum / total_weight if total_weight > 0 else 0.0	1✔
282
283	# Build source blocks list, deduplicating as we go
284	seen_ids: set[int] = {block.id}	1✔
285	source_blocks: list[Blocks] = [block]	1✔
286
287	# Add any classifier-specific additional source blocks
288	for b in self._get_additional_source_blocks(block, result):	1✔
289	if b.id not in seen_ids:	1✔
290	seen_ids.add(b.id)	1✔
291	source_blocks.append(b)	1✔
292
293	# Create score object (subclasses can override _create_score)
294	# This can validate the complete cluster and adjust the score
295	score_details = self._create_score(components, final_score, source_blocks)	1✔
296
297	# Get actual score (may differ from final_score after validation)
298	actual_score = score_details.score()	1✔
299
300	# Check classifier-specific acceptance logic on the actual score
301	if not self._should_accept(actual_score):	1✔
302	log.debug(	1✔
303	"[%s] block_id=%s "
304	"rejected: score=%.3f < min_score=%.3f components=%s",
305	self.output,
306	block.id,
307	actual_score,
308	self.min_score,
309	components,
310	)
311	continue	1✔
312
313	log.debug(	1✔
314	"[%s] block_id=%s cluster accepted: score=%.3f components=%s",
315	self.output,
316	block.id,
317	actual_score,
318	components,
319	)
320
321	# Compute bbox as the union of all source blocks
322	# This ensures the candidate bbox matches the source_blocks union,
323	# required by validation (assert_element_bbox_matches_source_and_children)
324	candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])	1✔
325
326	# Create candidate
327	candidate = Candidate(	1✔
328	bbox=candidate_bbox,
329	label=self.output,
330	score=actual_score,
331	score_details=score_details,
332	source_blocks=source_blocks,
333	)
334	result.add_candidate(candidate)	1✔
335
336	def _get_additional_source_blocks(	1✔
337	self, block: Blocks, result: ClassificationResult
338	) -> Sequence[Blocks]:
339	"""Get additional source blocks to include with the candidate.
340
341	Subclasses can override this to include related blocks (e.g.,
342	overlapping drawings, drop shadows) in the candidate's source_blocks.
343	These blocks will be marked as removed if the candidate wins.
344
345	The default implementation automatically includes Drawing/Image blocks
346	that appear to be visual effects (outlines, shadows) by calling
347	find_contained_effects if self.effects_margin is not None.
348	"""
349	margin = self.effects_margin	1✔
350	if margin is not None:	1✔
351	return find_contained_effects(	1✔
352	block,
353	result.page_data.blocks,
354	margin=margin,
355	)
356	return []	1✔
357
358	def _should_accept(self, score: float) -> bool:	1✔
359	"""Determine if a score is high enough to be a candidate.
360
361	Subclasses can override this.
362	"""
363	return score >= self.min_score	1✔

bramp / build-along / 20401440535

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source File
Press 'n' to go to next uncovered line, 'b' for previous