20400711546

Committed 20 Dec 2025 10:09PM UTC coverage: 89.367% (+0.006%) from 89.361%

Build # 20400711546

Build Type

push

github

Committed by

bramp

Commit Message

docs: Add comprehensive Classifier best practices documentation

- Add detailed docstrings to Classifier and RuleBasedClassifier classes
  covering all aspects of writing robust classifiers
- Document scoring phase: API access rules, Score object design,
  intrinsic vs relationship-based scoring
- Document build phase: source block rules, exception handling,
  construction patterns
- Document build_all(): when to use for global coordination
- Add complete code examples for atomic and composite patterns
- Fix DESIGN.md contradiction about Score objects storing candidates
- Update README.md and DESIGN.md to reference class docstrings as
  single source of truth
- Add recommendations to use RuleBasedClassifier for atomic classifiers

This consolidates documentation to reduce duplication and provides
clear guidelines for both humans and AI agents writing new classifiers.

Run Details

13708 of 15339 relevant lines covered (89.37%)

0.89 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

97.56

/src/build_a_long/pdf_extract/classifier/rule_based_classifier.py

"""
Rule-based classifier implementation.
"""

from __future__ import annotations

import logging
from abc import abstractmethod
from collections.abc import Sequence

from build_a_long.pdf_extract.classifier.block_filter import (
    find_contained_effects,
)
from build_a_long.pdf_extract.classifier.candidate import Candidate
from build_a_long.pdf_extract.classifier.classification_result import (
    ClassificationResult,
)
from build_a_long.pdf_extract.classifier.label_classifier import (
    LabelClassifier,
)
from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext
from build_a_long.pdf_extract.classifier.score import Score, Weight
from build_a_long.pdf_extract.extractor.bbox import BBox
from build_a_long.pdf_extract.extractor.page_blocks import Block, Blocks

log = logging.getLogger(__name__)


class RuleScore(Score):
    """Generic score based on rules."""

    components: dict[str, float]
    total_score: float

    def score(self) -> Weight:
        return self.total_score

    def get(self, rule_name: str, default: float = 0.0) -> float:
        """Get the score for a specific rule name."""
        return self.components.get(rule_name, default)


class StepNumberScore(RuleScore):
    """Score for step number candidates that includes the parsed step value.

    This avoids re-parsing the step number from source blocks when the value
    is needed later (e.g., for building StepNumber elements or sorting).
    """

    step_value: int
    """The parsed step number value (e.g., 1, 2, 3, 42)."""


class RuleBasedClassifier(LabelClassifier):
    """Base class for classifiers that use a list of rules to score candidates.

    This class provides a declarative way to create classifiers using rules.
    Instead of implementing custom scoring logic, subclasses declare a list
    of Rule objects that are evaluated for each block.

    How It Works
    ------------

    1. Define your rules in the `rules` property
    2. Rules are evaluated sequentially for each block
    3. Each rule returns a score (0.0 to 1.0) or None (skipped)
    4. Required rules with score 0.0 cause immediate rejection
    5. Final score is weighted average of all applicable rules
    6. Blocks meeting `min_score` threshold become candidates

    Scoring Calculation
    -------------------

    .. code-block:: python

        final_score = sum(rule.score * rule.weight) / sum(rule.weight)

    - If any required rule scores 0.0, the block is rejected immediately
    - Rules can return None to be skipped (not counted in average)
    - All rule scores and weights are stored in the score_details for debugging

    Best Practices
    --------------

    **Rule Design:**
    - Use Filter rules to eliminate invalid blocks early
    - Use Score rules to rate blocks on intrinsic properties
    - Set `required=True` for rules that MUST pass (e.g., type filters)
    - Use weights to emphasize important scoring factors

    **Score Object:**
    - Override `_create_score()` to return custom Score subclass
    - Use custom scores to store parsed values (e.g., step numbers)
    - Always inherit from `Score` abstract base class

    **Source Blocks:**
    - Override `_get_additional_source_blocks()` to include related blocks
    - Default implementation finds visual effects (shadows, outlines)
    - Set `effects_margin` to automatically include nearby drawings/images

    Example Implementation
    ----------------------

    .. code-block:: python

        class MyClassifier(RuleBasedClassifier):
            output = "my_label"
            requires = frozenset()  # Or frozenset({"dependency"})

            @property
            def min_score(self) -> float:
                return 0.6  # Require 60% confidence

            @property
            def rules(self) -> Sequence[Rule]:
                return [
                    # Filter: Only accept Text blocks
                    IsInstanceFilter((Text,)),

                    # Required rule: Must be in top half of page
                    PositionScore(
                        scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
                        weight=1.0,
                        required=True,
                    ),

                    # Optional scoring: Prefer larger text
                    FontSizeScore(
                        target_size=24.0,
                        weight=0.5,
                    ),
                ]

            # Optional: Custom score with parsed data
            def _create_score(
                self, block: Block, components: dict[str, float], total: float
            ) -> RuleScore:
                # Parse and store additional info
                value = self._parse_value(block)
                return MyCustomScore(
                    components=components,
                    total_score=total,
                    parsed_value=value,
                )

            def build(self, candidate, result) -> MyElement:
                score = candidate.score_details
                assert isinstance(score, MyCustomScore)
                return MyElement(
                    bbox=candidate.bbox,
                    value=score.parsed_value,
                )

    Built-in Hooks
    --------------

    Override these methods to customize behavior:

    - `_should_accept(score)`: Custom acceptance logic beyond min_score
    - `_create_score()`: Return custom Score subclass with additional data
    - `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)

    Visual Effects Support
    ----------------------

    RuleBasedClassifier automatically includes nearby Drawing/Image blocks
    as visual effects (outlines, shadows) if `effects_margin` is set:

    .. code-block:: python

        @property
        def effects_margin(self) -> float | None:
            return 2.0  # Include blocks within 2 units

        @property
        def effects_max_area_ratio(self) -> float | None:
            return 5.0  # Effect can be at most 5x the primary block area

    This ensures that when a candidate wins, all associated visual effects
    are consumed together, preventing other classifiers from incorrectly
    using shadow/outline blocks.

    See Also
    --------
    - Classifier: Main orchestrator with comprehensive best practices
    - Rule: Base class for scoring rules
    - rules module: Available rule implementations
    """

    @property
    @abstractmethod
    def rules(self) -> Sequence[Rule]:
        """Get the list of rules for this classifier."""
        pass

    @property
    def min_score(self) -> float:
        """Minimum score threshold for acceptance. Defaults to 0.0."""
        return 0.0

    @property
    def effects_margin(self) -> float | None:
        """Margin to expand block bbox to find visual effects (outlines, shadows).

        If None, no automatic effect finding is performed.
        Defaults to None.
        """
        return None

    # TODO Do we need effects_max_area_ratio ?
    @property
    def effects_max_area_ratio(self) -> float | None:
        """Maximum ratio of effect block area to primary block area.

        Used to avoid consuming unrelated large blocks as effects.
        Defaults to None (no ratio check).
        """
        return None

    def _create_score(
        self,
        block: Block,
        components: dict[str, float],
        total_score: float,
    ) -> RuleScore:
        """Create the score object for a candidate.

        Subclasses can override this to return a more specific score type
        that contains additional information (e.g., parsed values).

        Args:
            block: The block being scored
            components: Dictionary of rule name to score
            total_score: The weighted total score

        Returns:
            A RuleScore (or subclass) instance
        """
        return RuleScore(components=components, total_score=total_score)

    def _score(self, result: ClassificationResult) -> None:
        """Score blocks using rules."""
        context = RuleContext(result.page_data, self.config, result)
        rules = self.rules

        for block in result.page_data.blocks:
            components = {}
            weighted_sum = 0.0
            total_weight = 0.0
            failed = False

            for rule in rules:
                score = rule.calculate(block, context)

                # If rule returns None, it's skipped (not applicable)
                if score is None:
                    continue

                # If required rule fails (score 0), fail the block immediately
                if rule.required and score == 0.0:
                    failed = True
                    # log.debug(
                    #    "[%s] block_id=%s failed required rule '%s'",
                    #    self.output,
                    #    block.id,
                    #    rule.name,
                    # )
                    break

                rule_weight = rule.weight  # Using direct weight from Rule instance

                weighted_sum += score * rule_weight
                total_weight += rule_weight
                components[rule.name] = score

            if failed:
                continue

            # Calculate final score
            final_score = weighted_sum / total_weight if total_weight > 0 else 0.0

            # Check classifier-specific acceptance logic
            if not self._should_accept(final_score):
                log.debug(
                    "[%s] block_id=%s "
                    "rejected: score=%.3f < min_score=%.3f components=%s",
                    self.output,
                    block.id,
                    final_score,
                    self.min_score,
                    components,
                )
                continue

            log.debug(
                "[%s] block_id=%s accepted: score=%.3f components=%s",
                self.output,
                block.id,
                final_score,
                components,
            )

            # Build source blocks list, deduplicating as we go
            seen_ids: set[int] = {block.id}
            source_blocks: list[Blocks] = [block]

            # Add any classifier-specific additional source blocks
            for b in self._get_additional_source_blocks(block, result):
                if b.id not in seen_ids:
                    seen_ids.add(b.id)
                    source_blocks.append(b)

            # Create score object (subclasses can override _create_score)
            score_details = self._create_score(block, components, final_score)

            # Compute bbox as the union of all source blocks
            # This ensures the candidate bbox matches the source_blocks union,
            # required by validation (assert_element_bbox_matches_source_and_children)
            candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])

            # Create candidate
            candidate = Candidate(
                bbox=candidate_bbox,
                label=self.output,
                score=final_score,
                score_details=score_details,
                source_blocks=source_blocks,
            )
            result.add_candidate(candidate)

    def _get_additional_source_blocks(
        self, block: Blocks, result: ClassificationResult
    ) -> Sequence[Blocks]:
        """Get additional source blocks to include with the candidate.

        Subclasses can override this to include related blocks (e.g.,
        overlapping drawings, drop shadows) in the candidate's source_blocks.
        These blocks will be marked as removed if the candidate wins.

        The default implementation automatically includes Drawing/Image blocks
        that appear to be visual effects (outlines, shadows) by calling
        find_contained_effects if self.effects_margin is not None.
        """
        margin = self.effects_margin
        if margin is not None:
            return find_contained_effects(
                block,
                result.page_data.blocks,
                margin=margin,
                max_area_ratio=self.effects_max_area_ratio,
            )
        return []

    def _should_accept(self, score: float) -> bool:
        """Determine if a score is high enough to be a candidate.

        Subclasses can override this.
        """
        return score >= self.min_score

1	"""
2	Rule-based classifier implementation.
3	"""
4
5	from __future__ import annotations	1✔
6
7	import logging	1✔
8	from abc import abstractmethod	1✔
9	from collections.abc import Sequence	1✔
10
11	from build_a_long.pdf_extract.classifier.block_filter import (	1✔
12	find_contained_effects,
13	)
14	from build_a_long.pdf_extract.classifier.candidate import Candidate	1✔
15	from build_a_long.pdf_extract.classifier.classification_result import (	1✔
16	ClassificationResult,
17	)
18	from build_a_long.pdf_extract.classifier.label_classifier import (	1✔
19	LabelClassifier,
20	)
21	from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext	1✔
22	from build_a_long.pdf_extract.classifier.score import Score, Weight	1✔
23	from build_a_long.pdf_extract.extractor.bbox import BBox	1✔
24	from build_a_long.pdf_extract.extractor.page_blocks import Block, Blocks	1✔
25
26	log = logging.getLogger(__name__)	1✔
27
28
29	class RuleScore(Score):	1✔
30	"""Generic score based on rules."""
31
32	components: dict[str, float]	1✔
33	total_score: float	1✔
34
35	def score(self) -> Weight:	1✔
36	return self.total_score	×
37
38	def get(self, rule_name: str, default: float = 0.0) -> float:	1✔
39	"""Get the score for a specific rule name."""
40	return self.components.get(rule_name, default)	1✔
41
42
43	class StepNumberScore(RuleScore):	1✔
44	"""Score for step number candidates that includes the parsed step value.
45
46	This avoids re-parsing the step number from source blocks when the value
47	is needed later (e.g., for building StepNumber elements or sorting).
48	"""
49
50	step_value: int	1✔
51	"""The parsed step number value (e.g., 1, 2, 3, 42)."""	1✔
52
53
54	class RuleBasedClassifier(LabelClassifier):	1✔
55	"""Base class for classifiers that use a list of rules to score candidates.
56
57	This class provides a declarative way to create classifiers using rules.
58	Instead of implementing custom scoring logic, subclasses declare a list
59	of Rule objects that are evaluated for each block.
60
61	How It Works
62	------------
63
64	1. Define your rules in the `rules` property
65	2. Rules are evaluated sequentially for each block
66	3. Each rule returns a score (0.0 to 1.0) or None (skipped)
67	4. Required rules with score 0.0 cause immediate rejection
68	5. Final score is weighted average of all applicable rules
69	6. Blocks meeting `min_score` threshold become candidates
70
71	Scoring Calculation
72	-------------------
73
74	.. code-block:: python
75
76	final_score = sum(rule.score * rule.weight) / sum(rule.weight)
77
78	- If any required rule scores 0.0, the block is rejected immediately
79	- Rules can return None to be skipped (not counted in average)
80	- All rule scores and weights are stored in the score_details for debugging
81
82	Best Practices
83	--------------
84
85	Rule Design:
86	- Use Filter rules to eliminate invalid blocks early
87	- Use Score rules to rate blocks on intrinsic properties
88	- Set `required=True` for rules that MUST pass (e.g., type filters)
89	- Use weights to emphasize important scoring factors
90
91	Score Object:
92	- Override `_create_score()` to return custom Score subclass
93	- Use custom scores to store parsed values (e.g., step numbers)
94	- Always inherit from `Score` abstract base class
95
96	Source Blocks:
97	- Override `_get_additional_source_blocks()` to include related blocks
98	- Default implementation finds visual effects (shadows, outlines)
99	- Set `effects_margin` to automatically include nearby drawings/images
100
101	Example Implementation
102	----------------------
103
104	.. code-block:: python
105
106	class MyClassifier(RuleBasedClassifier):
107	output = "my_label"
108	requires = frozenset() # Or frozenset({"dependency"})
109
110	@property
111	def min_score(self) -> float:
112	return 0.6 # Require 60% confidence
113
114	@property
115	def rules(self) -> Sequence[Rule]:
116	return [
117	# Filter: Only accept Text blocks
118	IsInstanceFilter((Text,)),
119
120	# Required rule: Must be in top half of page
121	PositionScore(
122	scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
123	weight=1.0,
124	required=True,
125	),
126
127	# Optional scoring: Prefer larger text
128	FontSizeScore(
129	target_size=24.0,
130	weight=0.5,
131	),
132	]
133
134	# Optional: Custom score with parsed data
135	def _create_score(
136	self, block: Block, components: dict[str, float], total: float
137	) -> RuleScore:
138	# Parse and store additional info
139	value = self._parse_value(block)
140	return MyCustomScore(
141	components=components,
142	total_score=total,
143	parsed_value=value,
144	)
145
146	def build(self, candidate, result) -> MyElement:
147	score = candidate.score_details
148	assert isinstance(score, MyCustomScore)
149	return MyElement(
150	bbox=candidate.bbox,
151	value=score.parsed_value,
152	)
153
154	Built-in Hooks
155	--------------
156
157	Override these methods to customize behavior:
158
159	- `_should_accept(score)`: Custom acceptance logic beyond min_score
160	- `_create_score()`: Return custom Score subclass with additional data
161	- `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)
162
163	Visual Effects Support
164	----------------------
165
166	RuleBasedClassifier automatically includes nearby Drawing/Image blocks
167	as visual effects (outlines, shadows) if `effects_margin` is set:
168
169	.. code-block:: python
170
171	@property
172	def effects_margin(self) -> float \| None:
173	return 2.0 # Include blocks within 2 units
174
175	@property
176	def effects_max_area_ratio(self) -> float \| None:
177	return 5.0 # Effect can be at most 5x the primary block area
178
179	This ensures that when a candidate wins, all associated visual effects
180	are consumed together, preventing other classifiers from incorrectly
181	using shadow/outline blocks.
182
183	See Also
184	--------
185	- Classifier: Main orchestrator with comprehensive best practices
186	- Rule: Base class for scoring rules
187	- rules module: Available rule implementations
188	"""
189
190	@property	1✔
191	@abstractmethod	1✔
192	def rules(self) -> Sequence[Rule]:	1✔
193	"""Get the list of rules for this classifier."""
194	pass	×
195
196	@property	1✔
197	def min_score(self) -> float:	1✔
198	"""Minimum score threshold for acceptance. Defaults to 0.0."""
199	return 0.0	1✔
200
201	@property	1✔
202	def effects_margin(self) -> float \| None:	1✔
203	"""Margin to expand block bbox to find visual effects (outlines, shadows).
204
205	If None, no automatic effect finding is performed.
206	Defaults to None.
207	"""
208	return None	1✔
209
210	# TODO Do we need effects_max_area_ratio ?
211	@property	1✔
212	def effects_max_area_ratio(self) -> float \| None:	1✔
213	"""Maximum ratio of effect block area to primary block area.
214
215	Used to avoid consuming unrelated large blocks as effects.
216	Defaults to None (no ratio check).
217	"""
218	return None	1✔
219
220	def _create_score(	1✔
221	self,
222	block: Block,
223	components: dict[str, float],
224	total_score: float,
225	) -> RuleScore:
226	"""Create the score object for a candidate.
227
228	Subclasses can override this to return a more specific score type
229	that contains additional information (e.g., parsed values).
230
231	Args:
232	block: The block being scored
233	components: Dictionary of rule name to score
234	total_score: The weighted total score
235
236	Returns:
237	A RuleScore (or subclass) instance
238	"""
239	return RuleScore(components=components, total_score=total_score)	1✔
240
241	def _score(self, result: ClassificationResult) -> None:	1✔
242	"""Score blocks using rules."""
243	context = RuleContext(result.page_data, self.config, result)	1✔
244	rules = self.rules	1✔
245
246	for block in result.page_data.blocks:	1✔
247	components = {}	1✔
248	weighted_sum = 0.0	1✔
249	total_weight = 0.0	1✔
250	failed = False	1✔
251
252	for rule in rules:	1✔
253	score = rule.calculate(block, context)	1✔
254
255	# If rule returns None, it's skipped (not applicable)
256	if score is None:	1✔
257	continue	1✔
258
259	# If required rule fails (score 0), fail the block immediately
260	if rule.required and score == 0.0:	1✔
261	failed = True	1✔
262	# log.debug(
263	# "[%s] block_id=%s failed required rule '%s'",
264	# self.output,
265	# block.id,
266	# rule.name,
267	# )
268	break	1✔
269
270	rule_weight = rule.weight # Using direct weight from Rule instance	1✔
271
272	weighted_sum += score * rule_weight	1✔
273	total_weight += rule_weight	1✔
274	components[rule.name] = score	1✔
275
276	if failed:	1✔
277	continue	1✔
278
279	# Calculate final score
280	final_score = weighted_sum / total_weight if total_weight > 0 else 0.0	1✔
281
282	# Check classifier-specific acceptance logic
283	if not self._should_accept(final_score):	1✔
284	log.debug(	1✔
285	"[%s] block_id=%s "
286	"rejected: score=%.3f < min_score=%.3f components=%s",
287	self.output,
288	block.id,
289	final_score,
290	self.min_score,
291	components,
292	)
293	continue	1✔
294
295	log.debug(	1✔
296	"[%s] block_id=%s accepted: score=%.3f components=%s",
297	self.output,
298	block.id,
299	final_score,
300	components,
301	)
302
303	# Build source blocks list, deduplicating as we go
304	seen_ids: set[int] = {block.id}	1✔
305	source_blocks: list[Blocks] = [block]	1✔
306
307	# Add any classifier-specific additional source blocks
308	for b in self._get_additional_source_blocks(block, result):	1✔
309	if b.id not in seen_ids:	1✔
310	seen_ids.add(b.id)	1✔
311	source_blocks.append(b)	1✔
312
313	# Create score object (subclasses can override _create_score)
314	score_details = self._create_score(block, components, final_score)	1✔
315
316	# Compute bbox as the union of all source blocks
317	# This ensures the candidate bbox matches the source_blocks union,
318	# required by validation (assert_element_bbox_matches_source_and_children)
319	candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])	1✔
320
321	# Create candidate
322	candidate = Candidate(	1✔
323	bbox=candidate_bbox,
324	label=self.output,
325	score=final_score,
326	score_details=score_details,
327	source_blocks=source_blocks,
328	)
329	result.add_candidate(candidate)	1✔
330
331	def _get_additional_source_blocks(	1✔
332	self, block: Blocks, result: ClassificationResult
333	) -> Sequence[Blocks]:
334	"""Get additional source blocks to include with the candidate.
335
336	Subclasses can override this to include related blocks (e.g.,
337	overlapping drawings, drop shadows) in the candidate's source_blocks.
338	These blocks will be marked as removed if the candidate wins.
339
340	The default implementation automatically includes Drawing/Image blocks
341	that appear to be visual effects (outlines, shadows) by calling
342	find_contained_effects if self.effects_margin is not None.
343	"""
344	margin = self.effects_margin	1✔
345	if margin is not None:	1✔
346	return find_contained_effects(	1✔
347	block,
348	result.page_data.blocks,
349	margin=margin,
350	max_area_ratio=self.effects_max_area_ratio,
351	)
352	return []	1✔
353
354	def _should_accept(self, score: float) -> bool:	1✔
355	"""Determine if a score is high enough to be a candidate.
356
357	Subclasses can override this.
358	"""
359	return score >= self.min_score	1✔

bramp / build-along / 20400711546

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source File
Press 'n' to go to next uncovered line, 'b' for previous