• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

bramp / build-along / 20400711546

20 Dec 2025 10:09PM UTC coverage: 89.367% (+0.006%) from 89.361%
20400711546

push

github

bramp
docs: Add comprehensive Classifier best practices documentation

- Add detailed docstrings to Classifier and RuleBasedClassifier classes
  covering all aspects of writing robust classifiers
- Document scoring phase: API access rules, Score object design,
  intrinsic vs relationship-based scoring
- Document build phase: source block rules, exception handling,
  construction patterns
- Document build_all(): when to use for global coordination
- Add complete code examples for atomic and composite patterns
- Fix DESIGN.md contradiction about Score objects storing candidates
- Update README.md and DESIGN.md to reference class docstrings as
  single source of truth
- Add recommendations to use RuleBasedClassifier for atomic classifiers

This consolidates documentation to reduce duplication and provides
clear guidelines for both humans and AI agents writing new classifiers.

13708 of 15339 relevant lines covered (89.37%)

0.89 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

97.56
/src/build_a_long/pdf_extract/classifier/rule_based_classifier.py
1
"""
2
Rule-based classifier implementation.
3
"""
4

5
from __future__ import annotations
1✔
6

7
import logging
1✔
8
from abc import abstractmethod
1✔
9
from collections.abc import Sequence
1✔
10

11
from build_a_long.pdf_extract.classifier.block_filter import (
1✔
12
    find_contained_effects,
13
)
14
from build_a_long.pdf_extract.classifier.candidate import Candidate
1✔
15
from build_a_long.pdf_extract.classifier.classification_result import (
1✔
16
    ClassificationResult,
17
)
18
from build_a_long.pdf_extract.classifier.label_classifier import (
1✔
19
    LabelClassifier,
20
)
21
from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext
1✔
22
from build_a_long.pdf_extract.classifier.score import Score, Weight
1✔
23
from build_a_long.pdf_extract.extractor.bbox import BBox
1✔
24
from build_a_long.pdf_extract.extractor.page_blocks import Block, Blocks
1✔
25

26
log = logging.getLogger(__name__)
1✔
27

28

29
class RuleScore(Score):
1✔
30
    """Generic score based on rules."""
31

32
    components: dict[str, float]
1✔
33
    total_score: float
1✔
34

35
    def score(self) -> Weight:
1✔
36
        return self.total_score
×
37

38
    def get(self, rule_name: str, default: float = 0.0) -> float:
1✔
39
        """Get the score for a specific rule name."""
40
        return self.components.get(rule_name, default)
1✔
41

42

43
class StepNumberScore(RuleScore):
1✔
44
    """Score for step number candidates that includes the parsed step value.
45

46
    This avoids re-parsing the step number from source blocks when the value
47
    is needed later (e.g., for building StepNumber elements or sorting).
48
    """
49

50
    step_value: int
1✔
51
    """The parsed step number value (e.g., 1, 2, 3, 42)."""
1✔
52

53

54
class RuleBasedClassifier(LabelClassifier):
1✔
55
    """Base class for classifiers that use a list of rules to score candidates.
56

57
    This class provides a declarative way to create classifiers using rules.
58
    Instead of implementing custom scoring logic, subclasses declare a list
59
    of Rule objects that are evaluated for each block.
60

61
    How It Works
62
    ------------
63

64
    1. Define your rules in the `rules` property
65
    2. Rules are evaluated sequentially for each block
66
    3. Each rule returns a score (0.0 to 1.0) or None (skipped)
67
    4. Required rules with score 0.0 cause immediate rejection
68
    5. Final score is weighted average of all applicable rules
69
    6. Blocks meeting `min_score` threshold become candidates
70

71
    Scoring Calculation
72
    -------------------
73

74
    .. code-block:: python
75

76
        final_score = sum(rule.score * rule.weight) / sum(rule.weight)
77

78
    - If any required rule scores 0.0, the block is rejected immediately
79
    - Rules can return None to be skipped (not counted in average)
80
    - All rule scores and weights are stored in the score_details for debugging
81

82
    Best Practices
83
    --------------
84

85
    **Rule Design:**
86
    - Use Filter rules to eliminate invalid blocks early
87
    - Use Score rules to rate blocks on intrinsic properties
88
    - Set `required=True` for rules that MUST pass (e.g., type filters)
89
    - Use weights to emphasize important scoring factors
90

91
    **Score Object:**
92
    - Override `_create_score()` to return custom Score subclass
93
    - Use custom scores to store parsed values (e.g., step numbers)
94
    - Always inherit from `Score` abstract base class
95

96
    **Source Blocks:**
97
    - Override `_get_additional_source_blocks()` to include related blocks
98
    - Default implementation finds visual effects (shadows, outlines)
99
    - Set `effects_margin` to automatically include nearby drawings/images
100

101
    Example Implementation
102
    ----------------------
103

104
    .. code-block:: python
105

106
        class MyClassifier(RuleBasedClassifier):
107
            output = "my_label"
108
            requires = frozenset()  # Or frozenset({"dependency"})
109

110
            @property
111
            def min_score(self) -> float:
112
                return 0.6  # Require 60% confidence
113

114
            @property
115
            def rules(self) -> Sequence[Rule]:
116
                return [
117
                    # Filter: Only accept Text blocks
118
                    IsInstanceFilter((Text,)),
119

120
                    # Required rule: Must be in top half of page
121
                    PositionScore(
122
                        scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
123
                        weight=1.0,
124
                        required=True,
125
                    ),
126

127
                    # Optional scoring: Prefer larger text
128
                    FontSizeScore(
129
                        target_size=24.0,
130
                        weight=0.5,
131
                    ),
132
                ]
133

134
            # Optional: Custom score with parsed data
135
            def _create_score(
136
                self, block: Block, components: dict[str, float], total: float
137
            ) -> RuleScore:
138
                # Parse and store additional info
139
                value = self._parse_value(block)
140
                return MyCustomScore(
141
                    components=components,
142
                    total_score=total,
143
                    parsed_value=value,
144
                )
145

146
            def build(self, candidate, result) -> MyElement:
147
                score = candidate.score_details
148
                assert isinstance(score, MyCustomScore)
149
                return MyElement(
150
                    bbox=candidate.bbox,
151
                    value=score.parsed_value,
152
                )
153

154
    Built-in Hooks
155
    --------------
156

157
    Override these methods to customize behavior:
158

159
    - `_should_accept(score)`: Custom acceptance logic beyond min_score
160
    - `_create_score()`: Return custom Score subclass with additional data
161
    - `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)
162

163
    Visual Effects Support
164
    ----------------------
165

166
    RuleBasedClassifier automatically includes nearby Drawing/Image blocks
167
    as visual effects (outlines, shadows) if `effects_margin` is set:
168

169
    .. code-block:: python
170

171
        @property
172
        def effects_margin(self) -> float | None:
173
            return 2.0  # Include blocks within 2 units
174

175
        @property
176
        def effects_max_area_ratio(self) -> float | None:
177
            return 5.0  # Effect can be at most 5x the primary block area
178

179
    This ensures that when a candidate wins, all associated visual effects
180
    are consumed together, preventing other classifiers from incorrectly
181
    using shadow/outline blocks.
182

183
    See Also
184
    --------
185
    - Classifier: Main orchestrator with comprehensive best practices
186
    - Rule: Base class for scoring rules
187
    - rules module: Available rule implementations
188
    """
189

190
    @property
1✔
191
    @abstractmethod
1✔
192
    def rules(self) -> Sequence[Rule]:
1✔
193
        """Get the list of rules for this classifier."""
194
        pass
×
195

196
    @property
1✔
197
    def min_score(self) -> float:
1✔
198
        """Minimum score threshold for acceptance. Defaults to 0.0."""
199
        return 0.0
1✔
200

201
    @property
1✔
202
    def effects_margin(self) -> float | None:
1✔
203
        """Margin to expand block bbox to find visual effects (outlines, shadows).
204

205
        If None, no automatic effect finding is performed.
206
        Defaults to None.
207
        """
208
        return None
1✔
209

210
    # TODO Do we need effects_max_area_ratio ?
211
    @property
1✔
212
    def effects_max_area_ratio(self) -> float | None:
1✔
213
        """Maximum ratio of effect block area to primary block area.
214

215
        Used to avoid consuming unrelated large blocks as effects.
216
        Defaults to None (no ratio check).
217
        """
218
        return None
1✔
219

220
    def _create_score(
1✔
221
        self,
222
        block: Block,
223
        components: dict[str, float],
224
        total_score: float,
225
    ) -> RuleScore:
226
        """Create the score object for a candidate.
227

228
        Subclasses can override this to return a more specific score type
229
        that contains additional information (e.g., parsed values).
230

231
        Args:
232
            block: The block being scored
233
            components: Dictionary of rule name to score
234
            total_score: The weighted total score
235

236
        Returns:
237
            A RuleScore (or subclass) instance
238
        """
239
        return RuleScore(components=components, total_score=total_score)
1✔
240

241
    def _score(self, result: ClassificationResult) -> None:
1✔
242
        """Score blocks using rules."""
243
        context = RuleContext(result.page_data, self.config, result)
1✔
244
        rules = self.rules
1✔
245

246
        for block in result.page_data.blocks:
1✔
247
            components = {}
1✔
248
            weighted_sum = 0.0
1✔
249
            total_weight = 0.0
1✔
250
            failed = False
1✔
251

252
            for rule in rules:
1✔
253
                score = rule.calculate(block, context)
1✔
254

255
                # If rule returns None, it's skipped (not applicable)
256
                if score is None:
1✔
257
                    continue
1✔
258

259
                # If required rule fails (score 0), fail the block immediately
260
                if rule.required and score == 0.0:
1✔
261
                    failed = True
1✔
262
                    # log.debug(
263
                    #    "[%s] block_id=%s failed required rule '%s'",
264
                    #    self.output,
265
                    #    block.id,
266
                    #    rule.name,
267
                    # )
268
                    break
1✔
269

270
                rule_weight = rule.weight  # Using direct weight from Rule instance
1✔
271

272
                weighted_sum += score * rule_weight
1✔
273
                total_weight += rule_weight
1✔
274
                components[rule.name] = score
1✔
275

276
            if failed:
1✔
277
                continue
1✔
278

279
            # Calculate final score
280
            final_score = weighted_sum / total_weight if total_weight > 0 else 0.0
1✔
281

282
            # Check classifier-specific acceptance logic
283
            if not self._should_accept(final_score):
1✔
284
                log.debug(
1✔
285
                    "[%s] block_id=%s "
286
                    "rejected: score=%.3f < min_score=%.3f components=%s",
287
                    self.output,
288
                    block.id,
289
                    final_score,
290
                    self.min_score,
291
                    components,
292
                )
293
                continue
1✔
294

295
            log.debug(
1✔
296
                "[%s] block_id=%s accepted: score=%.3f components=%s",
297
                self.output,
298
                block.id,
299
                final_score,
300
                components,
301
            )
302

303
            # Build source blocks list, deduplicating as we go
304
            seen_ids: set[int] = {block.id}
1✔
305
            source_blocks: list[Blocks] = [block]
1✔
306

307
            # Add any classifier-specific additional source blocks
308
            for b in self._get_additional_source_blocks(block, result):
1✔
309
                if b.id not in seen_ids:
1✔
310
                    seen_ids.add(b.id)
1✔
311
                    source_blocks.append(b)
1✔
312

313
            # Create score object (subclasses can override _create_score)
314
            score_details = self._create_score(block, components, final_score)
1✔
315

316
            # Compute bbox as the union of all source blocks
317
            # This ensures the candidate bbox matches the source_blocks union,
318
            # required by validation (assert_element_bbox_matches_source_and_children)
319
            candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])
1✔
320

321
            # Create candidate
322
            candidate = Candidate(
1✔
323
                bbox=candidate_bbox,
324
                label=self.output,
325
                score=final_score,
326
                score_details=score_details,
327
                source_blocks=source_blocks,
328
            )
329
            result.add_candidate(candidate)
1✔
330

331
    def _get_additional_source_blocks(
1✔
332
        self, block: Blocks, result: ClassificationResult
333
    ) -> Sequence[Blocks]:
334
        """Get additional source blocks to include with the candidate.
335

336
        Subclasses can override this to include related blocks (e.g.,
337
        overlapping drawings, drop shadows) in the candidate's source_blocks.
338
        These blocks will be marked as removed if the candidate wins.
339

340
        The default implementation automatically includes Drawing/Image blocks
341
        that appear to be visual effects (outlines, shadows) by calling
342
        find_contained_effects if self.effects_margin is not None.
343
        """
344
        margin = self.effects_margin
1✔
345
        if margin is not None:
1✔
346
            return find_contained_effects(
1✔
347
                block,
348
                result.page_data.blocks,
349
                margin=margin,
350
                max_area_ratio=self.effects_max_area_ratio,
351
            )
352
        return []
1✔
353

354
    def _should_accept(self, score: float) -> bool:
1✔
355
        """Determine if a score is high enough to be a candidate.
356

357
        Subclasses can override this.
358
        """
359
        return score >= self.min_score
1✔
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc