• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

bramp / build-along / 20401440535

20 Dec 2025 11:19PM UTC coverage: 89.38% (+0.01%) from 89.367%
20401440535

push

github

bramp
Standardize RuleBasedClassifier effect finding and resolve PartCount conflicts

- Set default effects_margin to 2.0 in RuleBasedClassifier.
- Removed effects_max_area_ratio and effects_target_types from base class to simplify API.
- Implemented local filtering in PartCountClassifier and PartNumberClassifier to only consume Drawing effects, preventing conflicts with PartImage Image blocks.
- Opted-out non-content classifiers (Divider, ProgressBarBar, etc.) from automatic effect finding to maintain behavior.
- Cleaned up ProgressBarIndicatorClassifier to use standard base class properties.
- Updated golden files for pages 072 and 176 to reflect minor bbox changes from refined PartCount behavior.

37 of 39 new or added lines in 10 files covered. (94.87%)

26 existing lines in 7 files now uncovered.

13701 of 15329 relevant lines covered (89.38%)

0.89 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

98.75
/src/build_a_long/pdf_extract/classifier/rule_based_classifier.py
1
"""
2
Rule-based classifier implementation.
3
"""
4

5
from __future__ import annotations
1✔
6

7
import logging
1✔
8
from abc import abstractmethod
1✔
9
from collections.abc import Sequence
1✔
10

11
from build_a_long.pdf_extract.classifier.block_filter import (
1✔
12
    find_contained_effects,
13
)
14
from build_a_long.pdf_extract.classifier.candidate import Candidate
1✔
15
from build_a_long.pdf_extract.classifier.classification_result import (
1✔
16
    ClassificationResult,
17
)
18
from build_a_long.pdf_extract.classifier.label_classifier import (
1✔
19
    LabelClassifier,
20
)
21
from build_a_long.pdf_extract.classifier.rules import Rule, RuleContext
1✔
22
from build_a_long.pdf_extract.classifier.score import Score, Weight
1✔
23
from build_a_long.pdf_extract.extractor.bbox import BBox
1✔
24
from build_a_long.pdf_extract.extractor.page_blocks import Blocks
1✔
25

26
log = logging.getLogger(__name__)
1✔
27

28

29
class RuleScore(Score):
1✔
30
    """Generic score based on rules."""
31

32
    components: dict[str, float]
1✔
33
    total_score: float
1✔
34

35
    def score(self) -> Weight:
1✔
36
        return self.total_score
1✔
37

38
    def get(self, rule_name: str, default: float = 0.0) -> float:
1✔
39
        """Get the score for a specific rule name."""
40
        return self.components.get(rule_name, default)
1✔
41

42

43
class StepNumberScore(RuleScore):
1✔
44
    """Score for step number candidates that includes the parsed step value.
45

46
    This avoids re-parsing the step number from source blocks when the value
47
    is needed later (e.g., for building StepNumber elements or sorting).
48
    """
49

50
    step_value: int
1✔
51
    """The parsed step number value (e.g., 1, 2, 3, 42)."""
1✔
52

53

54
class RuleBasedClassifier(LabelClassifier):
1✔
55
    """Base class for classifiers that use a list of rules to score candidates.
56

57
    This class provides a declarative way to create classifiers using rules.
58
    Instead of implementing custom scoring logic, subclasses declare a list
59
    of Rule objects that are evaluated for each block.
60

61
    How It Works
62
    ------------
63

64
    1. Define your rules in the `rules` property
65
    2. Rules are evaluated sequentially for each block
66
    3. Each rule returns a score (0.0 to 1.0) or None (skipped)
67
    4. Required rules with score 0.0 cause immediate rejection
68
    5. Final score is weighted average of all applicable rules
69
    6. Blocks meeting `min_score` threshold become candidates
70

71
    Scoring Calculation
72
    -------------------
73

74
    .. code-block:: python
75

76
        final_score = sum(rule.score * rule.weight) / sum(rule.weight)
77

78
    - If any required rule scores 0.0, the block is rejected immediately
79
    - Rules can return None to be skipped (not counted in average)
80
    - All rule scores and weights are stored in the score_details for debugging
81

82
    Best Practices
83
    --------------
84

85
    **Rule Design:**
86
    - Use Filter rules to eliminate invalid blocks early
87
    - Use Score rules to rate blocks on intrinsic properties
88
    - Set `required=True` for rules that MUST pass (e.g., type filters)
89
    - Use weights to emphasize important scoring factors
90

91
    **Score Object:**
92
    - Override `_create_score()` to return custom Score subclass
93
    - Use custom scores to store parsed values (e.g., step numbers)
94
    - Always inherit from `Score` abstract base class
95

96
    **Source Blocks:**
97
    - Override `_get_additional_source_blocks()` to include related blocks
98
    - Default implementation finds visual effects (shadows, outlines)
99
    - Set `effects_margin` to automatically include nearby drawings/images
100

101
    Example Implementation
102
    ----------------------
103

104
    .. code-block:: python
105

106
        class MyClassifier(RuleBasedClassifier):
107
            output = "my_label"
108
            requires = frozenset()  # Or frozenset({"dependency"})
109

110
            @property
111
            def min_score(self) -> float:
112
                return 0.6  # Require 60% confidence
113

114
            @property
115
            def rules(self) -> Sequence[Rule]:
116
                return [
117
                    # Filter: Only accept Text blocks
118
                    IsInstanceFilter((Text,)),
119

120
                    # Required rule: Must be in top half of page
121
                    PositionScore(
122
                        scale=LinearScale({0.0: 1.0, 0.5: 0.0}),
123
                        weight=1.0,
124
                        required=True,
125
                    ),
126

127
                    # Optional scoring: Prefer larger text
128
                    FontSizeScore(
129
                        target_size=24.0,
130
                        weight=0.5,
131
                    ),
132
                ]
133

134
            # Optional: Custom score with parsed data
135
            def _create_score(
136
                self, components: dict[str, float], total: float,
137
                source_blocks: Sequence[Blocks]
138
            ) -> RuleScore:
139
                # Parse and store additional info from primary block
140
                value = self._parse_value(source_blocks[0])
141
                return MyCustomScore(
142
                    components=components,
143
                    total_score=total,
144
                    parsed_value=value,
145
                )
146

147
            def build(self, candidate, result) -> MyElement:
148
                score = candidate.score_details
149
                assert isinstance(score, MyCustomScore)
150
                return MyElement(
151
                    bbox=candidate.bbox,
152
                    value=score.parsed_value,
153
                )
154

155
    Built-in Hooks
156
    --------------
157

158
    Override these methods to customize behavior:
159

160
    - `_should_accept(score)`: Custom acceptance logic beyond min_score
161
    - `_create_score()`: Return custom Score subclass with additional data
162
    - `_get_additional_source_blocks()`: Include related blocks (shadows, etc.)
163

164
    Visual Effects Support
165
    ----------------------
166

167
    RuleBasedClassifier automatically includes nearby Drawing/Image blocks
168
    as visual effects (outlines, shadows) if `effects_margin` is set:
169

170
    .. code-block:: python
171

172
        @property
173
        def effects_margin(self) -> float | None:
174
            return 2.0  # Include blocks within 2 units
175

176
        @property
177
        def effects_max_area_ratio(self) -> float | None:
178
            return 5.0  # Effect can be at most 5x the primary block area
179

180
    This ensures that when a candidate wins, all associated visual effects
181
    are consumed together, preventing other classifiers from incorrectly
182
    using shadow/outline blocks.
183

184
    See Also
185
    --------
186
    - Classifier: Main orchestrator with comprehensive best practices
187
    - Rule: Base class for scoring rules
188
    - rules module: Available rule implementations
189
    """
190

191
    @property
1✔
192
    @abstractmethod
1✔
193
    def rules(self) -> Sequence[Rule]:
1✔
194
        """Get the list of rules for this classifier."""
UNCOV
195
        pass
×
196

197
    @property
1✔
198
    def min_score(self) -> float:
1✔
199
        """Minimum score threshold for acceptance. Defaults to 0.0."""
200
        return 0.0
1✔
201

202
    @property
1✔
203
    def effects_margin(self) -> float | None:
1✔
204
        """Margin to expand block bbox to find visual effects (outlines, shadows).
205

206
        If None, no automatic effect finding is performed.
207
        Defaults to 2.0.
208
        """
209
        return 2.0
1✔
210

211
    def _create_score(
1✔
212
        self,
213
        components: dict[str, float],
214
        total_score: float,
215
        source_blocks: Sequence[Blocks],
216
    ) -> RuleScore:
217
        """Create the score object for a candidate.
218

219
        Subclasses can override this to return a more specific score type
220
        that contains additional information (e.g., parsed values, cluster
221
        validation results).
222

223
        Args:
224
            components: Dictionary of rule name to score
225
            total_score: The weighted total score from rules
226
            source_blocks: All blocks that will be part of the candidate.
227
                The primary block (that passed the rules) is source_blocks[0].
228
                Additional blocks from _get_additional_source_blocks() follow.
229

230
        Returns:
231
            A RuleScore (or subclass) instance
232

233
        TODO: Consider adding a cluster_rules property if multiple classifiers
234
        need to validate/score complete clusters. This would allow expressing
235
        cluster validation (e.g., count >= 3, cluster bbox aspect ratio) as
236
        declarative rules instead of imperative code in _create_score().
237
        For now, the imperative approach is simpler for the few classifiers
238
        that need it (e.g., LoosePartSymbolClassifier).
239
        """
240
        return RuleScore(components=components, total_score=total_score)
1✔
241

242
    def _score(self, result: ClassificationResult) -> None:
1✔
243
        """Score blocks using rules."""
244
        context = RuleContext(result.page_data, self.config, result)
1✔
245
        rules = self.rules
1✔
246

247
        for block in result.page_data.blocks:
1✔
248
            components = {}
1✔
249
            weighted_sum = 0.0
1✔
250
            total_weight = 0.0
1✔
251
            failed = False
1✔
252

253
            for rule in rules:
1✔
254
                score = rule.calculate(block, context)
1✔
255

256
                # If rule returns None, it's skipped (not applicable)
257
                if score is None:
1✔
258
                    continue
1✔
259

260
                # If required rule fails (score 0), fail the block immediately
261
                if rule.required and score == 0.0:
1✔
262
                    failed = True
1✔
263
                    # log.debug(
264
                    #    "[%s] block_id=%s failed required rule '%s'",
265
                    #    self.output,
266
                    #    block.id,
267
                    #    rule.name,
268
                    # )
269
                    break
1✔
270

271
                rule_weight = rule.weight  # Using direct weight from Rule instance
1✔
272

273
                weighted_sum += score * rule_weight
1✔
274
                total_weight += rule_weight
1✔
275
                components[rule.name] = score
1✔
276

277
            if failed:
1✔
278
                continue
1✔
279

280
            # Calculate final score from rules
281
            final_score = weighted_sum / total_weight if total_weight > 0 else 0.0
1✔
282

283
            # Build source blocks list, deduplicating as we go
284
            seen_ids: set[int] = {block.id}
1✔
285
            source_blocks: list[Blocks] = [block]
1✔
286

287
            # Add any classifier-specific additional source blocks
288
            for b in self._get_additional_source_blocks(block, result):
1✔
289
                if b.id not in seen_ids:
1✔
290
                    seen_ids.add(b.id)
1✔
291
                    source_blocks.append(b)
1✔
292

293
            # Create score object (subclasses can override _create_score)
294
            # This can validate the complete cluster and adjust the score
295
            score_details = self._create_score(components, final_score, source_blocks)
1✔
296

297
            # Get actual score (may differ from final_score after validation)
298
            actual_score = score_details.score()
1✔
299

300
            # Check classifier-specific acceptance logic on the actual score
301
            if not self._should_accept(actual_score):
1✔
302
                log.debug(
1✔
303
                    "[%s] block_id=%s "
304
                    "rejected: score=%.3f < min_score=%.3f components=%s",
305
                    self.output,
306
                    block.id,
307
                    actual_score,
308
                    self.min_score,
309
                    components,
310
                )
311
                continue
1✔
312

313
            log.debug(
1✔
314
                "[%s] block_id=%s cluster accepted: score=%.3f components=%s",
315
                self.output,
316
                block.id,
317
                actual_score,
318
                components,
319
            )
320

321
            # Compute bbox as the union of all source blocks
322
            # This ensures the candidate bbox matches the source_blocks union,
323
            # required by validation (assert_element_bbox_matches_source_and_children)
324
            candidate_bbox = BBox.union_all([b.bbox for b in source_blocks])
1✔
325

326
            # Create candidate
327
            candidate = Candidate(
1✔
328
                bbox=candidate_bbox,
329
                label=self.output,
330
                score=actual_score,
331
                score_details=score_details,
332
                source_blocks=source_blocks,
333
            )
334
            result.add_candidate(candidate)
1✔
335

336
    def _get_additional_source_blocks(
1✔
337
        self, block: Blocks, result: ClassificationResult
338
    ) -> Sequence[Blocks]:
339
        """Get additional source blocks to include with the candidate.
340

341
        Subclasses can override this to include related blocks (e.g.,
342
        overlapping drawings, drop shadows) in the candidate's source_blocks.
343
        These blocks will be marked as removed if the candidate wins.
344

345
        The default implementation automatically includes Drawing/Image blocks
346
        that appear to be visual effects (outlines, shadows) by calling
347
        find_contained_effects if self.effects_margin is not None.
348
        """
349
        margin = self.effects_margin
1✔
350
        if margin is not None:
1✔
351
            return find_contained_effects(
1✔
352
                block,
353
                result.page_data.blocks,
354
                margin=margin,
355
            )
356
        return []
1✔
357

358
    def _should_accept(self, score: float) -> bool:
1✔
359
        """Determine if a score is high enough to be a candidate.
360

361
        Subclasses can override this.
362
        """
363
        return score >= self.min_score
1✔
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc