• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

IBM / unitxt / 14254422405

03 Apr 2025 11:05PM UTC coverage: 80.304% (+0.07%) from 80.233%
14254422405

push

github

web-flow
Add retry policy for huggingface assets downloads (#1711)

* Add retry policy for huggingface assets connection

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Enhance retry mechanism to include FileNotFoundError and set max connection retries in test settings

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Nested Exceptions

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add FileNotFoundError to retry exceptions in exponential backoff decorator

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Update default retry setting to use max_connection_retries from settings

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add exponential backoff retry decorator to prepare methods in GraniteDocumentsFormat and HFSystemFormat

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Remove download_config from hf_evaluate_load and hf_load_dataset; set max_connection_retries in test settings

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add unit tests for retry decorator with exponential backoff

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Fix loader cache size assignment and handle None dataset case in LoadHF class

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add exponential backoff retry to get_splits method and improve error handling for missing datasets

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add error handling for None dataset case in LoadHF class and improve retry logic in utils

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Re-raise ValueError in hf_load_dataset for improved error handling

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Update GitHub Actions workflow to include constraints file and adjust package installations

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Remove spacy dependency from pyproject.toml

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor error handling in LoadHF class to raise NotImplementedError for None dataset and im... (continued)

1579 of 1957 branches covered (80.68%)

Branch coverage included in aggregate %.

9878 of 12310 relevant lines covered (80.24%)

0.8 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

86.9
src/unitxt/string_operators.py
1
import os
1✔
2
import re
1✔
3
from typing import (
1✔
4
    Any,
5
    Dict,
6
    List,
7
    Optional,
8
)
9

10
from .operators import FieldOperator, InstanceOperator
1✔
11
from .settings_utils import get_settings
1✔
12
from .utils import retry_connection_with_exponential_backoff
1✔
13

14
settings = get_settings()
1✔
15

16
class Split(FieldOperator):
1✔
17
    by: str
1✔
18

19
    def process_value(self, value: str) -> List[str]:
1✔
20
        return value.split(self.by)
1✔
21

22

23
class RegexSplit(FieldOperator):
1✔
24
    by: str
1✔
25

26
    def process_value(self, value: str) -> List[str]:
1✔
27
        return re.split(self.by, value)
1✔
28

29

30
class TokensSplit(FieldOperator):
1✔
31
    model: str
1✔
32
    _requirements_list = ["transformers"]
1✔
33

34
    def prepare(self):
1✔
35
        super().prepare()
1✔
36
        from transformers import AutoTokenizer
1✔
37
        path = self.model
1✔
38
        if settings.hf_offline_models_path is not None:
1✔
39
            path = os.path.join(settings.hf_offline_models_path, path)
×
40
        self.tokenizer = AutoTokenizer.from_pretrained(path)
1✔
41

42
    def process_value(self, value: str) -> List[str]:
1✔
43
        return self.tokenizer.tokenize(value)
1✔
44

45

46
class TokensSlice(FieldOperator):
1✔
47
    model: str
1✔
48
    start: Optional[int] = None
1✔
49
    stop: Optional[int] = None
1✔
50
    step: Optional[int] = None
1✔
51

52
    _requirements_list = ["transformers"]
1✔
53

54
    @retry_connection_with_exponential_backoff(backoff_factor=2)
1✔
55
    def prepare(self):
1✔
56
        super().prepare()
1✔
57
        from transformers import AutoTokenizer
1✔
58
        path = self.model
1✔
59
        if settings.hf_offline_models_path is not None:
1✔
60
            path = os.path.join(settings.hf_offline_models_path, path)
×
61
        self.tokenizer = AutoTokenizer.from_pretrained(path)
1✔
62

63
    def process_value(self, value: str) -> str:
1✔
64
        encoded = self.tokenizer.encode(value)
1✔
65
        slicer = slice(self.start, self.stop, self.step)
1✔
66
        sliced = encoded[slicer]
1✔
67
        return self.tokenizer.decode(sliced)
1✔
68

69

70
class Join(FieldOperator):
1✔
71
    by: str
1✔
72

73
    def process_value(self, value: List[str]) -> str:
1✔
74
        return self.by.join(value)
1✔
75

76

77
class FormatText(InstanceOperator):
1✔
78
    to_field: str
1✔
79
    text: str
1✔
80

81
    def process(
1✔
82
        self, instance: Dict[str, Any], stream_name: Optional[str] = None
83
    ) -> Dict[str, Any]:
84
        instance[self.to_field] = self.text.format(**instance)
×
85
        return instance
×
86

87

88
class Strip(FieldOperator):
1✔
89
    def process_value(self, value: str) -> str:
1✔
90
        return value.strip()
×
91

92

93
class Replace(FieldOperator):
1✔
94
    old: str
1✔
95
    new: str
1✔
96

97
    def process_value(self, value: str) -> str:
1✔
98
        return value.replace(self.old, self.new)
×
99

100

101
class MapReplace(FieldOperator):
1✔
102
    mapping: Dict[str, str]
1✔
103

104
    def process_value(self, value: Any) -> Any:
1✔
105
        for key, val in self.mapping.items():
×
106
            value = value.replace(key, val)
×
107
        return value
×
108

109

110
class RegexReplace(FieldOperator):
1✔
111
    pattern: str  # A regex pattern
1✔
112
    replacement: str  # The replacement string or template
1✔
113

114
    def prepare(self):
1✔
115
        super().prepare()
1✔
116
        self.pattern = re.compile(self.pattern)
1✔
117

118
    def process_value(self, value: Any) -> Any:
1✔
119
        if isinstance(value, str):
1✔
120
            return re.sub(self.pattern, self.replacement, value)
1✔
121
        return value  # If not a string, return the value as is
×
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc