• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

openvax / pyensembl / 25808026971

13 May 2026 03:10PM UTC coverage: 85.335% (+0.4%) from 84.971%
25808026971

Pull #355

github

iskandr
Preserve FASTA-header versions in SequenceData (#351)

`_parse_header_id` no longer strips ENS .N version suffixes — the
versioned form is preserved as the canonical FASTA identifier, with
the version treated as authoritative source-of-truth for the
sequence's identity. Also added pipe-delimited (GENCODE) header
handling: the leading field before `|` is now the identifier rather
than the whole pipe-packed blob with everything-after-the-first-dot
silently dropped.

`SequenceData` now maintains:
  - `_fasta_dictionary` keyed on whichever form the header carried
    (versioned when available, bare otherwise)
  - `_stripped_index: bare_ens_id -> versioned_id`, so bare-ID
    lookups still resolve against a versioned FASTA (GENCODE case)
  - `_versions: id -> int`, exposed via the new `fasta_version(id)`
    accessor for downstream sanity-checking of FASTA / GTF alignment

`lookup_sequence_with_version_fallback` resolves both directions —
versioned caller against bare FASTA (existing Ensembl path) and
bare caller against versioned FASTA (new GENCODE path) — without
the runtime `.rsplit` on every miss the old helper did. Defensive
fallback for callers that hold a `SequenceData` without the new
attributes (e.g. unpickled old object).

Bumped the on-disk pickle filename to `<fasta>.v2.pickle` so stale
caches from the previous bare-key layout get ignored instead of
loaded.

Closes #351. Companion to gtfparse#67 (attribute_aliases +
cast_version_columns), which together close out the GENCODE
compatibility regression originally reported in #335.
Pull Request #355: Preserve FASTA-header versions in SequenceData (closes #351)

47 of 48 new or added lines in 3 files covered. (97.92%)

14 existing lines in 3 files now uncovered.

1641 of 1923 relevant lines covered (85.34%)

3.41 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

86.89
/pyensembl/sequence_data.py
1
# Licensed under the Apache License, Version 2.0 (the "License");
2
# you may not use this file except in compliance with the License.
3
# You may obtain a copy of the License at
4
#
5
#     http://www.apache.org/licenses/LICENSE-2.0
6
#
7
# Unless required by applicable law or agreed to in writing, software
8
# distributed under the License is distributed on an "AS IS" BASIS,
9
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
# See the License for the specific language governing permissions and
11
# limitations under the License.
12

13
from os import remove
4✔
14
from os.path import dirname, exists, abspath, split, join
4✔
15

16
import datacache
4✔
17
import logging
4✔
18
from collections import Counter
4✔
19
import pickle
4✔
20
from .common import load_pickle, dump_pickle
4✔
21
from .fasta import _split_ens_version, parse_fasta_dictionary
4✔
22

23

24
logger = logging.getLogger(__name__)
4✔
25

26

27
# Bump the FASTA-pickle filename suffix when the on-disk layout of
28
# `_fasta_dictionary` changes so stale caches get rebuilt instead of
29
# silently loaded under the new code path. v2 keys versioned ENS IDs.
30
FASTA_PICKLE_SCHEMA_VERSION = "v2"
4✔
31

32

33
def lookup_sequence_with_version_fallback(sequence_data, identifier):
4✔
34
    """
35
    Look up ``identifier`` in ``sequence_data``, tolerating ENS ``.N``
36
    version-suffix mismatches in either direction.
37

38
    Both Ensembl and GENCODE work with versioned IDs (e.g.
39
    ``ENSP00000123456.3``); the two formats just split that information
40
    differently. Ensembl GTFs keep the bare ID in ``protein_id`` /
41
    ``transcript_id`` and put the version in a separate ``*_version``
42
    attribute, while GENCODE GTFs embed the version directly in the ID
43
    attribute. Whichever convention the FASTA on disk used, the
44
    sequence dict now keys on the form that header carried — so either
45
    a versioned or a bare caller-supplied ID may need a fallback.
46

47
    Resolution order:
48
      1. literal lookup of ``identifier``
49
      2. if ``identifier`` is a versioned ENS ID, strip ``.N`` and retry
50
         (handles caller-supplied versioned ID against a bare FASTA)
51
      3. consult ``sequence_data._stripped_index`` for a versioned alias
52
         of a bare caller-supplied ID (handles bare ID against a
53
         versioned FASTA, the GENCODE case)
54
      4. None
55

56
    Non-Ensembl IDs (e.g. TAIR ``AT1G01010.1`` where ``.N`` is an
57
    isoform suffix, not a version) skip step (2): stripping there is
58
    semantically wrong.
59
    """
60
    if not identifier:
4✔
61
        return None
4✔
62
    sequence = sequence_data.get(identifier)
4✔
63
    if sequence is not None:
4✔
64
        return sequence
4✔
65
    if identifier.startswith("ENS") and "." in identifier:
4✔
66
        bare, version = _split_ens_version(identifier)
4✔
67
        if version is not None:
4✔
68
            sequence = sequence_data.get(bare)
4✔
69
            if sequence is not None:
4✔
70
                return sequence
4✔
71
    stripped_index = getattr(sequence_data, "_stripped_index", None)
4✔
72
    if stripped_index:
4✔
73
        versioned = stripped_index.get(identifier)
4✔
74
        if versioned is not None:
4✔
75
            return sequence_data.get(versioned)
4✔
76
    return None
4✔
77

78

79
class SequenceData(object):
4✔
80
    """
81
    Container for reference nucleotide and amino acid sequenes.
82
    """
83

84
    def __init__(self, fasta_paths, cache_directory_path=None):
4✔
85
        if type(fasta_paths) is str:
4✔
UNCOV
86
            fasta_paths = [fasta_paths]
×
87

88
        self.fasta_paths = [abspath(path) for path in fasta_paths]
4✔
89
        self.fasta_directory_paths = [split(path)[0] for path in self.fasta_paths]
4✔
90
        self.fasta_filenames = [split(path)[1] for path in self.fasta_paths]
4✔
91
        if cache_directory_path:
4✔
92
            self.cache_directory_paths = [cache_directory_path] * len(self.fasta_paths)
4✔
93
        else:
UNCOV
94
            self.cache_directory_paths = self.fasta_directory_paths
×
95
        for path in self.fasta_paths:
4✔
96
            if not exists(path):
4✔
UNCOV
97
                raise ValueError("Couldn't find FASTA file %s" % (path,))
×
98
        self.fasta_dictionary_filenames = [
4✔
99
            "%s.%s.pickle" % (filename, FASTA_PICKLE_SCHEMA_VERSION)
100
            for filename in self.fasta_filenames
101
        ]
102
        self.fasta_dictionary_pickle_paths = [
4✔
103
            join(cache_path, filename)
104
            for cache_path, filename in zip(
105
                self.cache_directory_paths, self.fasta_dictionary_filenames
106
            )
107
        ]
108
        self._init_lazy_fields()
4✔
109

110
    def _init_lazy_fields(self):
4✔
111
        self._fasta_dictionary = None
4✔
112
        self._fasta_keys = None
4✔
113
        # Maps a bare ENS ID -> the versioned form actually keyed in
114
        # _fasta_dictionary, so callers passing the unversioned form can
115
        # still resolve to a versioned FASTA entry (GENCODE case).
116
        self._stripped_index = None
4✔
117
        # Maps a sequence identifier -> the integer version suffix the
118
        # FASTA header carried, or absent for headers without a version.
119
        # Lets callers sanity-check FASTA / GTF alignment.
120
        self._versions = None
4✔
121

122
    def clear_cache(self):
4✔
123
        self._init_lazy_fields()
4✔
124
        for path in self.fasta_dictionary_pickle_paths:
4✔
125
            if exists(path):
4✔
126
                remove(path)
4✔
127

128
    def __str__(self):
4✔
129
        return "SequenceData(fasta_paths=%s)" % (self.fasta_paths,)
×
130

131
    def __repr__(self):
4✔
UNCOV
132
        return str(self)
×
133

134
    def __contains__(self, sequence_id):
4✔
135
        if self._fasta_keys is None:
×
UNCOV
136
            self._fasta_keys = set(self.fasta_dictionary.keys())
×
UNCOV
137
        return sequence_id in self._fasta_keys
×
138

139
    def __eq__(self, other):
4✔
140
        # test to see if self.fasta_paths and other.fasta_paths contain
141
        # the same list of paths, regardless of order
UNCOV
142
        return (other.__class__ is SequenceData) and Counter(
×
143
            self.fasta_paths
144
        ) == Counter(other.fasta_paths)
145

146
    def __hash__(self):
4✔
UNCOV
147
        return hash(self.fasta_paths)
×
148

149
    def _add_to_fasta_dictionary(self, fasta_dictionary_tmp):
4✔
150
        for identifier, sequence in fasta_dictionary_tmp.items():
4✔
151
            if identifier in self._fasta_dictionary:
4✔
152
                logger.warn(
×
153
                    "Sequence identifier %s is duplicated in your FASTA files!"
154
                    % identifier
155
                )
156
                continue
×
157
            self._fasta_dictionary[identifier] = sequence
4✔
158
            bare, version = _split_ens_version(identifier)
4✔
159
            if version is not None:
4✔
160
                if bare in self._stripped_index:
4✔
161
                    # Two FASTA entries collide on the same bare ENS ID with
162
                    # different versions. The newer of the two is kept as
163
                    # the canonical alias since version numbers grow
164
                    # monotonically.
165
                    existing = self._stripped_index[bare]
4✔
166
                    _, existing_version = _split_ens_version(existing)
4✔
167
                    if existing_version is None or version > existing_version:
4✔
168
                        self._stripped_index[bare] = identifier
4✔
169
                else:
170
                    self._stripped_index[bare] = identifier
4✔
171
                self._versions[identifier] = version
4✔
172

173
    def _load_or_create_fasta_dictionary_pickle(self):
4✔
174
        self._fasta_dictionary = dict()
4✔
175
        self._stripped_index = dict()
4✔
176
        self._versions = dict()
4✔
177
        for fasta_path, pickle_path in zip(
4✔
178
            self.fasta_paths, self.fasta_dictionary_pickle_paths
179
        ):
180
            if exists(pickle_path):
4✔
181
                # try loading the cached file
182
                # but we'll fall back on recreating it if loading fails
183
                try:
4✔
184
                    fasta_dictionary_tmp = load_pickle(pickle_path)
4✔
185
                    self._add_to_fasta_dictionary(fasta_dictionary_tmp)
4✔
186
                    logger.info("Loaded sequence dictionary from %s", pickle_path)
4✔
187
                    continue
4✔
UNCOV
188
                except (pickle.UnpicklingError, AttributeError):
×
189
                    # catch either an UnpicklingError or an AttributeError
190
                    # resulting from pickled objects refering to classes
191
                    # that no longer exists
UNCOV
192
                    logger.warn(
×
193
                        "Failed to load %s, attempting to read FASTA directly",
194
                        pickle_path,
195
                    )
196
            logger.info("Parsing sequences from FASTA file at %s", fasta_path)
4✔
197

198
            fasta_dictionary_tmp = parse_fasta_dictionary(fasta_path)
4✔
199
            self._add_to_fasta_dictionary(fasta_dictionary_tmp)
4✔
200
            logger.info("Saving sequence dictionary to %s", pickle_path)
4✔
201
            datacache.ensure_dir(dirname(pickle_path))
4✔
202
            dump_pickle(fasta_dictionary_tmp, pickle_path)
4✔
203

204
    def index(self, overwrite=False):
4✔
205
        if overwrite:
4✔
UNCOV
206
            self.clear_cache()
×
207
        self._load_or_create_fasta_dictionary_pickle()
4✔
208

209
    @property
4✔
210
    def fasta_dictionary(self):
4✔
211
        if not self._fasta_dictionary:
4✔
212
            self._load_or_create_fasta_dictionary_pickle()
4✔
213
        return self._fasta_dictionary
4✔
214

215
    def get(self, sequence_id):
4✔
216
        """Get sequence associated with given ID or return None if missing"""
217
        return self.fasta_dictionary.get(sequence_id)
4✔
218

219
    def fasta_version(self, sequence_id):
4✔
220
        """
221
        Return the integer FASTA-header version for ``sequence_id``, or
222
        ``None`` if the header didn't carry a version (e.g. older Ensembl
223
        releases) or the ID isn't in the FASTA.
224

225
        Accepts either the versioned or bare form of an ENS ID — the
226
        bare form is resolved through ``_stripped_index``.
227
        """
228
        if not sequence_id:
4✔
NEW
229
            return None
×
230
        # ensure lazy load
231
        _ = self.fasta_dictionary
4✔
232
        version = self._versions.get(sequence_id)
4✔
233
        if version is not None:
4✔
234
            return version
4✔
235
        versioned = self._stripped_index.get(sequence_id)
4✔
236
        if versioned is not None:
4✔
237
            return self._versions.get(versioned)
4✔
238
        return None
4✔
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc