• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

systemsdk / phpcpd / #4

30 Dec 2024 04:42PM UTC coverage: 75.789%. Remained the same
#4

push

DKravtsov
phpcpd 8.0.0 release. Made codebase refactoring. Updated requirements php 8.3, updated composer dependencies, updated tests to the PHPUnit 11. Updated dev environment to the php 8.4, Phing 3.0, added code quality tools: ecs, phpstan.

271 of 379 new or added lines in 20 files covered. (71.5%)

65 existing lines in 13 files now uncovered.

648 of 855 relevant lines covered (75.79%)

2.18 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

98.48
/src/Detector/Strategy/SuffixTree/SuffixTree.php
1
<?php
2

3
declare(strict_types=1);
4

5
namespace Systemsdk\PhpCPD\Detector\Strategy\SuffixTree;
6

7
/**
8
 * Efficient linear time constructible suffix tree using Ukkonen's online
9
 * construction algorithm (E. Ukkonen: "On-line construction of suffix trees").
10
 * Most of the comments reference this paper and it might be hard to follow
11
 * without knowing at least the basics of it.
12
 * <p>
13
 * We use some conventions which are slightly different from the paper however:
14
 * <ul>
15
 * <li>The names of the variables are different, but we give a translation into
16
 * Ukkonen's names.</li>
17
 * <li>Many variables are made "global" by realizing them as fields. This way we
18
 * can easily deal with those tuple return values without constructing extra
19
 * classes.</li>
20
 * <li>String indices start at 0 (not at 1).</li>
21
 * <li>Substrings are marked by the first index and the index after the last one
22
 * (just as in C++ STL) instead of the first and the last index (i.e. intervals
23
 * are right-open instead of closed). This makes it more intuitive to express
24
 * the empty string (i.e. (i,i) instead of (i,i-1)).</li>
25
 * </ul>
26
 * <p>
27
 * Everything but the construction itself is protected to simplify increasing
28
 * its functionality by subclassing but without introducing new method calls.
29
 *
30
 * @author Benjamin Hummel
31
 * @author $Author: kinnen $
32
 *
33
 * @version $Revision: 41751 $
34
 *
35
 * @ConQAT.Rating GREEN Hash: 4B2EF0606B3085A6831764ED042FF20D
36
 */
37
class SuffixTree
38
{
39
    /**
40
     * Infinity in this context.
41
     *
42
     * @var int
43
     */
44
    protected $INFTY;
45

46
    /**
47
     * The word we are working on.
48
     *
49
     * @var AbstractToken[]
50
     */
51
    protected $word;
52

53
    /**
54
     * The number of nodes created so far.
55
     *
56
     * @var int
57
     */
58
    protected $numNodes = 0;
59

60
    /**
61
     * For each node this holds the index of the first character of
62
     * {@link #word} labeling the transition <b>to</b> this node. This
63
     * corresponds to the <em>k</em> for a transition used in Ukkonen's paper.
64
     *
65
     * @var int[]
66
     */
67
    protected $nodeWordBegin;
68

69
    /**
70
     * For each node this holds the index of the one after the last character of
71
     * {@link #word} labeling the transition <b>to</b> this node. This
72
     * corresponds to the <em>p</em> for a transition used in Ukkonen's paper.
73
     *
74
     * @var int[]
75
     */
76
    protected $nodeWordEnd;
77

78
    /** For each node its suffix link (called function <em>f</em> by Ukkonen).
79
     * @var int[] */
80
    protected $suffixLink;
81

82
    /**
83
     * The next node function realized as a hash table. This corresponds to the
84
     * <em>g</em> function used in Ukkonen's paper.
85
     *
86
     * @var SuffixTreeHashTable
87
     */
88
    protected $nextNode;
89

90
    /**
91
     * An array giving for each node the index where the first child will be
92
     * stored (or -1 if it has no children). It is initially empty and will be
93
     * filled "on demand" using
94
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
95
     * (int[], int[], int[])}.
96
     *
97
     * @var int[]
98
     */
99
    protected $nodeChildFirst = [];
100

101
    /**
102
     * This array gives the next index of the child list or -1 if this is the
103
     * last one. It is initially empty and will be filled "on demand" using
104
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
105
     * (int[], int[], int[])}.
106
     *
107
     * @var int[]
108
     */
109
    protected $nodeChildNext = [];
110

111
    /**
112
     * This array stores the actual name (=number) of the mode in the child
113
     * list. It is initially empty and will be filled "on demand" using
114
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
115
     * (int[], int[], int[])}.
116
     *
117
     * @var int[]
118
     */
119
    protected $nodeChildNode = [];
120

121
    /**
122
     * The node we are currently at as a "global" variable (as it is always
123
     * passed unchanged). This is called <i>s</i> in Ukkonen's paper.
124
     *
125
     * @var int
126
     */
127
    private $currentNode = 0;
128

129
    /**
130
     * Beginning of the word part of the reference pair. This is kept "global"
131
     * (in constrast to the end) as this is passed unchanged to all functions.
132
     * Ukkonen calls this <em>k</em>.
133
     *
134
     * @var int
135
     */
136
    private $refWordBegin = 0;
137

138
    /**
139
     * This is the new (or old) explicit state as returned by
140
     * {@link #testAndSplit(int, Object)}. Ukkonen calls this <em>r</em>.
141
     *
142
     * @var int
143
     */
144
    private $explicitNode = 0;
145

146
    /**
147
     * Create a new suffix tree from a given word. The word given as parameter
148
     * is used internally and should not be modified anymore, so copy it before
149
     * if required.
150
     *
151
     * @param AbstractToken[] $word
152
     */
153
    public function __construct($word)
154
    {
155
        $this->word = $word;
1✔
156
        $size = count($word);
1✔
157
        $this->INFTY = $size;
1✔
158

159
        $expectedNodes = 2 * $size;
1✔
160
        $this->nodeWordBegin = array_fill(0, $expectedNodes, 0);
1✔
161
        $this->nodeWordEnd = array_fill(0, $expectedNodes, 0);
1✔
162
        $this->suffixLink = array_fill(0, $expectedNodes, 0);
1✔
163
        $this->nextNode = new SuffixTreeHashTable($expectedNodes);
1✔
164

165
        $this->createRootNode();
1✔
166

167
        for ($i = 0; $i < $size; $i++) {
1✔
168
            $this->update($i);
1✔
169
            $this->canonize($i + 1);
1✔
170
        }
171
    }
172

173
    /**
174
     * This method makes sure the child lists are filled (required for
175
     * traversing the tree).
176
     */
177
    protected function ensureChildLists(): void
178
    {
179
        if ($this->nodeChildFirst == null || count($this->nodeChildFirst) < $this->numNodes) {
1✔
180
            $this->nodeChildFirst = array_fill(0, $this->numNodes, 0);
1✔
181
            $this->nodeChildNext = array_fill(0, $this->numNodes, 0);
1✔
182
            $this->nodeChildNode = array_fill(0, $this->numNodes, 0);
1✔
183
            $this->nextNode->extractChildLists($this->nodeChildFirst, $this->nodeChildNext, $this->nodeChildNode);
1✔
184
        }
185
    }
186

187
    /**
188
     * Creates the root node.
189
     */
190
    private function createRootNode(): void
191
    {
192
        $this->numNodes = 1;
1✔
193
        $this->nodeWordBegin[0] = 0;
1✔
194
        $this->nodeWordEnd[0] = 0;
1✔
195
        $this->suffixLink[0] = -1;
1✔
196
    }
197

198
    /**
199
     * The <em>update</em> function as defined in Ukkonen's paper. This inserts
200
     * the character at charPos into the tree. It works on the canonical
201
     * reference pair ({@link #currentNode}, ({@link #refWordBegin}, charPos)).
202
     */
203
    private function update(int $charPos): void
204
    {
205
        $lastNode = 0;
1✔
206

207
        while (!$this->testAndSplit($charPos, $this->word[$charPos])) {
1✔
208
            $newNode = $this->numNodes++;
1✔
209
            $this->nodeWordBegin[$newNode] = $charPos;
1✔
210
            $this->nodeWordEnd[$newNode] = $this->INFTY;
1✔
211
            $this->nextNode->put($this->explicitNode, $this->word[$charPos], $newNode);
1✔
212

213
            if ($lastNode != 0) {
1✔
214
                $this->suffixLink[$lastNode] = $this->explicitNode;
1✔
215
            }
216
            $lastNode = $this->explicitNode;
1✔
217
            $this->currentNode = $this->suffixLink[$this->currentNode];
1✔
218
            $this->canonize($charPos);
1✔
219
        }
220

221
        if ($lastNode != 0) {
1✔
222
            $this->suffixLink[$lastNode] = $this->currentNode;
1✔
223
        }
224
    }
225

226
    /**
227
     * The <em>test-and-split</em> function as defined in Ukkonen's paper. This
228
     * checks whether the state given by the canonical reference pair (
229
     * {@link #currentNode}, ({@link #refWordBegin}, refWordEnd)) is the end
230
     * point (by checking whether a transition for the
231
     * <code>nextCharacter</code> exists). Additionally the state is made
232
     * explicit if it not already is and this is not the end-point. It returns
233
     * true if the end-point was reached. The newly created (or reached)
234
     * explicit node is returned in the "global" variable.
235
     */
236
    private function testAndSplit(int $refWordEnd, AbstractToken $nextCharacter): bool
237
    {
238
        if ($this->currentNode < 0) {
1✔
239
            // trap state is always end state
UNCOV
240
            return true;
×
241
        }
242

243
        if ($refWordEnd <= $this->refWordBegin) {
1✔
244
            if ($this->nextNode->get($this->currentNode, $nextCharacter) < 0) {
1✔
245
                $this->explicitNode = $this->currentNode;
1✔
246

247
                return false;
1✔
248
            }
249

250
            return true;
1✔
251
        }
252

253
        $next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);
1✔
254

255
        if ($nextCharacter->equals($this->word[$this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin])) {
1✔
256
            return true;
1✔
257
        }
258

259
        // not an end-point and not explicit, so make it explicit.
260
        $this->explicitNode = $this->numNodes++;
1✔
261
        $this->nodeWordBegin[$this->explicitNode] = $this->nodeWordBegin[$next];
1✔
262
        $this->nodeWordEnd[$this->explicitNode] = $this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin;
1✔
263
        $this->nextNode->put($this->currentNode, $this->word[$this->refWordBegin], $this->explicitNode);
1✔
264

265
        $this->nodeWordBegin[$next] += $refWordEnd - $this->refWordBegin;
1✔
266
        $this->nextNode->put($this->explicitNode, $this->word[$this->nodeWordBegin[$next]], $next);
1✔
267

268
        return false;
1✔
269
    }
270

271
    /**
272
     * The <em>canonize</em> function as defined in Ukkonen's paper. Changes the
273
     * reference pair (currentNode, (refWordBegin, refWordEnd)) into a canonical
274
     * reference pair. It works on the "global" variables {@link #currentNode}
275
     * and {@link #refWordBegin} and the parameter, writing the result back to
276
     * the globals.
277
     *
278
     * @param int $refWordEnd one after the end index for the word of the reference pair
279
     */
280
    private function canonize(int $refWordEnd): void
281
    {
282
        if ($this->currentNode === -1) {
1✔
283
            // explicitly handle trap state
284
            $this->currentNode = 0;
1✔
285
            $this->refWordBegin++;
1✔
286
        }
287

288
        if ($refWordEnd <= $this->refWordBegin) {
1✔
289
            // empty word, so already canonical
290
            return;
1✔
291
        }
292

293
        $next = $this->nextNode->get(
1✔
294
            $this->currentNode,
1✔
295
            $this->word[$this->refWordBegin]
1✔
296
        );
1✔
297

298
        while ($refWordEnd - $this->refWordBegin >= $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next]) {
1✔
299
            $this->refWordBegin += $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next];
1✔
300
            $this->currentNode = $next;
1✔
301

302
            if ($refWordEnd > $this->refWordBegin) {
1✔
303
                $next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);
1✔
304
            } else {
305
                break;
1✔
306
            }
307
        }
308
    }
309
}
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc