systemsdk / phpcpd / #4

Committed 30 Dec 2024 04:42PM UTC coverage: 75.789%. Remained the same

Build # #4

Build Type

push

Committed by DKravtsov

Commit Message

phpcpd 8.0.0 release. Made codebase refactoring. Updated requirements php 8.3, updated composer dependencies, updated tests to the PHPUnit 11. Updated dev environment to the php 8.4, Phing 3.0, added code quality tools: ecs, phpstan.

Run Details

271 of 379 new or added lines in 20 files covered. (71.5%)

65 existing lines in 13 files now uncovered.

648 of 855 relevant lines covered (75.79%)

2.18 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

98.48

/src/Detector/Strategy/SuffixTree/SuffixTree.php

<?php

declare(strict_types=1);

namespace Systemsdk\PhpCPD\Detector\Strategy\SuffixTree;

/**
 * Efficient linear time constructible suffix tree using Ukkonen's online
 * construction algorithm (E. Ukkonen: "On-line construction of suffix trees").
 * Most of the comments reference this paper and it might be hard to follow
 * without knowing at least the basics of it.
 * <p>
 * We use some conventions which are slightly different from the paper however:
 * <ul>
 * <li>The names of the variables are different, but we give a translation into
 * Ukkonen's names.</li>
 * <li>Many variables are made "global" by realizing them as fields. This way we
 * can easily deal with those tuple return values without constructing extra
 * classes.</li>
 * <li>String indices start at 0 (not at 1).</li>
 * <li>Substrings are marked by the first index and the index after the last one
 * (just as in C++ STL) instead of the first and the last index (i.e. intervals
 * are right-open instead of closed). This makes it more intuitive to express
 * the empty string (i.e. (i,i) instead of (i,i-1)).</li>
 * </ul>
 * <p>
 * Everything but the construction itself is protected to simplify increasing
 * its functionality by subclassing but without introducing new method calls.
 *
 * @author Benjamin Hummel
 * @author $Author: kinnen $
 *
 * @version $Revision: 41751 $
 *
 * @ConQAT.Rating GREEN Hash: 4B2EF0606B3085A6831764ED042FF20D
 */
class SuffixTree
{
    /**
     * Infinity in this context.
     *
     * @var int
     */
    protected $INFTY;

    /**
     * The word we are working on.
     *
     * @var AbstractToken[]
     */
    protected $word;

    /**
     * The number of nodes created so far.
     *
     * @var int
     */
    protected $numNodes = 0;

    /**
     * For each node this holds the index of the first character of
     * {@link #word} labeling the transition <b>to</b> this node. This
     * corresponds to the <em>k</em> for a transition used in Ukkonen's paper.
     *
     * @var int[]
     */
    protected $nodeWordBegin;

    /**
     * For each node this holds the index of the one after the last character of
     * {@link #word} labeling the transition <b>to</b> this node. This
     * corresponds to the <em>p</em> for a transition used in Ukkonen's paper.
     *
     * @var int[]
     */
    protected $nodeWordEnd;

    /** For each node its suffix link (called function <em>f</em> by Ukkonen).
     * @var int[] */
    protected $suffixLink;

    /**
     * The next node function realized as a hash table. This corresponds to the
     * <em>g</em> function used in Ukkonen's paper.
     *
     * @var SuffixTreeHashTable
     */
    protected $nextNode;

    /**
     * An array giving for each node the index where the first child will be
     * stored (or -1 if it has no children). It is initially empty and will be
     * filled "on demand" using
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
     * (int[], int[], int[])}.
     *
     * @var int[]
     */
    protected $nodeChildFirst = [];

    /**
     * This array gives the next index of the child list or -1 if this is the
     * last one. It is initially empty and will be filled "on demand" using
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
     * (int[], int[], int[])}.
     *
     * @var int[]
     */
    protected $nodeChildNext = [];

    /**
     * This array stores the actual name (=number) of the mode in the child
     * list. It is initially empty and will be filled "on demand" using
     * {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
     * (int[], int[], int[])}.
     *
     * @var int[]
     */
    protected $nodeChildNode = [];

    /**
     * The node we are currently at as a "global" variable (as it is always
     * passed unchanged). This is called <i>s</i> in Ukkonen's paper.
     *
     * @var int
     */
    private $currentNode = 0;

    /**
     * Beginning of the word part of the reference pair. This is kept "global"
     * (in constrast to the end) as this is passed unchanged to all functions.
     * Ukkonen calls this <em>k</em>.
     *
     * @var int
     */
    private $refWordBegin = 0;

    /**
     * This is the new (or old) explicit state as returned by
     * {@link #testAndSplit(int, Object)}. Ukkonen calls this <em>r</em>.
     *
     * @var int
     */
    private $explicitNode = 0;

    /**
     * Create a new suffix tree from a given word. The word given as parameter
     * is used internally and should not be modified anymore, so copy it before
     * if required.
     *
     * @param AbstractToken[] $word
     */
    public function __construct($word)
    {
        $this->word = $word;
        $size = count($word);
        $this->INFTY = $size;

        $expectedNodes = 2 * $size;
        $this->nodeWordBegin = array_fill(0, $expectedNodes, 0);
        $this->nodeWordEnd = array_fill(0, $expectedNodes, 0);
        $this->suffixLink = array_fill(0, $expectedNodes, 0);
        $this->nextNode = new SuffixTreeHashTable($expectedNodes);

        $this->createRootNode();

        for ($i = 0; $i < $size; $i++) {
            $this->update($i);
            $this->canonize($i + 1);
        }
    }

    /**
     * This method makes sure the child lists are filled (required for
     * traversing the tree).
     */
    protected function ensureChildLists(): void
    {
        if ($this->nodeChildFirst == null || count($this->nodeChildFirst) < $this->numNodes) {
            $this->nodeChildFirst = array_fill(0, $this->numNodes, 0);
            $this->nodeChildNext = array_fill(0, $this->numNodes, 0);
            $this->nodeChildNode = array_fill(0, $this->numNodes, 0);
            $this->nextNode->extractChildLists($this->nodeChildFirst, $this->nodeChildNext, $this->nodeChildNode);
        }
    }

    /**
     * Creates the root node.
     */
    private function createRootNode(): void
    {
        $this->numNodes = 1;
        $this->nodeWordBegin[0] = 0;
        $this->nodeWordEnd[0] = 0;
        $this->suffixLink[0] = -1;
    }

    /**
     * The <em>update</em> function as defined in Ukkonen's paper. This inserts
     * the character at charPos into the tree. It works on the canonical
     * reference pair ({@link #currentNode}, ({@link #refWordBegin}, charPos)).
     */
    private function update(int $charPos): void
    {
        $lastNode = 0;

        while (!$this->testAndSplit($charPos, $this->word[$charPos])) {
            $newNode = $this->numNodes++;
            $this->nodeWordBegin[$newNode] = $charPos;
            $this->nodeWordEnd[$newNode] = $this->INFTY;
            $this->nextNode->put($this->explicitNode, $this->word[$charPos], $newNode);

            if ($lastNode != 0) {
                $this->suffixLink[$lastNode] = $this->explicitNode;
            }
            $lastNode = $this->explicitNode;
            $this->currentNode = $this->suffixLink[$this->currentNode];
            $this->canonize($charPos);
        }

        if ($lastNode != 0) {
            $this->suffixLink[$lastNode] = $this->currentNode;
        }
    }

    /**
     * The <em>test-and-split</em> function as defined in Ukkonen's paper. This
     * checks whether the state given by the canonical reference pair (
     * {@link #currentNode}, ({@link #refWordBegin}, refWordEnd)) is the end
     * point (by checking whether a transition for the
     * <code>nextCharacter</code> exists). Additionally the state is made
     * explicit if it not already is and this is not the end-point. It returns
     * true if the end-point was reached. The newly created (or reached)
     * explicit node is returned in the "global" variable.
     */
    private function testAndSplit(int $refWordEnd, AbstractToken $nextCharacter): bool
    {
        if ($this->currentNode < 0) {
            // trap state is always end state
            return true;
        }

        if ($refWordEnd <= $this->refWordBegin) {
            if ($this->nextNode->get($this->currentNode, $nextCharacter) < 0) {
                $this->explicitNode = $this->currentNode;

                return false;
            }

            return true;
        }

        $next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);

        if ($nextCharacter->equals($this->word[$this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin])) {
            return true;
        }

        // not an end-point and not explicit, so make it explicit.
        $this->explicitNode = $this->numNodes++;
        $this->nodeWordBegin[$this->explicitNode] = $this->nodeWordBegin[$next];
        $this->nodeWordEnd[$this->explicitNode] = $this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin;
        $this->nextNode->put($this->currentNode, $this->word[$this->refWordBegin], $this->explicitNode);

        $this->nodeWordBegin[$next] += $refWordEnd - $this->refWordBegin;
        $this->nextNode->put($this->explicitNode, $this->word[$this->nodeWordBegin[$next]], $next);

        return false;
    }

    /**
     * The <em>canonize</em> function as defined in Ukkonen's paper. Changes the
     * reference pair (currentNode, (refWordBegin, refWordEnd)) into a canonical
     * reference pair. It works on the "global" variables {@link #currentNode}
     * and {@link #refWordBegin} and the parameter, writing the result back to
     * the globals.
     *
     * @param int $refWordEnd one after the end index for the word of the reference pair
     */
    private function canonize(int $refWordEnd): void
    {
        if ($this->currentNode === -1) {
            // explicitly handle trap state
            $this->currentNode = 0;
            $this->refWordBegin++;
        }

        if ($refWordEnd <= $this->refWordBegin) {
            // empty word, so already canonical
            return;
        }

        $next = $this->nextNode->get(
            $this->currentNode,
            $this->word[$this->refWordBegin]
        );

        while ($refWordEnd - $this->refWordBegin >= $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next]) {
            $this->refWordBegin += $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next];
            $this->currentNode = $next;

            if ($refWordEnd > $this->refWordBegin) {
                $next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);
            } else {
                break;
            }
        }
    }
}

1	<?php
2
3	declare(strict_types=1);
4
5	namespace Systemsdk\PhpCPD\Detector\Strategy\SuffixTree;
6
7	/**
8	* Efficient linear time constructible suffix tree using Ukkonen's online
9	* construction algorithm (E. Ukkonen: "On-line construction of suffix trees").
10	* Most of the comments reference this paper and it might be hard to follow
11	* without knowing at least the basics of it.
12	* <p>
13	* We use some conventions which are slightly different from the paper however:
14	* <ul>
15	* <li>The names of the variables are different, but we give a translation into
16	* Ukkonen's names.</li>
17	* <li>Many variables are made "global" by realizing them as fields. This way we
18	* can easily deal with those tuple return values without constructing extra
19	* classes.</li>
20	* <li>String indices start at 0 (not at 1).</li>
21	* <li>Substrings are marked by the first index and the index after the last one
22	* (just as in C++ STL) instead of the first and the last index (i.e. intervals
23	* are right-open instead of closed). This makes it more intuitive to express
24	* the empty string (i.e. (i,i) instead of (i,i-1)).</li>
25	* </ul>
26	* <p>
27	* Everything but the construction itself is protected to simplify increasing
28	* its functionality by subclassing but without introducing new method calls.
29	*
30	* @author Benjamin Hummel
31	* @author $Author: kinnen $
32	*
33	* @version $Revision: 41751 $
34	*
35	* @ConQAT.Rating GREEN Hash: 4B2EF0606B3085A6831764ED042FF20D
36	*/
37	class SuffixTree
38	{
39	/**
40	* Infinity in this context.
41	*
42	* @var int
43	*/
44	protected $INFTY;
45
46	/**
47	* The word we are working on.
48	*
49	* @var AbstractToken[]
50	*/
51	protected $word;
52
53	/**
54	* The number of nodes created so far.
55	*
56	* @var int
57	*/
58	protected $numNodes = 0;
59
60	/**
61	* For each node this holds the index of the first character of
62	* {@link #word} labeling the transition <b>to</b> this node. This
63	* corresponds to the <em>k</em> for a transition used in Ukkonen's paper.
64	*
65	* @var int[]
66	*/
67	protected $nodeWordBegin;
68
69	/**
70	* For each node this holds the index of the one after the last character of
71	* {@link #word} labeling the transition <b>to</b> this node. This
72	* corresponds to the <em>p</em> for a transition used in Ukkonen's paper.
73	*
74	* @var int[]
75	*/
76	protected $nodeWordEnd;
77
78	/** For each node its suffix link (called function <em>f</em> by Ukkonen).
79	* @var int[] */
80	protected $suffixLink;
81
82	/**
83	* The next node function realized as a hash table. This corresponds to the
84	* <em>g</em> function used in Ukkonen's paper.
85	*
86	* @var SuffixTreeHashTable
87	*/
88	protected $nextNode;
89
90	/**
91	* An array giving for each node the index where the first child will be
92	* stored (or -1 if it has no children). It is initially empty and will be
93	* filled "on demand" using
94	* {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
95	* (int[], int[], int[])}.
96	*
97	* @var int[]
98	*/
99	protected $nodeChildFirst = [];
100
101	/**
102	* This array gives the next index of the child list or -1 if this is the
103	* last one. It is initially empty and will be filled "on demand" using
104	* {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
105	* (int[], int[], int[])}.
106	*
107	* @var int[]
108	*/
109	protected $nodeChildNext = [];
110
111	/**
112	* This array stores the actual name (=number) of the mode in the child
113	* list. It is initially empty and will be filled "on demand" using
114	* {@link org.conqat.engine.code_clones.detection.suffixtree.SuffixTreeHashTable#extractChildLists
115	* (int[], int[], int[])}.
116	*
117	* @var int[]
118	*/
119	protected $nodeChildNode = [];
120
121	/**
122	* The node we are currently at as a "global" variable (as it is always
123	* passed unchanged). This is called <i>s</i> in Ukkonen's paper.
124	*
125	* @var int
126	*/
127	private $currentNode = 0;
128
129	/**
130	* Beginning of the word part of the reference pair. This is kept "global"
131	* (in constrast to the end) as this is passed unchanged to all functions.
132	* Ukkonen calls this <em>k</em>.
133	*
134	* @var int
135	*/
136	private $refWordBegin = 0;
137
138	/**
139	* This is the new (or old) explicit state as returned by
140	* {@link #testAndSplit(int, Object)}. Ukkonen calls this <em>r</em>.
141	*
142	* @var int
143	*/
144	private $explicitNode = 0;
145
146	/**
147	* Create a new suffix tree from a given word. The word given as parameter
148	* is used internally and should not be modified anymore, so copy it before
149	* if required.
150	*
151	* @param AbstractToken[] $word
152	*/
153	public function __construct($word)
154	{
155	$this->word = $word;	1✔
156	$size = count($word);	1✔
157	$this->INFTY = $size;	1✔
158
159	$expectedNodes = 2 * $size;	1✔
160	$this->nodeWordBegin = array_fill(0, $expectedNodes, 0);	1✔
161	$this->nodeWordEnd = array_fill(0, $expectedNodes, 0);	1✔
162	$this->suffixLink = array_fill(0, $expectedNodes, 0);	1✔
163	$this->nextNode = new SuffixTreeHashTable($expectedNodes);	1✔
164
165	$this->createRootNode();	1✔
166
167	for ($i = 0; $i < $size; $i++) {	1✔
168	$this->update($i);	1✔
169	$this->canonize($i + 1);	1✔
170	}
171	}
172
173	/**
174	* This method makes sure the child lists are filled (required for
175	* traversing the tree).
176	*/
177	protected function ensureChildLists(): void
178	{
179	if ($this->nodeChildFirst == null \|\| count($this->nodeChildFirst) < $this->numNodes) {	1✔
180	$this->nodeChildFirst = array_fill(0, $this->numNodes, 0);	1✔
181	$this->nodeChildNext = array_fill(0, $this->numNodes, 0);	1✔
182	$this->nodeChildNode = array_fill(0, $this->numNodes, 0);	1✔
183	$this->nextNode->extractChildLists($this->nodeChildFirst, $this->nodeChildNext, $this->nodeChildNode);	1✔
184	}
185	}
186
187	/**
188	* Creates the root node.
189	*/
190	private function createRootNode(): void
191	{
192	$this->numNodes = 1;	1✔
193	$this->nodeWordBegin[0] = 0;	1✔
194	$this->nodeWordEnd[0] = 0;	1✔
195	$this->suffixLink[0] = -1;	1✔
196	}
197
198	/**
199	* The <em>update</em> function as defined in Ukkonen's paper. This inserts
200	* the character at charPos into the tree. It works on the canonical
201	* reference pair ({@link #currentNode}, ({@link #refWordBegin}, charPos)).
202	*/
203	private function update(int $charPos): void
204	{
205	$lastNode = 0;	1✔
206
207	while (!$this->testAndSplit($charPos, $this->word[$charPos])) {	1✔
208	$newNode = $this->numNodes++;	1✔
209	$this->nodeWordBegin[$newNode] = $charPos;	1✔
210	$this->nodeWordEnd[$newNode] = $this->INFTY;	1✔
211	$this->nextNode->put($this->explicitNode, $this->word[$charPos], $newNode);	1✔
212
213	if ($lastNode != 0) {	1✔
214	$this->suffixLink[$lastNode] = $this->explicitNode;	1✔
215	}
216	$lastNode = $this->explicitNode;	1✔
217	$this->currentNode = $this->suffixLink[$this->currentNode];	1✔
218	$this->canonize($charPos);	1✔
219	}
220
221	if ($lastNode != 0) {	1✔
222	$this->suffixLink[$lastNode] = $this->currentNode;	1✔
223	}
224	}
225
226	/**
227	* The <em>test-and-split</em> function as defined in Ukkonen's paper. This
228	* checks whether the state given by the canonical reference pair (
229	* {@link #currentNode}, ({@link #refWordBegin}, refWordEnd)) is the end
230	* point (by checking whether a transition for the
231	* <code>nextCharacter</code> exists). Additionally the state is made
232	* explicit if it not already is and this is not the end-point. It returns
233	* true if the end-point was reached. The newly created (or reached)
234	* explicit node is returned in the "global" variable.
235	*/
236	private function testAndSplit(int $refWordEnd, AbstractToken $nextCharacter): bool
237	{
238	if ($this->currentNode < 0) {	1✔
239	// trap state is always end state
UNCOV 240	return true;	×
241	}
242
243	if ($refWordEnd <= $this->refWordBegin) {	1✔
244	if ($this->nextNode->get($this->currentNode, $nextCharacter) < 0) {	1✔
245	$this->explicitNode = $this->currentNode;	1✔
246
247	return false;	1✔
248	}
249
250	return true;	1✔
251	}
252
253	$next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);	1✔
254
255	if ($nextCharacter->equals($this->word[$this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin])) {	1✔
256	return true;	1✔
257	}
258
259	// not an end-point and not explicit, so make it explicit.
260	$this->explicitNode = $this->numNodes++;	1✔
261	$this->nodeWordBegin[$this->explicitNode] = $this->nodeWordBegin[$next];	1✔
262	$this->nodeWordEnd[$this->explicitNode] = $this->nodeWordBegin[$next] + $refWordEnd - $this->refWordBegin;	1✔
263	$this->nextNode->put($this->currentNode, $this->word[$this->refWordBegin], $this->explicitNode);	1✔
264
265	$this->nodeWordBegin[$next] += $refWordEnd - $this->refWordBegin;	1✔
266	$this->nextNode->put($this->explicitNode, $this->word[$this->nodeWordBegin[$next]], $next);	1✔
267
268	return false;	1✔
269	}
270
271	/**
272	* The <em>canonize</em> function as defined in Ukkonen's paper. Changes the
273	* reference pair (currentNode, (refWordBegin, refWordEnd)) into a canonical
274	* reference pair. It works on the "global" variables {@link #currentNode}
275	* and {@link #refWordBegin} and the parameter, writing the result back to
276	* the globals.
277	*
278	* @param int $refWordEnd one after the end index for the word of the reference pair
279	*/
280	private function canonize(int $refWordEnd): void
281	{
282	if ($this->currentNode === -1) {	1✔
283	// explicitly handle trap state
284	$this->currentNode = 0;	1✔
285	$this->refWordBegin++;	1✔
286	}
287
288	if ($refWordEnd <= $this->refWordBegin) {	1✔
289	// empty word, so already canonical
290	return;	1✔
291	}
292
293	$next = $this->nextNode->get(	1✔
294	$this->currentNode,	1✔
295	$this->word[$this->refWordBegin]	1✔
296	);	1✔
297
298	while ($refWordEnd - $this->refWordBegin >= $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next]) {	1✔
299	$this->refWordBegin += $this->nodeWordEnd[$next] - $this->nodeWordBegin[$next];	1✔
300	$this->currentNode = $next;	1✔
301
302	if ($refWordEnd > $this->refWordBegin) {	1✔
303	$next = $this->nextNode->get($this->currentNode, $this->word[$this->refWordBegin]);	1✔
304	} else {
305	break;	1✔
306	}
307	}
308	}
309	}

systemsdk / phpcpd / #4

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source File
Press 'n' to go to next uncovered line, 'b' for previous