• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

athalhammer / danker / 9561243736

18 Jun 2024 08:01AM UTC coverage: 93.103% (+0.06%) from 93.043%
9561243736

push

github

athalhammer
added danker install for gh actions

108 of 116 relevant lines covered (93.1%)

2.79 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

95.54
/danker/danker.py
1
#!/usr/bin/env python3
2

3
#    danker - PageRank on Wikipedia/Wikidata
4
#    Copyright (C) 2017  Andreas Thalhammer
5
#
6
#    This program is free software: you can redistribute it and/or modify
7
#    it under the terms of the GNU General Public License as published by
8
#    the Free Software Foundation, either version 3 of the License, or
9
#    (at your option) any later version.
10
#
11
#    This program is distributed in the hope that it will be useful,
12
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
13
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
#    GNU General Public License for more details.
15
#
16
#    You should have received a copy of the GNU General Public License
17
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
18

19
"""
20
`danker <https://github.com/athalhammer/danker>`_ is a light-weight
21
package/module to compute PageRank on very large graphs with limited
22
hardware resources. The input format are edge lists of the following format::
23

24
    A   B
25
    B   C
26
    C   D
27

28
The nodes can be denoted as strings or integers. However, depending on the
29
size of the graph and the amount of available memory you may have to index
30
string node as integers and map back after computation::
31

32
    # link file
33
    1   2
34
    2   3
35
    3   4
36

37
    # index file
38
    1   A
39
    2   B
40
    3   C
41
    4   D
42

43
The computation can then be run with 'int_only mode' which consumes less
44
memory. A pre-requisite for danker is that the input file is sorted. For
45
this, you can use the Linux `sort <https://linux.die.net/man/1/sort>`_
46
command. If you want to use the :func:`danker_smallmem` option you need
47
two copies of the same link file: one sorted by the left column and one
48
sorted by the right column::
49

50
   # Sort by left column
51
   sort --key=1,1 -o output-left link-file
52

53
   # Sort by right column
54
   sort --key=2,2 -o output-right link-file
55

56
The :func:`init` function is used to initialize the PageRank computation.
57
The following code shows a minimal example for computing PageRank with the
58
:func:`danker_bigmem` option (right-sorted file not needed)::
59

60
    import danker
61
    start_value, iterations, damping = 0.1, 40, 0.85
62
    pr_dict = danker.init("output-left", start_value, False, False)
63
    pr_out = danker.danker_bigmem(pr_dict, iterations, damping)
64
    result_loc = (iterations % 2) + 1
65
    for i in pr_out:
66
        print(i, pr_out[i][result_loc], sep='\\t')
67

68

69
The following code shows a minimal example for computing PageRank with the
70
:func:`danker_smallmem` option::
71

72
    import danker
73
    start_value, iterations, damping = 0.1, 40, 0.85
74
    pr_dict = danker.init("output-left", start_value, True, False)
75
    pr_out = danker.danker_smallmem(pr_dict, "output-right", iterations, damping, start_value)
76
    result_loc = (iterations % 2) + 1
77
    for i in pr_out:
78
        print(i, pr_out[i][result_loc], sep='\\t')
79

80
"""
81
import sys
3✔
82
import time
3✔
83
import datetime
3✔
84
import argparse
3✔
85
from importlib.metadata import version
3✔
86

87
# import memory_profiler
88

89

90
class InputNotSortedException(Exception):
3✔
91
    """
92
    Custom exception thrown in case the input file is not correctly sorted.
93

94
    :param file_name: The name of the file that is not correctly sorted.
95
    :param line1: The line that should be first (but is not).
96
    :param line2: The line that should be second (but is not).
97
    """
98

99
    _MESSAGE = (
3✔
100
        'Input file "{0}" is not correctly sorted. "{1}" after "{2}". If you think the sorting is'
101
        ' correct please check if you forgot the -i for "integers only" flag.'
102
    )
103

104
    def __init__(self, file_name, line1, line2):
3✔
105
        message = self._MESSAGE.format(file_name, line1, line2)
3✔
106
        Exception.__init__(self, message)
3✔
107

108

109
def _conv_int(string, int_only):
3✔
110
    """
111
    Helper function to optimize memory usage.
112
    """
113
    if int_only:
3✔
114
        return int(string)
3✔
115
    return string
3✔
116

117

118
def _get_std_list(smallmem, start_value):
3✔
119
    """
120
    Helper function to return standard list for smallmem vs bigmem.
121
    """
122
    if smallmem:
3✔
123
        return [0, start_value, start_value, False]
3✔
124
    return [0, start_value, start_value, []]
3✔
125

126

127
def init(left_sorted, start_value, smallmem, int_only):
3✔
128
    """
129
    This function creates the data structure for PageRank computation by
130
    indexing every node. Main indexing steps include setting the starting
131
    value as well as counting the number of outgoing links for each node.
132

133
    :param left_sorted: A tab-separated link file that is sorted by the
134
                        left column.
135
    :param start_value: The PageRank starting value.
136
    :param smallmem: This value is interpreted as a boolean that indicates
137
                     whether the indexing should be done for
138
                     :func:`danker_smallmem` (file iteration) or
139
                     :func:`danker_bigmem` (in-memory). Default is "False".
140
    :param int_only: Boolean flag in case the graph contains only integer nodes.
141
    :returns: Dictionary with each key referencing a node. The value is a
142
              list with the following contents - depending on the smallmem
143
              parameter and the intended use:
144

145
              * :func:`danker_bigmem` [link_count:int, start_value:float,
146
                start_value:float, linked_pages:list]
147

148
              * :func:`danker_smallmem` [link_cout:int, start_value:float,
149
                start_value:float, touched_in_1st_iteration:boolean]
150
    """
151
    dictionary = {}
3✔
152
    previous = None
3✔
153
    current_count = 1
3✔
154
    with open(left_sorted, encoding="utf-8") as ls_file:
3✔
155
        for line in ls_file:
3✔
156
            current = _conv_int(line.split("\t")[0].strip(), int_only)
3✔
157
            receiver = _conv_int(line.split("\t")[1].strip(), int_only)
3✔
158

159
            # take care of inlinks
160
            if not smallmem:
3✔
161
                data = dictionary.setdefault(
3✔
162
                    receiver, _get_std_list(smallmem, start_value)
163
                )
164
                data[3].append(current)
3✔
165

166
            # take care of counts
167
            if current == previous:
3✔
168
                # increase counter
169
                current_count = current_count + 1
3✔
170
            else:
171
                if previous:
3✔
172
                    # make sure input is correctly sorted
173
                    if current < previous:
3✔
174
                        raise InputNotSortedException(left_sorted, current, previous)
3✔
175
                    # store previous Q-ID and reset counter
176
                    prev = dictionary.get(
3✔
177
                        previous, _get_std_list(smallmem, start_value)
178
                    )
179
                    dictionary[previous] = [current_count] + prev[1:]
3✔
180
                    current_count = 1
3✔
181
            previous = current
3✔
182

183
        # take care of last item
184
        if previous:
3✔
185
            prev = dictionary.get(previous, _get_std_list(smallmem, start_value))
3✔
186
            dictionary[previous] = [current_count] + prev[1:]
3✔
187
    return dictionary
3✔
188

189

190
# @profile
191
def danker_smallmem(
3✔
192
    dictionary, right_sorted, iterations, damping, start_value, int_only
193
):
194
    """
195
    Compute PageRank with right sorted file.
196

197
    :param dictionary: Python dictionary created with :func:`init`
198
                       (smallmem set to True).
199
    :param right_sorted: The same tab-separated link file that was used for
200
                         :func:`init` sorted by the right column.
201
    :param iterations: The number of PageRank iterations.
202
    :param damping: The PageRank damping factor.
203
    :param start_value: The PageRank starting value (same as was used for
204
                        :func:`init`).
205
    :param int_only: Boolean flag in case the graph contains only integer nodes.
206
    :return: The same dictionary that was created by :func:`init`. The keys
207
             are the nodes of the graph. The output score is located at
208
             the ``(iterations % 2) + 1`` position of the respecive list
209
             (that is the value of the key).
210
    """
211
    for iteration in range(0, iterations):
3✔
212
        print(str(iteration + 1) + ".", end="", flush=True, file=sys.stderr)
3✔
213
        previous = None
3✔
214

215
        # positions for i and i+1 result values (alternating with iterations).
216
        i_location = (iteration % 2) + 1
3✔
217
        i_plus_1_location = ((iteration + 1) % 2) + 1
3✔
218

219
        with open(right_sorted, encoding="utf-8") as rs_file:
3✔
220
            for line in rs_file:
3✔
221
                current = _conv_int(line.split("\t")[1].strip(), int_only)
3✔
222
                if previous != current:
3✔
223
                    if previous and current < previous:
3✔
224
                        raise InputNotSortedException(right_sorted, current, previous)
3✔
225

226
                    # reset dank
227
                    dank = 1 - damping
3✔
228
                    # initialize in case of no outgoing links
229
                    dictionary.setdefault(current, _get_std_list(True, start_value))
3✔
230

231
                in_link = _conv_int(line.split("\t")[0].strip(), int_only)
3✔
232
                in_dank = dictionary.get(in_link)
3✔
233
                dank = dank + (damping * in_dank[i_location] / in_dank[0])
3✔
234
                dictionary[current][i_plus_1_location] = dank
3✔
235
                previous = current
3✔
236

237
                # record current node as 'touched' (it has incoming links)
238
                if iteration == 0:
3✔
239
                    dictionary[current][3] = True
3✔
240

241
            # fix 'untouched' nodes (they do not have incoming links)
242
            if iteration == 0:
3✔
243
                for k in dictionary.keys():
3✔
244
                    if not dictionary[k][3]:
3✔
245
                        dictionary[k][i_plus_1_location] = 1 - damping
3✔
246
                        dictionary[k][i_location] = 1 - damping
3✔
247
    print("", file=sys.stderr)
3✔
248
    return dictionary
3✔
249

250

251
# @profile
252
def danker_bigmem(dictionary, iterations, damping):
3✔
253
    """
254
    Compute PageRank with big memory option.
255

256
    :param dictionary: Python dictionary created with :func:`init`
257
                       (smallmem set to False).
258
    :param iterations: The number of PageRank iterations.
259
    :param damping: The PageRank damping factor.
260
    :return: The same dictionary that was created by :func:`init`. The keys
261
             are the nodes of the graph. The output score is located at
262
             the ``(iterations % 2) + 1`` position of the respecive list
263
             (that is the value of the key).
264
    """
265
    for iteration in range(0, iterations):
3✔
266
        # positions for i and i+1 result values (alternating with iterations).
267
        i_location = (iteration % 2) + 1
3✔
268
        i_plus_1_location = ((iteration + 1) % 2) + 1
3✔
269

270
        print(str(iteration + 1) + ".", end="", flush=True, file=sys.stderr)
3✔
271
        for j in dictionary:
3✔
272
            current = dictionary.get(j)
3✔
273
            dank = 1 - damping
3✔
274
            for k in current[3]:
3✔
275
                in_dank = dictionary.get(k)
3✔
276
                dank = dank + (damping * in_dank[i_location] / in_dank[0])
3✔
277
            dictionary[j][i_plus_1_location] = dank
3✔
278
    print("", file=sys.stderr)
3✔
279
    return dictionary
3✔
280

281

282
# @profile
283
def _main():
3✔
284
    """
285
    Execute main program.
286
    """
287
    parser = argparse.ArgumentParser(
3✔
288
        prog="python -m danker",
289
        description="danker"
290
        + " - Compute PageRank on large graphs with "
291
        + "off-the-shelf hardware.",
292
    )
293
    parser.add_argument(
3✔
294
        "left_sorted",
295
        type=str,
296
        help="A two-column, " + "tab-separated file sorted by the left column.",
297
    )
298
    parser.add_argument(
3✔
299
        "-r",
300
        "--right_sorted",
301
        type=str,
302
        help="The same " + "file as left_sorted but sorted by the right column.",
303
    )
304
    parser.add_argument(
3✔
305
        "damping", type=float, help="PageRank damping factor" + "(between 0 and 1)."
306
    )
307
    parser.add_argument(
3✔
308
        "iterations", type=int, help="Number of PageRank " + "iterations (>=0)."
309
    )
310
    parser.add_argument("start_value", type=float, help="PageRank starting value (>0).")
3✔
311
    parser.add_argument(
3✔
312
        "-p",
313
        "--output_precision",
314
        type=int,
315
        help="Number of places after the decimal point.",
316
        default=17,
317
    )
318
    parser.add_argument(
3✔
319
        "-i", "--int_only", action="store_true", help="All nodes are integers (flag)"
320
    )
321
    args = parser.parse_args()
3✔
322
    param_out = f"[iterations ({args.iterations}), damping ({args.damping}), start value({args.start_value})]"
3✔
323
    if (
3✔
324
        args.iterations < 0
325
        or args.damping > 1
326
        or args.damping < 0
327
        or args.start_value <= 0
328
    ):
329
        print(
×
330
            "ERROR: Provided PageRank parameters\n\t"
331
            f"{param_out}\n\tout of allowed range.\n\n",
332
            file=sys.stderr,
333
        )
334
        parser.print_help(sys.stderr)
×
335
        sys.exit(1)
×
336
    print(
3✔
337
        f"danker ({version('danker')}): starting computation of PageRank on '{args.left_sorted}' with parameters\n\t"
338
        f"{param_out} ({datetime.datetime.now()})",
339
        file=sys.stderr,
340
    )
341
    start = time.time()
3✔
342
    dictionary = init(
3✔
343
        args.left_sorted, args.start_value, args.right_sorted, args.int_only
344
    )
345
    result_position = (args.iterations % 2) + 1
3✔
346
    output_form = "{0}\t{1:." + str(args.output_precision) + "f}"
3✔
347

348
    if args.right_sorted:
3✔
349
        danker_smallmem(
×
350
            dictionary,
351
            args.right_sorted,
352
            args.iterations,
353
            args.damping,
354
            args.start_value,
355
            args.int_only,
356
        )
357
    else:
358
        danker_bigmem(dictionary, args.iterations, args.damping)
3✔
359

360
    print(
3✔
361
        f"danker ({version('danker')}): PageRank computation took "
362
        f"{time.time() - start:.2f} seconds ({datetime.datetime.now()}).",
363
        file=sys.stderr,
364
    )
365

366
    for qid, value in dictionary.items():
3✔
367
        print(output_form.format(qid, value[result_position]))
3✔
368

369

370
if __name__ == "__main__":
3✔
371
    _main()
×
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc