13
82%
dev: 82%

Ran 25 May 2026 05:04PM UTC

Files 325

Run time 10s

Badge

Embed ▾

Committed 25 May 2026 05:01PM UTC coverage: 18.449%. First build

Job # 26411444620.13

Build Type

push

github

Committed by

web-flow

Commit Message

Benchmark graph context tools on task 394 (#2009)

* docs: benchmark graph context tools

Compare CodeGraph, code-review-graph, Graphify, and baseline on task #394 to guide optional NeoKai integration priority.

* docs: correct graph benchmark findings

Address review feedback by fixing the task #394 answer key, MCP tool counts, and Graphify runtime notes.

* docs: add ast-grep benchmark comparison

Benchmark ast-grep as a structural search baseline alongside the graph context tools for task #394.

* docs: add unseeded graph benchmark round

* docs: add plain unseeded GLM baseline

* docs: add mixed graph benchmark round

* test: add graph tool benchmark as agent session integration test

Proper benchmark using NeoKai daemon sessions with MCP tool servers
attached, not raw Python HTTP calls. 12 test cases (describe.skip by
default): baseline GLM, 4 unseeded tool cases, 4 mixed discovery cases,
plus mixed baseline. Outputs JSON results to /tmp/.

Run: cd packages/daemon && GLM_API_KEY=xxx bun test tests/online/benchmark/benchmark-graph-tools.test.ts

* docs: add agent session benchmark results and fix GLM-SDK compatibility

Run graph tool benchmark through real NeoKai daemon sessions with MCP
servers attached. Key findings:

- GLM-5.x tool_use responses incompatible with Claude Agent SDK context-fetcher
- GLM-4.7 works for text-only and single-tool MCP sessions
- Mixed multi-tool sessions hang due to same SDK incompatibility
- GLM-4.7 did not voluntarily invoke MCP tools in any test case
- All 4 completed tests (baseline, CodeGraph, CRG, ast-grep) produced
  text-only plans with zero tool calls

Restructure benchmark: drop mixed round, keep unseeded tests only,
add text-only baseline prompt, increase timeouts, build indexes before
daemon start to avoid transport PONG timeout.

* fix: address benchmark PR review feedback

- Use BENCHMARK_PROMPT_UNSEDED for MCP cases (not TEXT_ONLY) so tools
  are not suppressed
- Record real commit SHA via git rev-parse... (continued)

Coverage Stats

16903 of 91618 relevant lines covered (18.45%)

9.71 hits per line

lsm / neokai / 26411444620 / 13
82%
dev: 82%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job daemon-online-mcp - 26411444620.13

lsm / neokai / 26411444620 / 13 82% dev: 82%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job daemon-online-mcp - 26411444620.13

lsm / neokai / 26411444620 / 13
82%
dev: 82%

README BADGES
x