All posts
|9 min read

We Indexed the Entire Linux Kernel in 90 Seconds. Here's What We Learned.

We ran our code analyzer on the world's largest open source project — 93,000 files, 30 years of history, 15,000+ contributors. Real benchmarks, real numbers, real lessons.

Most developer tools show you a demo on a todo app and call it a day. We wanted to know what actually happens when you throw the largest open-source codebase in existence at a code analyzer.

So we did. Here's what happened when we pointed CodeCortex at the Linux kernel — 93,011 files, 52,000 of them C, maintained by 15,000+ contributors over three decades.

The test setup

Three codebases, chosen to cover the full spectrum:

CodeCortex (itself)  —  44 files,     TypeScript
OpenClaw             —  5,698 files,  7 languages
Linux kernel         —  93,011 files, 7 languages

Same machine (MacBook Pro M-series), same tool, same config. No cherry-picking. We measured symbols extracted, time taken, memory used, and whether anything crashed.

Raw extraction numbers

                     CodeCortex     OpenClaw        Linux kernel
─────────────────────────────────────────────────────────────────
Files parsed:        44             5,698           64,708
Symbols extracted:   976            129,000         5,300,000
Import edges:        131            15,996          73,369
Call edges:          2,489          278,222         567,324
Modules detected:    7              94              82
Init time:           ~3s            ~45s            ~90s
Output size:         340KB          90MB            150MB
Crashed:             no             no              no

5.3 million symbols. Every function, struct, enum, typedef, and macro across 64,708 C, Python, Rust, Ruby, Bash, and Objective-C files — indexed in 90 seconds.

For perspective: a new developer joining a Linux subsystem team typically spends weeks building a mental model. This takes a minute and a half.

The WASM disaster (and why we rewrote it)

We didn't start with these numbers. Our first version used WebAssembly-based tree-sitter. It worked fine on small repos. Then we hit the Linux kernel.

WASM:    573,726 symbols → Aborted() — process killed, no stack trace
Native:  5,300,000 symbols → completed, no issues

WASM hit the V8 heap limit on large C files and crashed with an uncatchable error. No try/catch, no error handler — just a dead process. We spent a week migrating to native N-API tree-sitter bindings. The result: 9.3x more symbols extracted on the same file set, zero crashes, and lower memory usage.

Lesson: if your tool might process large codebases, test with large codebases. Unit tests on 10-file repos won't catch this.

The token comparison nobody asked for (but should have)

We measured what happens when an AI agent needs to understand a real codebase. Test subject: a 16-file TypeScript API server (3,421 lines) — the kind of project most developers actually work on.

Without any knowledge layer:

Reading config files:           5,280 tokens
Reading entry points:           5,476 tokens
Reading API layer:              3,766 tokens
Reading server logic:           3,264 tokens
Reading indexers (3 files):     7,613 tokens
Reading scoring engine:         4,142 tokens
Reading database layer:         1,572 tokens
Reading config modules:         841 tokens
Reading SDK:                    5,826 tokens
─────────────────────────────────────────────
Total to "understand" project:  37,780 tokens
Tool calls required:            25-35
Time before first useful edit:  5-10 minutes

With pre-processed knowledge:

Project manifest:               180 tokens
Architecture understanding:     850 tokens
Data flow + entry points:       620 tokens
Dependency graph:               1,200 tokens
Symbol index:                   980 tokens
Temporal analysis:              470 tokens
─────────────────────────────────────────────
Total to "understand" project:  4,300 tokens
Tool calls required:            3-5
Time before first useful edit:  <1 minute

88.6% fewer tokens. 7x fewer tool calls. And the 4,300-token version actually contains more information — it includes dependency relationships, historical coupling data, and bug patterns that the raw scan can never surface no matter how many files you read.

The thing raw scanning literally cannot do

This is the part that surprised us most during testing. Some knowledge only exists in git history, and no amount of source code reading will reveal it.

From the test codebase's git log:

routes.ts ↔ worker.ts:   75% co-change rate — ZERO imports between them
routes.ts ↔ migrate.ts:  58% co-change rate
compute.ts ↔ worker.ts:  58% co-change rate

routes.ts and worker.ts are modified together in 75% of all commits. They're the most coupled files in the entire codebase. And there's not a single import connecting them.

Why? They're parallel implementations of the same API for different runtimes. Change the response format in one, the other silently serves stale data. An AI agent editing routes.ts without knowing about worker.ts will create a bug 3 out of 4 times.

No static analyzer catches this. No file reader catches this. Only git history analysis reveals it.

Bug archaeology: the lessons your code already learned

The git history also surfaced these — real bugs from the test codebase, distilled into one-line warnings:

worker.ts:   "neon() HTTP driver can't do col = $1 OR $1 IS NULL — use explicit branching"
worker.ts:   "role field must exist in ALL 3 API surfaces (worker, routes, MCP)"
compute.ts:  "batch inserts required — 1 row/query is 500x too slow on Neon free tier"

Each of these caused a production outage. Each was fixed. And each is a trap waiting for the next developer (or AI agent) who doesn't know the history.

With temporal analysis, the agent gets these warnings before writing a line of code. Without it, the agent discovers them the hard way — by reintroducing the same bug.

The full efficiency table

                           Without         With            Change
─────────────────────────────────────────────────────────────────
Tokens to understand:      37,800          4,300           8.8x fewer
Tool calls needed:         25-35           3-5             7x fewer
Time to first edit:        5-10 min        <1 min          5-10x faster
Hidden deps surfaced:      0               100%            impossible → instant
Bug lessons available:     0               4 warnings      impossible → instant
Risk ranking:              none            full hotspot map impossible → instant
Cross-session memory:      none            persistent      reset → preserved

What this means in practice

The quantitative gains matter — 88% fewer tokens, 7x fewer tool calls. But the qualitative shift is bigger. An AI agent with a knowledge layer operates fundamentally differently:

  • It doesn't explore — it already knows the architecture
  • It doesn't guess at dependencies — it queries the graph
  • It doesn't miss coupled files — temporal analysis flags them
  • It doesn't repeat past mistakes — bug archaeology warns it

The Linux kernel test proved the tool doesn't break at scale. The token comparison proved it's not a marginal improvement. And the temporal analysis proved it surfaces knowledge that raw scanning can never provide.

Run it yourself

bash
npm install -g codecortex-ai cd your-project codecortex init

Check the .codecortex/ directory. That's the knowledge your AI agent has been reconstructing from scratch every single session.