Common Crawl harmonic centrality is the new metric for AI optimization
Written by James Berry • Last updated February 2, 2026
AI models are trained on the entire web. Finally, we have a new metric that measures your domain authority in this dataset (and SEOs can use to it optimize for AI visibility!)
Most large language models are trained on a dataset called Common Crawl. According to Mozilla Foundation research, 64% of the 47 LLMs analyzed used at least one filtered version of Common Crawl. For GPT-3, over 80% of training tokens came from filtered Common Crawl archives.
But Common Crawl does not crawl every website equally. It prioritizes certain domains over others using a metric called Harmonic Centrality. This is the hidden layer of AI optimization that most people do not know about. You can have excellent content, perfect freshness, and clear structure. But if your HC rank is poor, Common Crawl captures less of your site. AI models train on less of your content. Your AI share of voice suffers as a result.
What Is Common Crawl?
Common Crawl is a nonprofit foundation that has been archiving the web since 2007. Every month, it crawls billions of publicly accessible web pages and stores them in a massive archive measured in petabytes.
This archive is not just for researchers studying the web. It has become the foundation of modern AI. OpenAI, Google, Anthropic, Meta, and Amazon have all used Common Crawl data to train their models. When you ask ChatGPT a question and it knows the answer from its training data, that knowledge likely came from Common Crawl.
The archive is massive.
| Common Crawl Stats | Numbers |
|---|---|
| Domains per monthly release | 94 to 163 million |
| Total domain records (all periods) | 607+ million |
| Archive size | Petabytes |
| Update frequency | Monthly |
With hundreds of millions of domains to cover, Common Crawl cannot crawl everything equally. It must decide which sites get crawled more frequently and more thoroughly. That decision uses a metric called Harmonic Centrality.

How Common Crawl Decides What To Crawl
Common Crawl publishes WebGraph data alongside its main crawl archives. This dataset maps the link relationships between domains across the entire web. From this map, Common Crawl calculates two authority metrics for every domain.
Harmonic Centrality measures connectivity. Think of it as mapping the shortest paths between your domain and every other domain on the web. Domains with strong Harmonic Centrality sit at the intersection of many short paths. They are highly reachable and highly connected. The metric rewards domains that function as network hubs rather than isolated endpoints.
PageRank measures endorsement. It counts incoming links but weighs them by the authority of the linking site. A single link from a trusted source counts more than hundreds of links from unknown sites. Google's founders developed this metric in 1998, and it shaped how we think about web authority for decades.
Common Crawl uses Harmonic Centrality to determine crawl priority. The lower your HC rank, the more frequently you get crawled. More frequent crawling means more of your pages appear in monthly archives. More pages in archives means greater representation in AI training data. Greater representation means higher AI visibility.
The chain works like this. Lower HC rank means more frequent indexing in Common Crawl. More indexing means LLMs train on more of your content. More training data means AI models understand and trust your brand. Trust means AI responses are more likely to recommend you.
This creates a compounding effect. If your domain has a low HC rank, Common Crawl visits it more often. Your content appears in more monthly snapshots. AI models trained on this data see your brand mentioned across multiple time periods. This builds what researchers call "baseline familiarity" in the model's parametric knowledge. The model learns to understand and trust your brand. You become a brand the AI already knows and is more likely to recommend before any real-time retrieval happens.
Lower HC rank means more frequent indexing. More indexing means more of your content in training data. More training data means AI models understand and trust your brand. Trust means AI responses are more likely to recommend you.
The Two Sides Of AI Search Optimization
AI search visibility is a game of two halves. They should not be confused.
Online visibility is about real-time search grounding. When ChatGPT or Perplexity searches the live web to answer a question, will your content get cited? This depends on freshness, relevance, and whether AI crawlers can access your pages right now.
Offline visibility is about training data. Before any real-time search happens, AI models already have knowledge baked into their parameters. This knowledge came from training datasets like Common Crawl. If your brand appeared frequently in training data, the model already knows you exist.
Harmonic Centrality measures your offline visibility. It determines how much of your content ends up in AI training datasets. It does not replace the need for online visibility tracking. You need both.
Think of it like traditional SEO. Domain Rating (DR) or Domain Authority (DA) measures your backlink authority, but it does not tell you if you rank for specific keywords. You still need to track rankings separately. HC works the same way for AI. It measures your training data authority, but you still need to track your AI search visibility to see if you are actually getting cited in real-time responses.
Why Harmonic Centrality Differs From PageRank
Think of it this way. Domain Rating (DR) and similar metrics measure your authority for traditional Google SEO. They optimize for backlinks and help with online search rankings. Harmonic Centrality measures your authority for AI offline visibility. It optimizes for your weighting in LLM training datasets. DR is for online. HC is for offline. Both matter, but they measure different things.

Both metrics measure authority, but they capture different aspects of influence. PageRank focuses on who links to you. Harmonic Centrality focuses on your position in the overall network.
Consider Wikipedia. It ranks #14 in Harmonic Centrality but only #37 in PageRank according to Common Crawl's October-November-December 2025 data. Yet Wikipedia is one of ChatGPT's most frequently cited sources. Its position as a central hub in the web's link topology may matter as much as raw inbound link authority.
Infrastructure and CDN domains illustrate this further. Sites like Cloudflare, jsDelivr, and Gstatic rank extremely high in Harmonic Centrality because they are embedded across millions of websites. They serve as connective tissue in the web's structure even without being content destinations.
For AI visibility, both metrics matter. PageRank signals that other authoritative sites trust you. Harmonic Centrality determines how much of your content Common Crawl captures. A site with a low HC rank gets crawled more frequently and represented more thoroughly in AI training data. That translates directly to higher AI share of voice.
The Evidence For Training Data Influence
Does your HC rank actually affect AI visibility? The research points in that direction.
The Mozilla Foundation report confirmed that Common Crawl uses Harmonic Centrality to determine crawl priority. Domains with lower HC ranks appear more frequently in the archive. The report also noted that "digitally marginalized communities" are less likely to be included due to this approach. The prioritization creates real differences in how much content from each domain ends up in AI training data.
Citation studies show patterns worth noting. Semrush analyzed over 150,000 citations and found Reddit, Wikipedia, Google, and YouTube among the most frequently cited sources. Profound analyzed 680 million citations and found similar concentration. These are all domains with low HC ranks in Common Crawl's WebGraph.
The domains with the lowest HC ranks are also among the most cited by LLMs. They get crawled most frequently. They have the most content in training data. They show up most often in AI responses. The correlation is clear even if causation involves multiple factors.
Metehan Yesilyurt, an SEO consultant who built a tool to explore this data, found something interesting during his research. He discovered a domain with low Semrush and Ahrefs authority scores that was getting mentioned across thousands of ChatGPT responses. Its Common Crawl authority metrics told a different story than traditional SEO tools.
Model Knowledge Has A Delay
Even if you improve your Harmonic Centrality today, AI models will not immediately reflect that change. Large language models have knowledge cutoffs from their training data.
| AI Search Engine | Large Language Model (LLM) | Knowledge Cutoff | Training Data Impact |
|---|---|---|---|
| ChatGPT (OpenAI) | GPT-5.2 | June 2024 | Changes after this date not in parametric memory |
| Gemini (Google) | Gemini 3 | January 2025 | More recent cutoff, newer Common Crawl data included |
Common Crawl releases new WebGraph snapshots every month. The data is relatively fresh. But AI models take time to absorb new training data. If you are building authority now, the benefits may appear in future model versions rather than current ones.
This is why some practitioners are treating Common Crawl authority as a long-term investment. The links you build today and the position you establish in the web graph could influence how AI models perceive your brand for years.
How To Check Your Common Crawl Authority
There is a free tool at webgraph.metehan.ai that makes Common Crawl's WebGraph data accessible. You can check any domain's Harmonic Centrality rank and PageRank across five time periods from 2023 to 2025.

The tool indexes approximately 18 million domains. The top 10 million per time period by HC rank. If your domain does not appear, it likely ranks below the top 10 million in all indexed periods.
You can also verify whether Common Crawl is actually crawling your site. The Common Crawl Index Server at index.commoncrawl.org lets you search any URL pattern against their crawl archives. Enter your domain and see which pages have been captured.
What the tier rankings mean
The tool shows tier badges based on your domain's HC rank. Lower is better.
- Elite means Top 100. These domains get crawled most frequently and have maximum representation in AI training data.
- Top 1K means position 101 to 1,000. Very high crawl priority.
- Top 10K means position 1,001 to 10,000. Strong crawl priority.
- Top 100K means position 10,001 to 100,000. Moderate crawl priority.
- Top 1M means position 100,001 to 1,000,000. Limited crawl priority.
- Long Tail means position 1,000,001 and beyond. Minimal crawl priority.
If your domain sits in the long tail, you have what Metehan calls an "invisible ceiling." Your content could be excellent. Your freshness could be perfect. But Common Crawl captures less of your site because your HC rank is too high. Less content in training data means lower AI share of voice. You can track your AI search visibility to see if your HC rank correlates with citation performance.
What This Means For AI Optimization
Content quality, freshness, and structure still matter for AI visibility. But if you want to improve your AI share of voice, you need to optimize for Harmonic Centrality too. A lower HC rank means Common Crawl captures more of your content. More content in training data means AI models know your brand better.
Link building strategy shifts
Traditional link building rewarded volume. More backlinks meant more authority. For AI visibility, where your links come from matters more than how many you have. Getting linked by a well-connected domain improves your position in the web graph. Getting linked by an isolated domain does almost nothing. One link from a site embedded deep in the web's core can move your HC rank more than a hundred links from sites on the outskirts.
When evaluating potential link sources, consider their position in Common Crawl's WebGraph. Links from well-connected domains improve your HC rank. A better HC rank means more crawl priority. More crawl priority means more of your content in AI training data.
Tracking a new metric
Tools like Semrush, Ahrefs, and Moz do not currently track Harmonic Centrality. Stephen Burns from Common Crawl predicted that platforms like Semrush, Ahrefs, Profound, and LLMrefs will integrate HC data in the near future. LLMrefs has already started development work to include Harmonic Centrality into our AI visibility tracking analytics. When optimizing for AI becomes as routine as optimizing for Google, understanding your position in the web's link topology will become standard practice.
Long-term authority building
If AI models take months or years to incorporate new training data, authority building becomes a long-term play. The work you do now to improve your web graph position could influence AI responses for years. This is different from traditional SEO where results can appear within weeks.
Do not ignore it entirely
If your domain is stuck in Common Crawl's long tail despite having strong content, your HC rank may be the problem. A high HC rank means less crawl priority. Less crawl priority means less content in training data. Less content in training data means lower AI share of voice.
The data is now accessible. Check your HC rank. Compare it to competitors who are getting cited more often. If there is a gap, you have found a lever to pull.
Key Takeaways For AI Optimization
Here is what you need to know about Common Crawl and Harmonic Centrality.
What Common Crawl does:
- Archives billions of web pages every month
- Provides foundational training data for almost every major LLM
- Uses Harmonic Centrality to prioritize which domains get crawled most frequently
What Harmonic Centrality measures:
- Your domain's position in the web graph topology
- How "close" you are to other well-connected domains
- Your crawl priority in Common Crawl's monthly archives
The HC to AI recommendation chain:
- Lower HC rank = more frequent indexing in Common Crawl
- More indexing = LLMs train on more of your content
- More training data = AI models understand and trust your brand
- More trust = AI responses more likely to recommend you
How to improve your HC rank:
- Build links from domains with low HC ranks (well-connected sites)
- Focus on link topology, not just volume
- Treat it as a long-term investment since AI models have knowledge cutoffs
What HC does not measure:
- Real-time AI citations (you need online visibility tracking for that)
- Content quality or freshness
- Whether you are currently being mentioned in ChatGPT responses
From Index And Rank To Train And Retrieve
Stephen Burns from Common Crawl described this shift well. The old model was "index and rank." The new model is "train and retrieve." If you are not in the crawl, you cannot be in the model.
Common Crawl is the foundation of AI training data. Harmonic Centrality determines how much of your content makes it into that foundation. A lower HC rank means more frequent indexing. More indexing means more of your content in training data. More training data means AI models understand and trust your brand. Trust means you are more likely to be recommended in AI responses.
But remember the two halves. HC measures your offline visibility in training data. It does not tell you if ChatGPT is citing your content right now in live responses. You still need end-to-end AI visibility tracking for the online half.
This metric is measurable. It is trackable. And as AI search grows, optimizing for Harmonic Centrality may become as important as optimizing for PageRank was a decade ago. But it is one piece of the puzzle, not the whole picture.
The question is not whether training data matters. It clearly does. The question is how much you can influence your position. Now you have the data to start finding out.
Related Posts

December 13, 2025
How ChatGPT memory works, reverse engineered
Reverse engineering ChatGPT Memories reveals it does not use RAG or vector databases. It uses: metadata, facts, conversation summaries, and a sliding window.

December 10, 2025
33 key terms you need to know for AI SEO in 2026
Comprehensive glossary of 33 essential terms for AI SEO in 2026. From GEO and AEO to citations and fan-out queries, learn the vocabulary that defines modern search optimization.

December 8, 2025
AI assistants are not search engines
We analyzed 4.5M ChatGPT conversations. Two thirds have zero commercial intent. People use AI to think, not to shop. Here is what that means for your content strategy.

December 3, 2025
Why off-site SEO matters in GEO & AI search
Generative answer engines discover pages through traditional search results. This makes off-page SEO your best lever for visibility in ChatGPT and other AI search platforms.