find all pages on a website, site audit, technical seo, web crawling, seo guide
How to Find All Pages on a Website: A Complete Guide
Written by LLMrefs Team • Last updated April 15, 2026
You’re usually asked to find all pages on a website when something already feels off.
A client says the site has “about a thousand pages,” but the crawl finishes suspiciously fast. Search Console shows fewer URLs than expected. Analytics includes landing pages nobody mentioned. The sitemap looks tidy, but it’s too tidy. That’s the point where a technical audit stops being a checklist and becomes an investigation.
If you want to find all pages on a website, you need more than one source of truth. A crawler shows what’s discoverable through links. A sitemap shows what the site owner wants search engines to see. Logs and analytics show what was requested. AI answer engines add another layer, because some URLs matter even when they aren’t prominent in Google’s visible index. A complete inventory is no longer just an SEO housekeeping task. It’s the basis for crawl diagnostics, content pruning, migration planning, internal linking, and AI visibility analysis.
Why Finding Every Page Is Your First Step to SEO Mastery
Most SEO problems hide inside incomplete inventories.
A team thinks it has a clean site architecture, but the site itself includes old campaign pages, faceted URLs, forgotten blog tag archives, PDFs, and thin pages no one has touched in years. If those URLs exist, they affect crawl paths, internal equity, duplication, reporting, and the way search systems understand the site.

The page count problem is usually bigger than expected
A page inventory isn’t just a count. It’s a map of what the business has published, what search engines can reach, and what users can still access.
That distinction matters because “live” and “indexed” are not the same thing. Traditional workflows often stop at sitemaps, site:domain.com, or Search Console. That’s too narrow if your goal is a genuine audit.
Use the inventory to answer questions like these:
- What exists: Public HTML pages, PDFs, older content hubs, staging leftovers, and parameter variants.
- What’s discoverable through internal links: The URLs a standard crawler can reach from the starting point.
- What search engines have likely seen: Indexed URLs and crawled URLs are helpful, but they’re still only part of the picture.
- What deserves optimization: Some URLs exist but shouldn’t. Others should exist in your strategy but are buried or orphaned.
Practical rule: If you don’t know the full URL inventory, you can’t trust your recommendations on internal linking, indexation, consolidation, or migration scope.
AI visibility changes what “all pages” means
Here, many guides stop too early.
Most content about page discovery focuses on classic SEO tools. It rarely addresses pages that matter for AI answer engines, especially pages referenced in citations, surfaced through conversational retrieval, or left out of official sitemap files. That gap matters because LLMrefs data shows AI engines cite external sources 40 to 60 percent more frequently than traditional SERPs in conversational queries (SE Ranking).
That changes the workflow. “All pages” now includes pages that contribute to visibility in ChatGPT, Perplexity, Gemini, Claude, and AI Overviews, even if those URLs aren’t obvious in a Google-first audit.
If you need a plain-language refresher on how indexed visibility differs from total accessible content, this overview of Google Indexing is a useful supporting reference before you start reconciling counts.
What a complete inventory lets you do
Once the inventory is accurate, decisions get sharper.
You can identify sections that need stronger internal links, isolate pages that should be canonicalized or removed, and spot assets that deserve inclusion in AI-focused workflows. That’s also the handoff point into GEO work, because citation-worthy pages are often hidden in long-tail documentation, old comparison pages, or support content that never made it into a neat sitemap.
Starting Your Search with Crawlers and Sitemaps
A full site inventory starts with the sources you control. Crawl the site yourself, then compare that crawl with every sitemap the site publishes. That gives you two different views of the URL set. One shows what the site exposes through links and rendering. The other shows what the site owner intended search engines to find.

Run a proper crawler first
For most technical SEO audits, Screaming Frog is still the fastest way to build the first working URL list.
The catch is configuration. A default crawl often overcollects assets and undercollects pages that depend on JavaScript, faceted navigation, or deeper crawl paths. On a modern site, that creates a false sense of coverage. You get a file full of URLs, but not a page inventory you can trust.
Use a setup that reflects the site you are auditing:
- Start at the correct scope. Crawl the root domain if you need the whole site. Crawl the exact subfolder or subdomain if the audit is limited to a specific property.
- Choose a crawl depth on purpose. Shallow settings miss buried pages. Extremely deep settings can waste time on traps, calendars, or parameter loops.
- Limit resource noise. Excluding CSS, JS, and other non-page assets keeps the export focused on documents that can rank, get cited, or create duplication issues.
- Turn on JavaScript rendering when the site needs it. Many internal links, product grids, and support hubs only appear after rendering.
- Export internal HTML URLs separately from the full crawl file. That export is usually the right starting point for reconciliation.
This first crawl is not the final truth. It is the observed structure of the site at the time you ran it.
That distinction matters for AI visibility too. If a page is only reachable after rendering, hidden three states deep in a client-side app, or poorly linked from the main architecture, traditional crawlers and AI retrieval systems may both have trouble reaching it. If AI discovery is part of the audit scope, run an extra pass with the AI crawl checker to confirm whether AI bots can access the same site structure your standard crawl is surfacing.
Clean the crawl before you trust it
The raw export is a working set, not a final page list.
Review the crawl with page classification in mind. Separate true HTML documents from redirects, canonical duplicates, soft 404s, and assets. Keep redirecting URLs and canonicalized duplicates in your worksheet because they explain how the site behaves. Do not treat them as primary pages in the final inventory unless the audit has a specific reason to track alternates.
A simple review table keeps the process consistent:
| Crawl output type | Keep in working inventory | Keep in final page list |
|---|---|---|
| 200 HTML page | Yes | Yes |
| Redirecting URL | Yes | No |
| Canonicalized duplicate | Yes | Usually no |
| CSS or JS file | No | No |
| PDF or document asset | Depends on audit scope | Depends on audit scope |
I also check internal inlinks early. A page with one weak internal link is different from a page linked sitewide. Both may be live, but they carry very different discovery risk for Google and for AI systems that rely on crawlable pathways and repeated references.
Pull every sitemap you can find
After the crawl, gather all XML sitemaps tied to the property.
Start with robots.txt, then open the sitemap index and each child sitemap. On larger sites, this usually means separate files for products, categories, articles, images, news, videos, or CMS-specific content types. Do not stop at the first sitemap.xml file you find. Enterprise CMS setups often publish multiple sitemap groups, and old sitemap references can linger after migrations.
Sitemaps help because they show declared intent. They often surface URLs that are weakly linked, recently published, or isolated in templates your crawler did not hit on the first run. They also reveal governance problems fast. Important sections missing from the sitemap are usually not random. They point to broken generation rules, publishing workflow gaps, or content the site no longer treats as strategically important.
Sitemaps also have limits:
- They list submitted URLs, not proven discoveries.
- They do not confirm indexation.
- They often omit legacy or orphaned content after redesigns and platform changes.
- They may include URLs that should not be in a clean primary inventory.
A sitemap is a declared inventory. A crawl is an observed inventory. Use both, then investigate the mismatch.
Use the site operator carefully
The site: operator is still useful as a quick sense check, especially for spotting strange indexed folders, old staging paths, tag archives, or parameter patterns that deserve a closer look.
It is not reliable enough to count total pages. Google does not present site: results as a full index export, and the result estimates can shift for reasons that have nothing to do with the actual number of live URLs. Use it to identify patterns worth reviewing, not to settle page counts.
That matters even more if your end goal includes GEO. A page can be absent from a sitemap, buried in internal linking, or inconsistently indexed and still get cited by AI systems through secondary discovery paths, historical retrieval, or external references. The site operator will not give you that full picture.
A walkthrough is helpful if you’re training a team member or validating setup choices:
What this first pass should produce
By the end of this stage, you should have four concrete outputs:
- A crawler export of internal HTML URLs
- A collected set of XML sitemap URLs and child sitemap files
- A short list of patterns spotted through the
site:operator - A note on whether JavaScript rendering changed URL discovery
That baseline is usually enough to expose the first gaps. It will not catch every orphan, deprecated URL, or externally discovered page. It does give you the foundation you need before you bring in logs, analytics, and external data sources.
Digging Deeper with Logs Analytics and Third-Party Tools
A crawl gives you the site structure the crawler could reach. Discovery work gets more interesting when you compare that version of the site against requested URLs, visited URLs, and externally referenced URLs.
That comparison is where missing pages show up.
A URL can sit outside the current internal link graph and still matter. Googlebot may still request it. Users may still land on it from bookmarks, old SERP listings, partner links, or copied URLs in Slack threads and docs. AI answer engines can also surface pages through cached references, link-based discovery, or third-party citations even when your current sitemap and nav no longer expose them cleanly.

Why a crawl misses part of the site
In practice, crawlers miss pages for predictable reasons. Internal links may be weak, JavaScript events may hide paths, legacy URLs may no longer be linked, and some sections may only be reachable through search, forms, feeds, or old campaign paths. A crawler is still the right starting point, but it is one source, not the source.
I treat every source after the crawl as a challenge set. If logs, analytics, or third-party indexes contain URLs the crawler missed, that gap usually points to one of four problems: poor discoverability, weak governance, stale architecture, or intentional hiding that nobody documented.
Server logs show requested URLs at the edge
If you can access server logs, start there.
Logs record what bots and users requested from the server. That makes them one of the few sources that can expose URLs without depending on your crawl path, sitemap quality, or analytics setup. They are messy, but they are often the fastest way to find orphaned content, retired sections that still receive demand, and parameter patterns that waste crawl budget.
Review logs for patterns like these:
- URLs requested by Googlebot or other major crawlers but missing from your crawl exports
- Legacy directories that still receive user or bot traffic
- Parameter variants that generate repeated requests for near-duplicate content
- Non-HTML files that attract search or AI retrieval behavior
- Unexpected status code patterns on old URLs, such as repeated 301, 404, or 410 responses
One caution matters here. Logs show requests, not page quality. A noisy URL can appear important because it gets hit often, when in reality it is a broken parameter loop or bot trap. Normalize hosts, protocols, trailing slashes, and parameters before you draw conclusions.
Field note: When a URL appears only in logs, I verify it before I classify it. Those one-source URLs often uncover the biggest indexing, migration, and content governance problems.
Analytics helps you find pages people still reach
Analytics is weaker than logs for raw discovery, but stronger for prioritization.
Export landing pages and page paths over a date range long enough to catch seasonality and old campaign residue. This surfaces URLs users still reach through email, referrals, social shares, saved links, partner mentions, and old search results. It also helps separate dormant clutter from pages that still carry business value.
A few rules keep this useful:
- Use landing pages first. They reveal URLs people entered the site through, which is more useful for discovery than total pageview paths.
- Expect undercounting. Analytics will not show URLs with no visits, blocked scripts, consent suppression, or broken tagging.
- Flag hidden revenue pages. A URL with traffic or conversions deserves review even if the site no longer links to it well.
If you want a practical process for reviewing those exports before you merge them into the master inventory, this guide on how to analyze website traffic is a useful reference.
Third-party tools find URLs your site forgot
External indexes add a different kind of evidence. Backlink databases, historical crawlers, and URL index tools often surface pages the current site no longer supports internally.
This matters most on older domains. Product pages, campaign microsites, comparison pages, PDF assets, and deprecated blog posts often keep attracting links long after the CMS, nav, and sitemap moved on. Those URLs may still influence branded search behavior, link equity distribution, and AI retrieval because they continue to exist in the wider web ecosystem.
Use third-party sources to answer practical questions:
| Source | What it often finds | What it misses |
|---|---|---|
| Server logs | Requested URLs from bots and users | URLs with no recent requests |
| Analytics | Pages with visits | Pages with zero traffic or broken tracking |
| Backlink tools | Legacy or orphaned URLs with external references | Unlinked URLs with no external signals |
| Historical URL indexes | Older pages removed from current architecture | Fresh pages not yet discovered externally |
The goal at this stage is not to pile up more URLs for the sake of volume. The goal is to find contradictions between data sources, explain why those contradictions exist, and decide which URLs still matter for search, operations, and AI visibility.
That is how you move from “pages we can crawl” to “pages that can still be discovered.”
Navigating Parameters Pagination and Canonicalization
Raw URL exports overstate site size fast.
The usual cause is not bad crawling. It is URL variation. A single template can generate hundreds or thousands of addresses through filters, sort orders, tracking tags, session IDs, pagination states, and inconsistent canonical handling. If those variants stay mixed together, page counts drift, crawl waste looks smaller than it is, and AI-focused discovery work gets distorted because answer engines can surface URLs Google would never treat as primary.
Parameter handling needs rules before cleanup starts
Parameters are not just noise. They are clues.
Some show how marketing systems tag traffic. Some expose faceted navigation that search engines can crawl. Some create useful landing pages. Others produce duplicate or near-duplicate states with no standalone value. The job is to separate operationally important URLs from URLs that deserve a place in the final primary inventory.
Use a policy that reflects the site type and audit goal:
- Strip tracking parameters from the primary URL list. UTM tags, click IDs, and similar values usually describe the same document.
- Review filter and sort parameters by template, not one URL at a time. On ecommerce sites, color or size filters may expose meaningful inventory states. Sort orders usually do not.
- Keep parameterized URLs in a diagnostic tab if they affect crawling, rendering, or internal linking. Excluding them too early hides waste.
- Treat internal search result URLs carefully. They often create thin pages, but they can still be publicly accessible and retrievable by AI systems if left open.
I usually split this work into two outputs. One list answers, "What URLs exist and can be discovered?" The other answers, "What URLs count as preferred indexable pages?" That distinction keeps the analysis honest.
Pagination deserves its own review
Pagination errors rarely show up in a top-level crawl summary.
They appear in long category chains, blog archives, forum threads, and product listings where page 1 is linked clearly and deeper pages are weakly linked, loaded after interaction, or canonicalized badly. Analysts who stop at the first visible layer miss inventory that still consumes crawl budget, attracts links, or feeds retrieval systems.
Check pagination with a short QA sequence:
- Confirm the crawler reached deep enough into each series.
- Render JavaScript if page links or "load more" controls depend on it.
- Inspect canonicals on paginated URLs individually.
- Record whether each paginated page is indexable, canonicalized away, or effectively orphaned.
Paginated URLs often belong in the working inventory even when they are not targets for ranking. They are still public documents. They shape internal link flow, expose products or articles beyond page 1, and can become the version another system cites.
Canonicalization separates existence from preference
A URL can exist, return 200, and still be the wrong version to count as primary.
That is the practical value of canonical analysis. It tells you which URL the site declares as preferred, and it reveals where that declaration conflicts with redirects, internal links, or actual content uniqueness. On large ecommerce sites, inflated page counts usually stem from these conflicts. On publishers and marketplaces, similar issues obscure duplicate archives and parameter states.
Use canonicals in four buckets:
- Self-canonical, indexable URL. Usually keep in the final primary set.
- URL canonicalized to another equivalent page. Keep for diagnostics, but exclude from the primary count.
- Duplicate or near-duplicate URL with no usable canonical. Investigate the template or CMS rule.
- Paginated or filtered URL canonicalized to page 1 or a parent category without justification. Flag it. That pattern often suppresses valid discovery paths.
Filter to HTML documents while you do this. Assets, feeds, scripts, and PDFs may matter elsewhere, but they should not inflate the count of crawlable HTML pages unless your audit scope explicitly includes them. If you need a broader QA pass before this step, use a website auditing checklist for technical SEO reviews.
Canonicals reduce duplication on paper. They do not fix weak architecture, poor internal linking, or uncontrolled URL generation.
JavaScript can hide both pagination and canonical signals
Modern frameworks complicate discovery because the initial HTML often does not show the full link graph. Filters, product grids, archive pages, and even canonical tags may be injected after rendering. A non-rendered crawl can miss deep URLs entirely or misread how the site declares preferred versions.
That matters for SEO. It also matters for GEO. AI answer engines and retrieval systems may fetch rendered and non-rendered states differently, which means hidden links or inconsistent canonicals can produce partial inventories and unexpected citations.
Run a rendered crawl on any site where navigation, filtering, or listing content depends on JavaScript. Then compare the rendered and raw exports. The gap between those two files usually tells you where discovery breaks.
How to Consolidate and Verify Your Final URL List
At this point, you should have multiple exports. Crawl files. Sitemap URLs. Log-derived URLs. Analytics landing pages. Maybe a list from a custom crawler or a backlink platform.
The next job is synthesis. You’re building one master URL set that’s deduplicated, labeled, and verified.

Build a reconciliation sheet
Create a master spreadsheet with one row per normalized URL.
Then add columns for each source. Mark whether the URL appeared in the crawl, in a sitemap, in logs, in analytics, and in any third-party export. Add technical columns for status code, canonical target, content type, and notes.
A simple structure works well:
| URL | Crawl | Sitemap | Logs | Analytics | Status | Canonical | Final keep |
|---|---|---|---|---|---|---|---|
| /example-page/ | Yes | Yes | Yes | Yes | 200 | Self | Yes |
This immediately shows where the disagreements are. Those disagreements matter more than the easy rows.
Normalize before deduplicating
Deduplication only works if the URL formatting is consistent.
Standardize protocol, host casing, trailing slash treatment, and removable tracking parameters. If your exports contain both absolute and relative paths, convert them into one format. If subdomains matter, keep them distinct on purpose rather than merging them accidentally.
A practical sequence is:
- Trim tracking parameters first
- Standardize host and path format
- Convert relative URLs to full URLs
- Deduplicate on the normalized field, not the raw export
Verify the URLs, don’t just merge them
A URL appearing in a source doesn’t guarantee it’s still live.
Run the combined list through a bulk status check or a second-pass crawl to confirm response codes. Separate live pages from redirects, errors, and soft-dead remnants. The process reveals many “mystery URLs” as old migration leftovers.
A custom crawler can help in edge cases. According to IPRoyal, a Python crawler using BeautifulSoup can achieve 95 to 100% coverage on static sites, but it may miss 20 to 40% of orphan pages and 70% of JavaScript-rendered content without a headless browser. The same source recommends combining crawler output with sitemap parsing and exporting to CSV for reconciliation (IPRoyal).
That mirrors the practical audit workflow. No single export is final. The CSV reconciliation step is where the actual inventory gets built.
Label the final list by purpose
Don’t stop at one giant pile of URLs.
Split the final dataset into useful views:
- Primary HTML pages
- Canonical duplicates
- Redirecting legacy URLs
- Document assets
- Pages present in logs but absent from crawl
- Pages present in analytics but absent from sitemap
That final segmentation is what turns a URL list into an audit asset. If you need a broader QA framework to pair with this reconciliation work, this website auditing checklist fits well alongside the final validation pass.
Your Complete Page Inventory A Strategic Asset for SEO and AI
A full page inventory is not a reporting exercise. It’s operational infrastructure.
Once you know what exists, you can fix crawl barriers, prune waste, improve internal linking, plan migrations with less risk, and protect high-value URLs that would otherwise be missed. You also stop making decisions based on partial evidence, which is where many technical SEO projects go sideways.
The bigger shift is strategic. Search visibility no longer ends with Google’s visible index. Brands now need to understand which pages are available to AI systems, which pages get cited, and which useful assets stay buried because nobody connected crawl discovery with answer engine visibility. That’s where a complete inventory becomes the foundation for GEO.
A useful way to think about it is this:
- Discovery tells you what exists
- Verification tells you what’s live
- Reconciliation tells you what matters operationally
- AI citation analysis tells you which of those pages influence modern discovery
That workflow is why inventory quality has become a competitive advantage. It’s the same reason mature teams don’t rely on one report in paid media either. Even outside SEO, practitioners use layered inputs to make better decisions. For example, a specialist working through bid logic on marketplaces might use a tool-specific reference like this Amazon PPC bid calculator guide because rough estimates aren’t enough when budgets are involved. Site discovery works the same way. Better inputs produce better decisions.
If you’re auditing a site today, don’t stop when the crawler gives you a neat export. Keep pushing until the crawl, sitemap, logs, analytics, and URL verification all make sense together. That’s how you find all pages on a website in a way that supports real SEO work and modern AI visibility.
If you want to move from page discovery into AI visibility analysis, LLMrefs helps you inspect citations, track brand presence across answer engines, and connect your verified URL inventory to actual GEO opportunities.
Related Posts

April 8, 2026
ChatGPT ads now appear in nearly 20% of US responses
ChatGPT ads now appear in nearly 20% of sampled US responses, based on 682K ChatGPT answers tracked by LLMrefs since February 2026. See who is buying, how fast ads are growing, and how we measure it.

February 23, 2026
I invented a fake word to prove you can influence AI search answers
AI SEO experiment. I made up the word "glimmergraftorium". Days later, ChatGPT confidently cited my definition as fact. Here is how to influence AI answers.

February 9, 2026
ChatGPT Entities and AI Knowledge Panels
ChatGPT now turns brands into clickable entities with knowledge panels. Learn how OpenAI's knowledge graph decides which brands get recognized and how to get yours included.

February 5, 2026
What are zero-click searches? How AI stole your traffic
Over 80% of searches in 2026 end without a click. Users get answers from AI Overviews or skip Google for ChatGPT. Learn what zero-click means and why CTR metrics no longer work.