Web Content Processing
When you provide a web page URL as a knowledge base source, ZipTier automatically processes the page through a multi-pass cleanup pipeline. This ensures your AI Assistant learns from meaningful content — not cookie banners, navigation menus, or boilerplate footers. This guide explains what content is retained, what is removed, and how to get the best results.
Why We Clean Web Content
Web pages contain far more than their main content. Navigation bars, cookie dialogs, footer links, and other structural elements can account for over half the text on a typical page. If ingested as-is, this noise degrades your AI Assistant's accuracy and wastes processing resources.
Improve Response Accuracy
By removing irrelevant content, the AI Assistant focuses on the information that actually matters to your prospects.
Minimize Knowledge Base Noise
Navigation menus, cookie policies, and footer links would clutter your knowledge base and confuse responses.
Reduce Processing Costs
Cleaner content means fewer tokens processed per conversation, keeping your message credits efficient.
Prevent Irrelevant Citations
Without cleanup, the AI might cite cookie policies or reCAPTCHA notices when answering prospect questions.
What We Retain
The cleanup pipeline preserves all content that contributes to a rich, informative knowledge base for your AI Assistant.
| Content Type | Examples |
|---|---|
| Headings and body text | Page titles, section headings, paragraphs — the main content of the page |
| Actionable links with URLs | "Register now", "Download the whitepaper", "Learn more about pricing" |
| Inline links within sentences | "Our cloud migration guide covers the full process" |
| Descriptive link text (4+ words) | "View the complete case study" — long link text always preserves its URL |
| Code blocks and technical content | Code snippets, API examples, configuration samples |
| Tables and structured data | Pricing tables, feature comparisons, specification lists |
| List items with substantive content | Feature lists, step-by-step instructions, bullet-point descriptions |
What We Remove
The following elements are stripped during processing because they add noise without contributing meaningful knowledge.
| Content Type | Why It's Removed |
|---|---|
| Navigation menus and mega-menus | Repeated on every page; not part of the page's unique content |
| Headers and footers | Copyright notices, legal links, and social media icons are boilerplate |
| Cookie consent dialogs | Legal compliance UI that has no informational value |
| Images, thumbnails, icons, banners | Visual elements that cannot be interpreted as text knowledge |
| Breadcrumb navigation | Site structure markers, not meaningful content |
| Form fields and dropdowns | Country selectors, search boxes, and input elements are interactive UI |
| Boilerplate text | Privacy notices, reCAPTCHA disclaimers, and standard legal copy |
| Author bio blocks | Repeated author cards that appear across multiple blog posts |
| Table of contents / sidebar sections | Navigation aids that duplicate heading structure |
| Duplicate content blocks | Repeated CTAs, testimonials, or sections that appear more than once |
| Garbled UI text | Rendered mockups or screenshots that produce nonsensical text when parsed |
| Orphaned UI labels | Standalone button text like "Benefits", "Pricing" with no surrounding context |
| Generic CTA buttons without URLs | "Get started", "Buy now" — no destination link, no informational value |
Link Handling
Links on web pages are handled selectively. The goal is to preserve URLs that point prospects to useful resources while stripping navigation-style links that add clutter.
How link decisions are made:The cleanup pipeline evaluates each link based on its text, position, and context within the page. Here's how different link types are handled:
| Link Type | URL Preserved? | Example |
|---|---|---|
| Actionable verb links | Yes | "Register for the webinar", "Download the PDF", "Visit our partner page" |
| Long link text (4+ words) | Yes | "View the complete migration case study" |
| Inline links (within a sentence) | Yes | "See our data security overview for details" |
| Short nav-style links (1–3 words, no action verb) | No | "About", "Blog", "Contact Us" |
| Image-only links | Yes (bare URL) | A logo linking to a partner page — the URL is kept even though no text exists |
Tip on Action Verbs:Action verbs that trigger URL preservation include: go to, visit, register, sign up, download, learn more, read, view, explore, check out, get started with, apply, subscribe, join, and watch.
Best Practices
Follow these tips to get the best results when using web pages as knowledge base sources.
Choose the Right Pages
- Use content-rich pages — Pages with clear headings, structured paragraphs, and substantive text produce the best knowledge base entries.
- Prefer informational pages over marketing landing pages — Landing pages often consist mostly of images, short taglines, and CTAs with little extractable text.
- Pages with meaningful links to resources produce richer knowledge — If the page links to whitepapers, case studies, or documentation, those URLs are preserved for reference.
Avoid Common Pitfalls
- Avoid pages that are mostly images or video — The pipeline processes text content only; image-heavy pages yield very little usable knowledge.
- Avoid single-page apps with dynamically loaded content — Content rendered entirely by JavaScript after page load may not be captured.
- Avoid pages behind authentication — The crawler cannot access gated or login-protected content.
Note:If a web page produces very little content after processing, consider uploading a PDF or document version of the same content instead. Documents give you full control over what the AI Assistant learns.