Home/Campaigns/Web Content Processing

Web Content Processing

When you provide a web page URL as a knowledge base source, ZipTier automatically processes the page through a multi-pass cleanup pipeline. This ensures your AI Assistant learns from meaningful content — not cookie banners, navigation menus, or boilerplate footers. This guide explains what content is retained, what is removed, and how to get the best results.

Jump to:

Why We Clean What We Retain What We Remove Link Handling Best Practices

Why We Clean Web Content

Web pages contain far more than their main content. Navigation bars, cookie dialogs, footer links, and other structural elements can account for over half the text on a typical page. If ingested as-is, this noise degrades your AI Assistant's accuracy and wastes processing resources.

Improve Response Accuracy

By removing irrelevant content, the AI Assistant focuses on the information that actually matters to your prospects.

Minimize Knowledge Base Noise

Navigation menus, cookie policies, and footer links would clutter your knowledge base and confuse responses.

Reduce Processing Costs

Cleaner content means fewer tokens processed per conversation, keeping your AI Conversations efficient.

Prevent Irrelevant Citations

Without cleanup, the AI might cite cookie policies or reCAPTCHA notices when answering prospect questions.

What We Retain

The cleanup pipeline preserves all content that contributes to a rich, informative knowledge base for your AI Assistant.

Content Type	Examples
Headings and body text	Page titles, section headings, paragraphs — the main content of the page
Actionable links with URLs	"Register now", "Download the whitepaper", "Learn more about pricing"
Inline links within sentences	"Our cloud migration guide covers the full process"
Descriptive link text (4+ words)	"View the complete case study" — long link text always preserves its URL
Code blocks and technical content	Code snippets, API examples, configuration samples
Tables and structured data	Pricing tables, feature comparisons, specification lists
List items with substantive content	Feature lists, step-by-step instructions, bullet-point descriptions

What We Remove

The following elements are stripped during processing because they add noise without contributing meaningful knowledge.

Content Type	Why It's Removed
Navigation menus and mega-menus	Repeated on every page; not part of the page's unique content
Headers and footers	Copyright notices, legal links, and social media icons are boilerplate
Cookie consent dialogs	Legal compliance UI that has no informational value
Images, thumbnails, icons, banners	Visual elements that cannot be interpreted as text knowledge
Breadcrumb navigation	Site structure markers, not meaningful content
Form fields and dropdowns	Country selectors, search boxes, and input elements are interactive UI
Boilerplate text	Privacy notices, reCAPTCHA disclaimers, and standard legal copy
Author bio blocks	Repeated author cards that appear across multiple blog posts
Table of contents / sidebar sections	Navigation aids that duplicate heading structure
Duplicate content blocks	Repeated CTAs, testimonials, or sections that appear more than once
Garbled UI text	Rendered mockups or screenshots that produce nonsensical text when parsed
Orphaned UI labels	Standalone button text like "Benefits", "Pricing" with no surrounding context
Generic CTA buttons without URLs	"Get started", "Buy now" — no destination link, no informational value

Link Handling

Links on web pages are handled selectively. The goal is to preserve URLs that point prospects to useful resources while stripping navigation-style links that add clutter.

How link decisions are made:The cleanup pipeline evaluates each link based on its text, position, and context within the page. Here's how different link types are handled:

Link Type	URL Preserved?	Example
Actionable verb links	Yes	"Register for the webinar", "Download the PDF", "Visit our partner page"
Long link text (4+ words)	Yes	"View the complete migration case study"
Inline links (within a sentence)	Yes	"See our data security overview for details"
Short nav-style links (1–3 words, no action verb)	No	"About", "Blog", "Contact Us"
Image-only links	Yes (bare URL)	A logo linking to a partner page — the URL is kept even though no text exists

Tip on Action Verbs:Action verbs that trigger URL preservation include: go to, visit, register, sign up, download, learn more, read, view, explore, check out, get started with, apply, subscribe, join, and watch.

Best Practices

Follow these tips to get the best results when using web pages as knowledge base sources.

Choose the Right Pages

Use content-rich pages — Pages with clear headings, structured paragraphs, and substantive text produce the best knowledge base entries.
Prefer informational pages over marketing landing pages — Landing pages often consist mostly of images, short taglines, and CTAs with little extractable text.
Pages with meaningful links to resources produce richer knowledge — If the page links to whitepapers, case studies, or documentation, those URLs are preserved for reference.

Avoid Common Pitfalls

Avoid pages that are mostly images or video — The pipeline processes text content only; image-heavy pages yield very little usable knowledge.
Avoid single-page apps with dynamically loaded content — Content rendered entirely by JavaScript after page load may not be captured.
Avoid pages behind authentication — The crawler cannot access gated or login-protected content.

Note:If a web page produces very little content after processing, consider uploading a PDF or document version of the same content instead. Documents give you full control over what the AI Assistant learns.