Web URL Crawling Estimated reading: 4 minutes 2 views The Web URL Crawling feature is a sophisticated tool within the Antimanual ecosystem designed to extend the intelligence of your AI beyond the boundaries of your local WordPress environment. By ingesting content from external web pages, you can provide your AI with a broader context, including third-party documentation, industry news, or supplementary resources that reside on the public web. Table of Contents Feature Overview Adding a Web URL Advanced Extraction Logic Managing Indexed URLs Frequently Asked Questions Feature Overview Available exclusively in the Pro version, the Web URL Crawling functionality serves as a bridge between your and the vast information available on the internet. Unlike simple text copying, this system automates the retrieval and processing of web data, ensuring that your AI assistant has access to the most comprehensive data set possible. This feature is particularly useful for businesses that maintain documentation across multiple platforms or wish to feed their AI with niche-specific information from trusted industry leaders. By indexing these URLs, the AI can reference specific external facts, provide more accurate support, and generate content that reflects a deeper understanding of the global landscape. Adding a Web URL to the Knowledge Base The process of adding a new URL is designed to be straightforward yet powerful. To begin, follow these steps: Navigate to the Knowledge Base: Within the Antimanual dashboard, locate the “Website URL” tab. Enter the Target URL: Input the full, public URL of the web page you wish to crawl into the provided text field. Ensure the URL is accessible without a login. Execution: Click the “Submit” button. The system will initiate a background task to fetch and process the page. Once submitted, the AI begins a multi-stage process of fetching the HTML, parsing the structure, and generating embeddings to make the content searchable for your . Advanced Extraction Logic One of the primary challenges of web crawling is the inclusion of “noise”—irrelevant data such as navigation menus, sidebars, footers, and advertisements. Antimanual employs advanced filtering logic to ensure only the core substance of the page is indexed. Semantic Identification: The crawler identifies the main article or content tags within the HTML structure. Noise Reduction: Elements like headers, cookie banners, and social media widgets are automatically discarded. Content Cleaning: The extracted text is stripped of unnecessary scripts and styling, leaving a clean text format that is optimal for AI embedding and retrieval. This cleaning process ensures that when your chatbot answers a query, it isn’t confused by “Home,” “About Us,” or other repetitive site navigation text, focusing solely on the informative value of the page. Managing Indexed URLs All processed URLs are housed in a dedicated management table. This allows you to audit exactly what information is currently influencing your AI’s responses. The management interface provides several data points for each entry: URL Source: The original link that was crawled. AI Model: The specific embedding model used to index the data (e.g., OpenAI or Google Gemini). Actions: A simple deletion tool to remove outdated or irrelevant URLs from the Knowledge Base instantly. Frequently Asked Questions Can I crawl pages that require a login?No, the crawler currently supports only publicly accessible URLs. Pages behind a paywall or login screen cannot be indexed through this specific tool. How often is the content updated?The content is indexed at the time of submission. If the external webpage changes significantly, you should delete the old entry and re-submit the URL to refresh the knowledge. Is there a limit to how many URLs I can add?While the plugin allows for extensive indexing, your total Knowledge Base capacity is subject to the limits of your specific Pro license and the storage limits of your AI provider’s embedding database. Web URL Crawling - PreviousPDF Document IntegrationNext - Web URL CrawlingManual Text Snippets