Find answers to your questions

How Does Crawl Depth Work In The Knowledge Base

How Does Crawl Depth Work in the Knowledge Base

Overview

The knowledge base system uses web crawling to collect and process information from websites, transforming web content into searchable embeddings that can be retrieved when users ask questions. Understanding how crawl depth works is essential for optimizing your knowledge base.

What is Crawl Depth?

Crawl depth refers to how many levels of links the system will follow from the original URL you provide. For example:

  • Depth 1: Only the specific URL you entered is processed
  • Depth 2: The original URL plus all pages directly linked from it
  • Depth 3: Includes all pages from depth 2, plus all pages linked from those pages

How Resources and Embeddings Work

Resources vs. Embeddings

  • Resource: A complete document, text, or URL in your knowledge base
  • Embedding: A smaller chunk of a resource (approximately 2000 characters) that has been converted into a vector representation for similarity searching

When you add a URL to your knowledge base:

  1. The URL is saved as a single resource
  2. The crawler visits the URL and any additional linked pages based on the crawl depth
  3. The content from each page is divided into embeddings (chunks)
  4. These embeddings are stored with references to their source resource

Content Processing Details

  • Each page starts as a new chunk/embedding
  • Longer pages may be split into multiple embeddings based on character count
  • The system attempts to break text at natural points (periods, paragraph breaks)
  • Page content is never mixed between different pages

Searching the Knowledge Base

When a query is made against your knowledge base:

  1. The query is converted to the same vector format as your embeddings
  2. The system compares this query vector against all embeddings in the knowledge base
  3. The top 5 most relevant embeddings are returned as results
  4. These results come from the chunks with the highest similarity scores

Optimization Tips

  • Setting the right crawl depth: Higher depths gather more information but may include less relevant content and take longer to process
  • URL selection: Choose specific, content-rich starting URLs rather than very general pages
  • Regular updates: Recrawl your resources periodically to ensure information stays current

By understanding how crawl depth works, you can more effectively build and maintain a knowledge base that accurately represents your content.

Did this answer your question?