How Does Crawl Depth Work In The Knowledge Base

How Does Crawl Depth Work in the Knowledge Base

Overview

The knowledge base system uses web crawling to collect and process information from websites, transforming web content into searchable embeddings that can be retrieved when users ask questions. Understanding how crawl depth works is essential for optimizing your knowledge base.

What is Crawl Depth?

Crawl depth refers to how many levels of links the system will follow from the original URL you provide. For example:

Depth 1: Only the specific URL you entered is processed
Depth 2: The original URL plus all pages directly linked from it
Depth 3: Includes all pages from depth 2, plus all pages linked from those pages

How Resources and Embeddings Work

Resources vs. Embeddings

Resource: A complete document, text, or URL in your knowledge base
Embedding: A smaller chunk of a resource (approximately 2000 characters) that has been converted into a vector representation for similarity searching

When you add a URL to your knowledge base:

The URL is saved as a single resource
The crawler visits the URL and any additional linked pages based on the crawl depth
The content from each page is divided into embeddings (chunks)
These embeddings are stored with references to their source resource

Content Processing Details

Each page starts as a new chunk/embedding
Longer pages may be split into multiple embeddings based on character count
The system attempts to break text at natural points (periods, paragraph breaks)
Page content is never mixed between different pages

Searching the Knowledge Base

When a query is made against your knowledge base:

The query is converted to the same vector format as your embeddings
The system compares this query vector against all embeddings in the knowledge base
The top 5 most relevant embeddings are returned as results
These results come from the chunks with the highest similarity scores

Optimization Tips

Setting the right crawl depth: Higher depths gather more information but may include less relevant content and take longer to process
URL selection: Choose specific, content-rich starting URLs rather than very general pages
Regular updates: Recrawl your resources periodically to ensure information stays current

By understanding how crawl depth works, you can more effectively build and maintain a knowledge base that accurately represents your content.

Find answers to your questions