Domains

Domains allow an AI agent to consume data from a domain (or website). Each domain can be associated with one or more AI agents and the domain can be either a top-level domain or a subpath of a domain. Each webpage scraped counts towards the total number of documents available under each plan. The pages scraped will be available under the Knowledge Hub inside the Domain pages folder.

Domains can be created and or deleted from the following menu: Settings > Domains

More about Domains

The required details for creating a domain are:

URL (The domain to scrape content from)
- The URL must be a valid domain name without the https:// or http:// prefix.
- When the URL is a subpath, the scraping process will be limited to that subpath and its children only

Domains settings.

By defining a topic, the associated AI Agents will have knowledge of the content scraped from the domain.

Sitemap location

By default, the sitemap is looked for at the URL provided/sitemap.xml. However, the user can force the location of the sitemap by entering the full path in the Sitemap path field.

danger

For security reasons, domains scraping is limited to a publicly exposed sitemap.xml file. If the domain does not have a sitemap.xml file, the domain will not be scraped and the status will change to No sitemap.

Domain scraping

The scraping process will start once the domain is verified and the status will change to syncing. Depending on the size of the domain, the scraping process may take a few minutes to a few hours to complete. ToothFairyAI will scrape the domain and its children recursively to create a site tree of the domain - this ensures that incremental scraping operations are efficient and do not require re-scraping the entire domain over and over again.

Once the scraping process is complete, the status will change to active and the domain will be available for use by the associated AI Agents.

Images extraction

If the user wants to extract images from the domain, the Extract images option should be enabled. Upon enabling this option, the Images retrieval instruction field will appear and the user will be required to provide the instructions for the AI Agent to extract the images that are relevant to the domain. By default this option is disabled.

Sync cycle

The sync cycle is the frequency at which the domain will be scraped for new content. The sync cycle can be set to:

Manual (default)
24 hours (daily)
72 hours (3 days)
Every week (7 days)

By default, the sync cycle is set to manual and the user will need to manually trigger the sync process by clicking on the Sync button.

info

Any change to the configuration of the domain will take effect only after the next sync cycle.

Domain verification (only needed for domains with over 500 pages)

To sync websites with over 500 pages, administrators must add a DNS record to a domain to verify ownership; this is a security measure to ensure that only the domain owner can scrape content from the domain. Once the ownership validation process is complete, the website scraping will begin for the full domain.

Domains verification.

By adding the records provided by ToothFairyAI to the associated DNS, ToothFairyAI will verify the domain by checking the registered records.

During the verification process, the status will change from:

verifying
approved
syncing
completed

In case the verification fails, the status will change to failed or noDomain if the dns record is not found.

How We Scrape Data

ToothFairyAI uses a multi-layered approach to scrape website content, ensuring the best possible results for each page:

Scraping Methods (in order of priority)

TF native - Our primary scraping engine, optimised for fast and efficient content extraction natively converting data to markdown
HTTP Requests - Standard HTTP requests using Python's requests library for simple HTML pages
Headless Chrome (Pyppeteer) - A fallback method for JavaScript-heavy websites that require rendering

Each page is attempted with a 15-second timeout. If one method fails, the next method is automatically tried.

Content Processing

HTML is converted to Markdown format for optimal AI consumption
Navigation, forms, scripts, styles, headers, and footers are automatically removed
Sidebars, ads, and other non-content elements are filtered out
Only text content is extracted (images can be optionally enabled)

Crawler Identification

Our crawler identifies itself with the User-Agent:

TF-AI-Crawler/1.0 (+https://toothfairyai.com/crawler)

Website administrators can allow or block our crawler via robots.txt. See our crawler policy for details.

Limitations

Limitation	Details
Sitemap Required	All domains must have a publicly accessible `sitemap.xml` file
Page Limit (Unverified)	Domains without ownership verification are limited to 500 pages
No Authentication	Pages behind login walls or password protection cannot be scraped
Per-Page Timeout	Each page has a 15-second timeout limit
Robots.txt	We respect robots.txt directives; blocked pages will not be scraped
Rate Limiting	We crawl at a respectful rate to avoid overloading servers

tip

For domains with over 500 pages, complete the domain verification process to unlock unlimited scraping.

Common Scraping Issues

If a domain or specific pages fail to scrape correctly, check the following common causes:

Sitemap Issues

Issue	Cause	Solution
Status: noSitemap	No `sitemap.xml` found at the default location	Provide a custom sitemap URL in the Sitemap path field
Empty sitemap	Sitemap exists but contains no URLs	Check that your sitemap is properly formatted XML with valid `<loc>` entries
Invalid URLs	Sitemap contains malformed or broken URLs	Validate your sitemap using online sitemap validators
Sitemap not indexed	Pages exist but aren't in the sitemap	Ensure all important pages are included in your sitemap
Dynamic sitemap	Sitemap requires JavaScript to load	Generate a static sitemap.xml file

Page-Level Issues

Issue	Cause	Solution
JavaScript Required	Page content loads via JavaScript after page load	No action needed - our headless Chrome fallback handles most JS content
CAPTCHA / Anti-bot	Page has security measures blocking automated access	Contact support with the specific URL for manual review
Geo-blocking	Page is restricted to specific countries	Whitelist our crawler IP ranges (contact support for details)
Slow Server	Page takes longer than 15 seconds to load	The page may timeout; consider optimizing server response time
Authentication Required	Page is behind a login	These pages cannot be scraped; remove from sitemap or make public
Non-HTML Content	URL points to PDF, image, or other file	These are skipped; only HTML pages are scraped
Server Errors (4xx/5xx)	Page returns error status codes	Fix broken links and remove erroring pages from sitemap

Domain-Level Issues

Issue	Cause	Solution
All pages fail	Domain blocks our crawler User-Agent	Add `User-agent: TF-AI-Crawler` with `Allow: /` to robots.txt
Partial scraping	Some pages blocked by robots.txt	Review your robots.txt file for restrictions
Timeout on sync	Domain too large, sync taking too long	Contact support to increase timeout thresholds

Checking Your Sitemap

To verify your sitemap is working correctly:

Visit https://yourdomain.com/sitemap.xml in a browser
Ensure it's valid XML (should open without errors)
Check that URLs are properly formatted with full paths (e.g., https://yourdomain.com/page)
Verify the sitemap is accessible without authentication

info

If you continue to experience scraping issues after checking the above, contact our support team at support@toothfairyai.com with your domain details.

Menu location​

More about Domains​

Sitemap location​

Domain scraping​

Images extraction​

Sync cycle​

Domain verification (only needed for domains with over 500 pages)​

How We Scrape Data​

Scraping Methods (in order of priority)​

Content Processing​

Crawler Identification​

Limitations​

Common Scraping Issues​

Sitemap Issues​

Page-Level Issues​

Domain-Level Issues​

Checking Your Sitemap​

Menu location