Domains
Domains allow an AI agent to consume data from a domain (or website).
Each domain can be associated with one or more AI agents and the domain can be either a top-level domain or a subpath of a domain.
Each webpage scraped counts towards the total number of documents available under each plan.
The pages scraped will be available under the Knowledge Hub inside the Domain pages folder.
Menu location
Domains can be created and or deleted from the following menu:
Settings > Domains
More about Domains
The required details for creating a domain are:
- URL (The domain to scrape content from)
- The URL must be a valid domain name without the
https://orhttp://prefix. - When the URL is a subpath, the scraping process will be limited to that subpath and its children only
- The URL must be a valid domain name without the

By defining a topic, the associated AI Agents will have knowledge of the content scraped from the domain.
Sitemap location
By default, the sitemap is looked for at the URL provided/sitemap.xml. However, the user can force the location of the sitemap by entering the full path in the Sitemap path field.
For security reasons, domains scraping is limited to a publicly exposed sitemap.xml file. If the domain does not have a sitemap.xml file, the domain will not be scraped and the status will change to No sitemap.
Domain scraping
The scraping process will start once the domain is verified and the status will change to syncing.
Depending on the size of the domain, the scraping process may take a few minutes to a few hours to complete.
ToothFairyAI will scrape the domain and its children recursively to create a site tree of the domain - this ensures that incremental scraping operations are efficient and do not require re-scraping the entire domain over and over again.
Once the scraping process is complete, the status will change to active and the domain will be available for use by the associated AI Agents.
Images extraction
If the user wants to extract images from the domain, the Extract images option should be enabled.
Upon enabling this option, the Images retrieval instruction field will appear and the user will be required to provide the instructions for the AI Agent to extract the images that are relevant to the domain.
By default this option is disabled.
Sync cycle
The sync cycle is the frequency at which the domain will be scraped for new content. The sync cycle can be set to:
- Manual (default)
- 24 hours (daily)
- 72 hours (3 days)
- Every week (7 days)
By default, the sync cycle is set to manual and the user will need to manually trigger the sync process by clicking on the Sync button.
Any change to the configuration of the domain will take effect only after the next sync cycle.
Domain verification (only needed for domains with over 500 pages)
To sync websites with over 500 pages, administrators must add a DNS record to a domain to verify ownership; this is a security measure to ensure that only the domain owner can scrape content from the domain. Once the ownership validation process is complete, the website scraping will begin for the full domain.

By adding the records provided by ToothFairyAI to the associated DNS, ToothFairyAI will verify the domain by checking the registered records.
During the verification process, the status will change from:
- verifying
- approved
- syncing
- completed
In case the verification fails, the status will change to failed or noDomain if the dns record is not found.
How We Scrape Data
ToothFairyAI uses a multi-layered approach to scrape website content, ensuring the best possible results for each page:
Scraping Methods (in order of priority)
- TF native - Our primary scraping engine, optimised for fast and efficient content extraction natively converting data to markdown
- HTTP Requests - Standard HTTP requests using Python's
requestslibrary for simple HTML pages - Headless Chrome (Pyppeteer) - A fallback method for JavaScript-heavy websites that require rendering
Each page is attempted with a 15-second timeout. If one method fails, the next method is automatically tried.
Content Processing
- HTML is converted to Markdown format for optimal AI consumption
- Navigation, forms, scripts, styles, headers, and footers are automatically removed
- Sidebars, ads, and other non-content elements are filtered out
- Only text content is extracted (images can be optionally enabled)
Crawler Identification
Our crawler identifies itself with the User-Agent:
TF-AI-Crawler/1.0 (+https://toothfairyai.com/crawler)
Website administrators can allow or block our crawler via robots.txt. See our crawler policy for details.
Limitations
| Limitation | Details |
|---|---|
| Sitemap Required | All domains must have a publicly accessible sitemap.xml file |
| Page Limit (Unverified) | Domains without ownership verification are limited to 500 pages |
| No Authentication | Pages behind login walls or password protection cannot be scraped |
| Per-Page Timeout | Each page has a 15-second timeout limit |
| Robots.txt | We respect robots.txt directives; blocked pages will not be scraped |
| Rate Limiting | We crawl at a respectful rate to avoid overloading servers |
For domains with over 500 pages, complete the domain verification process to unlock unlimited scraping.
Common Scraping Issues
If a domain or specific pages fail to scrape correctly, check the following common causes:
Sitemap Issues
| Issue | Cause | Solution |
|---|---|---|
| Status: noSitemap | No sitemap.xml found at the default location | Provide a custom sitemap URL in the Sitemap path field |
| Empty sitemap | Sitemap exists but contains no URLs | Check that your sitemap is properly formatted XML with valid <loc> entries |
| Invalid URLs | Sitemap contains malformed or broken URLs | Validate your sitemap using online sitemap validators |
| Sitemap not indexed | Pages exist but aren't in the sitemap | Ensure all important pages are included in your sitemap |
| Dynamic sitemap | Sitemap requires JavaScript to load | Generate a static sitemap.xml file |
Page-Level Issues
| Issue | Cause | Solution |
|---|---|---|
| JavaScript Required | Page content loads via JavaScript after page load | No action needed - our headless Chrome fallback handles most JS content |
| CAPTCHA / Anti-bot | Page has security measures blocking automated access | Contact support with the specific URL for manual review |
| Geo-blocking | Page is restricted to specific countries | Whitelist our crawler IP ranges (contact support for details) |
| Slow Server | Page takes longer than 15 seconds to load | The page may timeout; consider optimizing server response time |
| Authentication Required | Page is behind a login | These pages cannot be scraped; remove from sitemap or make public |
| Non-HTML Content | URL points to PDF, image, or other file | These are skipped; only HTML pages are scraped |
| Server Errors (4xx/5xx) | Page returns error status codes | Fix broken links and remove erroring pages from sitemap |
Domain-Level Issues
| Issue | Cause | Solution |
|---|---|---|
| All pages fail | Domain blocks our crawler User-Agent | Add User-agent: TF-AI-Crawler with Allow: / to robots.txt |
| Partial scraping | Some pages blocked by robots.txt | Review your robots.txt file for restrictions |
| Timeout on sync | Domain too large, sync taking too long | Contact support to increase timeout thresholds |
Checking Your Sitemap
To verify your sitemap is working correctly:
- Visit
https://yourdomain.com/sitemap.xmlin a browser - Ensure it's valid XML (should open without errors)
- Check that URLs are properly formatted with full paths (e.g.,
https://yourdomain.com/page) - Verify the sitemap is accessible without authentication
If you continue to experience scraping issues after checking the above, contact our support team at support@toothfairyai.com with your domain details.