Skip to main content

URLs & Feeds

Keep your knowledge base up to date with web content. Add individual URLs for one-time indexing or RSS/Atom feeds for automatic updates.

Add a URL

Index content from a web page:
POST /personalities/:personality_id/knowledge/urls
{
  "url": "https://docs.example.com/getting-started",
  "metadata": {
    "category": "documentation"
  }
}
{
  "url_id": "url_abc123",
  "url": "https://docs.example.com/getting-started",
  "metadata": {
    "category": "documentation"
  },
  "processing": {
    "status": "processing"
  },
  "created_at": "2024-01-15T10:30:00Z"
}

Crawl a Website

Index multiple pages from a website:
POST /personalities/:personality_id/knowledge/urls/crawl
{
  "url": "https://docs.example.com",
  "options": {
    "max_pages": 100,
    "max_depth": 3,
    "include_patterns": ["/docs/*", "/guides/*"],
    "exclude_patterns": ["/blog/*", "/changelog/*"]
  },
  "metadata": {
    "source": "documentation-site"
  }
}
{
  "crawl_id": "crawl_abc123",
  "url": "https://docs.example.com",
  "status": "crawling",
  "progress": {
    "pages_found": 0,
    "pages_indexed": 0
  },
  "created_at": "2024-01-15T10:30:00Z"
}

Crawl Options

OptionDescriptionDefault
max_pagesMaximum pages to index100
max_depthHow many links deep to follow3
include_patternsURL patterns to include (glob)["*"]
exclude_patternsURL patterns to skip (glob)[]
respect_robotsHonor robots.txttrue
wait_time_msDelay between requests1000

Check Crawl Status

GET /personalities/:personality_id/knowledge/urls/crawl/:crawl_id
{
  "crawl_id": "crawl_abc123",
  "url": "https://docs.example.com",
  "status": "completed",
  "progress": {
    "pages_found": 87,
    "pages_indexed": 85,
    "pages_skipped": 2,
    "pages_failed": 0
  },
  "urls_created": [
    {"url_id": "url_abc123", "url": "https://docs.example.com/"},
    {"url_id": "url_def456", "url": "https://docs.example.com/docs/intro"}
  ],
  "completed_at": "2024-01-15T10:45:00Z"
}

Add an RSS/Atom Feed

Subscribe to a feed for automatic updates:
POST /personalities/:personality_id/knowledge/feeds
{
  "url": "https://blog.example.com/feed.xml",
  "options": {
    "check_interval": "1h",
    "max_items": 50
  },
  "metadata": {
    "category": "blog"
  }
}
{
  "feed_id": "feed_abc123",
  "url": "https://blog.example.com/feed.xml",
  "title": "Example Blog",
  "options": {
    "check_interval": "1h",
    "max_items": 50
  },
  "stats": {
    "items_indexed": 0,
    "last_checked": null
  },
  "created_at": "2024-01-15T10:30:00Z"
}

Feed Options

OptionDescriptionDefault
check_intervalHow often to check for updates"1h"
max_itemsMaximum items to keep indexed100
index_full_contentFetch and index full articletrue
Supported intervals: 15m, 30m, 1h, 6h, 12h, 24h

List URLs

GET /personalities/:personality_id/knowledge/urls
{
  "urls": [
    {
      "url_id": "url_abc123",
      "url": "https://docs.example.com/getting-started",
      "processing": {
        "status": "completed",
        "chunks_created": 15,
        "last_indexed": "2024-01-15T10:32:00Z"
      },
      "metadata": {
        "category": "documentation"
      }
    }
  ],
  "pagination": {
    "next_cursor": "eyJ...",
    "has_more": true
  }
}

List Feeds

GET /personalities/:personality_id/knowledge/feeds
{
  "feeds": [
    {
      "feed_id": "feed_abc123",
      "url": "https://blog.example.com/feed.xml",
      "title": "Example Blog",
      "stats": {
        "items_indexed": 47,
        "last_checked": "2024-01-20T14:00:00Z",
        "next_check": "2024-01-20T15:00:00Z"
      }
    }
  ]
}

Refresh a URL

Re-index a URL to get updated content:
POST /personalities/:personality_id/knowledge/urls/:url_id/refresh
{
  "url_id": "url_abc123",
  "processing": {
    "status": "processing"
  }
}

Refresh a Feed

Immediately check a feed for new items:
POST /personalities/:personality_id/knowledge/feeds/:feed_id/refresh

Delete a URL

DELETE /personalities/:personality_id/knowledge/urls/:url_id

Delete a Feed

DELETE /personalities/:personality_id/knowledge/feeds/:feed_id
This removes the feed and all indexed items from the knowledge base.

Bulk Add URLs

Add multiple URLs at once:
POST /personalities/:personality_id/knowledge/urls/bulk
{
  "urls": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3"
  ],
  "metadata": {
    "category": "documentation"
  }
}
{
  "added": 3,
  "urls": [
    {"url_id": "url_abc123", "url": "https://docs.example.com/page1", "status": "processing"},
    {"url_id": "url_def456", "url": "https://docs.example.com/page2", "status": "processing"},
    {"url_id": "url_ghi789", "url": "https://docs.example.com/page3", "status": "processing"}
  ]
}

Content Extraction

Web pages are processed to extract meaningful content:
  1. HTML parsing - Extract text content
  2. Boilerplate removal - Remove navigation, footers, ads
  3. Structure preservation - Keep headings, lists, tables
  4. Metadata extraction - Title, description, publish date

Handling JavaScript Sites

For sites that require JavaScript rendering:
{
  "url": "https://app.example.com/docs",
  "options": {
    "render_javascript": true,
    "wait_for_selector": ".content-loaded"
  }
}
Note: JavaScript rendering increases processing time.