Knowledge bases

What is a knowledge base?

A knowledge base (KB) is the collection of data your assistant can search. It acts as the index that powers Retrieval-Augmented Generation (RAG).

What is a data source?

A Data Source is a specific document or set of documents that the system uses to answer user queries. It acts as a building block for your Knowledge Base.

What type of data source can I use?



  • Web Content: HTML pages provided via URLs.
  • Uploaded Files: PDF, DOC, CSV, TXT (images like PNG/JPG are OCR-processed).
  • Raw Text: Plain text.

ℹ️ Tip: Unstructured data works best. Large spreadsheets or tables are supported, but not optimal.

Limits: You can attach up to 50 knowledge bases per assistant, and each KB can host up to 5,000 data sources.

Reminder: Once you’re satisfied with a KB, don’t forget to attach it to your assistant.

What if I have a big website with many pages?

Managing a large website with multiple pages can be challenging, but we've made it simpler for you. To add your extensive web content into your Knowledge Base, just do the following:

  1. Add a 'Web Content' data source to your Knowledge Base.
  2. Click on the "Search URLs" button.
Add a parent page and click 'Search urls"

Add a parent page and click 'Search urls"

Results populate within the associated parent folder, allowing you to selectively keep or remove URLs as needed.

Results populate within the associated parent folder, allowing you to selectively keep or remove URLs as needed.

This action will automatically extract all the URLs from the designated web pages, saving you significant time. Once the URL list is generated, you have the option to remove any URLs that are not relevant by clicking on the 'Delete' button. By extracting URLs from key pages of your website, you can swiftly map out your website's essential content in your Knowledge Base. Rest assured, duplicate URLs are not an issue; they will be ignored during the training phase.

What if I have a scanned document?

If you have scanned documents to include in your Knowledge Base, fret not. The Smartly platform is equipped with cutting-edge OCR (Optical Character Recognition) algorithms that can extract text from image-based documents. To take advantage of this feature, simply add your scanned document as a PNG or JPG file in your data sources. Our cutting-edge OCR technology will then automatically process these files to extract and index the text, making it a part of your assistant's Knowledge Base.

Smartly.AI OCR (Optical Character Recognition)

Smartly.AI OCR (Optical Character Recognition)

What happens to my data sources?

Your data sources undergo several processes:

  1. Ingestion (Web scraping for web content, OCR for scanned docs)
  2. Cleaning
  3. Splitting
  4. Vectorization (via embeddings)
  5. Storage in a local vector store

After defining your data sources, click on the Train button to update your Knowledge Base with the new content. Once ingested and processed, the Knowledge Base will be used by the bot to answer user questions.


Web Scraping

What is web scraping and how is it used in my Gen AI?

Web scraping is a method we use to gather relevant data from web pages you specify as data sources in your knowledge base. This enriches your Gen AI with up-to-date information from the web.

Can we scrape any web page on the web?

While our goal is to scrape a broad range of web pages, some restrictions apply. Some websites have anti-scraping mechanisms, and certain FAQ sections that require interaction to view answers may pose challenges. We recommend you test different web pages and examine the scraped data for compatibility. Rest assured, we're continually enhancing our scraping capabilities and plan to introduce additional libraries in the near future.

How to deal with intranet web pages?

Scraping content from intranet pages is more complex, but we offer several options:

  • You can send us content through an API using a relay script within your intranet.
  • Depending on your IT policy, a reverse proxy could be configured to securely route intranet content to our scraper.

What are the available options for web scraping?



Here are the options you can choose from, and when to use each:

  • Cheerio (HTML parsing)

    • Best for:
      • Static pages where HTML is fully returned by the server.
      • High-throughput, low-cost scraping (fast and lightweight).
      • Simple extraction from blogs, documentation, news, or sitemap-linked pages.
    • Pros:
      • Very fast and resource efficient.
      • Low maintenance and easier to scale.
    • Cons:
      • Cannot execute JavaScript. Won’t see content loaded dynamically by client-side frameworks (React/Vue/Angular).
    • Use it when:
      • You can fetch the final HTML via a simple HTTP GET.
      • The content you need is visible in the page source or in static endpoints (RSS, JSON API, sitemap).
      • You want to minimize cost and complexity.
    • Tip:
      • If some parts are missing, check if the page calls a JSON API you can query directly.
  • Puppeteer (headless browser automation)

    • Best for:
      • Dynamic or SPA sites where content is rendered by JavaScript after load.
      • Pages that require user interaction (click, scroll, form submit) or login sessions.
      • Scraping behind infinite scroll, pagination buttons, or cookie walls.
    • Pros:
      • Executes JavaScript, renders the page like a real browser.
      • Can handle waits, selectors, screenshots, and complex flows.
    • Cons:
      • Heavier, slower, and more expensive to run at scale.
      • More sensitive to anti-bot protections (may need human-like delays, retries, proxies).
    • Use it when:
      • Cheerio returns incomplete content because the page builds content client-side.
      • You need to simulate user actions or wait for elements to load.
      • The site requires authentication or complex navigation.

Recommended decision path

  • Start with Cheerio:
    • Works in most cases for static pages.
    • Faster, cheaper, simpler.
  • Switch to Puppeteer if:
    • Content doesn’t appear in page source.
    • You see placeholders like “Loading…” or empty containers until JS runs.
    • You must click, scroll, or log in to reveal content.

Indexation

What is indexation?

Indexation prepares your knowledge base so your assistant can quickly search and retrieve answers. It includes ingestion, cleaning, splitting, vectorization, and storage.

When is indexation needed?

Re-index whenever:

  • KB settings are updated,
  • KB content is changed, or
  • linked external sources (like websites) may have changed.

👉 The dashboard will highlight when indexation is required.


How to start indexation

Click the Index button next to your assistant. A progress popup will appear and confirm when complete.


Tracking progress

A popup will display the progress and confirm once indexation is complete.


After indexation

  • Review content chunks inside each data source (via View Content).
  • Refine scraping and chunking settings until the KB looks right.
  • Attach the KB to your assistant.

Your assistant is now ready for testing and production.