Knowledge bases
What is a knowledge base?
A Knowledge Base is the dataset that serves as an index for the Retrieval-Augmented Generation (RAG) system. It's the foundational layer that the system uses to generate answers.
What is a data source?
A Data Source is a specific document or set of documents that the system uses to answer user queries. It acts as a building block for your Knowledge Base.
What type of data source can I use?
The system performs optimally with unstructured data. While tabular data and large Excel files can be ingested, these aren't the ideal types of data for optimal performance.
Available types of data sources:
- Web Content: HTML pages provided via URLs.
- Uploaded Files: PDF, DOC, CSV, TXT. Images like PNG and JPG will be processed via OCR (Optical Character Recognition).
- Raw Text: Plain text can also be used as a data source.
File Hosting Note
As with every media or document uploaded to the platform's Gen AI data sources—be it web pages from the web or uploaded files—rest assured that all your data will be securely stored in our local cloud-based file system for optimized processing and retrieval. This includes files in PDF, DOC, CSV, and TXT formats, as well as image files such as PNG and JPG, which will be processed using our cutting-edge OCR (Optical Character Recognition) technology.
What if I have a big website with many pages?
Managing a large website with multiple pages can be challenging, but we've made it simpler for you. To add your extensive web content into your Knowledge Base, just do the following:
- Add a 'Web Content' data source to your Knowledge Base.
- Click on the "Search URLs" button.
This action will automatically extract all the URLs from the designated web pages, saving you significant time. Once the URL list is generated, you have the option to remove any URLs that are not relevant by clicking on the 'Delete' button. By extracting URLs from key pages of your website, you can swiftly map out your website's essential content in your Knowledge Base. Rest assured, duplicate URLs are not an issue; they will be ignored during the training phase.
What if I have a scanned document?
If you have scanned documents to include in your Knowledge Base, fret not. The Smartly platform is equipped with cutting-edge OCR (Optical Character Recognition) algorithms that can extract text from image-based documents. To take advantage of this feature, simply add your scanned document as a PNG or JPG file in your data sources. Our cutting-edge OCR technology will then automatically process these files to extract and index the text, making it a part of your Gen AI's Knowledge Base.
What happens to my data sources?
Your data sources undergo several processes:
- Ingestion (Web scraping for web content, OCR for scanned docs)
- Cleaning
- Splitting
- Vectorization (via embeddings)
- Storage in a local vector store
After defining your data sources, click on the Train button to update your Knowledge Base with the new content. Once ingested and processed, the Knowledge Base will be used by the bot to answer user questions.
Web Scraping
What is web scraping and how is it used in my Gen AI?
Web scraping is a method we use to gather relevant data from web pages you specify as data sources in your knowledge base. This enriches your Gen AI with up-to-date information from the web.
Can we scrape any web page on the web?
While our goal is to scrape a broad range of web pages, some restrictions apply. Some websites have anti-scraping mechanisms, and certain FAQ sections that require interaction to view answers may pose challenges. We recommend you test different web pages and examine the scraped data for compatibility. Rest assured, we're continually enhancing our scraping capabilities and plan to introduce additional libraries in the near future.
How to deal with intranet web pages?
Scraping content from intranet pages is more complex, but we offer several options:
- You can send us content through an API using a relay script within your intranet.
- Depending on your IT policy, a reverse proxy could be configured to securely route intranet content to our scraper.
What are the available options for web scraping?
Here is the available options for web scraping:
- Scraping Library: Options include `
Puppeteer
, withCheerio
andPlaywright
coming soon. - Minimum Waiting Time: Define the minimum waiting time in seconds for the web page to fully load. The default is set at 2 seconds. Reducing this to zero will prompt the scraper to pull content immediately upon page loading. Note that some websites use delayed loading as a security measure against bots, so a longer waiting period may be needed.
Prepare Your Gen AI
After configuring your knowledge base, instructions, and various settings, the next step is to initiate a "training" process. This step is crucial for preparing your Gen AI for production use.
Why is training necessary?
The training process is responsible for performing all the backend operations required to make your Gen AI fully functional. Depending on the size of your knowledge base, this process may take some time as it involves multiple steps to correctly ingest and index each document.
How to know when training is Needed?
Your dashboard will indicate which Gen AI models require training. This is particularly essential if you've updated your knowledge base or made modifications to the Search Engine settings.
Initiating the training process
To start the training process, simply click on the "Train" button associated with your Gen AI.
Progress Status
Upon launching the training, a popup will appear displaying the progress status of the operation.
Post-Training Steps
Once the training process is successfully completed, you're all set to deploy and test your Gen AI in a production environment.
Updated about 2 months ago