Skip to main content

Your data retriever

Here we are going to build a whole section about RAG

But for starters, let's review what we've learned during the development process. These tips will help you to get more from your data in our implementation.

Building prompts

  • Be specific. Think how you formulate your question. Try to use keywords, to get best results of data retrieving.
  • Keywords are company/field specific words that can narrow the field of interest - you know them best. Just play with it.
  • Even if your data is in another language, you can always try to write the query in English. Sometimes answers are better
  • Don't expect database like responses from the chat. It's just a summarization engine. It takes best matching documents and mixes them into shorter responses.
    • If you want exact answers you should wait for our text to sql chat with database retrieval
  • Don't ask about numbers and don't put trust in any number you see. Review the referenced file for the final answer.
  • And finally, it's not a chat. It doesn't remember your previous queries. This is because of model context length limitations - loaded documents grow very fast. But this isn't anything bad ;)

Upload new data

If you gave us some initial data, it should be already uploaded under your tenant. It's accessible through vector store index and as raw files. However, it is possible to upload more data, which is going to be ready as soon as the indexer finishes its job (interface will be grayed out). New data would be accessible by all users in your tenant.

Supported file types

  • PDF
  • CSV
  • JSON
  • TXT

How to

Just drag and drop files into the box on the left.

How to upload your files

And then click Send.
After a while (running indicator in the top right corner) you are good to go with new files inserted into the knowledge-base.

Convert XLS using Pandas

If you want to insert XLS files you should convert it to CSV or JSON. Current data loaders handle JSON better because of preserved column names.

import pandas as pd

file_name = 'some_file.xlsx'
name_wo_extension = file_name.split('.')[0]

#Save to CSV
pd.read_excel(file_name, header=0).to_csv(f'{name_wo_extension}.csv', index=False)

#Save to JSON
pd.read_excel(file_name, header=0).to_json(f'{name_wo_extension}.json', index=False)

In future implementations, there will be an auto converter of XLS and XLSX files.

Prepare data

You can always try to upload any kind of raw data, but to get the best results, you should be mindful about it.

Example

If your data is a pdf, but mainly tables, you should consider converting those tables to CSV or even better; JSON files. - An auto parser is in development

Also, if your XLS files are not necessarily database-like files (rows and columns), you should think of converting it to PDF.

Known problems

Some are real, some are just annoying

  • Language - even if the question and context are written in Polish, the response in 99% fo the cases will be in English. We're working on it, but for now, we don't have sufficient data in the Polish language corpus to fine-tune the model.
  • Disappearing API-KEY - this is because of the technology used in front-end Streamlit. We don't see any future with it, so we treat it only as a quick-to-implement interface for testing our back-end.

Ask for more

We are up-to-date with the newest solutions, always developing better versions of our software. If you don't like what you see, talk to us. We love your feedback because this is the only way to get better.

Connect with us at ai@comtegra.pl