Webiks-Hebrew-RAGbot-KolZchut-Paragraph-Corpus

This repo contains a corpus of all paragraphs from the Kol-Zchut website.
The file Webiks_Hebrew_RAGbot_KolZchut_Paragraphs_Corpus_v1.0.json is the paragraph corpus used to train the Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder model. Instructions on how to use this data for training can be found in the following repository. Each entry in this corpus represents a paragraph. It includes all relevant paragraphs from the Kol-Zchut website, extracted by splitting the Kol-Zchut webpages based on their HTML titles, and then combining paragraphs up to the maximum context size of the me5-large model, which is 512 tokens.

Fields

The corpus was extracated from the KolZchut website in May 2024 and is not necessarily up to date with today's website.

This data is published under Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license.
More info: https://creativecommons.org/licenses/by-nc-sa/2.5/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md