Skip to content

NNLP-IL/Webiks-Hebrew-RAGbot-KolZchut-Paragraph-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Webiks-Hebrew-RAGbot-KolZchut-Paragraph-Corpus

  • This repo contains a corpus of all paragraphs from the Kol-Zchut website.
  • The file Webiks_Hebrew_RAGbot_KolZchut_Paragraphs_Corpus_v1.0.json is the paragraph corpus used to train the Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder model. Instructions on how to use this data for training can be found in the following repository. Each entry in this corpus represents a paragraph. It includes all relevant paragraphs from the Kol-Zchut website, extracted by splitting the Kol-Zchut webpages based on their HTML titles, and then combining paragraphs up to the maximum context size of the me5-large model, which is 512 tokens.

Fields

  • doc_id: The unique identifier of the website page.
  • title: The title of the website page.
  • link: The website page link.
  • content: This is the paragraph in the page.
  • license: The license under which the file is published.

Notes

  • The corpus was extracated from the KolZchut website in May 2024 and is not necessarily up to date with today's website.

License

This data is published under Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license.
More info: https://creativecommons.org/licenses/by-nc-sa/2.5/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published