Skip to content

Commit b905e89

Browse files
authored
Added book extraction script
1 parent 80767dd commit b905e89

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

README.md

+10-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ at the Goethe Universität, has created tokenised versions of four languages
1919
(Chinese, Japanese, Thai, Vietnamese). They are included in this collection but they can also be found
2020
[here](https://www.hucompute.org/ressourcen/corpora).
2121

22-
Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools). If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
22+
If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
2323
```
2424
import xml.etree.ElementTree as ET
2525
lang = 'English'
@@ -28,3 +28,12 @@ with open(lang + '.txt', 'w', encoding='utf-8') as out:
2828
for n in root.iter('seg'):
2929
out.write(n.text.strip() + '\n')
3030
```
31+
or for a specific book:
32+
```
33+
book_id = 'b.GEN'
34+
with open(lang + '-' + book_id + '.txt', 'w', encoding='utf-8') as out:
35+
for n in root.findall('.//div[@id="'+book_id+'"]/*seg'):
36+
out.write(n.text.strip() + '\n')
37+
```
38+
39+
Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools).

0 commit comments

Comments
 (0)