Wednesday, August 6, 2008

Sword Project Bible Reader in Python

Update: I've updated this and put in on GitHub.

The SWORD API is complex. I did at one point try the SWIG wrapper, but that was still quite difficult to get right.

Fortunately, as complicated as the code is, what it actually does turns out to be very simple. With the help of zverse.cpp, libtool gdb examples/cmdline/lookup, and the Holy Spirit, I came up with this simple Python file to read a verse given an index:

#!/usr/bin/env python

# * ztext format documentation
# I'll use Python's struct module's format strings.
# See http://docs.python.org/lib/module-struct.html
# Take the Old Testament (OT) for example. Three files:
#
# - ot.bzv: Maps verses to character ranges in compressed buffers.
# 10 bytes ('<IIH') for each verse in the Bible:
# - buffer_num (I): which compressed buffer the verse is located in
# - verse_start (I): the location in the uncompressed buffer where the verse begins
# - verse_len (H): length of the verse, in uncompressed characters
# These 10-byte records are densely packed, indexed by VerseKey 'Indicies' (docs later).
# So the record for the verse with index x starts at byte 10*x.
#
# - ot.bzs: Tells where the compressed buffers start and end.
# 12 bytes ('<III') for each compressed buffer:
# - offset (I): where the compressed buffer starts in the file
# - size (I): the length of the compressed data, in bytes
# - uc_size (I): the length of the uncompressed data, in bytes (unused)
# These 12-byte records are densely packed, indexed by buffer_num (see previous).
# So the record for compressed buffer buffer_num starts at byte 12*buffer_num.
#
# - ot.bzz: Contains the compressed text. Read 'size' bytes starting at 'offset'.
#
# NT is analogous.

# Configuration (set this to your own modules path):

modules_path = '/home/kcarnold/.sword/modules/texts/ztext'


import struct, zlib
from os.path import join as path_join

class ZModule(object):
def __init__(self, module):
self.module = module
self.files = {
'ot': self.get_files('ot'),
'nt': self.get_files('nt')
}

def get_files(self, testament):
'''Given a testament ('ot' or 'nt'), returns a tuple of files
(verse_to_buf, buf_to_loc, text)
'''
base = path_join(modules_path, self.module)
v2b_name, b2l_name, text_name = [path_join(base, '%s.bz%s' % (testament, code))
for code in ('v', 's', 'z')]
return [open(name, 'rb') for name in (v2b_name, b2l_name, text_name)]

def text(self, testament, index):
'''Get the text for a given index.'''
verse_to_buf, buf_to_loc, text = self.files[testament]

# Read the verse record.
verse_to_buf.seek(10*index)
buf_num, verse_start, verse_len = struct.unpack('<IIH', verse_to_buf.read(10))

uncompressed_text = self.uncompressed_text(testament, buf_num)
return uncompressed_text[verse_start:verse_start+verse_len]

def uncompressed_text(self, testament, buf_num):
verse_to_buf, buf_to_loc, text = self.files[testament]

# Determine where the compressed data starts and ends.
buf_to_loc.seek(buf_num*12)
offset, size, uc_size = struct.unpack('<III', buf_to_loc.read(12))

# Get the compressed data.
text.seek(offset)
compressed_data = text.read(size)
return zlib.decompress(compressed_data)

if __name__=='__main__':
import sys
mod_name = sys.argv[1]
testament = sys.argv[2]
index = int(sys.argv[3])

module = ZModule(mod_name)
print module.text(testament, index)

This was a one-evening project, mostly taken up by the reverse-engineering. As you can see, Python provides a lot of the foundation work so that the code we actually have to write is very small.

I still need to figure out how human-readable verse identifiers get mapped to those numerical indices. It's hidden somewhere in VerseKey...

2 comments:

Anonymous said...

Nice reverse engineering. I wonder if you are going to further develop this. The SWIG method is unintelligible.

Anonymous said...

Nice. Little suggestion, though:
import os
modules_path = os.environ["HOME"]+"/.sword/modules/texts/ztext"