Here at the Research Software Company, we currently have a big corpus of TEI-encoded Hebrew songs we are working with. (If you have Hava Nagila playing in your head now, we apologize, but we promise Hebrew music is much richer and more complex than that tired Bar Mitzvah standard. But I digress.) Since we don’t like working with thousands of files, we store the documents in a database. In this case, since these are TEI documents, we decided to use eXist-db .
We’ve written a Python script that takes all the files, organizes them a bit and stores each file as a document in the database, using eXist-db as a path for each document. The name we’ve chosen was based on the filename of the file.
This all worked very well. We were able to issue xQuery queries on the database and we were very happy. Not dancing to Hava Nagila happy, but you know, able to continue our work.
Then we wanted something a little more complex. We wanted to take the TEI documents, perform some sort of processing on them and store the processed documents in a different collection in the database. This is basically pretty simple, the way to do that is:
- Query all the documents you want to process.
- Update each document and store it in a new document path.
This proved to be very tricky. The queries documents were all properly retrieved from the database. The paths, however, were returned as gibberish. Not all paths, just paths containing Hebrew letters. This was clearly a UTF-8 decoding problem.
After some investigation we nailed it down to a pretty obscure eXist-db bug. We used pyexistdb for accessing eXist-db from Python. Turns out eXist-db offers two HTTP-based APIs which are both used by pyexistdb. There’s a REST API and an XML-RPC API.
Pyexistdb uses the REST API to perform most xQuery queries. However, to get the document path we needed to use another method, which turned out to be using the XML-RPC API. Using this API caused the issue.
When using the REST API, eXist-db returns the following content-type header:
content-type: application/xml; charset=UTF-8
However, using XML-RPC the header is:
Note the missing
charset=UTF-8? The HTTP standard specifies
ISO-8559-1 as the default encoding, which caused all our Hebrew text to be decoded into gibberish.
Thankfully, the eXist-db people are blazingly fast, and have already fixed this. Version 4.20 should work properly. We hope it’s going to be released soon. Meanwhile, we’ll be dancing the hora.