First end to end demonstration

In an earlier post, I described how one can view content from a remote repository within the HUBzero implementation of the Bamboo Work Space. Steve Masover described in another recent post a proof-of-concept scholarly service running on the BSP that proxies the Google translation service. In this post I connect the dots, that is, I describe fetching content from HathiTrust through the repository browser interface, preparing and submitting the content to the BSP's service, and finally, retrieving the results back into the work space.

Process

Step 1. Fetching the text

In the earlier post I showed a screen shot of an part of the page image for physical page 10 form the 1840 title, The Youth of Shakespeare. Let's use that page as an example. In addition to page images, HathiTrust delivers two other representations of a page: 1) an XML file that maps words within lines onto regions of the image, using a schema amusingly called "DjVuXML"; and 2) a plain text transcription of the page. I chose the XML file because it's easier to work with, having a more explicit structure. By slightly tweaking the repository browser code, I could use the repository browser to capture the file. Here's an except taken from the file that corresponds to the first few lines of the text.

<DjVuXML>
<BODY>
<OBJECT data="file://localhost//tmp/derive/youthofshakspear02will//youthofshakspear02will.djvu" height="3409" type="image/x.djvu" usemap="youthofshakspear02will_0010.djvu" width="1932">
<PARAM name="PAGE" value="youthofshakspear02will_0010.djvu"/>
<PARAM name="DPI" value="500"/>
<HIDDENTEXT>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="155,153,187,112,153">4</WORD>
<WORD coords="556,154,669,120,154">THE</WORD>
<WORD coords="713,154,894,119,152">YOUTH</WORD>
<WORD coords="939,152,1002,118,151">OF</WORD>
<WORD coords="1047,151,1417,117,150">SHAKSPEARE.</WORD>
</LINE>
</PARAGRAPH>
<PARAGRAPH>
<LINE>
<WORD coords="152,303,260,252,302">that</WORD>
<WORD coords="288,303,361,252,302">his</WORD>
<WORD coords="399,304,592,252,302">clothes</WORD>
<WORD coords="620,304,750,271,302">were</WORD>
<WORD coords="797,302,1049,250,300">saturated</WORD>
<WORD coords="1098,299,1216,249,298">with</WORD>
<WORD coords="1246,300,1330,249,299">the</WORD>
<WORD coords="1359,298,1496,265,298">same</WORD>
<WORD coords="1525,298,1774,247,297">moisture.</WORD>
</LINE>
<LINE>
<WORD coords="153,377,282,326,377">This</WORD>
<WORD coords="308,378,454,328,377">made</WORD>
<WORD coords="508,378,609,327,377">him</WORD>
<WORD coords="656,377,801,327,376">make</WORD>
<WORD coords="856,376,916,345,376">an</WORD>
<WORD coords="966,375,1251,324,374">immediate</WORD>
<WORD coords="1298,391,1512,334,372">attempt</WORD>
<WORD coords="1563,373,1613,336,372">to</WORD>
<WORD coords="1664,372,1774,323,371">rise,</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
</HIDDENTEXT>
</OBJECT>
</BODY>
</DjVuXML>
Step 2. Prepare the text

The thing to notice about this file is that it's not in the form we need. This is inconvenient but not surprising. We often have to prepare a text for submission to a tool, which is another way of saying that we need apply one or more tools to objects in pipeline fashion before we can achieve the results we want. (Some of the tools may be local, some remote.)

A simple XSLT transformation will produce a file in the form we need for the translation tool. Here's the code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs"
    xmlns:tns="http://org.projectbamboo/translationservice"
    xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" version="2.0">

    <xsl:output indent="yes" method="xml"/>

    <xsl:template match="DjVuXML/BODY/OBJECT/HIDDENTEXT">
        <tns:Translation xmlns:tns="http://org.projectbamboo/translationservice"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://org.projectbamboo/translationservice ../../main/resources/TranslationService.xsd ">
            <tns:description>A test</tns:description>
            <tns:sourceDocument>
                <tns:title>Test</tns:title>
                <tns:language>EN</tns:language>
                <xsl:apply-templates select="PAGECOLUMN/REGION/PARAGRAPH/LINE"/>
            </tns:sourceDocument>
        </tns:Translation>
    </xsl:template>

    <xsl:template match="LINE">
        <tns:line>
            <tns:lineNumber><xsl:value-of select="position()"/></tns:lineNumber>
            <tns:lineContent>
                <xsl:for-each select="WORD">
                    <xsl:value-of select="."/><xsl:if test="position() >= 1"><xsl:text> </xsl:text></xsl:if>
                </xsl:for-each>
            </tns:lineContent>
        </tns:line>
    </xsl:template>
</xsl:stylesheet>

This yields a file of the kind Steve described in his post.  An except:

<Translation xmlns="http://org.projectbamboo/translationservice">
    <status>Created</status>
    <self>http://esb-d3.calnet.berkeley.edu:8181/cxf/bsp/translationservice/translations/2011-06-02T10:35:01.061-07:00</self>
    <description>A test</description>
    <translator>Google Language API</translator>
    <translatedDate>2011-06-02T10:35:01.061-07:00</translatedDate>
    <sourceDocument>
        <title>Test</title>
        <language>en</language>
        <line>
            <lineNumber>1</lineNumber>
            <lineContent>4 THE YOUTH OF SHAKSPEARE. </lineContent>
        </line>
        <line>
            <lineNumber>2</lineNumber>
            <lineContent>that his clothes were saturated with the same moisture. </lineContent>
        </line>
        <line>
            <lineNumber>3</lineNumber>
            <lineContent>This made him make an immediate attempt to rise, </lineContent>
        </line>
    </sourceDocument>
</Translation>
Step 3. Call the service and retrieve the results

I wrote a very simple RESTful client for the translation service in Java. It scans a source directory, submits any file it finds in the directory to the service, and writes the results in a target directory under the same file name. The client is very simple: no error handling to speak of, no UI. Remember this is a proof-of-concept demonstration, not something anyone would use.

Still, for the demo to count, we need to run this in HUBzero, and for good measure in the same instance as the one I used to demonstrate the repository browser, the test instance at Indiana. I used HUBzero's embedded VNC client to upload the Java application into my file space within HUBzero's maxwell service. This service gives any user access to a file system level sandbox on the HUBzero machine, from which the user can run code written in a number of different languages: C, Java, Perl, Python, Ruby, Fortran.

The source document, page10.xml, was placed in the directory I had configured as the source directory. It was the only file in the directory. Here is screen shot of the app's output.


(Click on the thumbnail to view the larger image.)

Notice the two step process. First we call the translation service to submit the file. Then, we retrieve the file, as though from a results cache.

Step 4. Inspect the results

Here's an except from the results file:

<Translation xmlns="http://org.projectbamboo/translationservice">
    <status>Created</status>
    <self>http://esb-d3.calnet.berkeley.edu:8181/cxf/bsp/translationservice/translations/2011-06-02T10:35:01.061-07:00</self>
    <description>A test</description>
    <translator>Google Language API</translator>
    <translatedDate>2011-06-02T10:35:01.061-07:00</translatedDate>
    <sourceDocument>
        <title>Test</title>
        <language>en</language>
        <line>
            <lineNumber>1</lineNumber>
            <lineContent>4 THE YOUTH OF SHAKSPEARE. </lineContent>
        </line>
        <line>
            <lineNumber>2</lineNumber>
            <lineContent>that his clothes were saturated with the same moisture. </lineContent>
        </line>
        <line>
            <lineNumber>3</lineNumber>
            <lineContent>This made him make an immediate attempt to rise, </lineContent>
        </line>
    </sourceDocument>
    <targetDocument>
        <title>Test</title>
        <language>fr</language>
        <line>
            <lineNumber>1</lineNumber>
            <lineContent>4 LA JEUNESSE de Shakespeare.</lineContent>
        </line>
        <line>
            <lineNumber>2</lineNumber>
            <lineContent>que ses vêtements ont été saturées avec la même humidité.</lineContent>
        </line>
        <line>
            <lineNumber>3</lineNumber>
            <lineContent>Ceci fait de lui faire une tentative immédiate de prendre la
                parole,</lineContent>
        </line>
    </targetDocument>
</Translation>

Comments

One of the moving parts is missing from this demonstration. We aren't writing the file directly to the HUBzero file system yet. We're working on that. We'll call it an implementation detail for now.

This demonstration, I think, proves the end-to-end concept we aim to implement more fully in BTP Phase One:  retrieve content from a major repository into a Bamboo Work Space using the CI adapter infrasture, apply tools it — perhaps local tools, but certainly remote tools invoked through the BSP (and in this case, both) — and collect the results back into the Work Space for inspection. More to come.