Saturday, February 23, 2013

Reading utmp/wtmp and utmpx/wtmpx with Perl's Convert::Binary::C module

Last week I came to revisit a project which I had done some time ago in 2007: analyzing wtmpx on Solaris 10. Since wtmpx is in binary format its contents need to be extracted with some tool. wtmpx is a sequence of equally sized chunks, each chunk representing a 'struct' element which is defined in a corresponding C header file.
My old approach was to use the Solaris tool fwtmp and the flow was something like
cat /var/adm/wtmpx | /usr/lib/acct/fwtmp |  wtmpx.pl ....
fwtmp contains all the transformation logic binary -> ASCII and would put out long lines of the form
andreash                         dt   console                               1506
   7 0000 0000 1311377458 0 0 3 :0 Sa Jul 23 01:30:58 2011
(this is one line).
These lines are fixed field text files and can be easily parsed by awk, Perl, whatsoever and my subsequent wtmpx.pl did analyze the contents.
In the meantime I had learned about Perl's pack/unpack functions and the following recipe can be found on various places on the Internet:
$wtmpx     = "/var/adm/wtmpx";
$typedef   = 'A32 A4 A32 i s s s b b i i i i5 s A257 b';
$sizeof    = length pack ($typedef, () );

open(WTMPX, $wtmpx) || die("Cannot open $wtmpx\n");

# read chunks of length 'sizeof'
while( read(WTMPX, $buffer, $sizeof) == $sizeof ) {
  ($ut_user, $ut_id, $ut_line, $ut_pid, $ut_type, $ut_e_exit, $ut_e_termination,
   $dummy, $dummy, $ut_tv_sec, $ut_tv_usec, $ut_session, $ut_pad[0], $ut_pad[1],
   $ut_pad[2], $ut_pad[3], $ut_pad[4], $ut_syslen, $ut_host, $dummy)
  = unpack($typedef, $buffer);

  # ... and now access the fields 
}
close(WTMPX);
So what fwtmp is doing needs to be done by Perl by defining a template which can be used by unpack to decipher the binary content. The template mimicks the 'struct' elements (struct futmpx in /usr/include/utmpx.h). Each letter represents a type of variable e.g A32 stands for string of length 32, i for integer, s for short integer a.s.o. (all defined in perlfunc). Interestingly (and making this thing somewhat complex) is the introduction of dummy bytes here and then. This is in order to comply with how types of variables are represented and aligned on the machine. Here we have an alignment of 4 so the sequence of 2-byte short integers 's s s' followed by a 4-byte integer needs to be filled with 2 extra bytes (the integer needs to start at a 4-byte boundary) which leads to 's s s b b i'. As you can see this definition is very platform and OS specific and such a script is not portable easily.
When I wanted to port this to Linux (Ubuntu 11 in my case) there are a couple of things to note. On Solaris utmp/wtmp has been deprecated and utmpx/wtmpx are being used which reside in /var/adm. On Linux it's the other way round: there is no utmpx/wtmpx file but only utmp in /var/run and wtmp in /var/log. There is no fwtmp equivalent on Linux and of course the 'struct' definition differs from Solaris (luckily enough though the names of the struct elements match to a great extent). So finding a portable solution seemed to be an impossible task from the start.

Then I found one of the most astonishing Perl modules Convert::Binary::C I have ever used. It allows to use C code in Perl, and it does that in a very nifty way which I'll show below.

First let's recap what I want to achieve:

  • 4. create a platform independent Perl script which allows to
  • 3. read utmpx/wtmpx and utmp/wtmp files
  • 2. and thus needs to determine the size of the C utmpx struct in an easy way
  • 1. i.e. by using the system's /usr/include/utmpx.h file

    Step 1: Read utmpx.h

    Below would have been the easiest code imaginable: tell Convert::Binary::C to look for utmpx.h in the /usr/include directory.
    use Convert::Binary::C;
    $c = Convert::Binary::C->new( Include => ['/usr/include'] );
    $c->parse_file('utmpx.h');
    
    This does not work though and thus I could not quite achieve goal 4.

    Solaris fails with:

    sys/isa_defs.h, line 503: #error "ISA not supported"
            included from /usr/include/sys/feature_tests.h:12
            included from /usr/include/utmpx.h:36 at - line 3.
    
    which needs to be rectified by defining one of the macros __i386, i386, __ppc, __sparc, or sparc for the machine type. This code here works (note the definition of __sparc):
    use Convert::Binary::C;
    $c = Convert::Binary::C->new( 
               Include => ['/usr/include'], 
               Define => [qw(__sparc)] 
            );
    $c->parse_file('utmpx.h');
    

    Ubuntu fails with:

    features.h, line 323: file 'bits/predefs.h' not found
     included from /usr/include/utmp.h:22 at - line 3.
    
    which needs to be rectified by adding another include directory
    use Convert::Binary::C;
    $c = Convert::Binary::C->new( 
               Include => ['/usr/include', '/usr/include/i386-linux-gnu']);
    $c->parse_file('utmpx.h');
    

    Funnily enough both codes can be combined: setting __sparc on Ubuntu does not have any effect since it is not used anywhere in the include files, adding a non-existent include directory on Solaris does not hurt either so I have the unified code and achieved goal 1 but not as nicely as I had hoped.

    use Convert::Binary::C;
    $c = Convert::Binary::C->new( 
               Include => ['/usr/include', '/usr/include/i386-linux-gnu'], 
               Define => [qw(__sparc)] 
            );
    $c->parse_file('utmpx.h');
    
    But what Convert::Binary::C achieved is nothing less than all macros and definitions of the C include file are available now in Perl, not a small task.

    In the next sections I will show how to explore a 'struct' and its elements (or members as the C folks prefer).

    Step 2: Determine The Size Of The 'struct'

    Again there is a difference between Solaris and Ubuntu: the 'struct' is named futmpx on Solaris and utmpx on Ubuntu. So the code will differ here but one should also note that most the element names are equal, a feature which will be used later on.

    futmpx on Solaris:

    struct futmpx {
            char    ut_user[32];            /* user login name */
            char    ut_id[4];               /* inittab id */
            char    ut_line[32];            /* device name (console, lnxx) */
            pid32_t ut_pid;                 /* process id */
            int16_t ut_type;                /* type of entry */
            struct {
                    int16_t e_termination;  /* process termination status */
                    int16_t e_exit;         /* process exit status */
            } ut_exit;                      /* exit status of a process */
            struct timeval32 ut_tv;         /* time entry was made */
            int32_t ut_session;             /* session ID, user for windowing */
            int32_t pad[5];                 /* reserved for future use */
            int16_t ut_syslen;              /* significant length of ut_host */
            char    ut_host[257];           /* remote host name */
    };
    

    utmpx on Ubuntu (in fact I shortened it a a little for better readability)

    struct utmpx
    {
      short int ut_type;  /* Type of login.  */
      __pid_t ut_pid;  /* Process ID of login process.  */
      char ut_line[__UT_LINESIZE]; /* Devicename.  */
      char ut_id[4];  /* Inittab ID. */
      char ut_user[__UT_NAMESIZE]; /* Username.  */
      char ut_host[__UT_HOSTSIZE]; /* Hostname for remote login.  */
      struct __exit_status ut_exit; /* Exit status of a process marked
           as DEAD_PROCESS.  */
      long int ut_session;  /* Session ID, used for windowing.  */
      struct timeval ut_tv;  /* Time entry was made.  */
      __int32_t ut_addr_v6[4]; /* Internet address of remote host.  */
      char __unused[20];  /* Reserved for future use.  */
    };
    
    Another important consideration at this point is related to the Perl code above when we were defining the template for Perl's unpack function. Some extra bytes were introduced into the template to get the correct alignment and the alignment has to be set as well in Conver::Binary::C but in a much easier way. One could set the alignment to one, two, ... byte boundaries but there is also the option to use the native alignment and this is what I'm going to do. With the alignment set there is a 'sizeof' function to determine the size of the 'struct'.
    # Choose native alignment
    $c->configure( Alignment => 0 );   # the same on both OSs
    
    $sizeof = $c->sizeof('futmpx');    # on Solaris (=372)
    
    $sizeof = $c->sizeof('utmpx');     # on Ubuntu  (=384)
    

    Step 3: Read Log Entries wtmpx

    Since the filenames are different they need to be defined differently (or passed as an argument to the Perl script).
    $wtmpx = "/var/adm/wtmpx";   # on Solaris, or /var/adm/utmpx
    
    $wtmpx = "/var/log/wtmp";    # on Ubuntu,  or /var/run/utmp
    
    or of course saved copies of these files could be used too (many admins keep the last n copies and rotate filenames).

    Finally these files can be read. and the elements of the structure can be accessed.

    open(WTMPX, $wtmpx) || die("Cannot open $wtmpx\n");
    while( read(WTMPX, $buffer, $sizeof) == $sizeof ) {
    
      $unpacked = $c->unpack('futmpx', $buffer);   # Solaris
    
      $unpacked = $c->unpack('utmpx', $buffer);    # Ubuntu
    
      # And now do something with the content
      # ...
    }
    close(WTMPX);
    
    Accessing the elements of the 'struct' is quite easy by using the names of the members of the 'struct' as defined in the include file e.g. $unpacked->{ut_pid} to get the process id of the entry. For certain types of variables though (e.g. strings) some special handling is needed. Convert::Binary::C reads arrays of characters as arrays of characters. If one wants a nice string some conversion needs to be done and I am using the Perl pack recipe.
    @u = @{$unpacked->{ut_user}};      # Save the ut_user array of chars
    $ut_user = pack( "C*", @u);        # Convert it to a string and save 
                                       # it in a new variable. Of course 
                                       # these two steps could be combined into one
    
    Here is some code which retrieves the user, process id, terminal number and exit code. And contrary to the first Perl solution (with unpack) it does not matter whether ut_user is the first (Solaris) or fifth (Ubuntu) element of the 'struct'. Convert::Binary::C has made this element accessibly by name and we don't need to care about its position, a welcome simplification.
    use Convert::Binary::C;
    
    $utmpxh = "utmpx.h";               # include file
    
    # two OS specific settings
    $struct = "futmpx";
    $wtmpx  = "/var/adm/wtmpx";        # on Solaris, or /var/adm/utmpx
    
    
    $c = Convert::Binary::C->new( 
               Include => ['/usr/include', '/usr/include/i386-linux-gnu'], 
               Define => [qw(__sparc)] 
            );
    $c->parse_file( $utmpxh );
    
    # Choose native alignment
    $c->configure( Alignment => 0 );    # the same on both OSs
    $sizeof = $c->sizeof( $struct );    # on Solaris (=372)
    
    open(WTMPX, $wtmpx) || die("Cannot open $wtmpx\n");
    while( read(WTMPX, $buffer, $sizeof) == $sizeof ) {
    
      $unpacked = $c->unpack( $struct, $buffer);   # Solaris
      $ut_user = pack( "C*", @{$unpacked->{ut_user}} );
      $ut_line = pack( "C*", @{$unpacked->{ut_line}} );
      
      print $ut_user, 
            " ", $unpacked->{ut_pid},
            " ", $ut_line,
            " ", $unpacked->{ut_exit}->{e_exit},
            "\n";
    }
    close(WTMPX);
    
    Note the exit code entry: ut_exit is a struct which is a member of the main struct futmpx. $unpacked->{ut_exit}->{e_exit} gets an element of ut_exit.

    Step 4: a platform independent script (?)

    Did I achieve a completely platform independent script? No.

    Even with that great module (which does a lot of things behind the scenes) I could not achieve complete independency. I could live with the different file names but the different 'struct' names and to a lesser extent the compile settings (include paths, compiler variables like __sparc) require some platform specific coding.

    What have I achieved (other than learning about a great module)? I could eliminate the unpack template i.e. the cumbersome manual sequence of types, their counts and their alignments and use the systems definitions instead and this is what I was looking for initially after all.


    A little fine tuning of the string conversion

    In the script above I used the following to convert an array of characters to a string.
    $ut_user = pack( "C*", @{$unpacked->{ut_user}} );
    
    In fact I was cheating a little. When you print '$ut_user' it does look ok in the terminal but when you investigate its output you'll find that it is actually a 32 byte string consisting of the real characters appended by NULL bytes to get to the full length (I use spaces here to better illustrate the issue).
    a n d r e a s h \0 \0 \0 \0 ....
    
    Using this object in comparisons will fail:
    if( $ut_user eq "andreash" ) {  # will not be reached; }
    

    There are two ways to resolve this.

  • Use Perl's unpack function with the 'Z*' template (available in newer Perl versions). It will remove the trailing NULL bytes.
    $ut_user = pack( "C*", @{$unpacked->{ut_user}} );   # a n d r e a s h \0 \0 \0 ...
    $ut_user = unpack( "Z*", $ut_user);                 # a n d r e a s h
    
  • Use the conversions offered by Convert::Binary::C
    # Tell the conversion object that its 'ut_user' element is a string
    $c->tag( $struct.'.ut_user', Format => "String" );
    
    # No further conversion is required
    # Use   $unpacked->{ut_user}   as is, it is equal to 'andreash', no more NULL bytes
    
    i.e. instead of applying a pair of pack/unpack operations to actual data in your code tell Convert::Binary::C to do this for you by configuring the conversion object correctly and after reading in the data they will be formatted and usable as is.

    These conversions work for both Solaris and Ubuntu in the same way and should be applied to the other strings 'ut_line', 'ut_host' etc. too.

  • Wednesday, February 13, 2013

    How to XML-ify a tab separated text file with xsltproc (revisited)

    My last post described an XSLT 1.0 solution how to transform a plain tab separated text file into XML.

    Here I present a solution which uses some features available in extensions (and which are available in XSLT 2 as language features), namely tokenize and node-set. tokenize will allow me to split a string into tokens at once rather than having to call substring-before and substring-after repeatedly. In a certain way it is a contrast to the template thinking of XSLT but of course useful. node-set is a mighty tool since it allows me to transform variables into node sets and with that comes the ability to use proper XPATH functions on the nodes.
    The xsltproc version on my Mac contains some EXSTL extensions (visible via xsltproc --dumpextensions) so here are the required namespaces which need to be declared at the beginning of the script

    namespace
    tokenizexmlns:strings="http://exslt.org/strings"
    node-setxmlns:common="http://exslt.org/common"

    And here is how to use them:

    Usage
    tokenizeI use tokeinze in a for-each loop to split $someText delimited by $newline
    <xsl:for-each select="strings:tokenize($someText,$newline)" >
    ...
    </xsl:for-each>
    
    node-setTransform the contents of a variable $lines into a node-set $lineNodes
    <xsl:variable name="lineNodes" select="common:node-set($lines)" />
    

    All the work is being done in the parseDelimited template and it follows pretty much old style programming conventions. There is one loop which splits the complete input by newline. The first line is split by delimiter into the names of the headers All other lines are then split by delimiter into their individual fields. Everything is wrapped into elements as follows and and put into a variable. The pseudo-code is already close to its implementation.

    element "data"
      for each line tokenize the line by delimiter
        element "row"
          for each field n
            element "header n"
              content of field n
            end of element "header n"
          end for
        end of element "row"
      end for
    end of element "data"
    

    Here is the complete code.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <xsl:stylesheet version="1.0"
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:strings="http://exslt.org/strings"
            xmlns:common="http://exslt.org/common"
            >
    <!-- From strings we use:  tokenize
         From common we use:  node-set
    -->
    
    <!-- Define delimiter and newline -->
    <xsl:variable name="delim"   select="'&#x9;'" />
    <xsl:variable name="newline" select="'&#xA;'" />
    
    <!-- Define node1 and node2 for the output -->
    <xsl:variable name="node1"   select="'data'" />
    <xsl:variable name="node2"   select="'row'" />
    
    
    <xsl:template match="/">
      <!-- Take whatever input is coming, don't care about 'fakeroot' -->
      <xsl:call-template name="root"/>
    </xsl:template>
    
    <xsl:template name="root">
        <!-- Call the line parser with the whole content of the file -->
        <xsl:call-template name="parseDelimited">
          <xsl:with-param name="delimitedText" select="." />
        </xsl:call-template>
    </xsl:template>
    
    <xsl:template name="parseDelimited">
      <xsl:param name="delimitedText" />
    
      <!-- Split the file content by newline -->
      <xsl:variable name="lines">
        <xsl:for-each select="strings:tokenize($delimitedText,$newline)" >
          <line>
          <xsl:value-of select='.' />
          </line>
        </xsl:for-each>
      </xsl:variable>
      <!-- Create a node-set out of the previous 'lines'
           in order to be able to use them as an XPATH var -->
      <xsl:variable name="lineNodes" select="common:node-set($lines)" />
    
      <!-- The first line containing the header fields -->
      <xsl:variable name="first" select='$lineNodes/line[1]' />
      <xsl:variable name="headers" >
        <xsl:for-each select="strings:tokenize($first,$delim)" >
          <head>
          <xsl:value-of select='.' />
          </head>
        </xsl:for-each>
      </xsl:variable>
      <!-- Create a node-set out of the previous 'headers'
           in order to be able to use them as an XPATH var -->
      <xsl:variable name="headerNodes" select="common:node-set($headers)" />
    
      <!-- Loop through all lines, we can do this since it is a node set.
           This creates the actual XML content -->
      <xsl:variable name="output" >
        <!-- Start tag <data> -->
        <xsl:element name="{$node1}">
        <xsl:value-of select='$newline' />
    
        <xsl:for-each select="$lineNodes/line">
          <!-- Skip the first line of course -->
          <xsl:if test="position() > 1">
    
            <!-- Start tag <row> -->
            <xsl:element name="{$node2}">
    
            <!-- Split the line by 'delim'
                 and create an element for each entry.
                 The element name is coming from the header line -->
            <xsl:for-each select="strings:tokenize(.,$delim)" >
              <xsl:variable name="p" select="position()" />
              <xsl:element name="{$headerNodes/head[$p]}">
                <!-- Print the actual content , phew! -->
                <xsl:value-of select="." />
              </xsl:element>
            </xsl:for-each>
            <!-- End tag <row> -->
            </xsl:element>
            <xsl:value-of select='$newline' />
    
          </xsl:if>
        </xsl:for-each>
    
        <!-- End tag </data> -->
        </xsl:element>
        <xsl:value-of select='$newline' />
    
      </xsl:variable>
    
      <xsl:variable name="all" select="common:node-set($output)" />
      <!-- Output of nodified elements -->
      <xsl:copy-of select="($all)/*" />
    
      <!-- With a node-set one can now use its advantages
           e.g. sum up all Num values -->
      <xsl:value-of select='$newline' />
      <xsl:element name="Sum_Num">
      <xsl:value-of disable-output-escaping="yes"  select="sum(common:node-set($output)/data/row/Num)"/>
      </xsl:element>
    
    </xsl:template>
    
    </xsl:stylesheet>
    
    There are two interesting pieces here.
  • How to get the header names into the game? The for-each loop in bold tokenizes a line. Each field has an index which you can get via position() in XSLT. An element is created and it gets the name of the header field using this exact index (this works since the header line has the same number of fields than every other line). <xsl:element name="{$headerNodes/head[$p]}"> (the creation of variable 'p' to store the position is actually superfluous but it makes the code more readable).
  • At the end there are two more lines in bold which show how to use the XPATH function sum to get the total of the Num fields.

    This script, call it data.xsl, needs to be fed by the same wrapped input as before, here is the script which I omitted last time.

    #!/bin/sh
           
    # A shell wrapper for non-xml parsing with xslt
    
    FILE=data.txt
    FAKEROOT=fakeroot   # Important for XML completeness but will be skipped by XSLT
    
    (
    echo "<?xml version=\"1.0\"?>"
    printf "<$FAKEROOT>"
    cat $FILE
    echo "</$FAKEROOT>"
    )  |
    xsltproc data.xsl -
    

    The result is as follows. Note the 71 in the last line which is the sum of Num (this makes the output non-XML, it's just there to show the possibilities).

    <?xml version="1.0"?>
    <data>
    <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row>
    <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row>
    <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row>
    <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row>
    <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row>
    </data>
    <Sum_Num>71</Sum_Num>
    
  • Tuesday, February 5, 2013

    How to XML-ify a tab separated text file with xsltproc

    In the past I have done a lot with plain - somehow delimited - text files, CSV where the comma can stand for any kind of character to play the delimiter.

    I used the ususal UNIX tools or Perl or what and I always wondered whether the file manipulation couldn't be done with XSLT as well. I never took the time to really look into it since my normal toolset worked so well and I didn't see the need though the curiosity persisted.

    So assume you have this file which consists of a header line and tab separated data (it is not really important what the entries mean, and you can see one of the well-known issues of human-readability: the column headers do not sit exactly on top of their columns in contrast to fixed field files.)

    Date    Num     Duration
    20120415        13      2310
    20120510        9       1470
    20120526        16      3817
    20120701        5       2269
    20120831        28      4505
    
    and you want to transform it into an XML file (something I would describe as XML-ifying though I'm not sure this term exists elsewhere) like this
    <?xml version="1.0"?>
    <data>
    <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row>
    <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row>
    <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row>
    <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row>
    <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row>
    </data>
    
    You can see that I want the header fields in the first line to become the enclosing tags for the data. For lack of something definite the root is called <data> and the various entries are <row>.

    The first important consideration when working with XSLT is the question of platform and tool which will determine the XSLT version and probable extensions.
    In my case that was MacOS 10.5.8 and its standarad xsltproc was on libs with XSLT 1.0.

    The second important issue is of course that XSLT requires XML input which my delimited text isn't at all.

    So can I be successful and what do I need?
    In XSLT terms all I have is a big string. XSLT will happily recognize it if it is wrapped a little to make it look like XML. So putting the following around the file (one can use whatever tool fits, I used simple shell echo/printf) will get XSLT started. The node name 'fakeroot' is not important, the script will not check for it.

    <?xml version="1.0"?>
    <fakeroot>
    ...
    </fakeroot>
    

    Now XSLT needs to parse the string properly. XLST 2.0 or extensions of 1.0 (e.g. EXSLT 'common') would have provided nice string functions like tokenize or node-set which would easily dissect the string and allow its chunks to be used in various ways. I wanted a pure XSLT 1.0 solution though. And this meant to handcraft the tokenization and also somehow manage the header line whose fields should become the node names for each line.

    The solution takes into account that the big input string can be split by newlines (they are part of the content) into lines and each line can then further be split by the delimiting character. Aside from the first 'match="/"' template I need three templates.

  • the first template 'match="fakeroot"' takes the whole input, chops off the first line and passes the remaining lines to the next template
  • the second template 'parseDelimited' does the split by newline: it chops off one line and feeds it to the next template and then calls itself recursively with one line less
  • the third template 'parseLine' does the split by delimiter: it chops off the content before the first delimiter and slso the header line before the delimiter,puts out the <tag>...</tag> entries and calls itself recursively with the remainder of the line after the delimiter

    The functions substring-before() and substring-after() and the recursive call of templates are the main ingredients in this whole process.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <!-- Define delimiter, newline and root -->  
    <xsl:variable name="delim"   select="'&#x9;'" />
    <xsl:variable name="newline" select="'&#xA;'" />
    
    <!-- Define node1 and node2 for the output -->  
    <xsl:variable name="node1"   select="'data'" />
    <xsl:variable name="node2"   select="'row'" />
    
          
    <xsl:template match="/">
      <!-- Take whatever input is coming, don't care about 'fakeroot' -->
      <xsl:call-template name="root"/>
    </xsl:template>
    
    <xsl:template name="root">
      <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$node1,'&gt;')" />
      <xsl:value-of select="$newline" />
      <xsl:call-template name="parseDelimited">
        <!-- Chop off the first line with the header fields
             It will be passed as a parameter to all other templates -->
        <xsl:with-param name="headerLine" select="substring-before(.,$newline)" />
        <xsl:with-param name="delimitedText" select="substring-after(.,$newline)" />
      </xsl:call-template>
      <xsl:value-of disable-output-escaping="yes" select="concat('&lt;/',$node1,'&gt;')" />
    </xsl:template>
      
        
    <xsl:template name="parseDelimited">   
      <xsl:param name="headerLine" />
      <xsl:param name="delimitedText" />
    
      <xsl:variable name="line" select="substring-before($delimitedText,$newline)" />
      <xsl:variable name="remaining" select="substring-after($delimitedText,$newline)" />
         
      <!-- Handle one line which has been chopped off -->
      <xsl:if test="string-length($line) > 0">
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$node2,'&gt;')" />
    
        <xsl:call-template name="parseLine">
          <xsl:with-param name="headerLine" select="concat($headerLine,$delim)" />
          <xsl:with-param name="line" select="concat($line,$delim)" />
        </xsl:call-template>
    
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;/',$node2,'&gt;')" />
        <xsl:value-of select="$newline" />
      </xsl:if>
    
      <!-- Call the template recursively with the remaining lines -->
      <xsl:if test="string-length($remaining) > 0">
        <xsl:call-template name="parseDelimited">
          <xsl:with-param name="headerLine" select="$headerLine" />
          <xsl:with-param name="delimitedText" select="$remaining" />    
        </xsl:call-template>
      </xsl:if>
    
    </xsl:template>
    
    
    <xsl:template name="parseLine">
      <xsl:param name="headerLine" />
      <xsl:param name="line" />
    
      <!-- Retrieve the fields before the delimiter -->
      <xsl:variable name="fieldName" select="substring-before($headerLine,$delim)" />
      <xsl:variable name="field" select="substring-before($line,$delim)" />
    
      <xsl:if test="string-length($fieldName) > 0">
        <!-- This is the actual output -->
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$fieldName,'&gt;',$field,'&lt;/',$fieldName,'&gt;')" />
    
        <!-- Call the template recursively with the remaining fields -->
        <xsl:call-template name="parseLine">
          <xsl:with-param name="headerLine" select="substring-after($headerLine,$delim)" />
          <xsl:with-param name="line" select="substring-after($line,$delim)" />
        </xsl:call-template>
      </xsl:if>
    
    </xsl:template>
    
    </xsl:stylesheet>
    

    There are a few tricks being used here.

  • the header line is always passed as a parameter to the templates because it cannot be stored in a global variable or an array (that is XSLT after all, not an imperative programming language)
  • in order to catch the last field I simply append a delimiter at the end of each line. This ensures that substring-before will always catch something. Otherwise I would have had to use some if-else-logic to handle the case of the last field

    Of course one can add newlines, indents etc. to change the look but that is not important for this exercise as well as parameterize the script: the delimiter could be passed as --stringparam to xsltproc in order to make it more flexible and also the output node names 'data' and 'row' could be coming from command line parameters in a true production script.


    Now looking at this of course I could have achieved the XML-ifying much simpler with this little awk script which took me less than 10 minutes to write. The XSLT exercise was nice but since this was eventually all about string handling one is probably still better off with the traditional tools.
    BEGIN { FS="TAB"; }      # Put in the real TAB character here
    NR==1 { split($0,header,FS); printf "<data>\n"}
    NR>1  {
            printf "<row>";
            for(i=1;i<=NF;i++)  printf "<%s>%s</%s>", header[i],$i,header[i];
            printf "</row>\n";
    }
    END   { printf "</data>\n" }
    
    which will result too in
    <data>
    <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row>
    <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row>
    <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row>
    <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row>
    <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row>
    </data>