Why Txt2XML Parser?

Txt2XMLParser is a powerful library able to parse text strings or files and generate XML documents or XML files. Txt2XML parser rules can be stored in XML "Formatter" files that can then be used by Txt2XML Parser library in batch mode.

A Txt2XML example

Txt2XML can be used to parse any text from simple CSV to complex structured file formats. 
In the following image there is an example of how to use Txt2XML to parse SIP messages. 

The Txt2XML-View with an input-editor

The philosophy behind Txt2XML

Txt2XML parses text using regular expressions. The philosophy behind the parser is quite simple : 

  1. select text within the document to parse using a regular espression; 
  2. revursively apply step 1 on the selected text.

The simple idea behind Txt2XML Parser makes easy to parse complicate text file simply dividing it and then processing each one separately. So for example a CSV file one can separate each row and then process the row according to the delimiter used.

Regular expression is the key

Working with Txt2XML is very easy if you know the regular expressions. In order to make it easy to write the formatter you can also use Txt2XML-Developer, an Eclipse plugin that greatly simplify the editing of the formatter file. 


Txt2XML by example

Before describing how to integrate Txt2XML Parser library in your code, let's see what Txt2XML can do for you. Let's start with a very simple example.

Say you need to parse this simple text (a csv file):

Head1,Head2,Head3
Greg,Rous,23
Nick,Franx,32
Jack,Troth,58

where each line is separated by a carriage return "\r\n". The last line has no "\r\n"

Step 1

Say you are interested to put all the text within a tag called text. This can be done using the following formatter:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="text"/>

The output will be:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<text>Head1,Head2,Head3&#13;
Greg,Rous,23&#13;
Nick,Franx,32&#13;
Jack,Troth,58</text>

Let's look at the formatter tags.

The first tag is txtParser which, in this case, has only an attribute: tagName. 

The value of the tagName will be the name of the tag in the output xml (in the example text will be the first tag of the output xml file because it is the name of the parent txtParser element of the formatter). Note that the \r\n is now &#13;.

Step 2

Let's complicate the example a little bit. Say you are interested to put each row in a different tag called row. We can try with the following formatter:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)\r\n" iterate="true">
    <txtParser parentGroup="1" tagName="row"/>
</txtParser>

 The result is the following:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
    <row>Head1,Head2,Head3</row>
    <row>Greg,Rous,23</row>
    <row>Nick,Franx,32</row>
</CSV>

Let's look at the formatter tags.

The first tag is txtParser which has now the following attributes: tagName, regEx and iterate. The parsers child is a txtParser element again. 

The value of tagName will be the name of the tag in the output xml (in the example CSV and row, in particular CSV will be the first tag of the output xml file because it is the name of the parent txtParser element of the formatter). 

The value of the regEx attribute is the regular expression used to parse the text. In the example it has been chosen to extract a single row (all the chars till the \r\n). The group is used to select the text to extract (in the example all the chars of the row is selected except the \r\n).

The iterate attribute allows to apply the pattern iterativelly. The default value is "false". In the example this allows to parse only the first and second row because the last one doesn't end with "\r\n".

In the child txtParser element a parentGroup attribute is used to select the needed regEx group. The default parentGroup is 0 (all the text that match the parent regEx is selected). In the example group 1 is selected. There are no sub tag so no further processing must be done on the the selected group but the value must be put under a tag called row.

In order to parse also the third row we have to change the formatter a bit.

Step 3

Let's try the following formatter:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)(\r\n|$)" iterate="true">
    <txtParser parentGroup="1" tagName="row"/>
</txtParser>

The corresponding output will be:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
    <row>Head1,Head2,Head3</row>
    <row>Greg,Rous,23</row>
    <row>Nick,Franx,32</row>
    <row>Jack,Troth,58</row>
</CSV>

The difference with the formatter in Step 2 is in the regEx (two groups are selected in order to take into account the end of the text).

Note that no txtParser select group 2 so it will be not considered.

Step 4

Finally say we want to separate each row in columns using the "," char as separator. The following formatter can be used:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)(\r\n|$)" iterate="true">
    <txtParser parentGroup="1" tagName="row" regEx="^(.*?)(,|$)" iterate="true">
        <txtParser parentGroup="1" tagName="column"/>
    </txtParser>
</txtParser>

 The output will be:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
    <row>
        <column>Head1</column>
        <column>Head2</column>
        <column>Head3</column>
    </row>
    <row>
        <column>Greg</column>
        <column>Rous</column>
        <column>23</column>
    </row>
    <row>
        <column>Nick</column>
        <column>Franx</column>
        <column>32</column>
    </row>
    <row>
        <column>Jack</column>
        <column>Troth</column>
        <column>58</column>
    </row>
</CSV>

In this example the row group selected by the first regEx is further processed by a second regEx and the selected group is then saved into the column tag.
If one want to use different names for the column tag but the number of column is constant, the following formatter can be used:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)(\r\n|$)" iterate="true">
    <txtParser parentGroup="1" tagName="row" regEx="^(.*?),(.*?),(.*?)$" >
        <txtParser parentGroup="1" tagName="column1"/>
        <txtParser parentGroup="2" tagName="column2"/>
        <txtParser parentGroup="3" tagName="column3"/>
    </txtParser>
</txtParser>

The output will be:
 
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
    <row>
        <column1>Head1</column1>
        <column2>Head2</column2>
        <column3>Head3</column3>
    </row>
    <row>
        <column1>Greg</column1>
        <column2>Rous</column2>
        <column3>23</column3>
    </row>
    <row>
        <column1>Nick</column1>
        <column2>Franx</column2>
        <column3>32</column3>
    </row>
    <row>
        <column1>Jack</column1>
        <column2>Troth</column2>
        <column3>58</column3>
    </row>
</CSV>

Step 5


Say you want now to use row 1 as header to the column name. This can be done using the following formatter:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="(.*?)\r\n(.*?)$">
    <txtParser parentGroup="1" tagName="headerRow" regEx="(.*?)(,|$)" iterate="true">
        <txtParser parentGroup="1" tagName="header"/>
    </txtParser>
    <txtParser parentGroup="2" tagName="rows" regEx="(.*?)(\r\n|$)" iterate="true">
        <txtParser parentGroup="1" tagName="row" regEx="^(.*?),(.*?),(.*?)$">
            <txtParser parentGroup="1" tagName="/CSV/headerRow/header[1]"/>
            <txtParser parentGroup="2" tagName="/CSV/headerRow/header[2]"/>
            <txtParser parentGroup="3" tagName="/CSV/headerRow/header[3]"/>
        </txtParser>
    </txtParser>
</txtParser>

The output will be:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
    <headerRow>
        <header>Head1</header>
        <header>Head2</header>
        <header>Head3</header>
    </headerRow>
    <rows>
        <row>
            <Head1>Greg</Head1>
            <Head2>Rous</Head2>
            <Head3>23</Head3>
        </row>
        <row>
            <Head1>Nick</Head1>
            <Head2>Franx</Head2>
            <Head3>32</Head3>
        </row>
        <row>
            <Head1>Jack</Head1>
            <Head2>Troth</Head2>
            <Head3>58</Head3>
        </row>
    </rows>
</CSV>

In the example the first row is firstly selected and put into the XML. Then the values of these tags are selected using xPath as tagNames for the other columns.

Step 6


Here is a more complicated example. The text to be parsed is the follow:

alias "jdoe" jdoe@xmli.com
note "jdoe" <country:US><zip:45202><state:OH><city:Cincinnati><address:34 Fountain Square Plaza><name:John Doe>
alias "jsmith" jsmith@worth-it.com
note "jsmith" <first:Jack><last:Smith><name:Jack Smith>
alias "pdupont" pdupont@pineapples.net
note "pdupont" <name:Pierre Dupont>

The formatter is the following:

<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="Txt2XML" regEx="^.*$">
    <txtParser tagName="persons" regEx="((.*?)(\r\n|$)){2}" tagValueGroup="1">
        <txtParser tagName="person" regEx="alias &quot;(.*?)&quot; (.*?)\r\nnote &quot;\1&quot; (.*)$">
            <txtParser parentGroup="1" tagName="alias"/>
            <txtParser parentGroup="2" tagName="email"/>
            <txtParser parentGroup="3" tagName="note" regEx="&lt;(.*?)&gt;" iterate="true">
                <txtParser parentGroup="1" tagName="parameter"  regEx="(.*?):(.*?)$">
                    <txtParser parentGroup="1" tagName="name"/>
                    <txtParser parentGroup="2" tagName="value"/>
                </txtParser>
            </txtParser>
        </txtParser>
    </txtParser>
</txtParser>

The corresponding output is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<Txt2XML>
    <persons>
        <person>
            <alias>jdoe</alias>
            <email>jdoe@xmli.com</email>
            <note>
                <parameter>
                    <name>country</name>
                    <value>US</value>
                </parameter>
                <parameter>
                    <name>zip</name>
                    <value>45202</value>
                </parameter>
                <parameter>
                    <name>state</name>
                    <value>OH</value>
                </parameter>
                <parameter>
                    <name>city</name>
                    <value>Cincinnati</value>
                </parameter>
                <parameter>
                    <name>address</name>
                    <value>34 Fountain Square Plaza</value>
                </parameter>
                <parameter>
                    <name>name</name>
                    <value>John Doe</value>
                </parameter>
            </note>
        </person>
    </persons>
</Txt2XML>