Txt2XMLParser is a powerful
library able to parse text strings or files and generate XML
documents or XML files.
Txt2XML
parser rules can be stored in XML "Formatter" files that can then be
used by Txt2XML Parser library in batch mode.
Txt2XML can be used to parse any text from simple CSV to
complex structured file formats.
In the following image there is an example of how to use Txt2XML to
parse SIP messages.
Txt2XML parses text using regular expressions. The philosophy behind the parser is quite simple :
The simple idea behind Txt2XML Parser makes easy to parse complicate text file simply dividing it and then processing each one separately. So for example a CSV file one can separate each row and then process the row according to the delimiter used.
Working with Txt2XML is very easy if you know the regular expressions. In order to make it easy to write the formatter you can also use Txt2XML-Developer, an Eclipse plugin that greatly simplify the editing of the formatter file.
Before describing how to integrate Txt2XML Parser library in your code, let's see what Txt2XML can do for you. Let's start with a very simple example.
Say you need to parse this simple text (a csv file):where each line is separated by a carriage return "\r\n". The last line has no "\r\n"
Step 1
Say you are interested to put all the text within a tag called
text. This can be done using the following formatter:
<?xml version="1.0" encoding="UTF-8"?> The output will be:
<txtParser tagName="text"/>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<text>Head1,Head2,Head3
Greg,Rous,23
Nick,Franx,32
Jack,Troth,58</text>
Let's look at the formatter tags. The first tag is txtParser which, in this case, has only an attribute: tagName. The
value of the tagName will be the name of the tag in the output xml (in
the example text will be the first tag of
the output xml file because it is the name of the parent txtParser
element of the formatter). Note that the \r\n is now .
Step 2
Let's complicate the example a little bit. Say you are interested to put each row in a different tag called row. We can try with the following formatter:
<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)\r\n" iterate="true">
<txtParser parentGroup="1" tagName="row"/>
</txtParser>
The result is the following:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
<row>Head1,Head2,Head3</row>
<row>Greg,Rous,23</row>
<row>Nick,Franx,32</row>
</CSV>
Let's look at the formatter tags.
The first tag is txtParser which has now the following attributes: tagName, regEx and iterate. The parsers child is a txtParser element again.
The value of tagName will be the name of the tag in the output xml (in the example CSV and row, in particular CSV will be the first tag of the output xml file because it is the name of the parent txtParser element of the formatter).
The value of the regEx attribute is the regular expression used to parse the text. In the example it has been chosen to extract a single row (all the chars till the \r\n). The group is used to select the text to extract (in the example all the chars of the row is selected except the \r\n).
The iterate attribute allows to apply the pattern iterativelly. The default value is "false". In the example this allows to parse only the first and second row because the last one doesn't end with "\r\n".
In the child txtParser element a parentGroup attribute is used to select the needed regEx group. The default parentGroup is 0 (all the text that match the parent regEx is selected). In the example group 1 is selected. There are no sub tag so no further processing must be done on the the selected group but the value must be put under a tag called row.
In order to parse also the third row we have to change the formatter a bit.
Step 3
Let's try the following formatter:
<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)(\r\n|$)" iterate="true">
<txtParser parentGroup="1" tagName="row"/>
</txtParser>
The corresponding output will be:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
<row>Head1,Head2,Head3</row>
<row>Greg,Rous,23</row>
<row>Nick,Franx,32</row>
<row>Jack,Troth,58</row>
</CSV>
The difference with the formatter in Step 2 is in the regEx (two groups are selected in order to take into account the end of the text).
Note that no txtParser select group 2 so it will be not considered.
Step 4
Finally say we want to separate each row in columns using the "," char as separator. The following formatter can be used:
<?xml version="1.0" encoding="UTF-8"?>
<txtParser tagName="CSV" regEx="^(.*?)(\r\n|$)" iterate="true">
<txtParser parentGroup="1" tagName="row" regEx="^(.*?)(,|$)" iterate="true">
<txtParser parentGroup="1" tagName="column"/>
</txtParser>
</txtParser>
The output will be:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Document created with Txt2XML-->
<CSV>
<row>
<column>Head1</column>
<column>Head2</column>
<column>Head3</column>
</row>
<row>
<column>Greg</column>
<column>Rous</column>
<column>23</column>
</row>
<row>
<column>Nick</column>
<column>Franx</column>
<column>32</column>
</row>
<row>
<column>Jack</column>
<column>Troth</column>
<column>58</column>
</row>
</CSV>