Having previously used the Text Data Processing Blueprints for DS 4.1, I downloaded the updated 4.2 Blueprints from SAP SCN today and imported this into my SAP Data Services 4.2 SP07 development installation.
However, I cannot get the "Text Data Processing Blueprints 4.2 - English" package to work.
I created the required database tables using the supplied SQL create script, I copied the files to the right locations and I updated the Python installation with the supplied 2.6.2 libraries. I then imported the ATL file and updated the data store and all file locations in the Data Flows with the correct paths.
I then run the TdpBlueprintEn_Basic and TdpBlueprintEn_Binary jobs which run fine.
The problem is with the TdpBlueprintEn_DictionaryGenerate job.
It will generate the TdpEnOutDictionary.xml file but the subsequent TdpBlueprintEn_DictionaryCompile script fails on the actual tf-ncc compiler.
The output in the error log isn't very helpful but by running this step from the command prompt, I get a much better insight into the problem.
Here is the output:
[MSG]Processing [C:/SAP/Data Services/TextAnalysis/languages/tf.nc-config]
[MSG]Processing [C:\SAP\Data Services\TextAnalysis\TdpEnOutDictionary.xml]
C:\SAP\Data Services\TextAnalysis\TdpEnOutDictionary.xml:61:7:[ERR]invalid content after root element's end tag
So there appears to be a problem with the XML. To help me define the issue, I removed all but 2 entries from the input dictionary Excel sheet and here is the resulting TdpEnOutDictionary XML file:
<?xml version="1.0" encoding = "UTF-8" ?><ns1:dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ns1="http://www.sap.com/ta/4.0"
>
<ns1:entity_category name = "RESTAURANT" ><ns1:entity_name standard_form = "Green Apple Italian Bistro" ><ns1:variant name = "Green Apple" ></ns1:variant>
<ns1:variant name = "Green Apple restaurant" ></ns1:variant>
</ns1:entity_name>
</ns1:entity_category>
</ns1:dictionary>
<?xml version="1.0" encoding = "UTF-8" ?><ns1:dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ns1="http://www.sap.com/ta/4.0"
>
<ns1:entity_category name = "RESTAURANT" ><ns1:entity_name standard_form = "Green Apple Italian Bistro" ><ns1:variant name = "Green Apple" ></ns1:variant>
<ns1:variant name = "Green Apple restaurant" ></ns1:variant>
</ns1:entity_name>
</ns1:entity_category>
</ns1:dictionary>
The problem is very clearly the repeated XML statement between two entries - there should only be 1 XML statement at the start of the document.
I am also not sure if the dictionary node needs to be repeated for each entity or variant because these should all be part of the same dictionary?
Clearly something isn't right with the nested "MapToXsd"query task in the TdpBlueprintEn_DictionaryCreate Data Flow.
The Row Generation transform is set to produce 1 row but it then forms a Cartesian product with the direct pull from the TdpEnInDictionary Excel source. Ideally it should only produce 1 dictionary parent node and therefore 1 xml document header - but now it repeats the dictionary parent node for each dictionary entry and that's where things get unhinged.
Has someone else ran into this particular issue as well and if so, was it the Cartesian product with the row generation transform that caused the problem? Has also no one at SAP noticed this issue? Or is this somehow an issue that occurred during the import of the ATL file?