I have setup source(HIVE) and target(HANA) data stores in Data Services to copy data from hive to hana using batch jobs. Although this is working, but it is staging data in a local filesystem before finally inserting it into HANA. This is actually slowing down the data movemment as it takes a lot of time to first write to the staging area then and again write to HANA.
I have gone through Data Services performance optimization guide, as per it the data staging happens under four conditions as below:
======================================================================================================
With the Bulk load option selected in the target table editor, any one of the following conditions triggers the staging mechanism:
●The data flow contains a Map_CDC_Operation transform.
●The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows.
●The data flow contains a Table_Comparison transform.
●The Auto correct load option in the target table editor is set to Yes.
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore Data Services does only a bulk insert operation, which does not require a staging table or the need to execute any additional SQL.
======================================================================================================
In my case none of the above conditions holds true, As I do not use any transform between the source and target table and I do not have auto correct load option checked. See below screen shots.
no transform in the above data flow.
Auto correct load is set to "no".
so ideally the data staging should not happen, yet it is happening as we can see below.
=======================================================================================
bash-4.1$ pwd
/build/dsbop-ips/dataservices/log/hadoop/HIVE_events_test_partd_ext_deviceseverity_59266/hiveRead_dir
bash-4.1$ ls -l
total 22057092
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000000_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000001_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000002_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000003_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000004_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000005_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000006_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000007_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000008_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000009_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000010_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000011_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000012_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 21:59 000013_0
-rw-r--r-- 1 dsuser dsuser 348530727 Aug 3 22:00 000014_0
=======================================================================================
Any clues to avoid data staging would be helpful.
Thanks,
Sanjay