Saturday, April 8, 2017

Generate Office Open XML files (DOCX, XLSX, PPTX…) using XSLT Transformations


Note: Download OOXML, XSL Transformation Sample here.


Office Open XML format (OOXML) has been introduced by Microsoft, for representing Office file formats like Excel, Word and Powerpoint, which later standardized by ECMA, ISO and IEC. These formats has become industry wide (universal)  standards, and portable in other office suites like Open Office or LibreOffice.

XSLT is one of the fastest tool to generate office documents, where you can generate an Office Document (Say a Word, Excel) from an input XML file and a target XSLT Transformation as detailed here. This was easy for formats like Speadsheet 2003 (MSO), which is a proprietary EXCEL format from Microsoft, where an entire workbook has been represented by a single XML file and XSLT typically outputs a single XML. Since it has been a proprietary format, other Office applications will not support them and not portable. It only works with Microsoft Office Products.

Now OOXML is a new specification, where a EXCEL workbook or Word Document will be splitted among many parts (which can be binary as well), which are combined in zip format. You can simply rename a XLSX/DOCX file to ZIP and can open them in any archive viewer to find its contents. You can find an indepth dive to the concept here and here.

As OOXML is a zip format, and contains many parts inside them. You cannot apply XSLT as is to this zip file, to generate the office documents and XSLT is not worth the effort, until Eric White, has introduced a solution based on Flat OPC Xml.

He converted the zip format, to a single Flat OPC XML format file, which can then be handled using XSL Transformation constructs. Again OPC (Open Packaging Conventions) has been introduced by Microsoft (Refer System.IO.Packaging namespace for more) as a Container Specification, which can embed multiple sub objects under it and later standardized by ISO to make it an industry wide standard to work with a variety of platforms/applications.

The actual OOXML to FlatOPC format conversion has been written by Eric White, and the implementations are free to download from here.


The flow has been depicted below:



The Solution Steps has been summarised below with a sample Transformation:


1. Create your Office Document Template  in an Office Application:

Save it in OOXML Format (DOX or XLSX).


See BookTemplate.xlsx, in the attached zip file.



2. Convert this Template to a single Flat OPC  XML file:

We can use C# application in this page to do this conversion.


See BookTemplate.xlsx.xml, in the attached zip file.



3. Define your XML Input file:


See BookStore.xml, in the attached zip file.



4. Build the XSL Transformation from this OPC XML Template file:

Now Modify the OPC XML file, to replace repetitive tags with XSL Constructs to fetch values from the input XML. Do embed required XSL Constructs to make it as an XSL Transformation file and change the extension to .XSL.


See BookStoreToFlatOPCxml.xsl, in the attached zip file. The below screenshots shows, the updated portions of the OPC XML file with required XSL Transformations. As simple as this.






5. Apply XSL Transformation to the XML input file to generate the final Flat OPC XML file:

Here we are applying the Transformation directly in Visual Studio IDE, for demonstration purpose. The good things is you can debug/step in/step through on the fly and fix the bugs pretty easily.


To know more about XSLT Debugging in Visual Studio, See here.


6. Convert the XSL output, Flat OPC XML file back to OOXML format:

Now We’ve the final output file in Flat OPC XML format. This has to be converted to OOXML format, so that Office Applications can process it.


We can use C# application in this page to do this conversion.


7. Open the Office Open XML file in your favourite Office Application:



8. Automate the Process using C#/Java Programs:

We can automate the above steps through programs, like pull the Flat OPC XSL file from a sharepoint list, and apply it to a XML Input being generated from an Oracle Database. The actual Xsl Tranformation code can be done with a few lines of code as listed here. Also ensure that, do not use C# functions in XSLT Files. This example was meant for demo. In actual scenarios, put all such C# functions inside a .NET Assembly and add it as an extension to the XslTransform process.

Monday, January 30, 2017

Utilize Ethernet over Wifi, When Plugged In – Linux Networking Scenario


I've two network interfaces in my Laptop. Ethernet (eth0) and Wifi (wlan0). Both are bridged together through ARP_Proxying in a software Bridge (br0) using 'parprouted' utility.

Now by default I would like to use my wifi to connect to my LAN and internet. But whenever a LAN cable plugged in I would like to use my Ethernet for both LAN and Internet. This to get the optimal performance and bandwidth, as my Ethernet will work faster than Wifi. As soon as I'm about to move around with my Laptop (By unplugging LAN cable), Wifi should take over and serve both LAN and Internet once again. Now If I've plugged in my LAN cable again, Ethernet should spring back. The switch over should be smooth enough, so that applications which requires net connection should work with minimal glitches after the switch.

The big picture has given below.

wifi lan

Below is an excerpt from my '/etc/rc.local' file, that shows the bridging setup

sudo brctl addbr br0 setfd 0 stp off
sudo brctl addif br0 eth0
sudo parprouted wlan0 br0
sudo bcrelay -d -i wlan0 -o br0
sudo sysctl net.ipv4.conf.wlan0.proxy_arp=1
sudo sysctl net.ipv4.conf.br0.proxy_arp=1

Problem Statement:

But as soon as I boot up my system, wifi is the only network interface which will serve both my internet/LAN. Even if Ethernet cable has plugged in, it will never get used.

After much troubleshooting I've found that the issue has been causing by 'parprouted' utility. The tool will set the default gateway of Wifi only. It will never respect my Ethernet state (no matter if its plugged-in or not). See my routing table after the system boot. Notice the last line, it will always use 'wlan0'

sudo route
Kernel IP routing table
default   UG     7      0        0   wlan0


Since opting the correct network device (interface) for the default gateway has been the solution, the fix is pretty straight forward. The abstract has been given below.

1. Whenever Ethernet cable is plugged in, Update the routing table so that, Default Gateway should use Ethernet (eth0/br0)

2. Whenever Ethernet cable is unplugged, Update the routing table so that, Default Gateway should use Wifi (wlan0)

3. On system startup, Perform both Step 1 & 2

To detect Ethernet Cable Plug In/Unplug events, we can use the utility called 'ifplugd'. Configure ethernet (eth0), so that both plugin/unplug will be monitored. We will then hook a simple bash script to this events, as to modify the network interface for the default gateway.

sudo apt-get install ifplugd
vi /etc/default/ifplugd
vi /etc/ifplugd/action.d/ifupdown
#Append the below line, which calls our custom script
sudo bash -c

Now my custom script (, given below to modify the network interface for the default gateway, as desired (on Ethernet plugin/unplug events).

set +e   # Don't exit on error status
//Removed some code.........
#Get LAN cable Plugged in status
is_eth_plugged=$(cat /sys/class/net/eth0/carrier)
#Is Wifi is our gateway device?
is_wifi_gw=$(route | grep default.*wlan0 | tr -s ' ' | cut -d' ' -f8 | head -n1| wc -l)
#Is Ethernet is our gateway device?
is_eth_gw=$(route | grep default.*br0 | tr -s ' ' | cut -d' ' -f8 | head -n1| wc -l)
#We've already set the device properly, so exit without doing anything
[ $is_eth_plugged -eq 1 ] && [ $is_eth_gw -eq 1 ] && exit 0
[ $is_eth_plugged -ne 1 ] && [ $is_wifi_gw -eq 1 ] && exit 0
#If ethernet is plugged in the device will be eth0/br0 or it will be wifi (wlan0)
[ $is_eth_plugged -eq 1 ] && iface="br0" || iface="wlan0"
#Have a ping on the interface to see if it is actually can be set for the  gateway
ping_iface=$(ping -I $iface -c 1 $defGateway | grep "64 bytes from $defGateway" | wc -l)
[ $ping_iface -ne 1 ] && exit 0
#Now remove the exiting gateway from the routing table
while [[ $(route del default) -ne 0 ]]
#Clear ARP cache
ip -s -s neigh flush all
#Now Add the default gateway with the correct device
route add -net default gw $defGateway dev $iface

 Add the same script in /etc/rc.local, so that the check will also be done on system startup

sudo bash –c

Now as soon as you plugin your LAN cable, Ethernet will be used as the primary device for internet/networking. If it gets unplugged, wifi will take over.

Here are my scripts for downloading.

Sunday, January 29, 2017

RaspberryPi2 As Network Failover/Load Balancer/Edge Router–A Reliable Implementation

I. MultiHoming & Multipath Routing:
Sometime back I’ve proposed an idea about using RPi2 as a NetworkFailOver device for a home network, given there are multiple sources of Internet.
I’ve tried many solutions including NAT pooling, Custom Routng Tables, Network Interface Bonding etc. But none gave me a reliable implementation.
Finally I’ve found the concept of MultiHoming & Multipath Routing in Linux, which Actually did the trick. In MultiHoming you have multiple network interfaces in your Linux Box, each provides a separate subnet/network. Also you can configure a single network interface to provide multiple subnets.(e.g Say eth0 is your device you can configure eth0:0, eth0:1 to deliver two seperate network segments say, respectively).
The real magic of Network Failover/Load Balancing is provided by MultiPath Routing, which seems surprisingly simple as below:
ip route add default \
        nexthop via $P1 dev $IF1 weight 1 \
        nexthop via $P2 dev $IF2 weight 1
By default Linux allows a single gateway for routing packets, which are not destined for your local network. Now the trick is to change this Single Gateway to a MultiPath Gateway, as shown above. In the above example, P1 is the gateway for interface IF1 and P2 is the gateway of interface IF2 and we added both with equal weight, which in turn work as a Load Balancer. Linux will try to equally split the packet traffic between the two interfaces.
Now if you change the weight, say you give weight 5 to IF1, it will work as a Network Failover Router, as most of the traffic will be routed to IF1, and if its is down traffic will be routed to IF2. See our example, where we’ve three network interfaces and each of them provides internet.

ip route add default \
      nexthop via dev eth0 weight 8 \
      nexthop via dev usb0 weight 6 \

      nexthop via dev ppp0 weight 3
eth0 is our LAN interface, which also connected to a ADSL Broadband router provides internet, which is our Primary Source of Internet. Next usb0, is a 4G Modem, USB ethernet, considered as a Secondary source of internet. If both are down, we’ve a 3G GSM USB Modem (ppp0) running with PPP protocol and it has the least priority, as it is much slower compared to the other two. So above single line actually provided the required Load Balancing/Network Failover for the Home network. (Ofcourse we’ve Source NATed both usb0 and ppp0 network, as they are in a different address space compared to the LAN-eth0- )
I’ve followed this article, to implement the multipath routing, for my own environment, and voila it worked like a charm! Other than this I’d to perform a few tweaks with DNS name server resolutions, NATing my secondary internet sources etc, which have detailed in the solution sections below.
II. Linux Kernel Verson Problems past 3.6:
The above solution has worked, since the Linux Kernel Caches the route, when a packet opts a particular route, through a specified interface. All subsequent packets in that connection will follow the same route through the same interface, and hence we’ve a reliable connection with Load Balancing/Network Failover. This is based on the flow-based load balancing technique.
Around version 3.6 of the Linux Kernel, they had removed “Routing Cache”, since it is assumed to have a Denial Of Service (DoS) vulnerability. As of now, I’m using Ubuntu 14.04, which is having a Kernel version of 3.13, with Routing Cache removed. The above implementation mysteriously stopped working after the 14.04 upgrade. Since the Routing Cache has been removed, the packets belongs to the same session, will no longer stick to the same route/interface and they will pick any available interface in quasi-random fashion. The result, you may never be able to estabalish a TCP Connection at all, as the inital SYN packet may pick eth0, but the subsequent SYN+ACK will pick usb0, and the TCP 3-Way handshake may not succeed at all.
Though Routing Cache, has been re-introduced in Kernel Version 4.4 (Ubuntu 16.04) with a more robust hash over Source/Destination address load balancing, than flow-based algorithm, I don’t wan’t to stick to the Kernel features, as it can be changed in future, which may break my implementation. So I’ve decided to mimic the “Route Caching” with my own implementation using “TCP Connection Tracking”.
III. Reliable Solution: MultiHoming, Multipath Routing with TCP Connection Tracking:
I owe to WILL COOKE, for his article and the initial scripts, shared in his blog on how to realize this method. I’ve modified his scripts according to my environment, introduced a few other scripts and changes to achieve this implementation fully.
Before diving into the solution, refer the below figure which depicts our environment.
RPi2-Network Failover (3)
As you can see I’ve three networks.
a. Home LAN (Subnet: 24) – Connected to shared Network Switch
Home LAN also has an attached cabled ADSL Broadband Modem (, Source of primary internet.
It has its own WAN port running a subnet range of 117.197.X.X. LAN Address are SNATed, by ADSL router itself for outgoing packets to its WAN port.
b. 4G USB Ethernet WAN (Subnet: X) – Connected to RPi2
Source of secondary internet. Priority: 2.
LAN Address are SNATed, while routing through this interface, as this has a different address space.
C. 3G USB PPP WAN (Subnet: X) – Connected to RPi2
Source of secondary internet. Priority: 3
LAN Address are SNATed, while routing through this interface, as this has a different address space.
RaspberryPi2 for its sole purpose, will act as the Edge Router, DHCP Server and DNS Server for the Home Network and I’ve disabled the DHCP Server on any other system and in ADSL Router. Now we need to implement a few scripts, to make it as a Network FailOver/Load Balance. By default Pi2 will route all internet traffic to the Primary Internet, Once its is down the traffic will be routed to Secondary Internet (4G Modem), If it too down, traffic will be routed to the other Secondary Internet (3G Modem).
If Primary Internet is back online, new traffic will be re-routed through the Primary. i.e Internet Traffic should always pick the source which has the highest priority and online. This was the major requirement to switch back to primary, when its is online. 
RPi2-Configuration Steps
Solution Step1 : Enable Router Mode-
We’ve to enable IP Forward, ICMP redirects in /etc/sysctl.conf. See the sysctl.conf file in the attached ZIP file.
Solution Step2 : Setup DHCP, DNS Server and Policies-
We’re using DNSMASQ as the DHCP, DNS server. We’ve to indicate the DHCP clients to use RPi2 as the DNS/Name server for address resolution using (—server and --dhcp-option=6 options). Mention RPi2’s LAN address ( for these settings. See the ‘dnsmasq’ file in the attached ZIP file.
Now we’ve to mention the upstream DNS Servers in ‘/etc/resolvconf/resolv.conf.d/head’. i.e DNS IP Address of ADSL Broadband Modem, 4G USB modem and 3G USB modem. Also we’ve to enable ‘options timeout:5 attempts:2’, in the same file, so that RPi2 will fallback to other secondary DNS Servers, if primary interface is down and primary DNS server is not available at the moment. This is the Network FailOver settings for the DNS Servers, without which we cannot resolve DNS names, though we’ve the fallback secondary network running (As primary DNS server does not known to the Secondary Internet Source Network, and it only knows its own DNS Sever, the Secondary DNS Server, which should be used during fallback), once the primary is down. See the ‘head and tail’ files in the attached ZIP file.
Solution Step3 : Setup Routing Tables, NAT rules, Multipath Routes and Connection Marking/Tracking-
The steps and concepts have been much detailed here. So I’m only providing hints on certain constructs based on my environment.
Refer “” in the attached zip file, here we does the below. We’re using PreRouting, Filter and PostRouting rules to accomplish the below.
a. Mark all new packets, based on the incoming network interface as 1,2 or 3 to track it during transit
b. Enable Source NAT on 4G USB (usb0) and 3G USB (ppp0)
c. Enable IP Forwarding between LAN and WAN interfaces” in the attached zip file, here we does the below.
a. Create separate routing tables for LAN, 4G USB, 3G USB and populate with their gateways
b. Create a dedicated routing table (loadbal) with multipath routes to LAN and WANs with corresponding weights for network failover/load balancing
c. Add rule, to pick the ‘loadbal’ routing table for all unmarked/untracked (new) packets, so that they are load balanced or pick a failover route
d. Add rule, to pick the corresponding routing table for all marked/tracked packets as per the marked number
e. Allocate each network interface’s send/receive queues to separate CPU cores for better performance” in the attached zip file, here we does the below.
a. Check the route configuration in every 15 seconds, if any glitches happens it rebuilds all routes periodically
It should be run on startup using the command, $ sudo bash
Run it manually or place it in /etc/rc.local

Solution Step4 : Redial 3G USB modem periodically, if it goes down-
Refer “” in the attached zip file, here we does the below.
It should be run on startup using the command, $ sudo bash
Run it manually or place it in /etc/rc.local
Solution Deployment: Configure the scripts at startup -
Now when it comes to fireup the Pi2 as FailOver/LoadBalance router, just plug-in the LAN cable, 4G USB Ethernet Modem, 3G USB PPP Modem and switch it on.
Run the below two scripts on start (Run it manually after bootup or configure them in /etc/rc.local, so they start automatically after the boot), which will run indefinitely to periodically validate and refresh routing config and initiate 3G Dialup, if it get disconnected.
$ sudo bash
$ sudo bash
Congratulations, now you’ve transformed your RaspberryPi2 to a highly fault tolerant Network Failover/Load Balance router ! 
A few screenshots from my environment: 

RPi2 Running

Monday, February 29, 2016

Setting up Hive2 Server On Hadoop 2.7+ (Multi-Node-Cluster On Ubuntu 14.04 LXD Containers)

In this article we build Hive2 Server on a Hadoop 2.7 cluster. We’ve a dedicated node (HDAppsNode-1) for Hive (and other apps) with in the cluster, which is highlighted in the below deployment digram, showing our cluster model in Azure. We will keep the Hive Meta Store in a seperate MySQL instance running on a seperate host (HDMetaNode-1) to have a production grade system, rather than keeping it in the default embeded database. This article assume, you’ve already configured Hadoop 2.0+ on your cluster. The steps we’ve followed to create the cluster can be found here, which is to build a Single Node Cluster. We’ve cloned the Single Node, to multiple nodes (7 Nodes as seen below), and then updated the Hadoop configuration files to transform it to a multi-node cluster. This blog has helped us to do the same. The updated Hadoop Configuration files for the below model (Multi-Node-Cluster) has been shared here for your reference.


Lets get started.

1. Create Hive2 Meta Store in MySql running on HDMetaNode-1.

sudo apt-get install mysql-server

<Loging to my sql using the default user: root>

CREATE DATABASE hivemetastore;
USE hivemetastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';
GRANT all on *.* to 'hive'@'HDAppsNode-1' identified by 'hive';

2. Get Hive2.

We are keeping Hive binaries under  (/media/SYSTEM/hadoop/hive/apache-hive-2.0.0)


tar -xvf apache-hive-2.0.0-bin.tar.gz

mv apache-hive-2.0.0-bin apache-hive-2.0.0
cd apache-hive-2.0.0

mv conf/hive-default.xml.template conf/hive-site.xml

Edit ‘hive-site.xml’, to configure MySql Meta Store and Hadoop related configurations. Please change as per your environment. (/media/SYSTEM/hadoop/tmp) is our Hadoop TMP directory in local filesystem.

Apart from that, we’d to replace all below occurances to make Hive2 work with our cluster,

${}/ with /media/SYSTEM/hadoop/tmp/hive/

/${} with /



    <description>Local scratch space for Hive jobs</description>
    <description>Temporary local directory for added resources in the remote file system.</description>




      <description>metadata is stored in a MySQL server</description>





      <description>MySQL JDBC driver class</description>





      <description>user name for connecting to mysql server</description>





      <description>password for connecting to mysql server</description>



3. Update Hadoop Config

core-site.xml (Add the below tags)



4. Setup Hive Server

Update ~/.bashrc and ~/.profile ,  to contain the Hive2 path



export HIVE_HOME

export PATH=$PATH:$HIVE_HOME/bin


Refresh the environment

source ~/.bashrc

Setup and Create Meta Store in MySql (You may need to download MySqlConnector JAR file to the lib folder)

bin/schematool -dbType mysql -initSchema

Start Hive2 Server


Sunday, February 28, 2016

Setting up Oozie 4.1.0 On Hadoop 2.7+ (Multi-Node-Cluster On Ubuntu 14.04 LXD Containers)

In this article we build Oozie 4.1.0 on a Hadoop 2.7 cluster. We’ve not selected the latest Oozie 4.2.0 which have build issues with Hadoop 2.0+ till date. We’ve a dedicated node (HDAppsNode-1) for Oozie (or other apps) with in the cluster, which is highlighted in the below deployment digram, showing our cluster model in Azure. We will keep the Oozie Meta data in a seperate MySQL instance running on a seperate host (HDMetaNode-1) to have a production grade system, rather than keeping it in the default Derby database. This article assume, you’ve already configured Hadoop 2.0+ on your cluster. The steps we’ve followed to create the cluster can be found here, which is to build a Single Node Cluster. We’ve cloned the Single Node, to multiple nodes (7 Nodes as seen below), and then updated the Hadoop configuration files to transform it to a multi-node cluster. This blog has helped us to do the same. The updated Hadoop Configuration files for the below model (Multi-Node-Cluster) has been shared here for your reference.


Lets get started. We’ve referenced the following blogs to prepare oozie under Hadoop.2.0. (Link1, Link2, Link3). Also to make Oozie work, you’ve to start your Job History Server along with YARN. I’ve configured Job History Server in the same node as that of YARN (HDResNode-1), which have been started with YARN using the command ( start historyserver).

1. Firstly the CodeHaus Maven repository referenced in the Oozie build file has been moved to another mirror. We need to ovveride the maven settings to point to the new location.

Edit or Create (home/hduser/.m2/settings.xml) and add the below.


        <id>Codehaus repository</id>

2. Now we have to create a Meta Store for Oozie in MySql running on HDMetaNode-1.

sudo apt-get install mysql-server

<Loging to my sql using the default user: root>

create database oozie;

grant all privileges on oozie.* to 'oozie'@'HDAppsNode-1' identified by 'oozie';
grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';

Edit  (/etc/mysql/my.conf) to enable MySql to accept connections from hosts, other than localhost

bind-address  = HDMetaNode-1

2. Get Oozie 4.1.0 and Build on HDAppsNode-1.

We are keeping Oozie binaries under  (/media/SYSTEM/hadoop/oozie-4.1.0)cd/media/SYSTEM/hadoop/

tar -xvf oozie-4.1.0.tar.gz
cd oozie-4.1.0

Update the pom.xml to change the default hadoop version to 2.3.0. The reason we’re not changing it to hadoop version 2.6.0 here is because 2.3.0-oozie-4.1.0.jar is the latest available jar file. Luckily it works with higher versions in 2.x series

vim pom.xml

--Search for
--Replace it with

Continue with Hadoop Build…

sudo apt-get install maven
bin/ -DskipTests -P hadoop-2 -DjavaVersion=1.7 -DtargetJavaVersion=1.7

cd ..
mv /media/SYSTEM/hadoop/oozie-4.1.0 /media/SYSTEM/hadoop/oozie-4.1.0-build
cp -R /media/SYSTEM/hadoop/oozie-4.1.0-build/distro/target/oozie-4.1.0-distro/oozie-4.1.0 /media/SYSTEM/hadoop/oozie-4.1.0

3. Prepare Oozie Libraries.

Update both ~/.profile, ~/.bashrc file to contain Oozie path. Append the below

export PATH=$PATH:/media/SYSTEM/hadoop/oozie-4.1.0/bin

Relaod the environment.

source ~/.bashrc

Prepare Oozie

cd /media/SYSTEM/hadoop/oozie-4.1.0
mkdir libext

cp /media/SYSTEM/hadoop/oozie-4.1.0-build/hadooplibs/target/oozie-4.1.0-hadooplibs.tar.gz .

tar -xvf oozie-4.1.0-hadooplibs.tar.gz

cp oozie-4.1.0/hadooplibs/hadooplib-2.3.0.oozie-4.1.0/* libext/

cd libext


rm -fr /media/SYSTEM/hadoop/oozie-4.1.0/oozie-4.1.0-hadooplibs.tar.gz

4. Update Hadoop and Oozie Config files

core-site.xml (Add the below tags)

  <!-- OOZIE -->

oozie-site.xml (Add/update the below tags). Pleas note the MySql and Hadoop directory configurations. Please change as per your environment.



4. Add Oozie user and setup Oozie Server

Add Oozie user

sudo adduser oozie --ingroup hadoop

sudo chown –R /media/SYSTEM/hadoop/oozie-4.1.0 oozie:hadoop

sudo chmod –R a+rwx /media/SYSTEM/hadoop/oozie-4.1.0

su oozie

cd /media/SYSTEM/hadoop/oozie-4.1.0

Setup MySql connector

tar -zxf mysql-connector-java-5.1.31.tar.gz
cp mysql-connector-java-5.1.31/mysql-connector-java-5.1.31-bin.jar /media/SYSTEM/hadoop/oozie-4.1.0/libext

Setup Logs

mkdir logs
sudo chmod -R a+rwx /media/SYSTEM/hadoop/oozie-4.1.0/logs

Setup and Create Meta Tables in MySql

bin/ db create –run

Setup Oozie WebApplication

sudo apt-get install zip
bin/ prepare-war

Setup Oozie Share Library in HDFS (Change the name node URL, as per your environment)

bin/ sharelib create -fs hdfs://HDNameNode-1:8020

Start Oozie and Test the status

bin/ start
bin/oozie admin -oozie http://localhost:11000/oozie -status


5. Prepare Oozie Samples and run a sample through Oozie

 Our Name Node running on HDNameNode:8020 and Resource Manager (YARN) running on HDResNode-1:8032. Hence we’ve to update the configuration of samples as below. Change the Host and port as per your environment

tar -zxvf oozie-examples.tar.gz
find examples/ -name "" -exec sed -i "s/localhost:8020/HDNameNode-1:8020/g" '{}' \;
find examples/ -name "" -exec sed -i "s/localhost:8021/HDResNode-1:8032/g" '{}' \;

Put the samples to HDFS
hdfs dfs -mkdir /user/oozie/examples

hdfs dfs -put examples/* /user/oozie/examples/

Run a sample by submitting a Job

oozie job -oozie http://HDAppsNode-1:11000/oozie -config examples/apps/map-reduce/ -run

Check the status of the job

#now open a web browser and access "http://HDAppsNode-1:11000/oozie", you will see the submitted job

Tuesday, February 16, 2016

RDP Over SSH or RDP with HTTP-Proxy (Or Any Protocol over SSH)

This tutorial address the below scenarios:

1. Alternatives X11 forwaring (X11 forwaring has performance issues)

2. RDP to a remote host using Coorporate HTTP Proxy Server

3. Circumventing, native RDP’s inablity use a corporate proxy

4. Getting the remote desktop of a public server, through corporate proxy/firewall

Typically if you’re inside your corporate network, your network will be protected by a proxy server and firewall. But suppose you would like to access the remote desktop of your Linux Machine residing in internet or in public cloud (eg. Azure). By default all TCP connections to outside will be channelled through the Proxy server and firewall rules will be applied.

We can use RDP through existing SSH Tunnel with the concept called SSH Portforwarding. In our case, TCP packets to one of our local machine port (eg. 5000) will be routed to a desired port (eg. 3389 the RDP port) on the remote host through the encrypted SSH connection. All the proxy/firewall only see the SSH connection, but wont see the RDP connection, as it will be hidden inside the SSH encrypted session.

Plese note that this does not restricted to RDP alone, you can redirect any port with SSH, May it be VNC, FTP, HTTP, HTTP,... you can use SSH to forward any protocol of your choice. One obivious advantage is you get the high security and unbreakable encryption of SSH as the base for your channels.


Lets implement this in Windows/Linux Client machines in a corporate network (with a proxy server), which want to connect to a remote Ubuntu Server in the internet.

*) Install putty

*) Provide Proxy settings  (You can check your wpad URL, to view your proxy settings)


*) Enable tunneling and forward local port 5000, to RDP port 3389 on server


*) Connect the putty session as usual to the remote server port 22


*) Open mstsc and connect to server using ‘localhost:5000’


The RDP session will be now routed through SSH, and server will respond with a RDP Loging screen.


Linux Client Machines:

** If you’re using Linux Clients like Ubuntu, Open SSH from your command line shell. The command will look like;

ssh -L 5000:localhost:3389 -p 22  -X username@  -o "ProxyCommand=nc -X connect -x %h %p"

Here LocalHost denotes the remote machine.

Then open RDP clients like Remmina, rdesktop and connect to localhost:5000


Read more here.

Hadoop Multi Node Cluster On Linux Containers - Backed By ZFS RAID File System - Inside Azure VM

We’re successful in building a Multi-Node-Hadoop-Cluster with only LXD-Linux Containers.

i.e All Name/Data Nodes have been built inside four LXD containers (One Name Node, 3 Data Nodes) hosted inside a single Azure Ubuntu VM. Here goes the major points about the implementation.

For unparalleled scalability, I've chosen a standard Azure Ubuntu VM (8-Core CPU, 14-GB RAM, 4x250GB Data Disks) with ZFS-RAID file system. i.e To increase the hardware capability, I can scale up my Azure VM Hardware any time. To scale up the storage, I can dynamically add Azure Storage Disks on demand, and then add them to the ZFS Pool, which will dynamically provide the additional storage to all running containers. Cool hah? :)

Also with ZFS file system you can clone, snapshot your file system at any moment. You can also Hot-Add additional storage any time, which will be immediately visible to the OS as it is running. ZFS also provides RAID-0 (I've chosen) and RAID-Z provisioning.

1. Provisioned the Ubuntu VM inside Azure. Installed LXDE, RDP packages to get the remote desktop on my Local Machine

2. Installed ZFS Tools and created a new ZFS Pool containing the 4x250GB disks (1TB)

4. Mounted ZFSPool so that, LXD directories will reside inside the ZFS

5. Installed LXD, and updated networking settings (Changed the LAN subnet IPs)

6. Pulled Ubuntu-Trusty-AMD64 image from LXD repository

7. Created the first LXD container (HDNameNode-1) and configured
Hadoop Single Node Cluster In it

8. Snap-Shoted ZFS File System, to keep my Hadoop-Single-Node-Cluster for later retrieval if required

9. Now I've updated HDNameNode-1 container, to have multi-node configurations (Which all nodes should have in a cluster).
(I've used tutorials 1
& 2). Tutorial-2 is for Hadoop-1.0+, So I'had to rely on Tutorial-1 to get settings for Hadoop-2.0+ multi-node configuration using YARN.

10. Cloned HDNameNode-1 container using LXD utility, to have 3-more containers which will act as Data Nodes
(HDDataNode-1, HDDataNode-2, HDDataNode-3)

11. Updated IP and Network settings for Name/Data nodes as desired

12. Updated HDNameNode-1 container, to have name-node specific multi-node configurations

13. Updated all DataNode containers, to have Data-node specific multi-node configurations

14. Restarted all containers, and started hadoop on NameNode, which in turn powered up Data Nodes.

15. Now I'm running a 4-Node-Hadoop-Cluster.

16. I can clone any Data Node any time, to have more data nodes if desired in future

17. Snap-shotted my ZFS pool for disaster recovery.

A few commands Listed below for your reference:

---Add ZFS package

apt-get update
apt-get install ubuntu-zfs

---Create ZFS Pool and Mount for LXD directories
sudo zpool create -f ZFS-LXD-Pool sdc sdd sde sdf -m none
sudo zfs create -p -o mountpoint=/var/lib/lxd    ZFS-LXD-Pool/LXD/var/lib
sudo zfs create -p -o mountpoint=/var/log/lxd    ZFS-LXD-Pool/LXD/var/log
sudo zfs create -p -o mountpoint=/usr/lib/lxd    ZFS-LXD-Pool/LXD/usr/lib
---Install LXD and update network settings for containers
add-apt-repository ppa:ubuntu-lxc/lxd-stable
apt-get update
sudo apt-get install lxd
#To update the subnet to and DHCP Leases
sudo vi /etc/default/lxc-net
sudo service lxc-net restart
--Add Ubuntu Trusty AMD64 Container Image and create first container, to setup SingleNodeCluster
lxc remote add images
lxc image copy images:/ubuntu/trusty/amd64 local: --alias=Trusty64
lxc launch Trusty64 HDNameNode-1
lxc stop HDNameNode-1
zfs snapshot  ZFS-LXD-Pool/LXD/var/lib@Lxd-Base-Install-With-Trusty64
zfs list -t snapshot

***Configure Hadoop Single Node Cluster On first container and Snapshot it

**change ip to

lxc start HDNameNode-1
lxc exec HDNameNode-1 /bin/bash
lxc stop HDNameNode-1
zfs snapshot  ZFS-LXD-Pool/LXD/var/lib@Hadoop-Single-Node-Cluster
zfs list -t snapshot

***Update Name Node configuration to have Multi-Node-Cluster-Settings

-----Clone Name Node to 3 Data Nodes

lxc copy HDNameNode-1 HDDataNode-1
lxc copy HDNameNode-1 HDDataNode-2
lxc copy HDNameNode-1 HDDataNode-3
#Change IPs to,4 and 5

***Update all Data Node configurations to have Multi-Node-Cluster-Settings

---Snapshot the Multi Node Cluster