Monday, December 21, 2015

Settingup Hadoop2.0+ - Hadoop @ Desk (Single Node Cluster)

This is part II of the series about Hadoop @ Desk. Hope you’ve gone thorough the setup prerequistes in Part I, and are ready to continue with Hadoop.2.0+ setup.

As discussed in part I, I’ve setup my permanent desktop OS (Host OS. Its also dual booting with Windows7) as Lubuntu 14.04 LTS. I’ve also setup Type1 virtualization suite, Qemu-KVM inside my Host. I’m using the exact same OS (Lubuntu 14.04 LTS), for my Guest as well. I’ve setup my Guest using Virt-Manager.1.0 UI.

I’ve compiled the steps based on some external blogs (read here, here and here).

Power On your guest now.  Steps below; (All steps are done on your Guest, bash command line)

A. Settingup base environment for Hadoop

1. Update your system sources

$ sudo apt-get update

2. Install Java (Open JDK)

$ sudo apt-get install default-jdk

3. Add a dedicated Hadoop group and user. Then add hadoop user to the ‘sudo’ group.

$ sudo addgroup hadoop

$ sudo adduser --ingroup hadoop hduser

$ sudo adduser hduser sudo

4. Install Secure Socket Shell (ssh)

$ sudo apt-get install ssh

5. Setup ‘ssh’ certificates

$ su hduser Password:

$ ssh-keygen -t rsa -P ""

$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

$ ssh localhost

The 3rd command adds the newly created key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.

Note: While prompting for the file name, just hit enter to accept the default file name.

At this point, snapshot your VM and restart the VM.

B. Settingup Hadoop

Now we’ve setup the basic environment, which requires by Hadoop. Now we will move on to install Hadoop.2.7. Please note that I am setting up the entire Hadoop on a dedicated ext4 partition (Not the root partition) for better management. You need to replace this path, as per your environment. (i.e. replace /media/SYSTEM as per your environment).

Note: I’ve underlined the paths in the below steps, that you need to replace with your own.


6. Get the core Hadoop Package (Version 2.7). Unzip it then move to our dedicated Hadoop partition.

$ sudo chmod 777 /media/SYSTEM

$ wget

$ tar xvzf hadoop-2.7.0.tar.gz

$ sudo mv * /media/SYSTEM/hadoop

$ sudo chown -R hduser:hadoop /media/SYSTEM/hadoop

Also create a ‘tmp’ folder, which will be used by hadoop later.

$ sudo mkdir -p /media/SYSTEM/hadoop/tmp
$ sudo chown hduser:hadoop /media/SYSTEM/hadoop/tmp



C. Settingup Hadoop Configuration Files

The following files will have to be modified to complete the Hadoop setup:  (Replace the paths based on your environment)


7. Update .bashrc

Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable using the following command:

$ update-alternatives --config java

Append the below to the end of .bashrc. Change ‘JAVA_HOME’ and ‘HADOOP_INSTALL’, as per your environment.

$ vi ~/.bashrc

export JAVA_HOME=/usr/lib/vm/java-7-openjdk-amd64
export HADOOP_INSTALL=/media/SYSTEM/hadoop/hadoop-2.7.0
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"

8. Update ‘’

We need to set JAVA_HOME by modifying file. Adding the above statement in the file ensures that the value of JAVA_HOME variable will be available to Hadoop whenever it is started up.

$ vi /media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64



9. Update ‘core-site.xml’

The ‘core-site.xml’ file contains configuration properties that Hadoop uses when starting up.  This file can be used to override the default settings that Hadoop starts with.

Open the file and enter the following in between the <configuration></configuration> tag:

$ vi /media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/core-site.xml

<configuration> <property> <name>hadoop.tmp.dir</name> <value>/media/SYSTEM/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name></name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>




10. Update ‘mapred-site.xml’

Create the ‘mapred-site.xml’ from existing template available with Hadoop installation:

$ cp /media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/mapred-site.xml.template /media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:

  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

11. Update ‘’hdfs-site.xml’

The/media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/hdfs-site.xml’ file needs to be configured for each host in the cluster that is being used. 
It is used to specify the directories which will be used as the namenode and the datanode on that host.

Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. 
This can be done using the following commands:

$ sudo mkdir -p /media/SYSTEM/hadoop/hadoop_store/hdfs/namenode
$ sudo mkdir -p /media/SYSTEM/hadoop/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop /media/SYSTEM/hadoop/hadoop_store


Open the file and enter the following content in between the <configuration></configuration> tag:

hduser@laptop:~$ vi /media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop/hdfs-site.xml

  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.



At this point, snapshot your VM and restart the VM to make the change in effect.


D. Format the HDFS file system

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it creates current directory under/media/SYSTEM/hadoop/hadoop_store/hdfs/namenode’ folder:

12. Format Hadoop file system

$ su hduser

$ hadoop namenode –format

Note that hadoop namenode -format command should be executed once before we start using Hadoop. If this command is executed again after Hadoop has been used, it'll destroy all the data on the Hadoop file system.

E. Start/Stop Hadoop

Now it's time to start the newly installed single node cluster. 

13. Start Hadoop

We can use or ( and


We can check if it's really up and running:

$ jps

You can also verify by hitting the below URL in the browser.

http://localhost:50070/ - web UI of the NameNode daemon



14. Stop Hadoop

We run or ( and to stop all the daemons running on our machine:



15. Test your installation

For testing I’ve moved a small file to HDFS file system and displayed it content from there.


16. Snapshot your VM

Now you’ve a working version of Hadoop. Snapshot your VM at this point, so that if you’ve any issues with future experiments, you can roll back the VM to this very working state of Hadoop.


Sunday, December 20, 2015

Prepare Yourself - Hadoop @ Desk (Single Node Cluster)

The whole idea about setting up Hadoop at your very desktop or laptop seems bit crazy. Especially, if you're having entry level configurations for your system. This is one of the primary reason, why many are staying out of Big Data development experience. Many thinks, they need servers or at least multiple machines (A ready made Lab environment) to setup Hadoop, and they never thought about playing it at home.

This was the same conviction I'd, till I realized a working Hadoop.2.0, at my desktop (with low specs: Dual Core Processor with 4Gig RAM). So I thought, sharing my experience will be helpful for many having the same thoughts. So in this blog, I will lay down a bit of prerequisites or environment preferences we need to make, for having a usable Hadoop installation at you're desktop/laptop.
Based on my experiences, I've found the below choices which are extremely important for the success of it. Or in other words, these are the same bottlenecks which restrict many from using Hadoop in a typical desktop or laptop.

A. Prefer a hands-on installation of Hadoop over Quick Start VM's

Many choose the easy way out for experiencing Hadoop. There are quick start VM's available from cloudera and  Hortonworks. I agree that, you can directly jump it to using the system, if you’ve a decent configuration. But it has the following cons.
i.  The prebuilt VM’s packs a huge set of modules, which you never use. But it will eat up your precious system resources, making your system clog down. You may only use a few modules in actual scenarios.
ii.  You never know the bare of you’re hadoop system, which are essentials to better understand your Map-Reduce programs and to troubleshoot low level issues
In essence, go for installing the Hadoop system on your own. Keep only the modules you actually wants.

e.g. I’ve installed Hadoop from scratch, starting from hadoop core.

B. Prefer a lightweight Desktop Host OS

Once you are ready to handle the Hadoop installation by yourself, the next hurdle is choosing the best desktop OS, which suits the need. As Hadoop may need good system resources, the key point is to choose a Host OS which is light on system resources as much as possible, with a descent GUI desktop and features for our daily use.

e.g. I’ve chosen, Lubuntu.14.04 LTS as my Host OS, which after boot-up consumes only 160MB of RAM with a working desktop.

C. Prefer Virtualization Technology over bare installation to Host system

Many get lost with Hadoop installation, when they mess up too much with their primary desktop OS. So the rule of thumb is, never pollute your primary desktop OS (Host OS). Keep it only for your daily tasks like browsing, document editing and other personal stuffs. Separate serious stuffs to Linux Containers or Virtual machines (Guest or Guest OS).
e.g I’ve setup the whole hadoop system inside a Virtual Machine and also inside Linux Containers.

D. Prefer a superior Virtualization Technology

Its also important to choose the right Virtualization Suite, as Virtualization itself add some performance overhead. So choose the one with the least overhead. Linux  Containers will be the top notch choice for this purpose. Both LXC and LXD is a best fit for setting up the cluster. If you go with Virtual Machines, Type1 hypervisors are more performers than Type2 ones. Many choose VMWare Player or VirtualBox which are essentially Type2 and significantly slower. Choose Type1 instead in such cases, like KVM (Though debates are still going on its Type1 status). If Docker/Containers works, it will be the best.
One more option to maximize the performance is to para-virtualize the guest OS virtual devices, than fully virtualizing them.

Other good thing about using containers/virtualization over host machine is, you can easily save the state of Container/Guest OS (Snapshots), at any point in time. Since hadoop installation steps are pretty long and error prone, once in a while take a snapshot of your Container/VM. When something goes wrong, restore the Container/VM back to the most previous working state, so that you wont lose all your past days precious hard work.

e.g I’ve chosen LXD/LXC Containers with ZFS to setup my cluster.
If your choice is Virtual Machines, I will recommend Qemu-KVM (Kernel Virtual Machine) under Linux and using internal snapshots. I’ve para-virtualized the Display/Disk/Network/Memory access. I’ve detailed both concepts in this blog.

E. Prefer a lightweight Guest OS

This step is similar to choosing your Host OS. Your Container/VM or Guest OS should also light as possible to have more resources for hadoop. If you are using Linxuc Container choose Ubuntu LTS Image. For Virtual Machines, You can choose Ubuntu Core or Ubuntu Server, if you’re comfortable with CLI. Ubuntu Core is as light as possible and will provide you the best experience with Hadoop than with others. You can also choose an OS with a GUI desktop, but select the one, which is very lightweight.

e.g. For my LXD container, I’ve chosen Trusty Container Image. For clusters using Virtual Machines, I’ve chosen, Lubuntu.14.04 LTS as my Guest OS as well.

We will see the actual Hadoop.2.0+ setup in Lubuntu (Host/Guest) in next blog. Thank you!

Tuesday, September 8, 2015

Raspberry Pi2 as a Router with Network Failover

This experiment is all about using Pi2 as a Home router with network failover. Assume you’ve both wired internet (Dialup Modem), and a mobile 3G internet (eg. 3G USB Dongle). Your wired internet can be considered as the primary and mobile 3G internet as backup or secondary.

The requirement is to, provide internet through your primary by default. As soon as your primary network is down, switch to your secondary. Your network clients (i.e the PC, smartphones, laptop which are connected to your local network through Ethernet and WIFI), should not be reconfigured, they should be automatically routed to the secondary network for accessing the internet, when the primary goes down.

Here Pi2 will be configured as a single gateway for all the network clients. By default it will route packets designated to internet to the primary network. Whenever primary is not available, it will dial in secondary network and route packets through secondary. When Primary is back online, Pi2 will disconnect the secondary, and route packets back to Primary. All packets from the attached clients will be routed to Pi2 first, where the actual routing will happen. Pi2 has been running with Ubuntu.14.04.

The conceptual diagram is shown below.


The configuration steps has been detailed below.

1.  Primary internet will be available through the ADSL modem, which is connected to a Wireless router (

2.  DHCP Server is disabled in the Wireless router, since DHCP requests will be served by the PI2

3.  Pi2 ( is connected to the Wireless Router through Ethernet through ‘eth0’ interface. It is the default gateway for the LAN connected devices

4.  DNSMASQ utility has been configured to run on Pi2, as a DHCP service, which will serve IP Address for the LAN connected devices.

Range: –, Gateway: (The Pi2 itself)

So that clients will route the packets to the Pi2.

5.  Pi2 has set the default route to the Wireless Router ( for internet, in the IP Routing table)

i.e. route add default gw

6.  Pi2 has also connected with a 3G USB Modem, which has been configured with ‘Wvdial’ utility, this will be secondary network. It will be recognized as ‘ppp0’ interface in Pi2.

7.  Pi2 is configured to Forward packets from eth0 to ppp0 through NAT (Using IP Tables)

8.  When internet is not available through primary (i.e through eth0), Pi2 has to dial ppp0

9.  Pi2 has to delete the default route through Wireless Router to internet, so that from now on packets will be routed to ppp0, to reach internet

Note: You can check internet connectivity through any interface using ‘ping’ command, where you can optionally give the interface name as well. In this case eth0

10.  Once the Primary network is back, ppp0 will be brought down. The default route will be added back to Wireless router, so that packets will be routed through it from now on.

11.  Steps 8-9, will be done through a shell script running in the background.

Monday, September 7, 2015

Setting up Wifi using Command Line–Raspberry Pi2 (Ubuntu 14.04)

Note: This tutorial is based on Ubuntu.14.04 (For RaspberryPi2, It’s the ARM version of Ubuntu14.04. The installation procedure is detailed here).

Though we can rely on Network Manager GUI, setting up Wifi using command line having some obvious advantages in certain scenarios. For e.g. Bridging your Ethernet and wifi interfaces.

The below tutorial will discuss about setting up the Wifi using command line along. Before moving further, Hope you have setup your WIfi Card in Ubuntu, with necessary firmware and driver. This tutorial is highly recommended, if you’ve not.

The first thing you’ve to do is to install certain packages, that makes the wifi setup a breeze under Debian environment. Install the following packages.

sudo apt-get install wireless-tools
sudo apt-get install wpa_supplicant

Once done, lets find the wifi interface which is ready to be used from command line. Issue the below command.


A sample output is given below:

br0       no wireless extensions.

eth0      no wireless extensions.

lo        no wireless extensions.

wlan0 unassociated  Nickname:"rtl_wifi"
          Mode:Managed  Access Point: Not-Associated   Sensitivity:0/0 
          Retry:off   RTS thr:off   Fragment thr:off
          Power Management:off
          Link Quality:0  Signal level:0  Noise level:0
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:0  Invalid misc:0   Missed beacon:0

The result shows that wlan0, is the wireless interface that  can be used. Now lets scan for wireless networks using any of the below command. The former will scan for every available wireless networks in the range. If you specifically know the SSID of the Wifi, to which you want to connect, to you can use the latter command, which specifies the exact SSID to connect. 

iwlist wlan0 scanning
sudo iwlist wlan0 scanning essid "Your Wifi SSID"

The below is a sample output.

wlan0     Scan completed :
                   Cell 01 - Address: AC:F1:DF:CD:2B:D4
                    ESSID:"My Wifi SSID"
                    Protocol:IEEE 802.11bgn
                    Frequency:2.412 GHz (Channel 1)
                    Encryption key:on
                    Bit Rates:1 Mb/s; 2 Mb/s; 5.5 Mb/s; 11 Mb/s; 6 Mb/s
                              9 Mb/s; 12 Mb/s; 18 Mb/s; 24 Mb/s; 36 Mb/s
                              48 Mb/s; 54 Mb/s
                    IE: WPA Version 1
                        Group Cipher : TKIP
                        Pairwise Ciphers (2) : TKIP CCMP
                        Authentication Suites (1) : PSK
                    IE: IEEE 802.11i/WPA2 Version 1
                        Group Cipher : TKIP
                        Pairwise Ciphers (2) : TKIP CCMP
                        Authentication Suites (1) : PSK
                    Signal level=58/100

Identify and note the words which are marked in green. We require those details to setup ‘wpa_supplicant’ configurations, which we will discuss shortly.

Now we’ve all details which are necessary to build our ‘wpa_supplicant.conf’, except one. We’ve to now generate the passphrase psk, using the Wifi SSID and the secret passphrase used to connect to your wifi router. The below command generate it for you.

wpa_passphrase "YourWifiSSID" "YourWifiSecretPassPhrase"
This will generate the below output, that can be copied to your ‘wpa_supplicant.conf’ file.

Now using the scan result and the passphrase psk details, fill in the ‘/etc/wpa_supplicant.conf’ file. Use the below commands.

sudo leafpad /etc/wpa_supplicant.conf

Now update this with the above details. A sample is given below;



To know about the different configurations options in ‘wpa_supplicant.conf’, refer this FreeBSD reference manual.

Save the file. Now we’ve all set, and issue the below commands to test connecting to our Wifi router.

sudo wpa_supplicant -Dnl80211 -iwlan0 -c/etc/wpa_supplicant.conf

(or use this command if the above driver not supported by your device
sudo wpa_supplicant -Dwext -iwlan0 -c/etc/wpa_supplica.conf)

You can use the –d or –dd switch with the above command to get the detailed log of the connection. If everything went will, you can see the logs on the console, which says association got succeeded. Now you’re connected to your wifi router. Now from another terminal issue the below command.

sudo dhclient wlan0

This will initiate DHCP requests to the router, and wifi interface (wlan0) will be configured with a dynamic IP addressed issued by the wifi router. Once done, try accessing internet, other machines in your network to test whether you’re actually connected.

To make this settings permanent (So that you are automatically connected on bootup), edit the network interfaces file and add the below content.

sudo leafpad /etc/network/interfaces

Append the below configuration.

auto wlan0
iface wlan0 inet dhcp
wpa-conf /etc/wpa_supplicant.conf

Now during every boot, your machine will automatically get connected to wifi. Happy surfing!




Saturday, August 29, 2015

Configuring WIFI USB Adapter with Raspberry Pi2

In this tutorial, we will discuss about, how to configure a WIFI USB adapter with RaspberryPI2 in station/client mode. Hope you already have a WIFI router to which your PI2 can connect to using the adapter.
I usually use my Pi2 as a download box, to download large files, which sometime takes days to complete. The advantages are quite obvious, when comparing to your PC downloads the files. Pi2 consume very less energy, silent and cost effective. If you are using your PC for downloading larger files, you’ve to keep it on for a long time, hence it will consume more energy compared to Pi2. Since it has rotating hard disks and fans, the noise can also make it less attractive choice for late night downloads.
For this experiment, I’ve used my BELKIN USB WIFI adapter (F7D2101 802.11n Surf & Share Wireless Adapter v1000). Internally it uses (Realtek RTL8192SU) chipset. I’ve referred this Raspberry thread to compile the details.  Though this article uses BELKIN, the steps are generic, so that you can use it for your devices, by replacing the firmware and driver, for your specific device.
The steps are detailed below.
1. Have a RaspberryPi2 with Ubuntu14.04 installed. (Follow this tutorial)
Note: Install Lubuntu Desktop, as per the tutorial.
2. Plugin your WIFI USB Adapter and then issue the command ‘lsusb’ (in LXPanel)
Now find out your USB device listed in the output. For mine its something like,
`Bus 001 Device 004: ID 050d:845a Belkin Components F7D2101 802.11n Surf & Share Wireless Adapter v1000 [Realtek RTL8192SU]`
Now we’ve to install the firmware and the kernel driver for the specified device, if not already installed. In this scenario, I’ve to install the firmware and driver for ‘RTL8192SU’, as Belkin internally uses this chipset
3. Installing the WIFI USB Firmware
This is specific to your device. In this scenario, there are various Realtek chipsets, which are supported by a specific driver and module.You can find this mapping in this wiki. According to this mappings, search your device and find the name of the driver and firmware you’ve to use.
Then download the firmware for various Realtek chipsets from, here. By referring to this link, I’ve found that, I’ve to install this firmware, for RTL8192SU.
Now Download the firmware and copy it to /lib/firmware/rtlwifi/
(In this scenario, I’ve /lib/firmware/rtlwifi/rtl8712u.bin)
4. Installing the WIFI USB Driver
Edit the file "/etc/modprobe.d/blacklist.conf" and add the line, blacklist rtl8192cu
Download the kernel driver for your device. In this scenario, there are various Realtek chipsets, which are supported by a specific driver and module.You can find this mapping in this wiki.  In our case the module name is ‘8192cu.ko’. You can either compile it from source for ARM (which is hard) or download the compiled one (for ARM) from here.
Open LXPanel and then issue the below commands;
tar xf 8192cu.tar.gz
sudo install -p -m 644 8192cu.ko /lib/modules/3.18.0-14-rpi2/kernel/drivers/net/wireless
(Verify the the above path, as it can be different based on your kernel version: i.e. in our case it is 3.18.0-14-rpi2)
sudo depmod –a
5. Reboot and Configure WIFI
After reboot, your WIFI Adapters LED must be blinking. Now you can either use Network Manager, to add the WIFI networks or you can use wpa_supplicant system, to configure the WIFI.
PS: Download the driver and firmware from here, if you are not able to access the aforementioned links.