Monday, March 11, 2013

Installing Cloudera Hadoop on Mac

Do not search it in the app store 

I know  that some of you searched for Hadoop in the app store :P .. It ain't there yet, too bad isn't it?  And, its not as easy as "sudo apt-get install hadoop" :)

But do not worry, its not as difficult either. Its much like installing Apache Hadoop in linux distributions. All it takes is a few simple baby steps to install the elephant in your mac.

This tutorial will guide you to install hadoop ina pseudo-distributed mode in your mac.

Step1: Passwordless ssh.


Try ssh-ing your own machine

$ ssh localhost

Without passwordless ssh, you will need to enter your password to login into your own system through ssh. To so

$ ssh-keygen

the command will ask for the location of the id_rsa and the id_rsa.pub keys. Press enter to accept the default location.
The command will ask for the passphrase. Just press enter to not have any password at all (afterall, we are trying to achieve passwordless-ssh) 

The command sometimes ends with a wierd random art of the key like this. 

+--[ RSA 2048]----+
|               -.|
|              .+o|
|              A=o|
|       .      ..=|
|        P .. . O |
|       . . .o    |
|          . F.. .|
|           +.o.o |
|          o O=+  |
+-----------------+

or just a cryptic fingerprint like this a2:b1:5e:6f:2a:a2:d7:3f:d1:e5:5a:aa:ab:c5:e8:2a

But yea, don't get scared. you do not have to remember them :P
now go to your home directory and copy the pub file to authorized_keys. 

$ cd /Users/rajgopalv
$ cd .ssh
$ cp id_*.pub authorized_keys
$ ssh localhost


some times, when you login for a first time to the host, you will be asked with a security warning of some kind... 


The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is <<[then finger print]>>
Are you sure you want to continue connecting (yes/no)?

just type "yes" and continue. The system should not ask a password, and you should be able to login successfully.


Step-2 : Download the CDH4 tarballs.


The Hadoop and its family of softwares are available in tarball format in here : https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs

Go ahead and download the stuff you want. 
But to begin with, let me start with downloading the "hadoop-2.0.0+922" tarball. If you want to run mapreduce version1 (i.e. nmot Yarn, ) then download the "hadoop-0.20-mapreduce-0.20.2+1341" tar ball too (recomended)

Now, unzip these tarballs and place them wherever you wnat them to get installed. I personally prefer a "Softwares" folder in my home directory.

$ pwd
/Users/rajgopalv/Softwares
$ ls -ld hadoop*
drwxr-xr-x@ 14 rajgopalv  1668562246  476 Feb  7 07:00 hadoop-2.0.0-cdh4.1.3

drwxr-xr-x@ 29 rajgopalv  1668562246  986 Feb  6 11:20 hadoop-2.0.0-mr1-cdh4.1.3

Step-3: Configure

DFS configuration : 
go to the hadoop hadoop-2.0.0-cdh4.1.3/etc/hadoop directory and edit the core-site.xml to look like this . 

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/rajgopalv/hadoop/data</value>
        <!-- ofcourse you can use any directory you want. -->
    </property>
</configuration>

and hdfs-site.xml to look like this:

<configuration>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Similarly, configure the map-reduce too. Go to hadoop-2.0.0-mr1-cdh4.1.3/conf/ directory and edit the mapred-site.xml to look like this.

<configuration>
 <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

Step-4: Run!

Now its time to format and start the DFS: 
go to the hadoop-2.0.0-cdh4.1.3 folder in your terminal.

$ cd /Users/rajgopalv/Softwares/hadoop-2.0.0-cdh4.1.3
$ bin/hdfs namenode -format

13/03/12 00:27:06 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Nucleus/192.168.2.106
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.0.0-cdh4.1.3
STARTUP_MSG:   classpath = /Users/rajgopalv.......... [etc etc..]

************************************************************/

blah bhal blah

13/03/12 00:27:07 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/03/12 00:27:07 INFO util.ExitUtil: Exiting with status 0
13/03/12 00:27:07 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Nucleus/192.168.2.106
************************************************************/

The important thing to notice is "Exiting with status 0". Status 0 indicates all is well. :)
Now start the DFS.


$ sbin/start-dfs.sh

now, http://localhost:50070/dfshealth.jsp must display the health of your DFS 
To start the mapreduce module.,  go to hadoop-2.0.0-mr1-cdh4.1.3 directory in your terminal.

$ cd /Users/rvaithiyanathan/Softwares/hadoop-2.0.0-mr1-cdh4.1.3
$ bin/start-mapred.sh

now, http://localhost:50030/jobtracker.jsp should show your mapreduce jobs. 

Bravo! you are good to go. 


Possible things that could go wrong: 

Set your java home before you start anything. 

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/

Try different port numbers in the configuration.

Although, I 've shown here that the DFS is configured on 8020 and mapreduce is configured on 8021, some other softwares might be using these ports. So feel free to try different ports.


Do you have permissions to the hadoop.tmp.dir ?

the directory that you specified in hadoop.tmp.dir must be writable by you. This is the reason i've specified a directory under my home-directory itself.

I say, check out the *.log files in the directories. hadoop-2.0.0-mr1-cdh4.1.3/logs and hadoop-2.0.0-cdh4.1.3/logs/ .. They could be a little cryptic if you are a beginner, but you will get used to it :)

Let me know if there are any doubts.!



11 comments:

  1. got this :

    smaikap:hadoop-2.0.0-cdh4.3.0 Smaikap$ bin-mapreduce1/start-mapred.sh
    +================================================================+
    | Error: HADOOP_HOME is not set correctly |
    +----------------------------------------------------------------+
    | Please set your HADOOP_HOME variable to the absolute path of |
    | the directory that contains hadoop-core-VERSION.jar |
    +================================================================+
    smaikap:hadoop-2.0.0-cdh4.3.0 Smaikap$ env
    TERM_PROGRAM=Apple_Terminal
    SHELL=/bin/bash
    TERM=xterm-256color
    TMPDIR=/var/folders/8j/4lsh8bvs6pjbs4_0nk9xryyc0000gn/T/
    Apple_PubSub_Socket_Render=/tmp/launch-JvtmhX/Render
    TERM_PROGRAM_VERSION=303.2
    OLDPWD=/Users/Smaikap/hadoop/tmpDir
    TERM_SESSION_ID=393EE8AC-0435-44DB-924C-5400D0D1E627
    USER=Smaikap
    COMMAND_MODE=unix2003
    SSH_AUTH_SOCK=/tmp/launch-8vEwVe/Listeners
    Apple_Ubiquity_Message=/tmp/launch-Ny09iQ/Apple_Ubiquity_Message
    __CF_USER_TEXT_ENCODING=0x1F5:0:0
    PATH=/Volumes/Apps&Data/devTools/sbt/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/devtools/ez:/Applications:.:/opt/X11/bin:/usr/X11/bin
    PWD=/Users/Smaikap/hadoop/hadoop-2.0.0-cdh4.3.0
    HOME=/Users/Smaikap
    SHLVL=1
    LOGNAME=Smaikap
    LC_CTYPE=UTF-8
    DISPLAY=/tmp/launch-0avV0c/org.macosforge.xquartz:0
    _=/usr/bin/env
    smaikap:hadoop-2.0.0-cdh4.3.0 Smaikap$ env | grep hadoop
    OLDPWD=/Users/Smaikap/hadoop/tmpDir
    PWD=/Users/Smaikap/hadoop/hadoop-2.0.0-cdh4.3.0
    smaikap:hadoop-2.0.0-cdh4.3.0 Smaikap$

    ReplyDelete
  2. As of CDH4.3, there is no separate tarball for MRv1: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Release-Notes/cdh4rn_topic_3_3.html

    ReplyDelete
  3. The information which you have provided is very good and easily understood. It is very useful who is looking for hadoop Training.
    Hadoop Training in hyderabad

    ReplyDelete
  4. Thanks so very much for taking your time to create this very useful and informative site. I have learned a lot from your site. Thanks!!


    Hadoop Training in Chennai

    ReplyDelete
  5. Your article is very useful for me. Thanks for sharing the wonderful information. AWS course chennai | AWS certification in chennai | AWS cerfication chennai

    ReplyDelete
  6. This is extremely helpful info!! Very good work. Everything is very interesting to learn and easy to understood. Thank you for giving information. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

    ReplyDelete
  7. This Information very helpful for the beginners.In this each step have a wonderful explanation.I would study and known about the application.thanks for giving wonderful information. Vmware certification in chennai | Vmware certification chennai

    ReplyDelete
  8. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..
    Selenium Training in Chennai | QTP Training in Chennai

    ReplyDelete
  9. Thanks for Information Oracle Apps Technical is a collection of a bunch of collected applications like accounts payables, purchasing, inventory, accounts receivables, human resources, order management, general ledger and fixed assets, etc which have its own functionality for serving the business
    Oracle Apps Training In Chennai

    ReplyDelete
  10. Oracle Training in chennai | Oracle D2K Training In chennai
    This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

    ReplyDelete
  11. Greens Technology's the leading software Training & placement centre Chennai & ( Adyar)
    hyperion training in chennai

    ReplyDelete