The Void Pointer: Installing Cloudera Hadoop on Mac

Monday, March 11, 2013

Installing Cloudera Hadoop on Mac

Step1: Passwordless ssh.

Try ssh-ing your own machine

$ ssh localhost

Without passwordless ssh, you will need to enter your password to login into your own system through ssh. To so

$ ssh-keygen

the command will ask for the location of the id_rsa and the id_rsa.pub keys. Press enter to accept the default location.
The command will ask for the passphrase. Just press enter to not have any password at all (afterall, we are trying to achieve passwordless-ssh)

The command sometimes ends with a wierd random art of the key like this.

+--[ RSA 2048]----+
| -.|
| .+o|
| A=o|
| . ..=|
| P .. . O |
| . . .o |
| . F.. .|
| +.o.o |
| o O=+ |
+-----------------+

or just a cryptic fingerprint like this a2:b1:5e:6f:2a:a2:d7:3f:d1:e5:5a:aa:ab:c5:e8:2a

But yea, don't get scared. you do not have to remember them :P
now go to your home directory and copy the pub file to authorized_keys.

$ cd /Users/rajgopalv
$ cd .ssh
$ cp id_*.pub authorized_keys
$ ssh localhost

some times, when you login for a first time to the host, you will be asked with a security warning of some kind...

The authenticity of host 'localhost (::1)' can't be established.

RSA key fingerprint is <<[then finger print]>>

Are you sure you want to continue connecting (yes/no)?

just type "yes" and continue. The system should not ask a password, and you should be able to login successfully.

Step-2 : Download the CDH4 tarballs.

The Hadoop and its family of softwares are available in tarball format in here : https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs

Go ahead and download the stuff you want.
But to begin with, let me start with downloading the "hadoop-2.0.0+922" tarball. If you want to run mapreduce version1 (i.e. nmot Yarn, ) then download the "hadoop-0.20-mapreduce-0.20.2+1341" tar ball too (recomended)

Now, unzip these tarballs and place them wherever you wnat them to get installed. I personally prefer a "Softwares" folder in my home directory.

$ pwd
/Users/rajgopalv/Softwares
$ ls -ld hadoop*
drwxr-xr-x@ 14 rajgopalv 1668562246 476 Feb 7 07:00 hadoop-2.0.0-cdh4.1.3

drwxr-xr-x@ 29 rajgopalv 1668562246 986 Feb 6 11:20 hadoop-2.0.0-mr1-cdh4.1.3

Step-3: Configure

DFS configuration :

go to the hadoop hadoop-2.0.0-cdh4.1.3/etc/hadoop directory and edit the core-site.xml to look like this .

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/rajgopalv/hadoop/data</value>

</property>
</configuration>

and hdfs-site.xml to look like this:

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Similarly, configure the map-reduce too. Go to hadoop-2.0.0-mr1-cdh4.1.3/conf/ directory and edit the mapred-site.xml to look like this.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Step-4: Run!

Now its time to format and start the DFS:

go to the hadoop-2.0.0-cdh4.1.3 folder in your terminal.

$ cd /Users/rajgopalv/Softwares/hadoop-2.0.0-cdh4.1.3

$ bin/hdfs namenode -format

13/03/12 00:27:06 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = Nucleus/192.168.2.106

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 2.0.0-cdh4.1.3

STARTUP_MSG:   classpath = /Users/rajgopalv.......... [etc etc..]

************************************************************/

blah bhal blah

13/03/12 00:27:07 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

13/03/12 00:27:07 INFO util.ExitUtil: Exiting with status 0

13/03/12 00:27:07 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at Nucleus/192.168.2.106

************************************************************/

The important thing to notice is "Exiting with status 0". Status 0 indicates all is well. :)
Now start the DFS.

$ sbin/start-dfs.sh

now, http://localhost:50070/dfshealth.jsp must display the health of your DFS
To start the mapreduce module., go to hadoop-2.0.0-mr1-cdh4.1.3 directory in your terminal.

$ cd /Users/rvaithiyanathan/Softwares/hadoop-2.0.0-mr1-cdh4.1.3
$ bin/start-mapred.sh

now, http://localhost:50030/jobtracker.jsp should show your mapreduce jobs.

Bravo! you are good to go.

Possible things that could go wrong:

Set your java home before you start anything.

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/

Try different port numbers in the configuration.

Although, I 've shown here that the DFS is configured on 8020 and mapreduce is configured on 8021, some other softwares might be using these ports. So feel free to try different ports.

Do you have permissions to the hadoop.tmp.dir ?

the directory that you specified in hadoop.tmp.dir must be writable by you. This is the reason i've specified a directory under my home-directory itself.

I say, check out the *.log files in the directories. hadoop-2.0.0-mr1-cdh4.1.3/logs and hadoop-2.0.0-cdh4.1.3/logs/ .. They could be a little cryptic if you are a beginner, but you will get used to it :)

Let me know if there are any doubts.!