How to install hadoop on ubuntu

Ubuntu linux에서 hadoop을 Pseudo-Distributed 방식으로 설치하는 방법입니다. 설치법은 생각보다 간단합니다.

우선 아래 패키지들을 설치합니다.

root@ruo91:~# apt-get -y install ssh rsync java7-jdk

 

hadoop 계정을 생성합니다.

root@ruo91:~# adduser hadoop

 

hadoop 계정으로 로그인합니다.

root@ruo91:~# su -l hadoop

 

로컬호스트에서 ssh로 암호입력 없이 연결이 되나 시도 해봅니다.
(이런.. 암호를 입력해야 되는군요!)

hadoop@ruo91:~$ ssh localhost
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
ECDSA key fingerprint is bb:91:c8:60:3c:09:86:75:4e:db:e7:c1:77:47:1b:a4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
hadoop@localhost’s password:

 

그럼 로컬호스트에서 암호없이 로그인 되도록 공개키를 생성합니다.

hadoop@ruo91:~$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
hadoop@ruo91:~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
hadoop@ruo91:~$ chmod 644 ~/.ssh/authorized_keys

 

Java 환경변수를 설정합니다.

hadoop@ruo91:~$ echo “export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386;PATH=$PATH:$JAVA_HOME/bin” >> ~/.profile

 

hadoop을 다운로드 합니다.

hadoop@ruo91:~$ wget http://mirror.yongbok.net/apache/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
hadoop@ruo91:~$ tar xzvf hadoop-1.0.3.tar.gz
hadoop@ruo91:~$ cd hadoop-1.0.3

 

namenode, replication, jobtracker를 설정하기 위해 해당 파일을 열어 사용자 환경에 맞게 수정합니다.
- conf/hadoop-env.sh

hadoop@ruo91:~/hadoop-1.0.3$ nano conf/hadoop-env.sh
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 

- conf/core-site.xml

hadoop@ruo91:~$ nano conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

 

- conf/hdfs-site.xml
dfs.name.dir과 dfs.data.dir은 hadoop이 설치 되어 있는 곳으로 변경합니다.
ex) hadoop location : /home/hadoop/hadoop-1.0.3

hadoop@ruo91:~$ nano conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/hadoop-1.0.3/name</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-1.0.3/data</value>
</property>
</configuration>

 

- conf/mapred-site.xml:

hadoop@ruo91:~/hadoop-1.0.3$ nano conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

 

설정이 끝났으면 DF(distributed-filesystem)으로 포멧합니다.

hadoop@ruo91:~/hadoop-1.0.3$ bin/hadoop namenode -format
12/07/22 06:00:21 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ruo91/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.3
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 13351 92; compiled by ‘hortonfo’ on Tue May 8 20:31:25 UTC 2012
************************************************************/
12/07/22 06:00:22 INFO util.GSet: VM type = 32-bit
12/07/22 06:00:22 INFO util.GSet: 2% max memory = 19.33375 MB
12/07/22 06:00:22 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/07/22 06:00:22 INFO util.GSet: recommended=4194304, actual=4194304
12/07/22 06:00:24 INFO namenode.FSNamesystem: fsOwner=hadoop
12/07/22 06:00:24 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/22 06:00:24 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/22 06:00:24 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/07/22 06:00:24 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/22 06:00:24 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/07/22 06:00:24 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/07/22 06:00:24 INFO common.Storage: Storage directory /home/hadoop/hadoop-1.0.3/name has been successfully formatted.
12/07/22 06:00:24 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ruo91/127.0.1.1
************************************************************/

 

프로세스를 모두 실행 시켜줍니다.

hadoop@ruo91:~/hadoop-1.0.3$ bin/start-all.sh
starting namenode, logging to /home/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-namenode-ruo91.out
localhost: starting datanode, logging to /home/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-datanode-ruo91.out
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-secondarynamenode-ruo91.out
starting jobtracker, logging to /home/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-jobtracker-ruo91.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-tasktracker-ruo91.out

 

hadoop에서는 namenode와 jobtracker를 웹페이지에서 확인이 가능합니다.
- namenode : http://localhost:50070/

- jobtracker : http://localhost:50030/

위에서 DF로 포멧했던 것을 시험해보기 위해 파일 하나를 복사 해봅니다.

hadoop@ruo91:~/hadoop-1.0.3$ bin/hadoop fs -put conf input

 

Map과 Reduce가 되는지 테스트 해봅니다.

hadoop@ruo91:~/hadoop-1.0.3$ bin/hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’
12/07/22 06:05:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/22 06:05:22 WARN snappy.LoadSnappy: Snappy native library not loaded
12/07/22 06:05:22 INFO mapred.FileInputFormat: Total input paths to process : 16
12/07/22 06:05:24 INFO mapred.JobClient: Running job: job_201207220601_0001
12/07/22 06:05:25 INFO mapred.JobClient: map 0% reduce 0%
12/07/22 06:06:33 INFO mapred.JobClient: map 6% reduce 0%
12/07/22 06:06:37 INFO mapred.JobClient: map 12% reduce 0%
12/07/22 06:07:53 INFO mapred.JobClient: map 12% reduce 4%
12/07/22 06:08:01 INFO mapred.JobClient: map 18% reduce 4%
12/07/22 06:08:09 INFO mapred.JobClient: map 25% reduce 4%
12/07/22 06:08:16 INFO mapred.JobClient: map 25% reduce 6%
12/07/22 06:08:21 INFO mapred.JobClient: map 25% reduce 8%
12/07/22 06:08:59 INFO mapred.JobClient: map 31% reduce 8%
12/07/22 06:09:03 INFO mapred.JobClient: map 37% reduce 8%
12/07/22 06:09:12 INFO mapred.JobClient: map 37% reduce 12%
12/07/22 06:10:41 INFO mapred.JobClient: map 50% reduce 12%
12/07/22 06:10:51 INFO mapred.JobClient: map 50% reduce 16%
12/07/22 06:12:04 INFO mapred.JobClient: map 62% reduce 16%
12/07/22 06:12:20 INFO mapred.JobClient: map 62% reduce 20%
12/07/22 06:12:54 INFO mapred.JobClient: map 75% reduce 20%
12/07/22 06:13:06 INFO mapred.JobClient: map 75% reduce 25%
12/07/22 06:13:46 INFO mapred.JobClient: map 81% reduce 25%
12/07/22 06:13:50 INFO mapred.JobClient: map 87% reduce 25%
12/07/22 06:14:01 INFO mapred.JobClient: map 87% reduce 27%
12/07/22 06:14:04 INFO mapred.JobClient: map 87% reduce 29%
12/07/22 06:14:36 INFO mapred.JobClient: map 100% reduce 29%
12/07/22 06:14:52 INFO mapred.JobClient: map 100% reduce 100%
12/07/22 06:15:04 INFO mapred.JobClient: Job complete: job_201207220601_0001
12/07/22 06:15:05 INFO mapred.JobClient: Counters: 30
12/07/22 06:15:05 INFO mapred.JobClient: Job Counters
12/07/22 06:15:05 INFO mapred.JobClient: Launched reduce tasks=1
12/07/22 06:15:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1046529
12/07/22 06:15:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/22 06:15:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/07/22 06:15:05 INFO mapred.JobClient: Launched map tasks=16
12/07/22 06:15:05 INFO mapred.JobClient: Data-local map tasks=16
12/07/22 06:15:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=496516
12/07/22 06:15:05 INFO mapred.JobClient: File Input Format Counters
12/07/22 06:15:05 INFO mapred.JobClient: Bytes Read=27119
12/07/22 06:15:05 INFO mapred.JobClient: File Output Format Counters
12/07/22 06:15:05 INFO mapred.JobClient: Bytes Written=238
12/07/22 06:15:05 INFO mapred.JobClient: FileSystemCounters
12/07/22 06:15:05 INFO mapred.JobClient: FILE_BYTES_READ=128
12/07/22 06:15:05 INFO mapred.JobClient: HDFS_BYTES_READ=28873
12/07/22 06:15:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=368728
12/07/22 06:15:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=238
12/07/22 06:15:05 INFO mapred.JobClient: Map-Reduce Framework
12/07/22 06:15:05 INFO mapred.JobClient: Map output materialized bytes=218
12/07/22 06:15:05 INFO mapred.JobClient: Map input records=770
12/07/22 06:15:05 INFO mapred.JobClient: Reduce shuffle bytes=212
12/07/22 06:15:05 INFO mapred.JobClient: Spilled Records=10
12/07/22 06:15:05 INFO mapred.JobClient: Map output bytes=112
12/07/22 06:15:05 INFO mapred.JobClient: Total committed heap usage (bytes)=3252158464
12/07/22 06:15:05 INFO mapred.JobClient: CPU time spent (ms)=127030
12/07/22 06:15:05 INFO mapred.JobClient: Map input bytes=27119
12/07/22 06:15:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=1754
12/07/22 06:15:05 INFO mapred.JobClient: Combine input records=5
12/07/22 06:15:05 INFO mapred.JobClient: Reduce input records=5
12/07/22 06:15:05 INFO mapred.JobClient: Reduce input groups=5
12/07/22 06:15:05 INFO mapred.JobClient: Combine output records=5
12/07/22 06:15:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=2256572416
12/07/22 06:15:05 INFO mapred.JobClient: Reduce output records=5
12/07/22 06:15:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6466576384
12/07/22 06:15:05 INFO mapred.JobClient: Map output records=5
12/07/22 06:15:06 INFO mapred.FileInputFormat: Total input paths to process : 1
12/07/22 06:15:08 INFO mapred.JobClient: Running job: job_201207220601_0002
12/07/22 06:15:09 INFO mapred.JobClient: map 0% reduce 0%
12/07/22 06:15:39 INFO mapred.JobClient: map 100% reduce 0%
12/07/22 06:16:00 INFO mapred.JobClient: map 100% reduce 100%
12/07/22 06:16:11 INFO mapred.JobClient: Job complete: job_201207220601_0002
12/07/22 06:16:11 INFO mapred.JobClient: Counters: 30
12/07/22 06:16:11 INFO mapred.JobClient: Job Counters
12/07/22 06:16:11 INFO mapred.JobClient: Launched reduce tasks=1
12/07/22 06:16:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=32548
12/07/22 06:16:12 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/22 06:16:12 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/07/22 06:16:12 INFO mapred.JobClient: Launched map tasks=1
12/07/22 06:16:12 INFO mapred.JobClient: Data-local map tasks=1
12/07/22 06:16:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19552
12/07/22 06:16:12 INFO mapred.JobClient: File Input Format Counters
12/07/22 06:16:12 INFO mapred.JobClient: Bytes Read=238
12/07/22 06:16:12 INFO mapred.JobClient: File Output Format Counters
12/07/22 06:16:12 INFO mapred.JobClient: Bytes Written=82
12/07/22 06:16:12 INFO mapred.JobClient: FileSystemCounters
12/07/22 06:16:12 INFO mapred.JobClient: FILE_BYTES_READ=128
12/07/22 06:16:12 INFO mapred.JobClient: HDFS_BYTES_READ=355
12/07/22 06:16:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42813
12/07/22 06:16:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=82
12/07/22 06:16:12 INFO mapred.JobClient: Map-Reduce Framework
12/07/22 06:16:12 INFO mapred.JobClient: Map output materialized bytes=128
12/07/22 06:16:12 INFO mapred.JobClient: Map input records=5
12/07/22 06:16:12 INFO mapred.JobClient: Reduce shuffle bytes=0
12/07/22 06:16:12 INFO mapred.JobClient: Spilled Records=10
12/07/22 06:16:12 INFO mapred.JobClient: Map output bytes=112
12/07/22 06:16:12 INFO mapred.JobClient: Total committed heap usage (bytes)=210632704
12/07/22 06:16:12 INFO mapred.JobClient: CPU time spent (ms)=4680
12/07/22 06:16:12 INFO mapred.JobClient: Map input bytes=152
12/07/22 06:16:12 INFO mapred.JobClient: SPLIT_RAW_BYTES=117
12/07/22 06:16:12 INFO mapred.JobClient: Combine input records=0
12/07/22 06:16:12 INFO mapred.JobClient: Reduce input records=5
12/07/22 06:16:12 INFO mapred.JobClient: Reduce input groups=1
12/07/22 06:16:12 INFO mapred.JobClient: Combine output records=0
12/07/22 06:16:12 INFO mapred.JobClient: Physical memory (bytes) snapshot=182882304
12/07/22 06:16:12 INFO mapred.JobClient: Reduce output records=5
12/07/22 06:16:12 INFO mapred.JobClient: Virtual memory (bytes) snapshot=763695104
12/07/22 06:16:12 INFO mapred.JobClient: Map output records=5

 

저장된 결과를 꺼내어 볼수 있고,

hadoop@ruo91:~/hadoop-1.0.3$ bin/hadoop fs -get output output
hadoop@ruo91:~/hadoop-1.0.3$ cat output/*
cat: output/_logs: 디렉터리입니다
1 dfs.data.dir
1 dfs.name.dir
1 dfs.replication
1 dfs.server.namenode.
1 dfsadmin

 

cat으로 바로 확인 할수 있습니다.

hadoop@ruo91:~/hadoop-1.0.3$ bin/hadoop fs -cat output/*
cat: File does not exist: /user/hadoop/output/_logs
1 dfs.data.dir
1 dfs.name.dir
1 dfs.replication
1 dfs.server.namenode.
1 dfsadmin

Comments

comments