一直很好奇Ruby是不是可以拿來寫mapReduce的程式
Google之後找到了這個map-reduce-with-ruby-using-apache-hadoop(Hadoop Streaming mode).
Streaming mode可以使用自己熟悉的語言(前提你上面的hadoop node也都有裝),來寫MapReduce Job.
以下是使用ruby做的一個簡單的範例:
1
2
3
4
5
6
|
one
two
three
one
two
six
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
# Ruby code for map.rb
ARGF.each do |line|
# remove any newline
line = line.chomp
#set key
key = line
# value is a count of 1
value = 1
puts key + "\t" + value.to_s
end
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
# Ruby code for reduce.rb
prev_key = nil
key_total = 0
ARGF.each do |line|
# remove any newline
line = line.chomp
# split key and value on tab character
(key, value) = line.split(/\t/)
# check for new key
if prev_key && key != prev_key && key_total > 0
puts prev_key + "\t" + key_total.to_s
# reset key total for new key
prev_key = key
key_total = 0
elsif ! prev_key
prev_key = key
end
# add to count for this current key
key_total += value.to_i
end
puts prev_key + "\t" + key_total.to_s
|
在local mode
你可以使用以下的command作測試,可以看到結果
masato@host ~ cat raw_data |ruby map.rb |sort | ruby reduce.rb
print:
one 2
six 1
three 1
two 2
#!/bin/bash
#設定hadoop home
HADOOP_HOME=/usr/local/hadoop
#設定hadoop-streaming jar位置
JAR=contrib/streaming/hadoop-streaming-0.20.2+737.jar
HSTREAMING="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"
$HSTREAMING \
-mapper 'ruby map.rb' \
-reducer 'ruby reduce.rb' \
-file map.rb \
-file reduce.rb \
-input '/user/phil/input/*' \
-output /user/phil/output