谷歌发表的三篇论文:
MapReduce: Simplified Data Processing on Large Clusters
Bigtable: A Distributed Storage System for Structured Data
The Google File System
这里主要学习Hadoop和Spark这两个框架。
Hadoop 单节点 条件: 确保Hadoop安装、配置并且正在运行. 更多细节可以查看Hadoop单节点安装
我在安装过程中也遇到一些问题, 主要是两个:
提示环境变量JAVA_HOME
没有设置(设置过/etc/profile
)
启动运行提示没有权限
解决方法: 在Hadoop安装目录修改etc/hadoop/hadoop-env.sh
文件 添加
1 2 export PDSH_RCMD_TYPE=sshexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
JAVA_HOME 的具体值根据你机器JAVA安装路径填写,官方推荐的JDK版本是8
输入和输出 MapReduce框架专门用来操作<key, value>
对, 这个框架会把job的输入看作键值对的集合, 并产生键值对的集合作为job的输出.
这里主要是两个函数map
和reduce
, map
是对输入的数据进行变换, 在论文里面的例子就是:
1 2 3 4 5 6 7 8 9 10 11 12 map(String key, String value): // key: document name // value: document contents for each word w in value EmitIntermediate(w, "1"); reduce(String key, Interator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
上面这段伪代码大概的意思就是, map读取文档名和文档的内容, 对文档里面的内容, 每个单词, 产生一个键值对<word, 1>
, reduce的输入是map的输出组合, 每个相同的单词有一个次数列表, reduce负责把它们加起来。看完这段代码, 有两个问题:
map和reduce可以并行执行吗?
map可以从流里面获得数据吗?
例子: WordCount 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 <?xml version="1.0" encoding="UTF-8" ?> <project xmlns ="http://maven.apache.org/POM/4.0.0" xmlns:xsi ="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation ="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" > <modelVersion > 4.0.0</modelVersion > <groupId > top.ourfor.start</groupId > <artifactId > hello</artifactId > <packaging > jar</packaging > <version > 1.0-SNAPSHOT</version > <name > HelloHadoop</name > <properties > <hadoopcommon.version > 3.3.1</hadoopcommon.version > <hadoopcore.version > 1.2.1</hadoopcore.version > <maven.compiler.source > 8</maven.compiler.source > <maven.compiler.target > 8</maven.compiler.target > </properties > <dependencies > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-hdfs</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-auth</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-mapreduce-client-core</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-mapreduce-client-common</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-common</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-client</artifactId > <version > ${hadoopcommon.version}</version > </dependency > <dependency > <groupId > org.apache.hadoop</groupId > <artifactId > hadoop-core</artifactId > <version > ${hadoopcore.version}</version > </dependency > </dependencies > </project >