Ansible Playbook that installs a CDH4 Hadoop cluster (running on Java 7, supported from CDH 4.4), with Ganglia, Fluentd, ElasticSearch and Kibana 3 for monitoring and centralized log indexing.
Hire/Follow @analytically. NEW: Deploys Hive Metastore and Facebook Presto!
- Ansible 1.4 or later
- 6 + 1 Ubuntu 12.04 LTS, 13.04 or 13.10 hosts - see ubuntu-netboot-tftp if you need automated server installation
- Mandrill API key for sending emails
ansibler
user in sudo group without sudo password prompt (see Bootstrapping section below)
Cloudera (CDH4) Hadoop Roles
If you're assembling your own Hadoop playbook, these roles are available for you to reuse:
cdh_common
- sets up Cloudera's Ubuntu repository and keycdh_hadoop_common
- common packages shared by all Hadoop nodescdh_hadoop_config
- common configuration shared by all Hadoop nodescdh_hadoop_datanode
- installs Hadoop DataNodecdh_hadoop_journalnode
- installs Hadoop JournalNodecdh_hadoop_mapreduce
- installs Hadoop MapReducecdh_hadoop_mapreduce_historyserver
- installs Hadoop MapReduce history servercdh_hadoop_namenode
- installs Hadoop NameNodecdh_hadoop_yarn_nodemanager
- installs Hadoop YARN node managercdh_hadoop_yarn_proxyserver
- installs Hadoop YARN proxy servercdh_hadoop_yarn_resourcemanager
- installs Hadoop YARN resource managercdh_hadoop_zkfc
- installs Hadoop Zookeeper Failover Controllercdh_hbase_common
- common packages shared by all HBase nodescdh_hbase_config
- common configuration shared by all HBase nodescdh_hbase_master
- installs HBase-Mastercdh_hbase_regionserver
- installs HBase RegionServercdh_hive_common
- common packages shared by all Hive nodescdh_hive_config
- common configuration shared by all Hive nodescdh_hive_metastore
- installs Hive metastore (using PostgreSQL database)cdh_zookeeper_server
- installs ZooKeeper Server
Facebook Presto Roles
presto_common
- downloads Presto to /usr/local/presto and prepares the node configurationpresto_coordinator
- installs Presto coordinator configpresto_worker
- installs Presto worker config
Customize the following files:
Required:
group_vars/all
- site_name and notify_emailroles/postfix_mandrill/defaults/main.yml
- set your Mandrill account (API key)
Optional:
roles/2_aggregated_links/defaults/main.yml
- aggregated link bond mode and mturoles/cdh_hadoop_config/defaults/main.yml
- Hadoop settingsroles/presto_coordinator/templates/config.properties
- Presto coordinator configurationroles/presto_worker/templates/config.properties
- Presto coordinator configuration
When specifying/reusing roles, one can override the vars, eg.:
- { role: postfix_mandrill, postfix_domain: example.com, mandrill_username: joe, mandrill_api_key: 123 }
Edit the hosts file and list hosts per group (see Inventory for more examples):
[datanodes]
hslave010
hslave[090:252]
hadoop-slave-[a:f].example.com
Make sure that the zookeepers
and journalnodes
groups contain at least 3 hosts and have an odd number of hosts.
Since we're using unicast mode for Ganglia (which significantly reduces chatter), you may have to wait 60 seconds after node startup before it is seen/shows up in the web interface.
To run Ansible:
./site.sh
To e.g. just install ZooKeeper, add the zookeeper
tag as argument (available tags: apache, bonding, configuration,
elasticsearch, fluentd, ganglia, hadoop, hbase, hive, java, kibana, ntp, presto, rsyslog, tdagent, zookeeper):
./site.sh zookeeper
- link aggregation configures Link Aggregation if 2 interfaces are available on the nodes (
balance-alb
by default) - Htop
- curl, checkinstall, intel-microcode/amd64-microcode, net-tools, zip
- NTP configured with the Oxford University NTP service by default
- Postfix with Mandrill configuration
- local 'apt' repository for Oracle Java packages
- unattended upgrades email to inform success/failure
- php5-cli, sysstat, hddtemp to report device metrics (reads/writes/temp) to Ganglia every 10 minutes.
- LZO (Lempel–Ziv–Oberhumer) and Google Snappy 1.1.1 compression
Instructions on how to test the performance of your CDH4 cluster.
- SSH into one of the machines.
- Change to the
hdfs
user:sudo su - hdfs
- Set HADOOP_MAPRED_HOME:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
cd /usr/lib/hadoop-mapreduce
hadoop jar hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=1000 10000000000 /tera/in
to run TeraGenhadoop jar hadoop-mapreduce-examples.jar terasort /tera/in /tera/out
to run TeraSort
hadoop jar hadoop-mapreduce-client-jobclient-2.0.0-cdh4.5.0-tests.jar TestDFSIO -write
Paste your public SSH RSA key in bootstrap/ansible_rsa.pub
and run bootstrap.sh
to bootstrap the nodes
specified in bootstrap/hosts
. See bootstrap/bootstrap.yml
for more information.
You can manually install additional components after running this playbook. Follow the official CDH4 Installation Guide.
Licensed un 4781 der the Apache License, Version 2.0.
Copyright 2013 Mathias Bogaert.