In my previous post, I gave a list of installed services on a Oracle Big Data Cloud Service when you select “full” as deployment profile. In this post, I’ll explain these services and software.
HDFS: HDFS is a distributed, scalable, and portable file system written in Java for Hadoop. It stores data so it is the main component of the our cluster. A Hadoop (big data) cluster has nominally a single namenode plus a cluster of datanodes, but there are redundancy options available for the namenode due to its criticality. Both namenode and datanode services can run in same server (although it’s not recommended on a production environment). In our small cluster, we have 1 active namenode, 1 standby namenode and 3 datanodes – distributed to 3 servers.
YARN + MapReduce (v2): MapReduce is a programming model popularized by Google to process large datasets in a parallel and scalable way. is a framework for cluster resource management and job scheduling. YARN contains a Resource Manager and Node Managers (for redundancy we can create a standby Resource Manager). The Resource Manager tracks how many live nodes and resources are available on the cluster and coordinates which applications submitted by users should get these resources. Each datanode should have a nodemanager to run MapReduce jobs.
Hive: Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in HDFS. Hive gives an SQL-like interface to query data. These queries are implicitly converted into MapReduce or Tez jobs. It stores structured data like traditional databases. It requires a relational database to store the metadata for Hive tables. In our cluster, MySQL is used for hive metastore.
Tez: The Tez is the next generation Hadoop Query Processing framework written on top of YARN. It’s used by the Hive to increase performance.
Pig: Pig is a scripting platform for processing and analyzing large data sets. It has its own language (Pig Latin). Pig is ideal for Extract-transform-load (ETL), research on raw data, and iterative data processing.
ZooKeeper: ZooKeeper is an open source project that provides a centralized infrastructure and services that enable synchronization across a cluster. All other services in our cluster register themself to ZooKeper, so whenever a service is required, it can be found using Zookeper.
Spark: Apache Spark is a fast, in-memory data processing engine that allows data scientists to build and run fast and sophisticated applications on Hadoop. Spark Engine supports a set of high-level tools that support SQL-like queries, streaming data applications, complex analytics such as machine learning, and graph algorithms.
Zeppelin Notebook: Zeppelin Notebook is a web-based notebook that enables interactive data analytics. It can be used to make beautiful data-driven, interactive and collaborative documents with SQL, Spark, Scala and more. In our cluster, we have 3 zeppelin servers running on each node.
Alluxio: Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage. It creates a storage layer between your applications and storage systems.
BDCSCE Logstash Agent: Logstash is a popular data collection engine with real-time pipelining capabilities. In our cluster, it is used to gather (and parse) logs of YARN Node Managers. The parsed logs are stored in HDFS. Normally, it’s not used on hadoop clusters.
Nginx Reverse Proxy:Nginx Reverse Proxy is used to provide authorization for Zeppelin Notebook. It’s also used for Big Data Console.
Spark Cloud Service UI: It’s written by Oracle to serve Zeppelin through Nginx proxy.
Spocs (or Spoccs) Fabric Service: It’s written by Oracle to provide Rest API for Jobs/Logs Management.
And of course, we have ambari service. Ambari is a completely open operational framework for provisioning, managing, and monitoring Hadoop clusters. It includes a set of operator tools and RESTful APIs that mask the complexity of Hadoop, simplifying the operation of clusters.
In my next post, I’ll talk about ambari console, and how we can add “missing views” to ambari.