Sunday, August 12, 2018

The 10 Most Important Hadoop Terms You Need to Know and Understand

Bigdata, the catchy name for large volumes of structured, unstructured or semi-structured data, is notoriously difficult to capture, store, manage, share, analyze and view, at least using traditional software applications and databases. That's why big data technologies have the potential to manage and process large volumes of data efficiently and efficiently. And is that Apache Hadoop provides the structure and associated technologies to process large datasets in groups of computers in a distributed way. Here, let's take a look at the most important terms you'll hear about Hadoop and what they mean.


But first, see how Hadoop works

Before entering the Hadoop ecosystem, you must clearly understand two fundamental things. The first is how a file is stored in Hadoop; the second is how the stored data is processed. All technologies related to Hadoop work mainly in these two areas and facilitate their use.


Now to the terms,



Hadoop Common


The Hadoop framework has different modules for different functionalities, and these modules can interact with each other for a variety of reasons. Hadoop Common can be defined as a library of common utilities to support these modules in Hadoop, the ecosystem. These utilities are basically archived files based on Java (JAR). These utilities are mostly used by developers and developers during development time.


Hadoop Distributed File System (HDFS)


The Hadoop Distributed File System (HDFS) is a subproject of Apache Hadoop under the Apache Software Foundation. This is the storage backbone within the Hadoop framework. It is a distributed, scalable, fault-tolerant file system that covers all the basic hardware known as the Hadoop cluster. The goal of HDFS is to store large volumes of data reliably with high-performance access to application data. HDFS follows the master / slave architecture, where the master is known as NameNode and the slaves are known as DataNodes.


MapReduce


Hadoop map reduces is also a subproject of the Apache Software Foundation. MapReduce is, in fact, a software framework written exclusively in Java. Its main objective is to process large datasets in a distributed system (composed of commodity hardware) in a completely parallel manner. The framework manages all activities such as task scheduling, monitoring, execution and re-execution.


HBase


Apache Hbase is known as a Hadoop database. It is a big data store, distributed, and scalable. It is also known as a type of Nosql database that is not a relational database management. HBase applications are also written in Java, built on Hadoop and run on HDFS. HBase is used when real-time reading / writing and random access to the big date are needed. The HBase is based on the concepts of Google's bigtable.


Hive


Apache Hive is an open source data warehouse software system. Hive was originally developed by Facebook before entering the Apache Software Foundation and becoming open source. It facilitates the management and consultation of large datasets in storage compatible with distributed Hadoop. Hive executes all its activities using a language similar to SQL known as HiveQL.


Apache Pig

Pig was originally started by Yahoo to develop and execute MapReduce tasks in a large volume of distributed data. Now, it has become an open source project of the Apache Software Foundation. The Apache Pig can be defined as a platform to analyze very large datasets efficiently. The Pig infrastructure layer produces MapReduce job streams to do the actual processing. The Pig language layer is known as Pig Latin and provides SQL-like features for querying distributed datasets.


Apache Spark


The SPAR was originally developed by the AMPLab at UC Berkeley. Apache Spark can be defined as an open-source, general-purpose cluster computing structure that makes data analysis much faster. It is built on the Hadoop Distributed File System, but is not linked to the MapReduce structure. Spark's performance is much faster compared to MapReduce. It provides high level API in Scala, Python and Java.


Apache Cassandra


Apache Cassandra is another open source NoSQL database. Cassandra is widely used to manage large volumes of structured, semi-structured and unstructured data in various datacenters and cloud storage. The Cassandra is designed on the basis of a "no master" architecture, which means that it does not support the master / slave model. In this architecture, all the nodes are equal and the data is distributed automatically and also in all the nodes. The most important resources of Cassandra are continuous availability, linear scalability, integrated / customizable replication, no single point of failure and operational simplicity.


Another resource negotiator (YARN)

However, another resource negotiator (YARN) is also known as MapReduce 2.0, but in reality it is framed in Hadoop 2.0. YARN can be defined as a structure for resource management and task scheduling. The basic idea of ​​YARN is to replace JobTracker functionalities with two separate demons, responsible for resource management and programming / monitoring. In this new structure, there will be a global ResourceManager (RM) and an application-specific master known as ApplicationMaster (AM). The Global ResourceManager (RM) and the NodeManager (per node slave) form the real data computing structure. Existing MapReduce v1 applications can also run on YARN, but these applications must be recompiled with jars of Hadoop2.x.


Impala

Imphala can be defined as an SQL query mechanism with massive processing processing power (MPP). It runs natively in the Apache Hadoop structure. The Impala is projected as part of the Hadoop ecosystem. Share the same flexible file system (HDFS), metadata, resource management and security structures used by other components of the Hadoop ecosystem. The most important point is to note that the Impala is much faster in processing queries compared to the Hive. But we must also remember that the Impala is intended for consultation / analysis in a small data set and is primarily projected as an analysis tool that works on processed and structured data.



No comments:

Post a Comment