This article addresses Big Data, with a focus on HDInsight, the Hadoop distribution from Microsoft and Hortonworks.
General concepts on Big Data
Why Big Data?
Data, as we know, are generally structured. That is to say, organized and legible.
However, relevant data are generally stored in an unstructured environment, usually in text files.
The lack of structure has a non-negligible role. Not only in the quality of processed information, but also in processing time. Nowadays, these data are so large and exogenous that it is difficult to handle them via an « all-in-one » tool, or traditional transactional applications.
In addition, these types of data are often generated continuously. The best example being social networks exchanges or real-time trading platforms.
With the explosion of the volume of information, access to them is becoming slower. For example, reading a terabyte of data can take more than two hours; transfer times painfully reaching 100 MB / s on average.
To respond to these various issues related to the data volume and performance, a treatment approach began to be discussed in the 90s: get users to the data, rather than the reverse. Technically, the idea is to parallelize the user access to multiple disks storing each a certain volume of data to drastically reduce reading time.
Okay, but what does « Big Data » really mean?
In fact, « Big Data
» is a generic (and popular) term used to describe the exponential growth and availability of (un)structured data.
The idea behind this buzzword is to capture and manage a large volume of exogenous or endogenous data, to analyze them in order to quickly restore meaningful data.
The quintessence of Big Data is based around the letter V, which refers to 4 keywords:
- Volume: with the exponential and perpetual growth of various information to store, it is becoming less unusual to have terabytes or even petabytes of data to be processed.
- Velocity: roughly speaking, it refers to the increasingly high frequency of data processing (generation, transformation,…).
- Variety: it refers to heterogeneous data (spatial data, textual data, multimedia data, cryptographic data, …).
- Veracity: it refers to the quality of data.
Big Data allows each organization to access data from any source and analyze them with a good level of performance.
Big Data vs. traditional data
The table below provides some comparisons between Big Data and traditional data:
Amount of data
Application or batch.
Occasional write operations, frequent read operations.
Frequent read and write operations.
|Translation from french : Données classiques = traditional data.|
The main challenges of Big Data
Continuous treatment of a large exponential data volume at a high performance level generates its own set of challenges for Big Data:
- Data Quality: adoption of optimal processing techniques in order to analyze and extract meaningful information.
- Loading information: given the large amount of data to be processed, ETL (Extract, Transform, Load) operations can be expansive.
- Storage: to enhance and parallelize the access to information, it is becoming an evidence to use and maintain an increasing number of oversized storage units.
- High-availability: in order to face potential data losses or unavailability, replication solutions must be envisaged.
Concrete use cases of Big Data
For some concrete examples of using a Big Data solution in the real world, you can take a look here: http://hadoopilluminated.com/hadoop_book/Hadoop_Use_Cases.html.
Overview of Hadoop and its modules
Hadoop: what is this?
Hadoop is an open source project developed in Java by Apache (via Doug Cutting and his team) – inspired by Google’s MapReduce , GoogleFS and BigTable which are oriented around the parallel processing of a large contingent of data through different clusters or servers. It has been designed with a view to ensure a certain data scalability from a single server to a hundred distributed machines, with a (very) high level of tolerance for data loss. Hadoop works with the assumption that not only the hardware crashes are common, they should be managed by softwares. For more information, you can take a look here: http://hadoop.apache.org/.
Hadoop’s general architecture
Hadoop has various modules (or features) to ensure reliable operation:
- A distributed file system inspired on GoogleFS: HDFS (Hadoop Distributed File System).
- A MapReduce engine, for parallel processing of data.
- A central platform resource management called YARN (Yet Another Resource Negotiator) for ensuring consistency and security of treatments through Hadoop clusters.
- A column-oriented database for storage: HBase (for Hadoop Base).
- A Hadoop configuration management software, including HBase: ZooKeeper (inspired on Chubby, a Google creation).
Some platforms for analyzing data:
The concept of HDFS
As seen in the previous section, HDFS (Hadoop Distributed File System) is the heart of Hadoop. It is a distributed file management system implemented according to the following statements:
- Any crash material is a norm, not an exception.
- The treatments are more in batch mode that in interactive mode (i.e., application access).
- Applications possess a very large volume of data (a typical HDFS file is sized in GB, TB or Po).
- Once a file created and its data written, they do not need to be changed.
- Handling calculations is more effective than placing data. The idea is to bring the user to data rather than the reverse.
- Portability reinforced, from a platform to another.
A HDFS architecture is composed of two important types of components (or nodes):
- NameNode: it is a node where file system metadata are stored. This node allows to resolve: correspondences between files and used blocks, which blocks of data are stored, and in which node, etc…
DataNode: it is a node where actual data are stored. To notify their activity, each node sends a notification system to the NameNode every 3 seconds. Each node can communicate with another node, for example:
- Rebalancing the amount of stored data.
- Replicating data.
- Rebalancing the amount of stored data.
Some translations from french:
Communications are made via TCP/IP, and storage of files is distributed across multiple machines. Clients communicate using a RPC (Remote Procedure Call) protocol.
Within a Hadoop architecture, the NameNode is a SPOF (Single Point Of Failure). It means that in case of a failure, Hadoop won’t work. Thereby, to face it, a second NameNode has been implemented within the Hadoop architecture so that periodically, metadata from the main NameNode are copied to the secondary. Thus, in case of a failure on the main NameNode, the secondary NameNode will be able to take over with metadata sufficiently up-to-date.
Hadoop has 2 main task managers:
- The Job Tracker, which handles resource management (capture of the availability of resources, life cycle management of tasks and their progress, management of the tolerance to data loss …). It is a service that allows to distribute MapReduce tasks towards specific nodes of a Hadoop cluster.
- The Task Tracker, its role is to follow orders given by the Job Tracker. Each Task Tracker owns a set of slots, which is the number of tasks that it may accept to execute. Thus, in this way, each time the Job Tracker plans to launch a task, it looks first through MapReduce operations.
The concept of MapReduce
Hadoop MapReduce is a platform that facilitates the processing of a large amount of data (> 1TB) in parallel, across several servers (> 100 nodes). It also ensures the scalability of data treatments on hundreds of servers in a Hadoop cluster.
In fact, the term « MapReduce » refers to two distinct tasks:
- A Map Task which gathers a set of data to convert to another set of splitted data into several tuples (key/value pairs).
- A Reduce Task that gets the result of Map operations and combines mapped tuples to form a set of smaller tuples.
The simple diagram below summarizes the overall process of a MapReduce operation (we assume textual data have to be processed):
Translation from french: Résultats = results.
As seen above, the Shuffle & Sort phase is the heart of MapReduce operations. This step ensures that each reducer is able to work with sorted data per key, after their transfer (shuffling) by the Mapper.
Other optional phases exist (Combining, Partitioning,…).
To summarize, a typical operation of Hadoop MapReduce is as follows:
- A client program submits a job to the Job Tracker.
- The Job Tracker retrieves information, from the NameNode, the position of data within DataNodes.
- The Job Tracker puts the client program (in general, a compressed JAR file) in a HDFS shared folder.
- The Job Tracker assigns tasks to Task Trackers on DataNodes where are localized data to be processed.
- Once the client program file is decompressed from the HDFS shared file, the Task Tracker launches Map Tasks on DataNodes.
- The Task Tracker informs the Job Tracker about its progression.
- After the Map Task completes, an intermediate file is created within the Task Tracker’s HDFS.
- The results of operations performed by the Map Task are moved to the Reduce Task.
- The Reduce Task processes all mapped data and wrote the final result of its operations in HDFS.
- After the Reduce Task finishes, the intermediate file is deleted by the Task Tracker.
Introduction to HDInsight
What is HDInsight?
HDInsight is a Big Data solution co-implemented by Microsoft and Hortonworks. It allows to process, analyze and publish a large volume of (un)structured data using the power of Hadoop on server clusters.
Data analysis can be done via various tools such as PowerPivot, PowerView, …
In reality, HDInsight exists in two forms:
- Windows Azure HDInsight, which allows to deploy and use Hadoop clusters on the Microsoft Azure cloud.
Microsoft HDInsight Server for Windows, which doesn’t « officially » exist anymore, except within 2 solutions:
The diagram below summarizes the HDInsight ecosystem:
- In green: packages (Apache Mahout is a library of scalable algorithms for Machine Learning; RHadoop is a combination of packages for manipulating Hadoop R language for statistical processing; Pegasus is used for graph-mining).
- In blue: data processors (database, scripting, querying, …).
- In red: the heart of Hadoop from which HDInsight is implemented, and allowing the storage distribution (HDFS) and treatments (MapReduce).
- In purple: some complementary tools on the HDInsight solution (storage, analysis, reporting, monitoring, security, …).
- In orange: the movement of data (knowing that Apache Flume allows collecting, aggregating and routing large data flows, Apache Sqoop transfers data blocks between Hadoop and RDBMS and REST API is a webservice which handles HTTP operations).
For the management and storage of data, HDInsight uses Azure Blob Storage as the default file system. However, it is possible to store data in the local HDFS of a node, knowing that the Hadoop clusters are optimized to handle MapReduce operations.
In terms of job management, HDInsight uses Azure PowerShell.
For further information…
Keep an eye here for other articles about Big Data.
Here are some interesting references: