Focus: Hadoop (Part 1)

A google trend graph created on “Hadoop” and related technologies shows an interesting scenario. The interest over time related to web searches for Hadoop has steadily increased and continues to increase over time. It seems as if “Hadoop” and “Big Data” are replacing “Data mining” as keywords. Hadoop has aided in Big-data analytics that is a buzz-word everywhere these days. What was “Big” a few years back seems very “small” now. “Big” keeps becoming “Bigger”. Hadoop enables us to bridge the gap.

 

 

Image

 

This brief article (and Part 1 in the series) talks about Hadoop at an overview level, it’s history, the technology and future trends.

Hadoop is not new, the underlying technology is used by Google for web indexing, is used by organizations world-wide for Big-data analytics. It is in fact even used by Mars “rover” mission to aid in determining if life ever existed on Mars. It’s the sheer volume of data that needs to be handled where Hadoop shines through it cluster based distributed system.

In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are difficult to put into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is also well addressed by Hadoop.

Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as the best way to handle massive amounts of data, including not only structured data but also complex, unstructured data as well.

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:

Image

“The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.”

The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

Simply put, Hadoop provides: a reliable shared storage and analysis system. The storage is provided by Hadoop Distributed File System (HDFS) and analysis by MapReduce algorithm. These are the main kernel components of Hadoop. However, Hadoop also has several other components like:

  • Hive (queries and data summarization)
  • Pig (processing large data sets)
  • HBase (column oriented NoSQL data storage system)
  • ZooKeeper (co-ordinating processes)
  • Ambari (administration)
  • HCatalog (meta data management service)

HDFS is a filesystem designed for storing very large files reliably with streaming data access patterns, running on clusters of commodity hardware. As the name implies, HDFS is a distributed filesystem, and hence has all the complications of network based filesystems like consistency, node failures, etc. However, by distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. It’s designed to run on clusters of commodity hardware.

MapReduce is a framework for processing “embarassing parallel” problems across huge datasets using large number of computers. It uses locality of data effectively to reduce transmission of data between nodes. As the name implies, it consists of two steps: Map and Reduce. “Map” divides the problem into subproblems and distributes it across cluster of nodes, while “Reduce” collects the answers from all the nodes in the cluster and merges the results. MapReduce is not specific to Hadoop and it has been applied in different schemes for other solutions. For example, at Google, MapReduce algorithm was used to completely regerenate Google’s index of the World Wide Web.

The premise of MapReduce is that the entire dataset—or at least a good portion of it—is processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. This enables solutions like big data analysis.

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is break that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

Despite all the advantages provided by Hadoop, there are use case scenarios where Hadoop does not serve well. Such use cases include scenarios where we have:

  • Low-latency access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

Quantcast recently announced Open-Sourcing of their Quantcast File System (QFS) that claims to provide better through-put than HDFS. It will be interesting to study how the two compare in performance tests. But, Quantcast isn’t the only company that has replaced HDFS. MapR‘s commercial distribution of Hadoop uses a proprietary file system. DataStax Enterprise uses Apache Cassandra to replace HDFS.

Over next parts in this series, we shall talk about Hadoop components in more detail.

 

References:

Hadoop: What it is, how it works, and what it can do http://strata.oreilly.com/2011/01/what-is-hadoop.html

What is Apache Hadoop? http://hortonworks.com/what-is-apache-hadoop/

Trends in Big Connectivity: Big Data, Hadoop and Life on Mars http://blogs.datadirect.com/2012/08/trends-in-big-connectivity-big-data-hadoop-and-life-on-mars.html

Quantcast Open Sources Hadoop Distributed File System Alternative http://techcrunch.com/2012/09/27/quantcast-open-sources-hadoop-distributed-file-system-alternative/

Advertisement

One thought on “Focus: Hadoop (Part 1)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s