Big Data Tutorials

Big Data Overview

2014-02-09T05:15:00.003-08:00

What is Big Data?

Big Data is a term coined to represent the collection of data assets so huge such that traditional methods for storage and data analysis would fall short of handling it effectively. The key challenges or characteristics which differentiates Big Data from regular data sets are termed as 3 Vs

Volume
Velocity
Variety

Let's understand each of these in a more detailed level

Volume
This deals which the sheer size of data assets that need to be dealt with.Which the overall data available for consumption is doubling every year, it is the need of our times to start bringing in frameworks and solutions which can tackle data assets that are in the scale of Peta or Exa bytes.

The realization on the amount of data an average human creates everyday - via his electronic interchanges via mobile,pc,laptop,internet etc are a clear testament to the fact that the data volumes which were handled by our traditional relational models are coming to an end. And in the new era which has already dawned volume should not be a bottleneck

Velocity
The days of batch processing are gone. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.

The speed at which data is made available is the need of our latest business models which would need real time data to plan their customer interactions.For e.g. we are looking at targeted messages to reach a prospective customer when he walks into our store which needs to be customized based on his spatial positioning and customer preferences harnessed from the likes of social media interactions.

Variety
This is the most interesting among the 3V phenomenon that defines Big Data. Our earlier definitions of enterprise data to reside on a structural (mostly relational) mode is taking a paradigm shift. Where-in the schema on write ideology of force fitting the incoming data into a predefined relational schema is transitioning into a schema on read approach where all data in its pure and true form would be received. And schema definition are applied only at the time of data provisioning or consumption by business apps or other analytics systems based on their unique needs.

Today's analytics is against sensor logs, twitter data, geospatial maps, handwrittern documents, images, scanned documents etc. This brings us to the need for a new way of stroring and intepreting these unstructured data assets

The above phenomenons which are acting across the world is driving the collective movement which is represented by Big Data and solutions to tackle it.

Big Data Trends

2014-02-09T05:15:00.000-08:00

The latest trends from the world of Big Data are summarized below for your faster consumption.

1) More Analytics and Less Guesses
The assimilation of knowledge assets like never before which spans both internal and data assets has been a breakthrough which most of the businesses were waiting for. And that's exactly what Big Data has achieved. And on top of the same the latest Big Data Analytics capabilities which are being explored by leading Analytic platforms paves the way for smarter and faster decision making using the plethora of information made available.

Needless to stay Big Data and its Analytics is a trend which will define the future of how business decisions are to be made.

2) Privacy and Security on Big Data
As the data assets grow in volume its critical to place the right measures and checkpoints to ensure utmost privacy and security over these assets. And that is one area which is being positioned on the high priority list by most organizations. This trend again has synergy with the security over cloud paradigm that is another angle which is redefining what privacy and security really means in today's data management landscape.

3) Real Investments on Big Data
If the prior years saw more of POCs wherein businesses were exploring the potential of big data. Compared to the investments getting planned currently that was just the tip of the iceberg. Having realized the immense capabilities of Big Data and its Analytics businesses have decided to invest heavily towards related technology platforms to ensure that they in the forefront of this smarter-faster decision making ride.

4) A coalition of Big Data - Cloud - Social - Mobile - NoSQL and Analytics
There is convergence slowly evolving where in the following workflow of critical knowledge assets is getting evident

Social & Mobile platform will continue to be the key data generators
Big Data Platforms leveraging cloud solutions would be the data management solutions
NoSQL and Analytics coupled with the power of in-memory analytics will drive data consumption

Big Data Challenges

2014-02-09T05:14:00.001-08:00

Big Data is here to stay and there are definitely no two ways on the same.

At the same time the unique opportunity which Big Data brings to the current technology landscape also brings along-with few unique challenges which needs to be understood and tackled with:-

1) Effective Analytics
Merging the traditional structured data assets and the unstructured/alternative structured data assets towards holistic analytics would remain one of the key challenges for Big Data as a Platform.

2) Privacy and Security
The growing volume/variety/velocity of data assets would undermine the traditional Privacy and security modes. Hence it is imperative to develop evolved processes and framework to support the changing needs of Big Data.

3) Performance with Schema on Read
The shift from the age old Schema-on-write paradigm which most of the current Data-warehouses follow. To the schema-on-read paradigm which is propagated by the likes of Data Lakes in the Big Data world, will bring on a critical challenge to meet the same performance benchmarks which the Data Retrieval applications used to get from the old warehouses.

4) Co-exist or Replace Data-warehouses?
The key question which many of the Big Data practitioners are today raising is how should the Big Data platforms evolve in an enterprise. Should they co-exists and be limited to play the role of all encompassing Landing layer for the warehouse. Or should it move ahead and replace the warehouses by directly bridging the gap between the data generating source systems and data consumption systems.

5) Survival of a Distributed Model in a relational world
Big Data has become synonymous with the Hadoop Data Platform as its strongest implementation framework. However the fully distributed mode of functioning with Hadoop will bring in intrinsic challenges when business apps and logic accustomed to the relational model of functioning start getting assimilated into the underlying distributed framework of Hadoop powered be MapReduce as the data processing mode.

Flume Overview

2014-02-09T05:01:00.003-08:00

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Solr Overview

2014-02-09T04:56:00.003-08:00

Solr is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

SolrTM Features

Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.

Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML, JSON and HTTP
Comprehensive HTML Administration Interfaces
Server statistics exposed over JMX for monitoring
Linearly scalable, auto index replication, auto failover and recovery
Near Real-time indexing
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture

Solr Uses the LuceneTM Search Library and Extends it!

A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
Powerful Extensions to the Lucene Query Language
Faceted Search and Filtering
Geospatial Search with support for multiple points per document and geo polygons
Advanced, Configurable Text Analysis
Highly Configurable and User Extensible Caching
Performance Optimizations
External Configuration via XML
An AJAX based administration interface
Monitorable Logging
Fast near real-time incremental indexing and index replication
Highly Scalable Distributed search with sharded index across multiple hosts
JSON, XML, CSV/delimited-text, and binary update formats
Easy ways to pull in data from databases and XML files from local disk and HTTP sources
Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika
Apache UIMA integration for configurable metadata extraction
Multiple search indices

Detailed Features

Schema

Defines the field types and fields of documents
Can drive more intelligent processing
Declarative Lucene Analyzer specification
Dynamic Fields enables on-the-fly addition of new fields
CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
Explicit types eliminates the need for guessing types of fields
External file-based configuration of stopword lists, synonym lists, and protected word lists
Many additional text analysis components including word splitting, regex and sounds-like filters
Pluggable similarity model per field

Query

HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)
Sort by any number of fields, and by complex functions of numeric fields
Advanced DisMax query parser for high relevancy results from user-entered queries
Highlighted context snippets
Faceted Searching based on unique field values, explicit queries, date ranges, numeric ranges or pivot
Multi-Select Faceting by tagging and selectively excluding filters
Spelling suggestions for user queries
More Like This suggestions for given document
Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.
Range filter over Function Query results
Date Math - specify dates relative to "NOW" in queries and updates
Dynamic search results clustering using Carrot2
Numeric field statistics such as min, max, average, standard deviation
Combine queries derived from different syntaxes
Auto-suggest functionality for completing user queries
Allow configuration of top results for a query, overriding normal scoring and sorting
Simple join capability between two document types
Performance Optimizations

Core

Dynamically create and delete document collections without restarting
Pluggable query handlers and extensible XML data format
Pluggable user functions for Function Query
Customizable component based request handler with distributed search support
Document uniqueness enforcement based on unique key field
Duplicate document detection, including fuzzy near duplicates
Custom index processing chains, allowing document manipulation before indexing
User configurable commands triggered on index changes
Ability to control where docs with the sort field missing will be placed
"Luke" request handler for corpus information

Caching

Configurable Query Result, Filter, and Document cache instances
Pluggable Cache implementations, including a lock free, high concurrency implementation
Cache warming in background
When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
Autowarming in background
The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.
Fast/small filter implementation
User level caching with autowarming support

SolrCloud

Centralized Apache ZooKeeper based configuration
Automated distributed indexing/sharding - send documents to any node and it will be forwarded to correct shard
Near Real-Time indexing with immediate push-based replication (also support for slower pull-based replication)
Transaction log ensures no updates are lost even if the documents are not yet indexed to disk
Automated query failover, index leader election and recovery in case of failure
No single point of failure

Admin Interface

Comprehensive statistics on cache utilization, updates, and queries
Interactive schema browser that includes index statistics
Replication monitoring
SolrCloud dashboard with graphical cluster node status
Full logging control
Text analysis debugger, showing result of every stage in an analyzer
Web Query Interface w/ debugging output
Parsed query output
Lucene explain() document score detailing
Explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.

Sqoop Overview

2014-02-09T04:53:00.002-08:00

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.

A by-product of the import process is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimiteds record data yourself, using any other tools you prefer.

Pig Overview

2014-02-09T04:29:00.001-08:00

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Mahout Overview

2014-02-09T04:28:00.000-08:00

The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.

Mahout currently has

User and Item based recommenders
Matrix factorization based recommenders
K-Means, Fuzzy K-Means clustering
Latent Dirichlet Allocation
Singular value decomposition
Logistic regression based classifier
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)
A vibrant community

With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms

Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.

Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.

Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

HIVE Overview

2014-02-09T04:24:00.000-08:00

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

HBase Overview

2014-02-09T04:07:00.002-08:00

HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are:

Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
Automatic RegionServer failover
Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
Java Client API: HBase supports an easy to use Java API for programmatic access.
Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

Cassandra Overview

2014-02-09T03:55:00.002-08:00

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.

Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

HDFS Overview

2014-02-09T03:44:00.001-08:00

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail.

The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

The following are some of the salient features that could be of interest to many users.

Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.
HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.
Hadoop is written in Java and is supported on all major platforms.
Hadoop supports shell-like commands to interact with HDFS directly.
The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.
New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:
- File permissions and authentication.
- Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
- Safemode: an administrative mode for maintenance.
- fsck: a utility to diagnose health of the file system, to find missing files or blocks.
- fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.
- Rebalancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
- Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
- Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
- Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
- Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

Avro Overview

2014-02-09T03:32:00.002-08:00

Apache Avro™ is a data serialization system.

Avro provides:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON. This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Ambari Getting Started

2014-02-09T03:21:00.002-08:00

Follow the installation guide for Ambari 1.4.3.

Note: Ambari currently supports the 64-bit version of the following Operating Systems:

RHEL (Redhat Enterprise Linux) 5 and 6
CentOS 5 and 6
OEL (Oracle Enterprise Linux) 5 and 6
SLES (SuSE Linux Enterprise Server) 11

Ambari Overview

2014-02-09T03:07:00.002-08:00

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

The set of Hadoop components that are currently supported by Ambari includes:

HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Ambari enables System Administrators to:

Provision a Hadoop Cluster
- Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
- Ambari handles configuration of Hadoop services for the cluster.

Manage a Hadoop Cluster
- Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.

Monitor a Hadoop Cluster
- Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
- Ambari leverages Gangila for metrics collection.
- Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

Ambari enables Application Developers and System Integrators to:

Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

Hadoop Overview

2014-02-09T02:54:00.003-08:00

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
HBase: A scalable, distributed database that supports structured data storage for large tables.
HIVE: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.

Images

2014-02-09T01:53:00.002-08:00