tag:blogger.com,1999:blog-24277012845730351092024-03-06T19:33:54.549-08:00Big Data TutorialsUnknownnoreply@blogger.comBlogger17125tag:blogger.com,1999:blog-2427701284573035109.post-78003282759317019622014-02-09T05:15:00.003-08:002014-02-09T09:58:50.951-08:00Big Data Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: Arial, Helvetica, sans-serif;"><b>What is Big Data?</b></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Big Data is a term coined to represent the collection of data assets so huge such that traditional methods for storage and data analysis would fall short of handling it effectively. The key challenges or characteristics which differentiates Big Data from regular data sets are termed as 3 Vs</span><br />
<ul style="text-align: left;">
<li><span style="font-family: Arial, Helvetica, sans-serif;">Volume</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Velocity</span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Variety</span></li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgvT43zUpcJ5IvDwebsI_zPtZ-JG0JMVvD8HI2EcRIHiX9XR9BgD1-rOaAc4_TX5vCe_iaZ7frIo5PXconRjwawyn0-Rq3Z1W2BGVHg4yHl4J6SFTmcmQ52DopZ_IAbOykP-8q1o_4Wh6A/s1600/BigData.001.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgvT43zUpcJ5IvDwebsI_zPtZ-JG0JMVvD8HI2EcRIHiX9XR9BgD1-rOaAc4_TX5vCe_iaZ7frIo5PXconRjwawyn0-Rq3Z1W2BGVHg4yHl4J6SFTmcmQ52DopZ_IAbOykP-8q1o_4Wh6A/s1600/BigData.001.jpg" height="239" width="320" /></a></div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span><span style="font-family: Arial, Helvetica, sans-serif;">Let's understand each of these in a more detailed level</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Volume</b></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">This deals which the sheer size of data assets that need to be dealt with.Which the overall data available for consumption is doubling every year, it is the need of our times to start bringing in frameworks and solutions which can tackle data assets that are in the scale of Peta or Exa bytes.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The realization on the amount of data an average human creates everyday - via his electronic interchanges via mobile,pc,laptop,internet etc are a clear testament to the fact that the data volumes which were handled by our traditional relational models are coming to an end. And in the new era which has already dawned volume should not be a bottleneck</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Velocity</b></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">The days of batch processing are gone. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The speed at which data is made available is the need of our latest business models which would need real time data to plan their customer interactions.For e.g. we are looking at targeted messages to reach a prospective customer when he walks into our store which needs to be customized based on his spatial positioning and customer preferences harnessed from the likes of social media interactions.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><b>Variety</b></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">This is the most interesting among the 3V phenomenon that defines Big Data. Our earlier definitions of enterprise data to reside on a structural (mostly relational) mode is taking a paradigm shift. Where-in the schema on write ideology of force fitting the incoming data into a predefined relational schema is transitioning into a schema on read approach where all data in its pure and true form would be received. And schema definition are applied only at the time of data provisioning or consumption by business apps or other analytics systems based on their unique needs.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Today's analytics is against sensor logs, twitter data, geospatial maps, handwrittern documents, images, scanned documents etc. This brings us to the need for a new way of stroring and intepreting these unstructured data assets</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The above phenomenons which are acting across the world is driving the collective movement which is represented by Big Data and solutions to tackle it.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
</div>
Unknownnoreply@blogger.com26tag:blogger.com,1999:blog-2427701284573035109.post-75241052070807435192014-02-09T05:15:00.000-08:002014-02-09T05:15:05.540-08:00Big Data Trends<div dir="ltr" style="text-align: left;" trbidi="on">
The latest trends from the world of Big Data are summarized below for your faster consumption.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdrlmCbGA0ysdOxT5ksWY4EnSEHBsW3SW-Hw8HCJZJAm_npzChp5IHvFv_TX5QguPlPU4XZclbV27NiL-ewvbUWFUEFVlyoy6ykAtEPOu91gLO_wSmWz7pnxQPEC4p1j4aSPNdTV3vHQ9b/s1600/Big-Data+Trends.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdrlmCbGA0ysdOxT5ksWY4EnSEHBsW3SW-Hw8HCJZJAm_npzChp5IHvFv_TX5QguPlPU4XZclbV27NiL-ewvbUWFUEFVlyoy6ykAtEPOu91gLO_wSmWz7pnxQPEC4p1j4aSPNdTV3vHQ9b/s1600/Big-Data+Trends.jpg" height="213" width="320" /></a></div>
<br />
<br />
<br />
<b>1) More Analytics and Less Guesses</b><br />
The assimilation of knowledge assets like never before which spans both internal and data assets has been a breakthrough which most of the businesses were waiting for. And that's exactly what Big Data has achieved. And on top of the same the latest Big Data Analytics capabilities which are being explored by leading Analytic platforms paves the way for smarter and faster decision making using the plethora of information made available.<br />
<br />
Needless to stay Big Data and its Analytics is a trend which will define the future of how business decisions are to be made.<br />
<br />
<b>2) Privacy and Security on Big Data</b><br />
As the data assets grow in volume its critical to place the right measures and checkpoints to ensure utmost privacy and security over these assets. And that is one area which is being positioned on the high priority list by most organizations. This trend again has synergy with the security over cloud paradigm that is another angle which is redefining what privacy and security really means in today's data management landscape.<br />
<br />
<b>3) Real Investments on Big Data</b><br />
If the prior years saw more of POCs wherein businesses were exploring the potential of big data. Compared to the investments getting planned currently that was just the tip of the iceberg. Having realized the immense capabilities of Big Data and its Analytics businesses have decided to invest heavily towards related technology platforms to ensure that they in the forefront of this smarter-faster decision making ride.<br />
<br />
<b>4) A coalition of Big Data - Cloud - Social - Mobile - NoSQL and Analytics</b><br />
There is convergence slowly evolving where in the following workflow of critical knowledge assets is getting evident<br />
<br />
<ul style="text-align: left;">
<li>Social & Mobile platform will continue to be the key data generators</li>
<li>Big Data Platforms leveraging cloud solutions would be the data management solutions</li>
<li>NoSQL and Analytics coupled with the power of in-memory analytics will drive data consumption</li>
</ul>
<br />
<br />
<div>
<br /></div>
<div>
<br /></div>
</div>
Unknownnoreply@blogger.com8tag:blogger.com,1999:blog-2427701284573035109.post-75828742624475986542014-02-09T05:14:00.001-08:002014-02-09T05:14:20.360-08:00Big Data Challenges<div dir="ltr" style="text-align: left;" trbidi="on">
Big Data is here to stay and there are definitely no two ways on the same.<br />
<br />
At the same time the unique opportunity which Big Data brings to the current technology landscape also brings along-with few unique challenges which needs to be understood and tackled with:-<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfx1mDlIoHLDR6czP-Ajc4g7Ddg0cUqlo7ImRkwJAY2z1_m5bWvskRQXPp7u4MY8tAffbTUeoX1DOeA1UyXZq5xMtmCuYARA6DkZ8B7vat58jWAaHs9vU-N3l9QlaHXzqgXfPIUhoCyVMf/s1600/en_challenges-300x175.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfx1mDlIoHLDR6czP-Ajc4g7Ddg0cUqlo7ImRkwJAY2z1_m5bWvskRQXPp7u4MY8tAffbTUeoX1DOeA1UyXZq5xMtmCuYARA6DkZ8B7vat58jWAaHs9vU-N3l9QlaHXzqgXfPIUhoCyVMf/s1600/en_challenges-300x175.jpg" /></a></div>
<br />
<br />
<br />
<b>1) Effective Analytics</b><br />
Merging the traditional structured data assets and the unstructured/alternative structured data assets towards holistic analytics would remain one of the key challenges for Big Data as a Platform.<br />
<br />
<b>2) Privacy and Security</b><br />
The growing volume/variety/velocity of data assets would undermine the traditional Privacy and security modes. Hence it is imperative to develop evolved processes and framework to support the changing needs of Big Data.<br />
<br />
<b>3) Performance with Schema on Read</b><br />
The shift from the age old Schema-on-write paradigm which most of the current Data-warehouses follow. To the schema-on-read paradigm which is propagated by the likes of Data Lakes in the Big Data world, will bring on a critical challenge to meet the same performance benchmarks which the Data Retrieval applications used to get from the old warehouses.<br />
<br />
<b>4) Co-exist or Replace Data-warehouses?</b><br />
The key question which many of the Big Data practitioners are today raising is how should the Big Data platforms evolve in an enterprise. Should they co-exists and be limited to play the role of all encompassing Landing layer for the warehouse. Or should it move ahead and replace the warehouses by directly bridging the gap between the data generating source systems and data consumption systems.<br />
<br />
<b>5) Survival of a Distributed Model in a relational world</b><br />
Big Data has become synonymous with the Hadoop Data Platform as its strongest implementation framework. However the fully distributed mode of functioning with Hadoop will bring in intrinsic challenges when business apps and logic accustomed to the relational model of functioning start getting assimilated into the underlying distributed framework of Hadoop powered be MapReduce as the data processing mode.<br />
<br /></div>
Unknownnoreply@blogger.com6tag:blogger.com,1999:blog-2427701284573035109.post-7284240582026178932014-02-09T05:01:00.003-08:002014-02-09T05:02:40.416-08:00Flume Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; font-size: 16px; line-height: 20.799999237060547px; text-align: justify;">
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.</div>
<div style="background-color: white; font-size: 16px; line-height: 20.799999237060547px; text-align: justify;">
<br /></div>
<div style="background-color: white; font-size: 16px; line-height: 20.799999237060547px; text-align: justify;">
<br /></div>
<div>
<div style="background-color: white; font-size: 16px; line-height: 20.799999237060547px; text-align: justify;">
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.</div>
<div style="background-color: white; font-size: 16px; line-height: 20.799999237060547px; text-align: justify;">
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.</div>
</div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-29243502025730188772014-02-09T04:56:00.003-08:002014-02-09T04:56:39.463-08:00Solr Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; border: 0px; font-family: 'Lucida Grande', Geneva, Verdana, Arial, Helvetica, sans-serif; font-size: 13px; line-height: 21.450000762939453px; margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">
<div style="background-color: transparent; border: 0px; outline: rgb(0, 0, 0); padding: 10px; vertical-align: baseline;">
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene<span style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; font-size: xx-small; margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: super;">TM</span>project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.</div>
<div style="background-color: transparent; border: 0px; outline: rgb(0, 0, 0); padding: 10px; vertical-align: baseline;">
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.</div>
<div style="background-color: transparent; border: 0px; outline: rgb(0, 0, 0); padding: 10px; vertical-align: baseline;">
<span style="background-color: transparent; color: #333333; font-size: 3.5em; letter-spacing: -2px; line-height: normal;">Solr</span><span style="background-color: transparent; border: 0px; color: #333333; font-size: xx-small; letter-spacing: -2px; line-height: normal; margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: super;">TM</span><span style="background-color: transparent; color: #333333; font-size: 3.5em; letter-spacing: -2px; line-height: normal;"> </span><span style="background-color: transparent; color: #333333; font-size: 3.5em; letter-spacing: -2px; line-height: normal;">Features</span></div>
</div>
<div style="background-color: white; border: 0px; font-family: 'Lucida Grande', Geneva, Verdana, Arial, Helvetica, sans-serif; font-size: 13px; line-height: 21.450000762939453px; margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">
<div style="background-color: transparent; border: 0px; outline: rgb(0, 0, 0); padding: 10px; vertical-align: baseline;">
Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.</div>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Advanced Full-Text Search Capabilities</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Optimized for High Volume Web Traffic</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Standards Based Open Interfaces - XML, JSON and HTTP</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Comprehensive HTML Administration Interfaces</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Server statistics exposed over JMX for monitoring</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Linearly scalable, auto index replication, auto failover and recovery</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Near Real-time indexing</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Flexible and Adaptable with XML configuration</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Extensible Plugin Architecture</li>
</ul>
<h2 id="solr-uses-the-lucenewzxhzdk2tmwzxhzdk3-search-library-and-extends-it" style="background-color: transparent; border: 0px; font-family: 'Trebuchet MS', Tahoma, Arial, sans-serif; font-size: 22px; font-weight: normal; margin: 0px; outline: rgb(0, 0, 0); padding: 20px 10px 5px; vertical-align: baseline;">
Solr Uses the Lucene<span style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; font-size: xx-small; margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: super;">TM</span> Search Library and Extends it!</h2>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Powerful Extensions to the Lucene Query Language</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Faceted Search and Filtering</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Geospatial Search with support for multiple points per document and geo polygons</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Advanced, Configurable Text Analysis</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Highly Configurable and User Extensible Caching</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Performance Optimizations</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">External Configuration via XML</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">An AJAX based administration interface</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Monitorable Logging</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Fast near real-time incremental indexing and index replication</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Highly Scalable Distributed search with sharded index across multiple hosts</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">JSON, XML, CSV/delimited-text, and binary update formats</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Easy ways to pull in data from databases and XML files from local disk and HTTP sources</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Apache UIMA integration for configurable metadata extraction</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Multiple search indices</li>
</ul>
<h2 id="detailed-features" style="background-color: transparent; border: 0px; font-family: 'Trebuchet MS', Tahoma, Arial, sans-serif; font-size: 22px; font-weight: normal; margin: 0px; outline: rgb(0, 0, 0); padding: 20px 10px 5px; vertical-align: baseline;">
Detailed Features</h2>
<h3 id="schema" style="margin: 0px; padding: 0px;">
Schema</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Defines the field types and fields of documents</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Can drive more intelligent processing</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Declarative Lucene Analyzer specification</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Dynamic Fields enables on-the-fly addition of new fields</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Explicit types eliminates the need for guessing types of fields</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">External file-based configuration of stopword lists, synonym lists, and protected word lists</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Many additional text analysis components including word splitting, regex and sounds-like filters</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Pluggable similarity model per field</li>
</ul>
<h3 id="query" style="margin: 0px; padding: 0px;">
Query</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, CSV, binary)</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Sort by any number of fields, and by complex functions of numeric fields</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Advanced DisMax query parser for high relevancy results from user-entered queries</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Highlighted context snippets</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Faceted Searching based on unique field values, explicit queries, date ranges, numeric ranges or pivot</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Multi-Select Faceting by tagging and selectively excluding filters</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Spelling suggestions for user queries</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">More Like This suggestions for given document</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Function Query - influence the score by user specified complex functions of numeric fields or query relevancy scores.</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Range filter over Function Query results</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Date Math - specify dates relative to "NOW" in queries and updates</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Dynamic search results clustering using Carrot2</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Numeric field statistics such as min, max, average, standard deviation</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Combine queries derived from different syntaxes</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Auto-suggest functionality for completing user queries</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Allow configuration of top results for a query, overriding normal scoring and sorting</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Simple join capability between two document types</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Performance Optimizations</li>
</ul>
<h3 id="core" style="margin: 0px; padding: 0px;">
Core</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Dynamically create and delete document collections without restarting</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Pluggable query handlers and extensible XML data format</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Pluggable user functions for Function Query</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Customizable component based request handler with distributed search support</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Document uniqueness enforcement based on unique key field</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Duplicate document detection, including fuzzy near duplicates</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Custom index processing chains, allowing document manipulation before indexing</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">User configurable commands triggered on index changes</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Ability to control where docs with the sort field missing will be placed</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">"Luke" request handler for corpus information</li>
</ul>
<h3 id="caching" style="margin: 0px; padding: 0px;">
Caching</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Configurable Query Result, Filter, and Document cache instances</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Pluggable Cache implementations, including a lock free, high concurrency implementation</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Cache warming in background</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Autowarming in background</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Fast/small filter implementation</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">User level caching with autowarming support</li>
</ul>
<h3 id="solrcloud" style="margin: 0px; padding: 0px;">
SolrCloud</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Centralized Apache ZooKeeper based configuration</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Automated distributed indexing/sharding - send documents to any node and it will be forwarded to correct shard</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Near Real-Time indexing with immediate push-based replication (also support for slower pull-based replication)</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Transaction log ensures no updates are lost even if the documents are not yet indexed to disk</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Automated query failover, index leader election and recovery in case of failure</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">No single point of failure</li>
</ul>
<h3 id="admin-interface" style="margin: 0px; padding: 0px;">
Admin Interface</h3>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: rgb(0, 0, 0); padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Comprehensive statistics on cache utilization, updates, and queries</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Interactive schema browser that includes index statistics</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Replication monitoring</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">SolrCloud dashboard with graphical cluster node status</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Full logging control</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Text analysis debugger, showing result of every stage in an analyzer</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Web Query Interface w/ debugging output</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Parsed query output</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Lucene explain() document score detailing</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style-image: url(https://lucene.apache.org/images/bullet.gif); margin: 0px; outline: rgb(0, 0, 0); padding: 0px; vertical-align: baseline;">Explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.</li>
</ul>
</div>
</div>
Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-2427701284573035109.post-72019664661837439992014-02-09T04:53:00.002-08:002014-02-09T04:57:30.106-08:00Sqoop Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="line-height: 19.200000762939453px;">
<span style="font-family: Arial, Helvetica, sans-serif;">Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.</span></div>
<div style="line-height: 19.200000762939453px;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="line-height: 19.200000762939453px;">
<span style="font-family: Arial, Helvetica, sans-serif;">Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.</span></div>
<div style="line-height: 19.200000762939453px;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="font-family: verdana; line-height: 19.200000762939453px;">
With Sqoop, you can <span class="emphasis"><em>import</em></span> data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.</div>
<div style="font-family: verdana; line-height: 19.200000762939453px;">
A by-product of the import process is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimiteds record data yourself, using any other tools you prefer.</div>
</div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2427701284573035109.post-5998456421690854232014-02-09T04:29:00.001-08:002014-02-09T04:57:42.631-08:00Pig Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
<strong>Apache Pig</strong> is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.</div>
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:</div>
<ul style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; margin: 0px; padding: 0px 25px;">
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Ease of programming.</strong> It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Optimization opportunities.</strong> The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Extensibility.</strong> Users can create their own functions to do special-purpose processing.</li>
</ul>
</div>
Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-2427701284573035109.post-12636221898687606762014-02-09T04:28:00.000-08:002014-02-09T04:28:46.448-08:00Mahout Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<h2 style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-weight: normal; line-height: 25.350000381469727px; margin: 0px; outline: 0px; padding: 20px 10px 5px; text-rendering: optimizelegibility; vertical-align: baseline;">
<span style="font-size: x-small;">The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.</span></h2>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
</div>
<div class="highlights" style="background-color: #dfe9ef; border: 1px solid rgb(238, 238, 238); color: #555555; display: inline; float: right; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; margin: 0px 10px; outline: 0px; padding: 15px; vertical-align: baseline; width: 400px;">
<h4 style="background-color: transparent; border: 0px; color: inherit; font-family: inherit; line-height: 20px; margin: 0px; outline: 0px; padding: 5px 5px 0px; text-rendering: optimizelegibility; vertical-align: baseline;">
Mahout currently has</h4>
<ul style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; list-style: none; margin: 10px; outline: 0px; padding: 0px 0px 0px 10px; vertical-align: baseline;">
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">User and Item based recommenders</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Matrix factorization based recommenders</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">K-Means, Fuzzy K-Means clustering</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Latent Dirichlet Allocation</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Singular value decomposition</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Logistic regression based classifier</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Complementary Naive Bayes classifier</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">Random forest decision tree based classifier</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">High performance java collections (previously colt collections)</li>
<li style="background-color: transparent; background-position: initial initial; background-repeat: initial initial; border: 0px; line-height: 20px; list-style-image: url(http://mahout.apache.org/images/highlight-bullet.gif); margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">A vibrant community</li>
</ul>
</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
With scalable we mean:</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
</div>
<div style="background-color: white; border: 0px; color: #555555; font-family: Opensans, 'Helvetica Neue', Helvetica, Arial, 'Lucida Grande', sans-serif; font-size: 13px; line-height: 22.100000381469727px; outline: 0px; padding: 10px; vertical-align: baseline;">
Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.</div>
<div>
<br /></div>
</div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2427701284573035109.post-83144900617607126342014-02-09T04:24:00.000-08:002014-02-09T05:25:35.810-08:00HIVE Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="color: #333333; font-family: Helvetica, Arial, 'Liberation Sans', FreeSans, sans-serif; font-size: 16px;">The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. </span><br />
<span style="color: #333333; font-family: Helvetica, Arial, 'Liberation Sans', FreeSans, sans-serif; font-size: 16px;"><br /></span>
<span style="color: #333333; font-family: Helvetica, Arial, 'Liberation Sans', FreeSans, sans-serif; font-size: 16px;">At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.</span></div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-41627269101383194692014-02-09T04:07:00.002-08:002014-02-09T04:07:56.578-08:00HBase Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; line-height: 19.200000762939453px;">
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.</div>
<div style="background-color: white; line-height: 19.200000762939453px;">
<br /></div>
<div style="background-color: white; line-height: 19.200000762939453px;">
However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are:</div>
<div style="background-color: white; line-height: 19.200000762939453px;">
<br /></div>
<div class="itemizedlist" style="background-color: white; line-height: 19.200000762939453px; margin: 0px;">
<ul class="itemizedlist" style="line-height: 1.2; margin: 0px 0px 0px 55.390625px;" type="disc">
<li class="listitem">Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.</li>
<li class="listitem">Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.</li>
<li class="listitem">Automatic RegionServer failover</li>
<li class="listitem">Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.</li>
<li class="listitem">MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.</li>
<li class="listitem">Java Client API: HBase supports an easy to use Java API for programmatic access.</li>
<li class="listitem">Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.</li>
<li class="listitem">Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.</li>
<li class="listitem">Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.</li>
</ul>
</div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-67711673306530879512014-02-09T03:55:00.002-08:002014-02-09T03:57:59.248-08:00Cassandra Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; border: 0px; margin-bottom: 1.5em; padding: 0px; vertical-align: baseline;">
<span style="color: #222222; font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><span style="font-size: 14px; line-height: 21px;">The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.</span></span></div>
<div style="background-color: white; border: 0px; margin-bottom: 1.5em; padding: 0px; vertical-align: baseline;">
<span style="color: #222222; font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;"> Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.</span></div>
<div style="background-color: white; border: 0px; margin-bottom: 1.5em; padding: 0px; vertical-align: baseline;">
<span style="color: #222222; font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;">Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.</span></div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-38792485051842714822014-02-09T03:44:00.001-08:002014-02-09T04:29:05.595-08:00HDFS Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. </div>
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
<br /></div>
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
<br /></div>
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3T_yJy0J53qkuJEP-lP_4XpFibXDqKftFoDv_-DiPzgwPGRKuWLYRMU7kjAor2nITigityl4f-zn-nIfjKubtNIqaYWdv2u3gqgPtNk0JzdtQaCGt0L14D4MiViir2Z34TnjwYqKQxoDz/s1600/hdfsarchitecture.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3T_yJy0J53qkuJEP-lP_4XpFibXDqKftFoDv_-DiPzgwPGRKuWLYRMU7kjAor2nITigityl4f-zn-nIfjKubtNIqaYWdv2u3gqgPtNk0JzdtQaCGt0L14D4MiViir2Z34TnjwYqKQxoDz/s1600/hdfsarchitecture.png" height="221" width="320" /></a></div>
<br />
<br /></div>
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
<br /></div>
<div style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 12px; line-height: 1.3em;">
The following are some of the salient features that could be of interest to many users.</div>
<ul style="background-color: white; font-family: Verdana, Helvetica, Arial, sans-serif; font-size: 13px;">
<li style="color: #333333; font-size: 12px;">Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.</li>
<li style="color: #333333; font-size: 12px;">HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.</li>
<li style="color: #333333; font-size: 12px;">Hadoop is written in Java and is supported on all major platforms.</li>
<li style="color: #333333; font-size: 12px;">Hadoop supports shell-like commands to interact with HDFS directly.</li>
<li style="color: #333333; font-size: 12px;">The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.</li>
<li style="color: #333333; font-size: 12px;">New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:<ul>
<li>File permissions and authentication.</li>
<li>Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.</li>
<li>Safemode: an administrative mode for maintenance.</li>
<li><tt>fsck</tt>: a utility to diagnose health of the file system, to find missing files or blocks.</li>
<li><tt>fetchdt</tt>: a utility to fetch DelegationToken and store it in a file on the local system.</li>
<li>Rebalancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.</li>
<li>Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.</li>
<li>Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.</li>
<li>Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.</li>
<li>Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.</li>
</ul>
</li>
</ul>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-85408194409714068132014-02-09T03:32:00.002-08:002014-02-09T03:32:42.294-08:00Avro Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="section" style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;">
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Apache Avro™ is a data serialization system.</div>
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Avro provides:</div>
<ul style="margin: 0px; padding: 0px 25px;">
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;">Rich data structures.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;">A compact, fast, binary data format.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;">A container file, to store persistent data.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;">Remote procedure call (RPC).</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;">Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.</li>
</ul>
</div>
<h2 class="h3" style="background-color: white; font-family: 'Trebuchet MS', verdana, arial, helvetica, sans-serif; font-size: 18px; margin: 22px 0px 3px; padding: 0px;">
Schemas</h2>
<div class="section" style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;">
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Avro relies on <em>schemas</em>. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.</div>
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.</div>
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.</div>
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Avro schemas are defined with JSON. This facilitates implementation in languages that already have JSON libraries.</div>
</div>
<h2 class="h3" style="background-color: white; font-family: 'Trebuchet MS', verdana, arial, helvetica, sans-serif; font-size: 18px; margin: 22px 0px 3px; padding: 0px;">
Comparison with other systems</h2>
<a href="https://www.blogger.com/blogger.g?blogID=2427701284573035109" name="schemas" style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;"></a><span style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;"></span><a href="https://www.blogger.com/blogger.g?blogID=2427701284573035109" name="compare" style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;"></a><span style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;"></span><br />
<div class="section" style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px;">
<div style="line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.</div>
<ul style="margin: 0px; padding: 0px 25px;">
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><em>Dynamic typing</em>: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><em>Untagged data</em>: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><em>No manually-assigned field IDs</em>: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.</li>
</ul>
</div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-77307694394158083872014-02-09T03:21:00.002-08:002014-02-09T03:23:12.931-08:00Ambari Getting Started<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
Follow the <a class="externalLink" href="https://cwiki.apache.org/confluence/display/AMBARI/Instructions+for+installing+Ambari-1.4.3+bits" style="background-image: none; background-position: 100% 50%; background-repeat: no-repeat no-repeat; color: #0088cc; padding-right: 0px; text-decoration: none;" target="_blank">installation guide for Ambari 1.4.3</a>.</div>
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
Note: Ambari currently supports the 64-bit version of the following Operating Systems:</div>
<ul style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin: 0px 0px 10px 25px; padding: 0px;">
<li style="color: #404040;">RHEL (Redhat Enterprise Linux) 5 and 6</li>
<li style="color: #404040;">CentOS 5 and 6</li>
<li style="color: #404040;">OEL (Oracle Enterprise Linux) 5 and 6</li>
<li style="color: #404040;">SLES (SuSE Linux Enterprise Server) 11</li>
</ul>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-56103728948690811172014-02-09T03:07:00.002-08:002014-02-09T03:22:48.012-08:00Ambari Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.</div>
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
The set of Hadoop components that are currently supported by Ambari includes:</div>
<div style="background-color: white; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
<span style="font-family: Helvetica Neue, Helvetica, Arial, sans-serif;"><span style="font-size: 14px; line-height: 20px;">HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop</span></span></div>
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
Ambari enables System Administrators to:</div>
<ul style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin: 0px 0px 10px 25px; padding: 0px;">
<li style="color: #404040;">Provision a Hadoop Cluster<ul style="margin: 0px 0px 0px 25px; padding: 0px;">
<li>Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.</li>
<li>Ambari handles configuration of Hadoop services for the cluster.</li>
</ul>
</li>
</ul>
<ul style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin: 0px 0px 10px 25px; padding: 0px;">
<li style="color: #404040;">Manage a Hadoop Cluster<ul style="margin: 0px 0px 0px 25px; padding: 0px;">
<li>Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.</li>
</ul>
</li>
</ul>
<ul style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin: 0px 0px 10px 25px; padding: 0px;">
<li style="color: #404040;">Monitor a Hadoop Cluster<ul style="margin: 0px 0px 0px 25px; padding: 0px;">
<li>Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.</li>
<li>Ambari leverages Gangila for metrics collection.</li>
<li>Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).</li>
</ul>
</li>
</ul>
<div style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 10px; margin-left: 7px; margin-right: 7px;">
Ambari enables Application Developers and System Integrators to:</div>
<ul style="background-color: white; color: #333333; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin: 0px 0px 10px 25px; padding: 0px;">
<li style="color: #404040;">Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.</li>
</ul>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2427701284573035109.post-60016514710717519972014-02-09T02:54:00.003-08:002014-02-09T04:31:32.548-08:00Hadoop Overview<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.</div>
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.</div>
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
The project includes these modules:</div>
<ul style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; margin: 0px; padding: 0px 25px;">
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Hadoop Common</strong>: The common utilities that support the other Hadoop modules.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Hadoop Distributed File System (HDFS™)</strong>: A distributed file system that provides high-throughput access to application data.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Hadoop YARN</strong>: A framework for job scheduling and cluster resource management.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><strong>Hadoop MapReduce</strong>: A YARN-based system for parallel processing of large data sets.</li>
</ul>
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
<br /></div>
<div style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; line-height: 15.360000610351563px; margin-bottom: 1em; margin-top: 0.5em;">
Other Hadoop-related projects at Apache include:</div>
<ul style="background-color: white; font-family: Verdana, Helvetica, sans-serif; font-size: 13px; margin: 0px; padding: 0px 25px;">
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>Ambari</u></b>: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>Avro</u></b>: A data serialization system.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>Cassandra</u></b>: A scalable multi-master database with no single points of failure.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>HBase</u></b>: A scalable, distributed database that supports structured data storage for large tables.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>HIVE</u></b>: A data warehouse infrastructure that provides data summarization and ad hoc querying.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>Mahout</u></b>: A Scalable machine learning and data mining library.</li>
<li style="margin-bottom: 0.5em; margin-top: 0.5em; padding: 0px 5px;"><b><u>Pig</u></b>: A high-level data-flow language and execution framework for parallel computation.</li>
</ul>
</div>
Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-2427701284573035109.post-54586953128347495502014-02-09T01:53:00.002-08:002014-02-09T05:21:01.664-08:00Images<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguWPPUQdtF0Hc_pJK51SKAiI9YMCHWDE7_5YTsWy3ABGfFtfoz6bXEjpS7DWC1Qk1gLSNj1xlWgEry-ui-Ke-6I9sdOXNCAwcEWo2_hLhwZX2QcRYNuPW4Ol-VdguLOKHMd_CMvs96m0Zo/s1600/HIVE.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguWPPUQdtF0Hc_pJK51SKAiI9YMCHWDE7_5YTsWy3ABGfFtfoz6bXEjpS7DWC1Qk1gLSNj1xlWgEry-ui-Ke-6I9sdOXNCAwcEWo2_hLhwZX2QcRYNuPW4Ol-VdguLOKHMd_CMvs96m0Zo/s320/HIVE.jpg" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKy6YyvGh5aDMeGwdw6tshdn5IreSBPGdG70VifWGdrvOeP8LXyCdO8Ooo2ONtm7VVEDN6KLQPGOC9kE8XMA-hvfI7GQKin7rMIA4mbNmk5q1iKLc5eOs8QsDuFc5nnObmGkFic0Emfibt/s1600/images.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKy6YyvGh5aDMeGwdw6tshdn5IreSBPGdG70VifWGdrvOeP8LXyCdO8Ooo2ONtm7VVEDN6KLQPGOC9kE8XMA-hvfI7GQKin7rMIA4mbNmk5q1iKLc5eOs8QsDuFc5nnObmGkFic0Emfibt/s320/images.jpg" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGC7xX6licBxX8x5Mis7jS_xvA4Kfc6wPwJZ_F2I4UBuw-wScPSwLdg5suj5n-HdZdwrSqrgbCL-9yl4bXkOsCysBtdyFVlIuxTGEbWuaEz09DNzJnLWryEh8kc6MOtqspcWCEnfGV2js9/s1600/Big-Data+Trends.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGC7xX6licBxX8x5Mis7jS_xvA4Kfc6wPwJZ_F2I4UBuw-wScPSwLdg5suj5n-HdZdwrSqrgbCL-9yl4bXkOsCysBtdyFVlIuxTGEbWuaEz09DNzJnLWryEh8kc6MOtqspcWCEnfGV2js9/s320/Big-Data+Trends.jpg" /></a></div>Unknownnoreply@blogger.com0