Data Ingestion for Enterprise Data Platforms

May 5, 2014

The Ingestion Box in the reference architecture is displayed as the smallest box.  However, this is the component that integrates with all the available data sources.  This tends to be among the most complex and time consuming task, but tends to be relegated to a lower priority which is a big mistake.

One needs to prioritize the data sources that generate maximum value and ensure we can ingest the data into the Big Data platform for subsequent “cool” analytics.

In my experience, it is also extremely important to have a robust User Interface for the ingestion section.  Otherwise, there could be a series of manual steps leading to errors and ingestion of “bad” data that will minimize impact of subsequent analytics.



Should we accept eventual consistency?

November 3, 2013

The NoSQL solutions available today provide distributed architectures with fault tolerance and scalability. However, to provide these benefits many NoSQL solutions have given up the strong data consistency and isolation guarantees provided by relational databases, coining a new term – “eventually consistent” – to describe their weak data consistency guarantees.

Is this acceptable? Shouldn’t we be demanding at least close to real time consistency?

A must read article by Dave Rosenthal

Object Storage or Block Storage

October 2, 2013

Are you evaluating Object storage?  Have you converted existing applications to utilize Object storage?  What is your experience doing this?

If you are not familiar with Object storage check this out.

Here is a basic overview of the difference between block and object storage –

It seems like the time has come to consider Block storage



HBase Region splitting not entirely hands free

October 2, 2013

While HBase provides a lot of in-built functionality to manage region splits, it is not sufficient to ensure optimal performance. The number of regions in a table, and how those regions are split are crucial factors in understanding, and tuning your HBase cluster load. You should monitor the load distribution across the regions at all times, and if the load distribution changes over time, use manual splitting, or set more aggressive region split sizes.

Good article from HortonWorks to help you get started on Region splitting.

MapReduce or Traditional Databases

May 5, 2013

This article does an excellent job of detailing the business use cases where Hadoop is useful and where traditional Database Management systems might be more appropriate.


Monitoring Hadoop MapReduce Applications

April 29, 2013

While users have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult.  Step #1 is to create a test suite that you can reliably run after every change.

Should I use traditional HPC or Hadoop?

April 12, 2013

Excellent blog by Guident –  

comparing a  Hadoop versus Traditional High Performance Computing.  A specific use case of reading large log files is compared and Hadoop is the winner in terms of performance.

What happens if we have access to traditional HPC hardware?  Should we use Hadoop on HPC?   Check out excellent article by S.Krishnan on this – .   Results are not conclusive, but an interesting read.

Bottom line, it appears to be dependent on the use case?  Has anyone done a more detailed comparison?

Organizing Unstructured Data

January 6, 2013

The organizational approach in enterprise content and records management systems has traditionally relied upon the use of document attribute data as the primary information source for content storage and retrieval.  Attributes such as business units, document types, expiration dates, functional areas, and similar things are common.

In the past few years content and records management applications have often used a “facetted” taxonomy design to define how the content and/or records are classified.

Is this sufficient?  It might be acceptable in some cases for content management, but falls short of expectations for knowledge management.

Large volumes of knowledge content are often well suited to auto-categorization. he tools and methods most commonly used for auto-categorization are text analytics or image analysis. This can be expensive to implement.  This makes it challenging for small law offices to benefit from this technology, who also have large volumes of data to contend with.

In coming blogs, I will talk about how to deliver inexpensive solutions to the problem.


Minimize need for Data Scientists

December 31, 2012

Anywhere we turn, we read about the shortage of Data Scientists to help us make sense of Big Data.  How do we resolve this bottleneck?

As an analogy, look at Content Management Systems.  In the late 90’s everybody wanted a website and IT expertise was a bottleneck – Every new piece of content had to be coded by an IT elite.  We resolved issue by abstracting the basic needs and made them easy for non-techies.

We need to do this again for Big Data. Industry is crying out for a solution.

move from batch processing to real time serving of Big Data

December 9, 2012

Check out


With Kiji , we can use HBase as a real-time data persistence and serving layer for applications