Organizing Unstructured Data

January 6, 2013

The organizational approach in enterprise content and records management systems has traditionally relied upon the use of document attribute data as the primary information source for content storage and retrieval.  Attributes such as business units, document types, expiration dates, functional areas, and similar things are common.

In the past few years content and records management applications have often used a “facetted” taxonomy design to define how the content and/or records are classified.

Is this sufficient?  It might be acceptable in some cases for content management, but falls short of expectations for knowledge management.

Large volumes of knowledge content are often well suited to auto-categorization. he tools and methods most commonly used for auto-categorization are text analytics or image analysis. This can be expensive to implement.  This makes it challenging for small law offices to benefit from this technology, who also have large volumes of data to contend with.

In coming blogs, I will talk about how to deliver inexpensive solutions to the problem.



Minimize need for Data Scientists

December 31, 2012

Anywhere we turn, we read about the shortage of Data Scientists to help us make sense of Big Data.  How do we resolve this bottleneck?

As an analogy, look at Content Management Systems.  In the late 90’s everybody wanted a website and IT expertise was a bottleneck – Every new piece of content had to be coded by an IT elite.  We resolved issue by abstracting the basic needs and made them easy for non-techies.

We need to do this again for Big Data. Industry is crying out for a solution.

move from batch processing to real time serving of Big Data

December 9, 2012

Check out


With Kiji , we can use HBase as a real-time data persistence and serving layer for applications

HDFS becoming the defacto standard for Big Data

December 9, 2012

With growing use of applications using in-memory data grids,  the need for  the place where data is stored does not need to be fast.   It does however need to be fault-tolerant and scalable.  HDFS nicely fills this requirement.

Good explanation on details of this at


When do we need in-memory computing?

December 7, 2012

Corona to the rescue of Hadoop

November 9, 2012

Batch processing jobs still not meeting user expectations after putting Hadoop in the mix?  Great article with a good analogy in regards to bottlenecks while grocery shopping

Corona divides the job tracker’s responsibilities in two. First, a new manager manages cluster resources and keeps an eye on what’s available in that cluster. At the same time, Corona creates a dedicated job tracker for each job, which means the job tracker no longer has to be tied to the cluster. With Corona, smaller jobs can be processed right on the requester’s own machine.

Will this help improve overall throughput?  Looking forward to giving this a dry-run.

Hadoop Cluster Management Players

October 2, 2012

Here are some of the competitors for managing Hadoop clusters