Data Ingestion for Enterprise Data Platforms

May 5, 2014

The Ingestion Box in the reference architecture is displayed as the smallest box.  However, this is the component that integrates with all the available data sources.  This tends to be among the most complex and time consuming task, but tends to be relegated to a lower priority which is a big mistake.

One needs to prioritize the data sources that generate maximum value and ensure we can ingest the data into the Big Data platform for subsequent “cool” analytics.

In my experience, it is also extremely important to have a robust User Interface for the ingestion section.  Otherwise, there could be a series of manual steps leading to errors and ingestion of “bad” data that will minimize impact of subsequent analytics.



Top 10 eDiscovery Production errors

December 11, 2012

Which errors does your Litigation support vendor commit?

I have found that automated QC can address most of these issues.

You can’t run away from metadata

December 7, 2012



Can I afford defensible deletion?

December 6, 2012

Is it hard to justify ROI for defensible deletion?

here is a tip – try and calculate how much money was spent last year processing and reprocessing useless data for eDiscovery purposes, rejecting it time after time, at considerable expense. There’s a big chunk of ROI there.

Getting Value from Machine Learning Frameworks

November 7, 2012

A lot of focus has been placed on delivering automated solutions with no human interaction. We however need to invest in more than just making machines smarter. We need to train our employees to become more sophisticated consumers of the outputs of their machines. Then, the network effect will begin to bringing more value out of data than ever before.

The biggest victories in the man-machine framework come when machine learning is appropriately delivered to respect the role of humans.

In the E-Discovery space, the better machine learning products actively learn from documents marked by human reviewers to produce continuously improved results expediting the review process

Time to use Predictive Coding, not evaluate and measure

November 3, 2012

We can talk all day long about measuring, researching, evaluating and get nowhere.  It is time to start using Predictive Coding technology and reap its benefits of lower cost of review.

Cloud and Big Data

September 9, 2012

Cloud computing is a boon to big data. Paying by consumption destroys the barriers to entry that prohibit many organizations from playing with large datasets, because there’s no up-front investment. In many ways, big data gives clouds something to do.

SOLR vs ElasticSearch

September 5, 2012

A more comprehensive comparison of solr and elastic search

SOLR or ElasticSearch

September 4, 2012

Good article comparing the 2 open source search server solutions built on top of Lucene.

Big Data Investment Map

August 14, 2012