Secret Sauce of Predictive Analytics

January 30, 2012

1 – Understand your domain

Predictive analytics encompasses a variety of statistical techniques  that analyze current and historical facts to make predictions about future events.  Key is the ability to know what you’re looking for and where to find it. The best statistician in the world will be useless if they don’t get the context of the business.   It takes exposure to your content matter to pick this up. Read as much as you can about your business and industry, stay involved in every conversation that’s even tangentially related, and be patient. Domain expertise will follow.

2 – Simplify Data Capture and Analysis – Make it easy

What differentiates good data analysts from the rest? The ability to do simple analysis quickly and easily, with minimal “friction”, so you can do more analysis faster.

      • Set up “one click” access to data.
      • Develop Shell scripts to drop you directly as a read-only user in your database
      • Setup libraries to get your data from your database into a clean format in your preferred analysis environment
      • Try to pick consistent time periods for how far back you look at data
      • Memorize your database schema. At the very least, know what tables are named and generally what the relationship between them is.   You’ll know what you’re able to be able to find from a database alone and you’ll save a ton of time by not doing SHOW TABLES and DESCRIBE TABLE X all the time.

3 – Look at lots of Data

The only way to know what “normal” is when it comes to your data is to look at it, a lot.  Be the most voracious consumer of new information and data that you can. Keep finding that next source of information.  It doesn’t need to be for any specific investigation of the moment, but it will pay dividends down the line.

Protest supporting Predictive Review

January 30, 2012

Karl Schieneman protesting to bring wider acceptance of predictive coding. 

Thanks Karl

DIY models for eDiscovery

January 26, 2012

Vendors claim that we can do complete eDiscovery with SaaS model.

To that end, Kroll Ontrack is offering Verve, BIA hasTotalDiscovery and IE Discovery introduced eDiscovery DIY™.

Are these vendors securely & efficiently transferring large amounts of data?

Business Process improvements with Predictive Coding

January 26, 2012

Following are some of the business process improvements that have occurred in the past couple of years on account of predictive coding technologies

  1. Acceleration of Review process – law firms are accelerating case development by prioritizing document review,  starting with documents with the highest relevancy scores, and then progressively working back
  2. Stratified Review Process – high-scoring documents might be assigned for review by senior reviewers, while low-scoring, low-potential documents will be reviewed by lower-cost contract reviewers. In so doing, the firm can balance risk and cost.
  3. Systematic QA in litigation process – Rather than doing a simple random test, predictive coding allows firm to compare the software’s relevance score against those of the human review. QA then focuses on the “discrepancy” documents, where the software and humans did not agree. This allows the firm to systemize the whole quality process.
  4. Predictive coding puts the “assessment” back into ECA by enabling users to zoom in on the most relevant documents and make informed assessments of the winnability and potential cost of the case

Any other business process that has been transformed with predictive coding?

Differentiating predictive coding vendors

January 26, 2012

While the fundamental principles underlying  all predictive coding technologies are similar, there are significant differences between offerings in the market.

These differences are manifest in the validity of the training process, quality and defensibility of results, ability of the tools to quantify outcomes, statistical veracity, and capacity of the tools to verify output and perform quality assurance.

Any other differentiators?

Introduction to HBASE

January 24, 2012

Good video on basics of Hbase

eDiscovery trends for 2012

January 23, 2012

a good summary of trends for this year.


Data Volume Estimates AND Conversions

January 19, 2012

In several cases, we have a need to get an estimate of approximate size.  Following is an approximate estimate of size based on data from multiple sources

Storage Estimates:
CD = 650 MB = 50,000 pages.
DVD = 4.7 GB = 350,000 pages.
DLT Tape = 40/80 GB = 3 to 6 Million pages.
Super DLT Tape = 60/120 GB = 4 to 9 Million pages.Page Estimates:
1 MB is about 75 pages.
1 GB is about 75,000 pages(pick-up truck full of documents).
Aver. pgs. per email: 1.5 (100,099 pages per GB).
Aver. pgs. per word document: 8 (64,782 pages per GB).
Aver. pgs. per spreadsheet: 50 (165,791 pages per GB).
Aver. pgs. per power point: 14 (17,552 pages per GB).Email File Estimates:
100 MB .PST file is 900 emails and 300 attachments.
400 MB .PST file is 3,500 emails and 1,200 attachments.
600 MB .PST file is 5,500 emails and 1,600 attachments.
A 1.00 GB .NSF file is 9,000 emails and 3,000 attachments.
A 1.5 GB .NSF file is 13,500 emails and 4,500 attachments.

Bits and Bytes Sizes:
8 bits are equal to 1 byte (one or two words).
1,024 bytes are equal to 1 kilobyte (KB).
1,024 kilobytes (KB) are equal to 1 megabyte (MB or Meg).
1,024 megabytes are equal to 1 gigabyte (GB or Gig) (truck full of paper).
1,024 gigabytes are equal to 1 terabyte (TB) (50,000 trees of paper).
1,024 terabytes are equal to 1 petabyte (PB) (250 Billion Pages of Text).
1,024 petabytes are equal to 1 exabytes (EB) (1,000,000,000,000,000,000 bytes).

Is Predictive coding defensible?

January 18, 2012

A 2008 TREC study analyzing the success of keyword searching indicated that on average, “Boolean keyword search found only 24% of the total number of responsive documents in the target data set.” Since this is the current court-accepted standard,  shouldn’t  predictive coding only to beat this standard to come out ahead?




Alternate terms for Predictive Coding

January 18, 2012

Following are some of the terms vendors have used to describe predictive coding

•Prognostic Data Profiling

• Predictive Ranking
• Relevance Assessment
• Suggestive Coding
• Predictive Categorization
• Automatic Categorization
• “Propagated Coding” or “Replicated Coding”
• Automated Document Categorization

Yikes! like we need more market confusion.