Discovery in cases involving a high volume of data, can often feel like March Madness. A reviewer must winnow a huge field down to the “final four” of documents, those that are relevant to the underlying case. As with making picks in the NCAA tournament, different methods are employed with varying degrees of accuracy, time, and use of resources. Whether picking March Madness winners or choosing relevant documents in discovery, putting relevant data in the context of metadata should help improve the accuracy and efficiency of the selection process.
A common form of document review occurs in two steps. First, opposing counsel agree on a set of data and a “keyword(s)” with which to search that data. Second, counsel and paralegals for the producing party then manually review every page of every document that contains that potentially relevant “keyword(s).” This process can be both over-inclusive (containing documents with “keyword(s)” that are completely irrelevant) and under-inclusive (missing relevant documents that do not contain the critical “keyword(s)”). Because of the time involved, a more experienced litigator cannot execute the manual aspect of the above process. Thus, many researchers have sought to make the discovery of high-volume data more accurate and efficient.
Southern District of New York magistrate judge Andrew Peck recently weighed in on the issue in his Opinion and Order, which cites heavily to his article, “Search, Forward: Will Manual Document Review and Keyword Searches be Replaced by Computer-Assisted Coding?” He lauds the process of computer-assisted coding as a way to find relevant documents more accurately and efficiently in high-volume discovery cases. Computer-assisted coding, is a search based on a human review of a “seed set” of documents (a small subset of representative documents). Based on that “seed set” review, computer software creates an algorithm that analyzes the contents of the documents selected for data and metadata like time of creation, author, and program type. Once the algorithm is created, the computer can code documents for relevance based on how the “seed set” was searched (the existence of certain data in the context of certain metadata). On the back end, a human can test the relevance of randomly selected sample sets to ensure the algorithm produced an accurate search. Because the review of both the “seed set” and “sample set” can take up far less time than a “keyword” search review, a more senior attorney can take control of more of the review process even though a computer will execute most of the review. In short, putting data in context can serve to produce a more accurate, cheaper and less time-consuming computerized search of documents. That same principle has been utilized to come up with a more accurate prediction of NCAA Division I men’s basketball games.
At the MIT Sloan Sports Analytics Conference, Mark Bashuk has created the SevenOvertimesMetric (SOM), a ranking system that does not “focus on just the final score of the game” (a single data point, like a “keyword”). Instead, SOM uses an interesting set of algorithms to process play-by-play metadata to come up with a ranking system based on teams’ behavior at critical times in the game. For instance, Bashuk theorizes that a win by a certain amount of points is statistically less significant than understanding the context of that point differential. He notes that a 9-point win attributable to garbage-time buckets in the last two minutes of a blow out is very different from a 9-point win attributable to excellent free throw shooting at the end of a tight game, where the team in the lead is intentionally fouled. SOM is meant to account for those contextual differences. SOM has done a better job this year at predicting wins and losses than the RPI but has not yet outpaced Las Vegas odds makers. That said, SOM is still being developed, and I, for one, would like to see how this theory could apply to the playoff-less (sigh) NCAA Division I football rankings.