[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3573128.3609341acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper
Open access

Technology-Assisted Review for Spreadsheets and Noisy Text

Published: 22 August 2023 Publication History

Abstract

In a large-scale eDiscovery effort, human assessors participated in a technology-assisted review ("TAR") process employing a modified version of Grossman and Cormack's Continuous Active Learning® ("CAL®") tool to review Excel spreadsheets and poor-quality OCR text (defined as 30-50% Markov error rate). In the legal industry, these documents are typically considered inappropriate for the application of TAR and, consequently, are usually the subject of exhaustive manual review. Our results assuage this concern by showing that a CAL TAR process, using feature engineering techniques adapted from spam filtering, can achieve satisfactory results on Excel spreadsheets and noisy OCR text. Our findings are cause for optimism in the legal industry--- adding these document classes to TAR datasets will make large reviews more manageable and less costly.

References

[1]
Baseline Model Implementation for Automatic Participation in TREC 2015 Total Recall Track. URL: https://cormack.uwaterloo.ca/trecvm/'
[2]
Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam, and Blaz Zupan. 2006. Spam Filtering Using Statistical Data Compression Models. Journal of Machine Learning Research 7, 97 (2006), 2673--2698.
[3]
Gordon V. Cormack and Nigel R. Horspool. 1987. Data Compression Using Dynamic Markov Modelling. Comput. J. 30 (December 1987), 541--550. https://doi.org/10.1093/comjnl/30.6.541
[4]
Gordon V. Cormack. 2008. Email Spam Filtering: A Systematic Review. Found. Trends Inf. Retr. 1, 4 (April 2007), 335--455. https://doi.org/10.1561/1500000006
[5]
Gordon V. Cormack and Maura R. Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In SIGIR '14 (2014), 153--162. https://doi.org/10.1145/2600428.2609601
[6]
Gordon V. Cormack and Maura R. Grossman. 2015. Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review. In SIGIR '15 (2015), 763--766. https://dl.acm.org/doi/10.1145/2766462.2767771
[7]
Gordon V. Cormack and Maura R. Grossman. 2016. Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. In CIKM '16 (2016), 1039--1048. https://doi.org/10.1145/2983323.2983776
[8]
Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII Rich. J.L. & Tech. 11 (2011), http://jolt.richmond.edu/v17i3/article11.pdf
[9]
Gordon V. Cormack and Maura R. Grossman. 2017. Navigating Imprecision in Relevance Assessments on the Road to Total Recall: Roger and Me. In SIGIR '17 (2017), 5--14.
[10]
David Sculley, Gabriel Wachman and Carla E. Brodley. 2006. Sp am Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers. In TREC 2006 (2006), 14--17. http://www.eecs.tufts.edu/~dsculley/papers/trec.2006.spam.pdf

Index Terms

  1. Technology-Assisted Review for Spreadsheets and Noisy Text

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023
    August 2023
    187 pages
    ISBN:9798400700279
    DOI:10.1145/3573128
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2023

    Check for updates

    Author Tags

    1. CAL
    2. Classification
    3. Continuous Active Learning
    4. Electronic Discovery
    5. IR
    6. Information Retrieval
    7. OCR
    8. Optical Character Recognition
    9. Recall
    10. Spreadsheets
    11. TAR
    12. Technology-Assisted Review
    13. eDiscovery

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    DocEng '23
    Sponsor:
    DocEng '23: ACM Symposium on Document Engineering 2023
    August 22 - 25, 2023
    Limerick, Ireland

    Acceptance Rates

    DocEng '23 Paper Acceptance Rate 9 of 27 submissions, 33%;
    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 422
      Total Downloads
    • Downloads (Last 12 months)201
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media