One of the biggest challenges that occurs when dealing with electronic data is estimating the volume when all that is known is the total GB to process. Since the overall volume will have significant impact on the project as a whole, it is important to understand the circumstances that will drive that estimate.
Means of Measuring
Pages
In a lot of cases the overall review time and cost for a project can be determined by the total number of pages that will be reviewed and eventually produced. This can be better estimated the more you know about the collection. If you can separate the total volume, and identify the amount of email data, application data, and non-printable data, you can get a more accurate estimate then you would based on volume alone.
Number of Documents
Since another important driver in how much effort will need to be put in to the document review is the number of documents that will be reviewed, estimating this can be a valuable statistic. Although there are quick ways to identify the number of documents in the collection, it becomes more challenging to quickly identify the documents that will be removed from the culling process.
Culling Rate
The amount of deduplication can vary greatly based on the nature of the data (backups, live data, or a combination), the scope of the deduplication (within or across custodian), and the custodian retention habits.
Searching/Filtering is another aspect that is important to consider when estimating the overall volume that will be delivered for review. Depending on the on the number of terms, and the nature of the documents the results can vary greatly.
Nonprintable Files
Nonprintable files are documents that in general will not be delivered or reviewed. Therefore it is important to exclude them from the document/GB/page estimates in order to yield more accurate results.
Industry Benchmark Survey
The table below lists some industry averages that can be used as a tool for guidance for estimating a document collection:
Benchmark |
Value |
||
|
High |
Median |
Low |
Images[1] per GB |
78,671 |
47,213 |
18,534 |
Images per file email |
11 |
4 |
2 |
Images per file app files |
63 |
10 |
3 |
Files per GB email |
36,530 |
22,572 |
9,934 |
Files per GB app files |
20,305 |
15,791 |
7,553 |
GB per custodian email |
5 |
2 |
1 |
GB per custodian app files |
4 |
1 |
0 |
Culling Rate Percentages |
|||
Deduplication |
51% |
21% |
6% |
Searching/Filtering |
64% |
61% |
23% |
Non-printable files |
22% |
5% |
2% |
Processing Speeds |
|||
Process time per GB native |
117 |
33 |
11 |
Process time per GB image |
35 |
32 |
23 |
Process time to first deliverable |
53 |
35 |
21 |
Process time by file type |
4 |
3 |
2 |
Process time by file type |
6 |
4 |
3 |
Process time by file type |
2 |
3 |
2 |
Quality |
|||
First pass quality yield %[2] |
57% |
78% |
73% |
Footnotes
- Images are counted one per page, so that a 4-page multi-page TIFF would count as 4 images.
- The percentage of data that runs through without intervention or exception handling.
Paper-to-Electronic Estimate Conversion Table
Boxes of Documents |
Approximate Total Pages |
Megabytes, Gigabytes, Terabytes |
|
1 |
2,500 |
50 |
Megabytes |
10 |
25,000 |
500 |
Megabytes |
20 |
50,000 |
1 |
Gigabyte |
100 |
250,000 |
5 |
Gigabyte |
200 |
500,000 |
10 |
Gigabyte |
300 |
750,000 |
15 |
Gigabyte |
400 |
1,000,000 |
20 |
Gigabyte |
500 |
1,250,000 |
25 |
Gigabyte |
1,000 |
2,500,000 |
50 |
Gigabyte |
2,000 |
5,000,000 |
100 |
Gigabyte |
5,000 |
12,500,000 |
250 |
Gigabyte |
10,000 |
25,000,000 |
500 |
Gigabyte |
20,000 |
50,000,000 |
1 |
Terabyte |
40,000 |
100,000,000 |
2 |
Terabyte |
60,000 |
150,000,000 |
3 |
Terabyte |
Source: EDRM: (edrm.net)