I’ve recently switched to Linux (Ubuntu 8.10) as my main operating system. I find it’s a more effective workspace for most of my tasks. Check it out if you haven’t already; Linux really is growing up. I do keep Windows around for a couple tasks, mainly gaming, but Linux is closing the gap on that, too, through the latest implementations of Wine.
One thing I’ve noticed, though, that I haven’t been able to pin down a reason for, is that PDF file sizes in Linux seem high compared to those generated in Windows. I know, this is a somewhat generic statement given the fact that, Linux or Windows, the process is dependent on the software doing the compression. Yet there seems to be a consistent discrepancy between the two operating systems when it comes to PDF file sizes. Looking around online, my observations seem to be somewhat validated. A popular solution on forums is to use the DjVu compression scheme, but I’d prefer sticking with the fairly universal PDF file format. To its credit, DjVu seems to match or better PDF when it comes to black-and-white documents, but it falls behind in grayscale.
So I ran a little test, scanning the front page of my offer letter for my new job. It consists of a company logo at the top and a full page of text. It is somewhat indicative of what I archive. All scans were done in black-and-white or grayscale. Results (file size in bytes):
18474 150dpiLinuxDjVu-BW.djvu
241812 150dpiLinuxDjVu-Gray.djvu
55298 150dpiLinuxLZW-BW.pdf
813876 150dpiLinuxLZW-Gray.pdf
50213 150dpiWin-BW.pdf
29172 150dpiWinG4-BW.tif
34410 150dpiWinG4-Gray.tif
58947 150dpiWin-Gray.pdf
47280 150dpiWinLZW-BW.tif
1304736 150dpiWinLZW-Gray.tif
29229 300dpiLinuxDjVu-BW.djvu
688967 300dpiLinuxDjVu-Gray.djvu
113726 300dpiLinuxLZW-BW.pdf
2670089 300dpiLinuxLZW-Gray.pdf
81978 300dpiWin-BW.pdf
59188 300dpiWinG4-BW.tif
73842 300dpiWinG4-Gray.tif
114967 300dpiWin-Gray.pdf
5024631 300dpiWin-Gray-300dpiPDF.pdf
5024632 300dpiWin-Gray-600dpiPDF.pdf
5040863 300dpiWin-GrayThenPDF.pdf
8955576 300dpiWin-Gray.tif
132170 300dpiWinLZW-BW.tif
5577814 300dpiWinLZW-Gray.tif
759067 CNNLinux.pdf
237794 CNNWin600dpi.pdf
In order of size:
18474 150dpiLinuxDjVu-BW.djvu
29172 150dpiWinG4-BW.tif
29229 300dpiLinuxDjVu-BW.djvu
34410 150dpiWinG4-Gray.tif
47280 150dpiWinLZW-BW.tif
50213 150dpiWin-BW.pdf
55298 150dpiLinuxLZW-BW.pdf
58947 150dpiWin-Gray.pdf
59188 300dpiWinG4-BW.tif
73842 300dpiWinG4-Gray.tif
81978 300dpiWin-BW.pdf
113726 300dpiLinuxLZW-BW.pdf
114967 300dpiWin-Gray.pdf
132170 300dpiWinLZW-BW.tif
237794 CNNWin600dpi.pdf
241812 150dpiLinuxDjVu-Gray.djvu
688967 300dpiLinuxDjVu-Gray.djvu
759067 CNNLinux.pdf
813876 150dpiLinuxLZW-Gray.pdf
1304736 150dpiWinLZW-Gray.tif
2670089 300dpiLinuxLZW-Gray.pdf
5024631 300dpiWin-Gray-300dpiPDF.pdf
5024632 300dpiWin-Gray-600dpiPDF.pdf
5040863 300dpiWin-GrayThenPDF.pdf
5577814 300dpiWinLZW-Gray.tif
8955576 300dpiWin-Gray.tif
Make note of the file extensions; there are actually three different file types in those listings. The file names lead with resolution, with the exception of the two starting with “CNN.” Those two were PDF’s created by printing cnn.com’s cover page to PDF in Linux and Windows (using PDF Creator). The cover page contained slightly different content but not enough to explain the file size difference. After the resolution in the file name comes the operating system, followed by compression algorithm where applicable. Immediately after the hyphen is the grayscale/black-and-white indicactor and in those cases where there is a second hyphen, it indicates the file was post-processed with a PDF printer at the stated resolution.
For Windows, where a compression algorithm is not listed, I used the software included with my Canon LiDE 50 scanner, which saves directly to PDF. In Linux, I used the popular gscan2pdf GUI. Having OCR on or off did not seem to make much of a difference, as far as file size. For gscan2pdf, the file was also processed with Unpaper, which should optimize the file further (it also creates blockiness in the document’s whitespace that is undesirable to me, but it’s fine for archiving documents).
So there you go. The difference is significant. One would have to dig into the underpinnings of the software, I think, to expose the reason for this, but I’m definitely curious. Again, DjVu pulls close and surpasses PDF when it comes to black-and-white scanning, but even it falls short when using grayscale (which happens to by my method of choice). I’ll admit I don’t relish the idea of booting into Windows simply to archive documents.