On 07-Feb-02 MJ Ray wrote:
Ted Harding:
However, after 'ps2pdf' this PS file converts to a PDF file of size 40667 bytes -- actually smaller than the original! (of course, there is compression involved here; if I turn off compression I end up with 207810
I notice that ps.gz is missing from your figures?
On purpose. The compression referred to is the 'Flate' compression (implented as gzip when you use ps2pdf) internal to the PDF file. Only chunks of the PDF file are internally compressed in this way, not the whole file. Since Flate compression is the norm for PDF, that's why I made a point of mentioning the uncompressed outcome as well.
Anyway, PDF is a good idea poorly executed, I think. By introducing multiple incompatible versions of the format and making implementers have to play "chase the document trail" if they want to do a full version, they've pretty much guaranteed it's always going to cause pain. (The document trail comment is based on commend from the CL-PDF library authors, not personal experience.)
Yes, I've also heard comment to this effect (and others). But in my experience such pain is only rarely encountered.
Oh, remember that xpdf (or even pdf2ps and ps2ascii in some cases) recovers you the original text for editing.
[I think you mean pdftotext here? Out of the same stable as xpdf of course].
You can grab small quantities of text from xpdf by cut&paste with the mouse, quite successfully. There seems to be no way (at least in the xpdf which I have) of saving to text a whole PDF file opened with xpdf.
While pdftotext will (usually) save the text content, often there is so much garbage as well (including masses of space characters) that cleaning this up prior to editing would be a horrible pain.
pdf2ps and pdf2ascii can do a fair job when they work -- which, on the whole, is when the files are very simple. All sorts of things go wrong with more complicated layouts -- chunks missing, spurious "text", etc. I have some PS examples (mostly docs consisting mainly of tables) where, out of say 40K of text characters (i.e. characters that get printed and are meant to be read), maybe a few hundred are extracted by ps2ascii.
The point to remeber about PS (and from this point of view it applies also to PDF) is that it is primarily designed to place marks at precise places on a page. These can be placed in any order; in an extreme example, a PS file could be constructed which rendered a page of print by taking the characters in random order, each with the coordinates of its position, and planting them as they come. The printed result would be the same; but ps2ascii would make nothing whatever sensible of it.
In summary: if such tools work for a particular file, fine. But, in my direct experience, there are so many catastrophic exceptions to this that I don't consider them as serious options to be relied on.
In many cases, xhtml is a better format for document interchange.
Horses for courses, of course; and where xhtml may be useful, XML may be even better. If you're not fussed about how the person at the other end will format the layout, then it's probably OK (depending on what kind of document it is). If what you want them to see is precisely what you see, then it can get very hairy.
Cheers, Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 167 1972 Date: 07-Feb-02 Time: 17:22:02 ------------------------------ XFMail ------------------------------