[publican-list] Possible alternative to FOP!

Peter Moulder peter.moulder at monash.edu
Sun Aug 28 04:19:02 UTC 2011


On Sun, Aug 28, 2011 at 10:56:43AM +1000, Ryan Lerch wrote:

(Hi Ryan, long time no see.)

> I will have to investigate playing with someof these styles. So far
> i have not played with the actual styles of the document. wkhtml2pdf
> lets you supply a footer.html file to provide custom footers (that
> is all i have been playing with so far.)

Can it be a header instead of a footer?  One of the areas where fop is better
than HTML-based renderers is in making a page fill up the available space
despite the need to honour widows & orphans.

One way of making the problem less noticeable would be to avoid using a footer,
so that one wouldn't be as conscious that the page were under-full.

(If you as a reader can see that the page is under-full, then it sometimes
makes you expect that this is the end of the section, causing a bit of
confusion and loss of concentration when you find that it isn't.)

Granted, it looks like wkhtmltopdf doesn't yet honour widows and orphans to
begin with, but a similar issue occurs with small div.note blocks and the like.


Regarding the evaluation of wkhtmltopdf, can we try an example with a
multi-page table, and see whether it handles <thead> (repeating the table
header on each page) ?  I think it doesn't yet.

Of course many documents don't have long tables, but it would be good to
make a list of the differences we're aware of, so that deployers can make
an informed decision, perhaps deciding based on characteristics of the
documents they're interested in.  Having the list might also help knowing
how long to keep the FOP option around, as WebKit slowly evolves to
whittle down the list.

Here's an initial list, though note that I don't have the right version
of WebKit / wkhtmltopdf, so there's quite likely to be some factual
errors in this list.

  Doesn't honour 'widows' / 'orphans'
  Doesn't always honour 'page-break-before' / 'page-break-after'.
  Tables of contents have no page numbers
  Foot notes become end notes
  Limitations in page headers / footers:
    no header/footer suppression on first page
    no roman-numeral stuff up to end of preface
    can't use name of section? (this is just a guess: section names are a
      bit harder than name of chapter, because there can be multiple
      sections in a page but only one chapter on a page in Publican's
      styling)
    only footers, not headers?  [I'd guess headers work fine.]
  Use of bitmap images in some places where fop doesn't?
    (The important/note/warning icons are the most obvious cases,
    which the stylesheet I gave should address.  There are a few
    other cases where bitmap images are used in the html version
    where I don't know whether fop uses bitmaps or not.  Examples
    are some front page stuff, and list item bullets.)
  Chinese html-single output has zero-width-space characters everywhere,
    leading to bad line breaks.  (This is true of the two Chinese
    outputs I've checked, but not true of the one Japanese output I
    checked.  I don't know whether this is something Publican's doing
    or it's just the text that the translator supplied for those
    two documents.  Mechanically removing all zwsp characters from the
    html-single file used as input to wkhtmltopdf is a reasonable
    work-around, though risks removing zwsp characters that should be
    there.)
  Table headers not repeated on each page when a table split over pages
    [This item would be much higher up the list for documents
     with long tables, but most documents don't have long tables.]
  There may be issues if breaking in the middle of a table, especially if
    using row-spanning cells or long cells with uneven line-heights.
    (Presumably we can mostly avoid this by using
    tr,td,th { page-break-inside: avoid; }.)
  Can't do multiple columns per page?  [At least, this was a concern
    someone raised; has anyone tested, e.g. html { column-count:2
    -webkit-column-count:2 }, or on body instead of html?]
  Page breaking said not to work well with floats.  (But does Publican
    ever use floats?)
  Margins are never stretchable, so under-full pages are more frequent.
    (Though this isn't so much of an issue for documents that are full of
    large unsplittable objects like screenshots, div.note's, <pre>
    blocks, tables etc., because under-full pages will be very common in
    these documents, so it's less likely for the reader to read anything
    into gaps at the bottoms of pages if there's a gap at the end of
    almost every page.)

Some of the page-breaking issues in the above list are just based on
things I've read elsewhere, but I don't know how many of them are still
present in the current software.

If the above looks like a daunting list of disadvantages, then remember
that most of them are just due to current limitations in available
html-to-pdf renderers.  If WebKit has recently added support for
page-break-inside and simple footers, then maybe there'll also be
improvements addressing some of the remaining issues (and for that matter
some of the issues in the above list might already be out of date or
otherwise false).

Similarly, there may be other HTML renderers that become available,
offering an alternative to wkhtmltopdf (no doubt with their own
advantages and disadvantages).  I'm working on an HTML renderer myself
[hence my interest in exploring an alternative to FOP],
so we'll see how that goes.  It'll have aesthetic advantages in some
areas, and is focused on print output, though no doubt will have more
bugs and limitations in general layout than WebKit.  Currently there's
not much to recommend it: it does widows and orphans better, and can do
multi-column, but most of the other issues are present, and does some
things worse.  We'll see how things go in a few months.

> Also, i need to add that this approach will help out a lot with the
> current limitations we hit in FOP wrt images. (we currently limit
> images to 444px for this reason)

How does wkhtmltopdf differ here?  I tentatively added

  img { max-width:444px }

in the stylesheet that I posted, though I haven't checked whether it does
the right thing.  max-width:100% would be another alternative, though I
wasn't sure whether that would get something wrong for a non-full-screen
image; though maybe the selector could be made more specific to alay that
concern.

As the Publican documentation notes, even if the image doesn't get
truncated, it'll still have the problem of being shrunk down to what
might be too small to read comfortably.

For documentation that uses lots of big screenshots, one might consider
formatting the document in landscape format, perhaps in two-column
format, and using page-wide screenshots (e.g. with column-span:all, or as
a page float).  Would that ever be reasonable?  I'm usually not so fond
of wide pages, but it's just an idea.

pjrm.




More information about the publican-list mailing list