blog

Firefox Reader View heuristics

by
videoinu authors
. published on under technical .

Reader view is Firefox's answer bloated websites and content-filled websites with non-readable content, but how does it function underneath?

The main crux is utilizing a big bag of heuristics and knowledge about how semantic HTML elements should be used and structured to strip the page of content that offers no immediate relevance to the reader. All this while also making the font size bigger and more readable, and the page more distraction-free.

If you haven't tried it out (and are still sticking with Firefox), please do! If you have, then welcome to an episode of how does it work (with more code than usual).

The reader view code is in a library form on Mozilla's Github and is pure Javascript. The code is split into modules for checking the readability of a page (Readability-readerable.js) and the actual conversion of a document to reader view (Readability.js).

The conversion function takes in the document and pushes out the article title, HTML string of processed article content, length of an article, in characters, article description (or excerpt automatically extracted from the content), and author metadata (byline). The whole process is a bit heuristical in approach as can be seen in for example hardcoding of site names and attributes, calculation of title similarities using the length difference, and node removal thresholds, but in a crude sense the codebase feels pragmatic. Almost industrial. Thankfully nobody has decided to call the if statements with arbitrary numeric thresholds machine learning yet.

Retrieving the title from article is simple enough; just get the page title or alternatively whatever metadata we're trying to push using Dublin Core (dc:title), Twitter (twitter:title), Open Graph (og:title), or Weibo (weibo:title). What was ever wrong with just <title>? Anyway, as long as we got some title-like property we can move on. Next up, excerpt and author metadata.

Getting author information a.k.a. byline relies primarily on either on the correct appearance two attribute=value pairs rel=author or itemprop=author, or a regexp soup matching potential byline matches. Even in this blog where I decided to go fancy with a misused <address> attribute for the author information (or lack thereof), Readability finds the information due to the classname class="author".

Excerpt isn't quite as fancy. Extracting the extract primarily relies on finding a meta tag for description similarly to how title was retried (with all the accompanying social media site tags), but if all else fails, the code falls back to retrieving the first paragraph of the text content and trimming it.

Reader view is an example of very pragmatic solve-a-real-problem coding, hopefully resulting in a zen-like reading experience for the reader. Even with its beautiful goal, we can only hope that as a tool Reader view will become less useful as sites start adapting more reader-friendly approaches.

If you're a website owner and concerned about reader view (or lacking reader view), here's a few tools to hopefully help with that.

Readability score checker

This checker allows you to check the score your page would get from Mozilla's reader view readability algorithm. The default contents are this page's source.

Paste your page's HTML source (right click -> View page source) in the following text area. If your page renders HTML with Javascript (using e.g. React or fetching data from server on load), you might have to use Devtools to get the rendered HTML (document.documentElement.outerHTML is a quick method for essentially whole page HTML).

I want reader view, and don't have it :(

The reader view algorithm has a few pitholes your site might fall into. We'll go from the most to the least likely:

  1. Are you using <p>, <pre>, or <div> with immediate <br> children for your text content?
    • These are the only elements readability algorithm uses to determine the readability of your page
  2. Is the content inside those nodes long enough?
    • Readability algorithm only starts incrementing your readability score after the node contents hit 140 characters.
    • After 140 characters, the score is increased with the logarithmic sqrt(textContentLength - 140) formula, so two paragraphs of ~150 characters is better than one paragraph of ~300 characters.
  3. Are the text paragraphs hidden or considered unlikely text candidates by the algorithm?
    • Make sure paragraph is hidden neither by display: none or aria-hidden
    • Avoid the classnames listed here in your text paragraphs

I have reader view, but don't want it!

Let's play devil's advocate for a while. On many websites reader view allows you to essentially skip the paywalls and advertisements sprinkled alongside the article. Therefore, for a publisher it might make sense to actively try to get rid of the reader view.

The easiest method would be to just get rid of all p, pre, and div with br tags, but this might cause a lot of effort in restyling everything. Way easier way is to use one of the hardcoded exceptions that Mozilla has added for non-viable readability elements; for instance, any node with class skyscraper is considered hidden (maybe the logic is that those elements are shrouded by clouds).

We can achieve this on any page with the following snippet. // Reader view disabler 9000 [...document.querySelectorAll("p,pre"), ...[...document.querySelectorAll("div > br")].map(e => e.parentNode)] .forEach(e => e.classList.add("skyscraper"));