Firefox Reader View heuristicsby videoinu authors . published on under technical .
Reader view is Firefox's answer bloated websites and content-filled websites with non-readable content, but how does it function underneath?
The main crux is utilizing a big bag of heuristics and knowledge about how semantic HTML elements should be used and structured to strip the page of content that offers no immediate relevance to the reader. All this while also making the font size bigger and more readable, and the page more distraction-free.
If you haven't tried it out (and are still sticking with Firefox), please do! If you have, then welcome to an episode of how does it work (with more code than usual).
The conversion function takes in the
document and pushes out the article title, HTML string of processed article content, length of an article, in characters, article description (or excerpt automatically extracted from the content), and author metadata (
byline). The whole process is a bit heuristical in approach as can be seen in for example hardcoding of site names and attributes, calculation of title similarities using the length difference, and node removal thresholds, but in a crude sense the codebase feels pragmatic. Almost industrial. Thankfully nobody has decided to call the if statements with arbitrary numeric thresholds machine learning yet.
Retrieving the title from article is simple enough; just get the page title or alternatively whatever metadata we're trying to push using Dublin Core (
dc:title), Twitter (
twitter:title), Open Graph (
og:title), or Weibo (
weibo:title). What was ever wrong with just
<title>? Anyway, as long as we got some title-like property we can move on. Next up, excerpt and author metadata.
Getting author information a.k.a. byline relies primarily on either on the correct appearance two attribute=value pairs
itemprop=author, or a regexp soup matching potential byline matches. Even in this blog where I decided to go fancy with a misused
<address> attribute for the author information (or lack thereof), Readability finds the information due to the classname
Excerpt isn't quite as fancy. Extracting the extract primarily relies on finding a meta tag for description similarly to how title was retried (with all the accompanying social media site tags), but if all else fails, the code falls back to retrieving the first paragraph of the text content and trimming it.
Reader view is an example of very pragmatic solve-a-real-problem coding, hopefully resulting in a zen-like reading experience for the reader. Even with its beautiful goal, we can only hope that as a tool Reader view will become less useful as sites start adapting more reader-friendly approaches.
If you're a website owner and concerned about reader view (or lacking reader view), here's a few tools to hopefully help with that.
Readability score checker
This checker allows you to check the score your page would get from Mozilla's reader view readability algorithm. The default contents are this page's source.
document.documentElement.outerHTML is a quick method for essentially whole page HTML).
I want reader view, and don't have it :(
The reader view algorithm has a few pitholes your site might fall into. We'll go from the most to the least likely:
- Are you using
<br>children for your text content?
- These are the only elements readability algorithm uses to determine the readability of your page
- Is the content inside those nodes long enough?
- Readability algorithm only starts incrementing your readability score after the node contents hit 140 characters.
- After 140 characters, the score is increased with the logarithmic
sqrt(textContentLength - 140)formula, so two paragraphs of ~150 characters is better than one paragraph of ~300 characters.
- Are the text paragraphs hidden or considered unlikely text candidates by the algorithm?
- Make sure paragraph is hidden neither by
- Avoid the classnames listed here in your text paragraphs
- Make sure paragraph is hidden neither by
I have reader view, but don't want it!
Let's play devil's advocate for a while. On many websites reader view allows you to essentially skip the paywalls and advertisements sprinkled alongside the article. Therefore, for a publisher it might make sense to actively try to get rid of the reader view.
The easiest method would be to just get rid of all
br tags, but this might cause a lot of effort in restyling everything. Way easier way is to use one of the hardcoded exceptions that Mozilla has added for non-viable readability elements; for instance, any node with class
skyscraper is considered hidden (maybe the logic is that those elements are shrouded by clouds).
We can achieve this on any page with the following snippet.
// Reader view disabler 9000 [...document.querySelectorAll("p,pre"), ...[...document.querySelectorAll("div > br")].map(e => e.parentNode)] .forEach(e => e.classList.add("skyscraper"));