Instant hypertext

Kragen Javier Sitaker, 2013-05-17 (updated 2013-05-20) (14 minutes)

(I think I published this on kragen-tol at some point.)

I'm frustrated with the slowness of web browsers at navigating Wikipedia. Navigating Wikipedia is one of the most important tasks you can do on a computer; it provides access to a reasonable summary of all human knowledge.

Ideally, there should be no perceptible delay between selecting a link and seeing the resulting page. The links browser can more or less realize this for pages it has in its cache, although only up to a certain size and without displaying images. More full-featured browsers, unfortunately, are not capable of such a feat. To some extent this is a consequence of the very powerful features of CSS and JavaScript which do not admit general efficient implementations, but in the case of Wikipedia, the use of CSS and JS is quite stereotyped and not at all essential to the main content. It's entirely feasible to provide no perceptible delay for Wikipedia pages.

My current setup

I'm running kiwix-serve on an English Wikipedia snapshot and displaying it using links:

./kiwix/bin/kiwix-serve --port=8000 ~/Downloads/wikipedia_en_all_nopic_01_2012.zim &
links http://localhost:8000/wikipedia_en_all_nopic_01_2012/

This snapshot, which can be torrented, occupies 9.7 gibibytes of disk space and includes all the text and equations but none of the pictures from the English Wikipedia, not even the most popular pictures without copyright encumberment. (I think that an additional gibibyte or two of images could be extremely valuable.) It's deeply unfortunate that the version of kiwix-serve I'm using is not fully functional without JS in the browser, since the main reason for using kiwix-serve rather than kiwix is to be able to browse using an unusual browser.

Back to the ideal: "no perceptible delay" could be operationalized as 200 milliseconds (a traditional number from my memory of HCI research) or 70 milliseconds (the length of a typical keypress: you want to display the result page before the user finishes pressing the key) or 17 milliseconds (the screen refresh time of nearly all current computer monitors).

70 milliseconds should be achievable

Meeting any of these goals (17ms, 70ms, or 200ms) would necessarily require a local replica, if not a replica necessarily on my own machine; ping en.wikipedia.org from my location in Argentina reports 300ms average latency, so even a single round trip would blow it. In fact, I haven't found anyplace in Argentina with less than 250ms latency to anyplace in the US.

I argue that a 70-millisecond response time is achievable, with no network server, using a reasonably powerful machine (a netbook from a couple of years back, or an Android phone with a large SD card). The minimal amount of work involved is displaying a screenful of data (around 0.6 megapixels, or 1.8 megs uncompressed) following a single random hard disk seek (a 10ms cost which goes away if you're using an SD card) after determining which of perhaps a few dozen links on the screen the user clicked on. That is, transferring three megabytes of data from a random location on the disk to the framebuffer. This is almost the same task that any movie player software performs in 17 milliseconds, except that it decompresses the frame from data in memory instead of loading it from a random location on the disk.

A typical modern laptop hard disk transfers some 20 megabytes per second, or 20 kilobytes per millisecond. That means that in the 70 - 10 - 17 = 43 remaining milliseconds, it can transfer 860 kilobytes. So we could meet our 70-millisecond performance goal if we could compress the screen image to display by only a factor of two. However, we wouldn't be able to fit much of Wikipedia onto a typical netbook hard disk in that representation; so compression is necessary for reasonable coverage, as well as saving us much of those milliseconds to use for other things!

A lot of work is needed.

According to Chromium, kiwix-serve can serve up a typical page in under 400ms on my netbook, although it occasionally takes almost 500ms. I don't know how much of that is due to the LZMA2 compression used by the .zim file format, and how much is due to kiwix being suboptimal.

Lossless image compression

I made a 1024×600 screenshot of links displaying a screenful of a Wikipedia page at 113×33 characters (3729 characters). This screenshot compressed as PNG takes 80 kilobytes, or 72 kilobytes after optipng; pngtopnm decompressing it to raw data takes 72 user milliseconds, although given that the clock time is around 250ms, I don't really trust that. (The gzipped PNM decompresses in 20ms.) A GIF version produced by ImageMagick is 63 kilobytes. This is to say, PNG or GIF compression reduces it to such a size that it needs about 3ms of disk transfer time to load.

Here is the screenful, stripped of its highlighting and hypertext links, which should not comprise a significant fraction of its size in bytes:

                                                                               Banking in Switzerland (p2 of 17) 
     * 8 International competition                                                                               
     * 9 See also                                                                                                
     * 10 References                                                                                             
     * 11 External links

Overview

   Switzerland is a prosperous nation with a per capita gross domestic product higher than that of most          
   western European nations. In addition, the value of the Swiss franc (CHF) has been relatively stable          
   compared with that of other currencies.[3] In 2009, the financial sector comprised 11.6% of Switzerland's     
   GDP and employed approximately 195,000 people (136,000 of whom work in the banking sector); this represents   
   about 5.6% of the total Swiss workforce. Furthermore, Swiss banks employ an estimated 103,000 people          
   abroad.[4]

   Swiss neutrality and national sovereignty, long recognized by foreign nations, have fostered a stable         
   environment in which the banking sector was able to develop and thrive. Switzerland has maintained            
   neutrality through both World Wars, is not a member of the European Union, and was not even a member of the   
   United Nations until 2002.[5][6]

   Currently an estimated one-third of all funds held outside the country of origin (sometimes called            
   "offshore" funds) are kept in Switzerland. In 2001 Swiss banks managed US$ 2.6 trillion. The following year   
   they handled $400 billion USD less which has been attributed to both a bear market and stricter regulations   
   on Swiss banking.[7] By 2007 this figure has risen to roughly 6.7 trillion Swiss francs (US$6.4 trillion).

   The Bank of International Settlements, an organization that facilitates cooperation among the world's         
   central banks, is headquartered in the city of Basel. Founded in 1930, the BIS chose to locate in             
   Switzerland because of the country's neutrality, which was important to an organization founded by            
   countries that had been on both sides of World War I.[8]

   Foreign banks operating in Switzerland manage 870 billion Swiss francs worth of assets (as of May 2006).[9]

http://localhost:8000/wikipedia_en_all_nopic_01_2012/A/Banking in Switzerland.html#International_competition

This character data compresses to 1158 bytes with gzip -9, 3.22 times smaller, 1202 bytes with bzip2 -9, and 1446 bytes with compress.

None of these programs take a significant amount of time to decompress the data, but my measurement noise is around 4ms, so they might be taking as much as 4ms.

Conclusion: compressed images are very heavy, and to be avoided if possible.

Wikipedia size

As I mentioned, the English Wikipedia snapshot I'm using is 9.7 gibibytes. kiwix-serve reports that it contains "1,766,695MB (3,874,564 articles, 262,623 medias)", which is to say, 1.8 terabytes in 3.9 million articles, or 460 kilobytes per article. I seem to recall that the text of a typical article is some tens of kilobytes, and almost none are over 100 kilobytes, so I assume that the bulk of that is the "medias".

Five random Wikipedia pages (not in the snapshot) are 4, 6, 7, 7, and 10 such screenfuls in links, so I'm going to assume that 7 is a reasonable median and 20 is a reasonable guess at mean number of screenfuls. That means we have about 80 million screenfuls of text, or 300 gigabytes of text, uncompressed; perhaps it would be 93 gigabytes compressed. This seems surprisingly large, since the original Wikipedia source data is only about 9 gigabytes in bz2-compressed XML. Nevertheless, it's small enough to be feasible to store textual snapshots of this entire Wikipedia on a normal netbook, and you could retrieve any one of them with a single seek if you identified them by number.

(If you printed out these 80 million 113×33 screenfuls in a square, they would be about 800 thousand characters wide and 400 thousand tall; in a normal ten-point font, this would be about 500 feet square.)

Since lossless image compression produced files that were some 40 times larger than text compression, we probably don't have room on the disk for precomputed images of the entire Wikipedia, pre-rendered. However, it's probably perfectly adequate to layout and render the text on the fly, within limits.

Image selection

That doesn't help with things that aren't text, and in particular, math articles suffer badly from lack of diagrams. http://en.wikipedia.org/wiki/Cantor_function is rather hard to understand without the graphs, the first of which is an 8-kilobyte SVG drawn with Adobe Illustrator which gzips down to 988 bytes; the third is a 9-kilobyte PNG. If you included a single gigabyte of images, you could include a million images of the complexity of this SVG; you could include one such image per article in four gigabytes.

Presumably you want to optimize some kind of cost-benefit ratio, perhaps using page view counts or lists of featured articles or vital articles to estimate the benefit, and the file size (perhaps with an estimate of benefit due to compression) to estimate the cost. For example, Georg Cantor is a "vital article" that is read some 700 times a day, and it uses a diagram of bijection which is 1.8 kilobytes gzipped and is also used on four other pages; perhaps the 27-kilobyte portrait of young Cantor used on no other pages does not contribute 15 times as much to your replica of Wikipedia. If a picture is one of the 3196 featured pictures it's probably better quality. http://commons.wikimedia.org/wiki/Commons:Picture_of_the_Year http://commons.wikimedia.org/wiki/Commons:Valued_images http://commons.wikimedia.org/wiki/Commons:Quality_images http://commons.wikimedia.org/wiki/Commons:Featured_pictures

Wikimedia Commons says it has 15 million files, which add up to 22.5 terabytes. This is substantially larger than any of the textual Wikipedias, by about two thousand times. The great bulk of this, about 19.6 terabytes, is JPEGs, followed by 880 gigabytes of TIFFs, 600 gigabytes of PNGs, 520 gigabytes of Ogg video, 300 gigabytes of DjVu, and so on.

These files range over a very wide range of sizes; even consulting statistics by MIME type shows that the 2597 MIDI files average 5.2 kilobytes while WebM video files average 27 megabytes, four orders of magnitude greater. That is, all the MIDI files put together are smaller than a single average WebM video, and of course within each category, there are orders of magnitude of variation.

The Commons metadata is available for bulk download, currently totaling about 2.4 gigabytes, and could be used to measure the file size distribution more precisely.

Most of the JPEGs are very large indeed, and a much smaller version would be more than adequate for initial page presentation; some random images I selected using http://commons.wikimedia.org/wiki/Special:Random/File were 0.3 megapixels, 0.3 megapixels, 3.1 megapixels, 5.9 megapixels, and a 22-kilobyte SVG. Most of the time, however, images are presented in the page at a much smaller size. For example, I clicked the "random" link above until I got a big image: http://commons.wikimedia.org/wiki/File:Ebensee1.jpg, which is 7.7 megapixels, but where it's used on http://de.wikipedia.org/wiki/Bezirk_Gmunden, it's scaled down to 150×99, or 0.015 megapixels, 500 times smaller, making it 7.5 kilobytes instead of 4.9 megabytes.

(en.wikipedia.org has its own set of uploaded files, totaling 800 thousand to Commons's 15 million. I'm not clear if Commons images used on en are included in that total.)

A quick sample of five random images finds one used on the Hebrew Wikipedia and none used in English Wikipedia. Four more finds one on nl.wikipedia; eight more finally finds one image used on en.wikipedia as well as others, one used on es.wikipedia, one used on de.wikipedia, and one more used on en.wikipedia. Naïvely, two out of 17 in the "sample" should mean that on the order of 15% of the files in Commons are used on en.wikipedia. Both of these are used at a reduced size in the articles, and have an en.wikipedia File: namespace page.

Topics