How to save a web pages?
Design A
--------
Save the web page and all dependent files as one unit
Pros
- Easy to implement. Easy deletion.
- Good performance. (No extra lookup, all data in one place)
Cons
- Duplicated .jpg and .gif, etc are saved for each web page.
Design B
--------
Build a URL (or MD5) to resources database.
Pros
- Web pages referencing same resources only have 1 copy saved. Experiments show that in compressed form, the original document takes only 1/5 to 1/10 of storage. The rest are potentially duplicated. (Note dependent resources like .jpg are often not well compressed).
Cons
- Delete requires reference counting before removing shared resources.
- Opening a saved web page requires multiple lookups.
- Saving in storage only happens incrementally. I.e. the first web pages saved take same amount of storage in either design. Saving only happends when adding subsequent web pages from the same site (referencing similar resources)
Design C
--------
Save resources from the same DNS domain in one place. Items under one domain are more managable.
Compromise? Best of both world or complexity of both?
Caveat: one web page can have resources coming from several domains.
--------
Save the web page and all dependent files as one unit
Pros
- Easy to implement. Easy deletion.
- Good performance. (No extra lookup, all data in one place)
Cons
- Duplicated .jpg and .gif, etc are saved for each web page.
Design B
--------
Build a URL (or MD5) to resources database.
Pros
- Web pages referencing same resources only have 1 copy saved. Experiments show that in compressed form, the original document takes only 1/5 to 1/10 of storage. The rest are potentially duplicated. (Note dependent resources like .jpg are often not well compressed).
Cons
- Delete requires reference counting before removing shared resources.
- Opening a saved web page requires multiple lookups.
- Saving in storage only happens incrementally. I.e. the first web pages saved take same amount of storage in either design. Saving only happends when adding subsequent web pages from the same site (referencing similar resources)
Design C
--------
Save resources from the same DNS domain in one place. Items under one domain are more managable.
Compromise? Best of both world or complexity of both?
Caveat: one web page can have resources coming from several domains.
3 Comments:
At 8:09 AM, Tung Wai Yip said…
2005-Apr-13 17:05
A good idea akin to design A is to save all files in a MIME message. Hopefully browsers would have a good support as this is how HTML email are transmitted.
This idea come from the "Practical Internet Groupware" book by Jon Udell, an incredibly insightful book I have just discovered.
At 8:09 AM, Tung Wai Yip said…
2005-Sep-02 01:07
Hack this file format is called web archive file (.mht)! Both IE and firefox support saving web page in a single mht. Opera can only read it. Guess you'd get a narrower view if you only stick with one product.
Does it conforms to rfc2557?
At 8:10 AM, Tung Wai Yip said…
2005-Sep-10 17:21
Correction, Firefox cannot read or write .mht file. Opera can read (mostly) but not write. Only IE truely support it.
Also a run down on the 'save webpage with image' feature offered by various browsers. You would think if the browser can display the webpage, it can saves them too! Surprising none of them seems to save them 100% correct. Firefox is the weakest. Try http://creativecommons.org/. IE is better. And the .mht web archive feature is just sweet. However it does not handle references from CSS right. Opera 8.0 is the best among them. Opera 7.5 has similar issue with IE but I'm glad that it is resolved now.
Perhaps MindRetrieve could turn out to be the best :-)
Post a Comment
<< Home