How to save a web pages?
Design A
--------
Save the web page and all dependent files as one unit
Pros
- Easy to implement. Easy deletion.
- Good performance. (No extra lookup, all data in one place)
Cons
- Duplicated .jpg and .gif, etc are saved for each web page.
Design B
--------
Build a URL (or MD5) to resources database.
Pros
- Web pages referencing same resources only have 1 copy saved. Experiments show that in compressed form, the original document takes only 1/5 to 1/10 of storage. The rest are potentially duplicated. (Note dependent resources like .jpg are often not well compressed).
Cons
- Delete requires reference counting before removing shared resources.
- Opening a saved web page requires multiple lookups.
- Saving in storage only happens incrementally. I.e. the first web pages saved take same amount of storage in either design. Saving only happends when adding subsequent web pages from the same site (referencing similar resources)
Design C
--------
Save resources from the same DNS domain in one place. Items under one domain are more managable.
Compromise? Best of both world or complexity of both?
Caveat: one web page can have resources coming from several domains.
--------
Save the web page and all dependent files as one unit
Pros
- Easy to implement. Easy deletion.
- Good performance. (No extra lookup, all data in one place)
Cons
- Duplicated .jpg and .gif, etc are saved for each web page.
Design B
--------
Build a URL (or MD5) to resources database.
Pros
- Web pages referencing same resources only have 1 copy saved. Experiments show that in compressed form, the original document takes only 1/5 to 1/10 of storage. The rest are potentially duplicated. (Note dependent resources like .jpg are often not well compressed).
Cons
- Delete requires reference counting before removing shared resources.
- Opening a saved web page requires multiple lookups.
- Saving in storage only happens incrementally. I.e. the first web pages saved take same amount of storage in either design. Saving only happends when adding subsequent web pages from the same site (referencing similar resources)
Design C
--------
Save resources from the same DNS domain in one place. Items under one domain are more managable.
Compromise? Best of both world or complexity of both?
Caveat: one web page can have resources coming from several domains.