MindRetrieve Blog

MindRetrieve - an open source desktop search tool for your personal web

Wednesday, February 23, 2005

How to save a web pages?

Design A
Save the web page and all dependent files as one unit

- Easy to implement. Easy deletion.
- Good performance. (No extra lookup, all data in one place)

- Duplicated .jpg and .gif, etc are saved for each web page.

Design B
Build a URL (or MD5) to resources database.

- Web pages referencing same resources only have 1 copy saved. Experiments show that in compressed form, the original document takes only 1/5 to 1/10 of storage. The rest are potentially duplicated. (Note dependent resources like .jpg are often not well compressed).

- Delete requires reference counting before removing shared resources.
- Opening a saved web page requires multiple lookups.
- Saving in storage only happens incrementally. I.e. the first web pages saved take same amount of storage in either design. Saving only happends when adding subsequent web pages from the same site (referencing similar resources)

Design C
Save resources from the same DNS domain in one place. Items under one domain are more managable.

Compromise? Best of both world or complexity of both?

Caveat: one web page can have resources coming from several domains.

Why don't we all have 10,000 bookmarks?

Is that because there is not so much worthy materials from the web? At least not for me. If there are tools to support it I am confident that I would a bookmark of 10,000 items within 10 years.

One obvious issue is that web browser's bookmark UI is not scalable. The primary UI is a pull down menu with items organized under hierarchical folders. Access become dramatically more difficult when a menu is longer than a screenful or when items are organized in a second level folder or below. I painstakingly organized my bookmark list to keep frequently used items accessible and avoid having some went out of the radar screen. It was never fully successful. It also mean I bookmark a lot less than I want to.

A search based access is an improvement. Organize by a flat keyword list (tags a la deli.cio.us) also improve upon the strict hierarchical taxonomy.

When you really have 10,000 bookmarks, another issue comes up. People don't seems to have the ability to pay attention to such a large amount of items. I purge bookmark regularly after I think I got a good grip on the information and that it cannot provide me extra value. This is usually a mistake. I seldom want to purge something completely from my consciousness. What I was trying to do is to keep the bookmark list tidy to prepare for new information. The tools force me to erase the memory completely.

An alternative is keep recent items handy while pushing old items into an archive. Gmail introduced such organization technique. It recommends you to use 'archive' instead of 'delete' so to get an item out of sight but still accessible via search in the long term archive. This process is very promising but unconventional. How much success does Google has?

Monday, February 21, 2005

Practical File System Design

Part of the design is reminiscience of a file system. It store's web pages instead of files. And there are attributes like URL and title that is indexed and searchable. I just found Dominic Giampaolo's book "Practical File System Design with the Be File System" available online (http://www.nobius.org/~dbg/practical-file-system-design.pdf). This seems to be a good study on this subject.

Saturday, February 19, 2005

A Comparison of Hyperstructures

A Comparison of Hyperstructures: Zzstructures, mSpaces, and Polyarchies

McGuffin, M. J. and schraefel, m. c. (2004) A Comparison of Hyperstructures: Zzstructures, mSpaces, and Polyarchies. In Proceedings of ACM Conference on Hypertext and Hypermedia, 2004 (in press), pages pp. 153-162, Santa Cruz, California, USA.

Just found this paper at http://eprints.ecs.soton.ac.uk/9230/. The descriptions of Hyperstructures has enlighten me on a lot of unstructure thought on how to organize & structure bookmarks. The deli.cio.us way to eschew hierarchical structure in favor of associated keyword is a simple and effective approach. However the paper says

"A disadvantage of using this approach alone is that users cannot exploit their spatial memory and learn where things are located, or remember paths of links to find them"

It also helps me to understand what semantic web is all about. Great work.

Wednesday, February 02, 2005

article gone? MindRetrieve save the day

I have bookmarked an article "INNOVATION, INFORMATION TECHNOLOGY. AND THE CULTURE OF FREEDOM" by Manuel Castells for later reading at


When I go back today somehow the original article is pulled and replaced with another Spanish piece. The site is in Portuguese and there is no indication of any language button. I begin to question my memory if there was ever an English article there.

A search in MindRetrieve quickly dispel my doubt. In the cache is the English article I've seen.

Anyway it is largely things about open source I already know. Nothing particularly insightful there.

TortoiseSVN + Total Commander

I've just setup the TortoiseSVN client with the fresh release of Total Commander 6.5. The result is a wonderful integration of SVN client with my favourite file manager. A little icon next to the file indicates whether the file is modified or not. Diff, commit and revert commands are readily available in the context menu. Even through the SVN repository is half way around the globe and I am working off DSL, the speed still blow away the ClearCase that I use in corporate environment. Also SVN works great offline and diff works without the network!

The only thing I haven't figure out is how to store the session for the svn+ssh protocol. Right now it asks for password for every remote operation, sometimes even twice.