MindRetrieve Blog

MindRetrieve - an open source desktop search tool for your personal web

Friday, December 30, 2005

Incremental Development

I was looking at the subversion checkin log at


I notice I have made many many small checkins. Often several times a day and each checkin include a group of several files (taking into consideration I am not working on this full time). This style is quite different from my work on other projects when I do a lot less checkin but usually in a larger chunk.

Perhaps this say something about the productivity? Perhaps I was acting thoughtless because I'm the only developer right now. But just now I have come to another characterization - this is incremental development!

Each time I made small changes, add some new feature or a methods, refactor code, fix a bug, add a test. I made the code changes, test it, and then I check in. The code base is functional most of time. Seldom did I make big changes that break the code base for several days or more.

Is incremental development the best development process? I'll leave it to other discussion. But from a developer's perspective, having a functional system most of time and being able to test and verify any code change easily is wonderful. Everytime I do a checkin I have the satisfaction that something is done. Coming from an environment where changing only one line of code would lead to tedious work of building a test environment and a painful testing process, this is just pure joy.

Wednesday, December 28, 2005

Keep Your Article in One Single Web Page

There is a style widely used in web publishing. It says people don't like to read long article, one should keep modem user in consideration that a long web page would result in slow download and so. If you have a long article to publish, break it down into several short sections and let user read it page by page.

I have arrived in a contrarian view. I think breaking a long article down into several pages is a hassle to the users. It is best to publish the entire article in one single page. The issue is in order to finish the article I need to click next page several times, every time there is a delay and it interrupts the momentum. Usually the delay is short, like one or two seconds. But it is a noticeable delay. Scrolling is seamless in comparison. For slow sites the delay is much worst. Some sites routinely take 10 seconds or more. That would feel like an episode ended in a cliffhanger and we have to wait for the next episode.

Is several short page a better layout than a long one? I actually prefer to have a long one. The scroll bar give a good indication of how far in the article I have progressed. The scroll wheel is very handy in navigation. I can also use the browser to search for a word that appears anywhere in the article. All these is better than arbitrary break an article into several parts and then leave only a narrow window to the user. The speed issue is a non-issue. Any browser should be able to render incrementally so that you can start reading as soon as any text arrives. Even my cellphone do this flawlessly.

In some case the multiple pages format backfire badly. I'm glad that Yahoo has a mobile version made for cellphone at http://mobile.yahoo.com/. I thought it would be great for reading email. Turns out the issue with cellphone data network is not just low bandwidth, every time when I click a link there would be a long delay before I can get any response. As Yahoo mobile break every screen into 15 lines or so, even the simplest email cost me multiple clicks to read though. The long delay between clicks make it plain unusable. I end up went back to the regular web interface. Even I have to scroll through tons of irrelevant stuff to get to the email body I still prefer it to the mobile version.

Of course my comment should not be generalized too far. I have tried to load an entire technical manual as a single web page (in my harddrive). The browser have noticeably delay to process a web page of this size. But I still keep the manual in this format for the ease of searching.

Friday, December 16, 2005

ISO 8601, the metric of date format

Different countries have different convention of writing date. Some write in m/d/y order and some write in d/m/y order and so on. One thing to do for localization is to show date in the right format...

I say ____ the convention. Write it in ISO 8601 format. That is in the YYYY-MM-DD format. No more confusion of whether the day or month come first. And when you sort it, it comes out in chronological order. ISO 8601 is the metric of date format.

Saturday, December 10, 2005

Your Site Has Vulnerability

I was testing my web application for security problem. Failure to escape user input is a very common class of security problem. So I created an input string like this:

'"></script><h1><font size=7 color=red>GOTCHA<iframe src=http://mindretrieve.blogspot.com/2005/12/your-site-has-vulnerability.html width=500 height=300>

Cut and paste it into any input fields and then click submit. If you see something you don't expect, that site probably has a problem.

Hurry! Test out your site before hackers do!

Thursday, December 08, 2005

17000 lines of code

It has been a while since I did a tally on the code size. The new statistics shows there are about 17000 lines of code in about 100 modules. That is for a non-trivial application with everything deliberately designed to be as simple as possible.

A closer look inside, among the them is 5000 lines of unit test code. It on track with the the rule of thumb of 1 to 1 ratio of production code to test code.

Two of the largest module has around 700 lines of code. The median is around 100 lines. This sounds minuscule to most other software code. But with Python, 700 lines can be a really sophisticated module. In many cases I would break a module down into smaller components before they even reach 700 lines.

Wednesday, December 07, 2005

Google v.s. bookmark?

While we are working hard on MindRetrieve to improve bookmarking, some people argue that the entire idea of bookmarking is outdated. Instead of bookmark, we just google. Indeed Google gives such excellent experience that people expect the right answer instantaneously. More easy than you would looking things up from the bookmark menu.

While this is true for some obvious sites, perhaps for 'dell' or 'walmart', it ain't necessary useful for everything you'd ever interested. How many times have you flip through pages of marginally useful search result until you finally found the one that gives what you need? (assuming, of course, you an advance searcher who actually go beyond the first page). You will not want to repeat this search. And if you do, you won't necessary found the same item. How about a page that is not a direct search result but is one click, or two clicks away? Let's say from Google you found a passionated travelogue about an Italy trip. From there you found more links about some hotel recommendation that you really want. Google is a good starting point. But you have also make additional effort to arrive in what you needed.

When you search and evaluate the result, you are actually creating something of value. If you put together the link of the travelogue, a guide on how to use the Italy train system and the two cute hotel you have seen, you will in effect created some kind of travel guide, even if you have not authored any of them. Personal web is a way for you to capture the creation.

Thursday, December 01, 2005

Weblib file specification

The work on MindRetrieve has reached another mile stone as I have published the weblib file specification and that it supports updating now. Previously I was using a hack to rewrite the entire file whenever I update a tiny little piece of data. That actually served me well for several months during development and personal use. Rewriting a 200k file everytime never seems to add any drag. In anycase the newer format seems well designed and is more ready for future scalability.

I've included a snapshot of the spec here or you can find the latest version from the
source code.

MindRetrieve Weblib Data File Specification Version 0.5

MindRetrieve weblib data is an UTF-8 encoded text file (no other
encoding is supported as this time). The overall format is a block of
headers followed by a blank line and then the body similar to email and
HTTP messages. Each line of the body part represents a webpage or tag
item. Update to the weblib is appended as change records to the end of
the file. The entire weblib can be represented by a single file.

file = headers BR body
headers = *(header BR)
header = field-name ":" [ field-value ]
field-name = token
field-value = DSV encoded value
body = column-header BR *((data-line | comment-line | *SP) BR)
column-header = column-name *( "|" column-name)
column-name = token
comment-line = "#" any string
data-line = [change-prefix] (data-record | header)
change-prefix = '[' YYYY-MM-DD SP HH:MM:SS ']' SP ['r' | 'u' | 'h'] '!'
data-record = ["@"] id *( "|" field-value)
BR = CR | LF | CR LF
SP = space characters


* token is defined according to RFC 2616 Section 2.2.

* DSV encoded value is an unicode string with the characters "\", "|",
CR and LF encoded as "\\", "\|", "\r" and "\n" respectively.

* There are two kind of data records, a webpage has a numeric id, while
a tag has a numeric id prefixed by "@".

//* A record with the same id can appears multiple times in the data file.
// The last record overwritten preceding records.

* A data-record preceded by a change-prefix denote update to the file.

* A record prefixed by "[ISO8601 time] r!" is a remove record. The item
with the corresponding id is to be removed.

* A record prefixed by "[ISO8601 time] u!" is an update record. The item
with the corresponding id is to be replaced.

* A record with by "[ISO8601 time] h!" is an header update record. The
header value is to be updated. There is no remove header record. A
header value can be set to empty string however.

* The last line should always ended with BR. If the last line is not
terminated with BR it is considered a corrupted record and must be
discarded. Moreover never append change record to a corrupted record
because the line break would be misplaced.

* The encoding header is defined for future extension only. Only UTF-8
encoding is supported right now.