MindRetrieve Blog

MindRetrieve - an open source desktop search tool for your personal web

Thursday, December 01, 2005

Weblib file specification

The work on MindRetrieve has reached another mile stone as I have published the weblib file specification and that it supports updating now. Previously I was using a hack to rewrite the entire file whenever I update a tiny little piece of data. That actually served me well for several months during development and personal use. Rewriting a 200k file everytime never seems to add any drag. In anycase the newer format seems well designed and is more ready for future scalability.

I've included a snapshot of the spec here or you can find the latest version from the
source code.




MindRetrieve Weblib Data File Specification Version 0.5

MindRetrieve weblib data is an UTF-8 encoded text file (no other
encoding is supported as this time). The overall format is a block of
headers followed by a blank line and then the body similar to email and
HTTP messages. Each line of the body part represents a webpage or tag
item. Update to the weblib is appended as change records to the end of
the file. The entire weblib can be represented by a single file.

file = headers BR body
headers = *(header BR)
header = field-name ":" [ field-value ]
field-name = token
field-value = DSV encoded value
body = column-header BR *((data-line | comment-line | *SP) BR)
column-header = column-name *( "|" column-name)
column-name = token
comment-line = "#" any string
data-line = [change-prefix] (data-record | header)
change-prefix = '[' YYYY-MM-DD SP HH:MM:SS ']' SP ['r' | 'u' | 'h'] '!'
data-record = ["@"] id *( "|" field-value)
BR = CR | LF | CR LF
SP = space characters


Note

* token is defined according to RFC 2616 Section 2.2.

* DSV encoded value is an unicode string with the characters "\", "|",
CR and LF encoded as "\\", "\|", "\r" and "\n" respectively.

* There are two kind of data records, a webpage has a numeric id, while
a tag has a numeric id prefixed by "@".

//* A record with the same id can appears multiple times in the data file.
// The last record overwritten preceding records.

* A data-record preceded by a change-prefix denote update to the file.

* A record prefixed by "[ISO8601 time] r!" is a remove record. The item
with the corresponding id is to be removed.

* A record prefixed by "[ISO8601 time] u!" is an update record. The item
with the corresponding id is to be replaced.

* A record with by "[ISO8601 time] h!" is an header update record. The
header value is to be updated. There is no remove header record. A
header value can be set to empty string however.

* The last line should always ended with BR. If the last line is not
terminated with BR it is considered a corrupted record and must be
discarded. Moreover never append change record to a corrupted record
because the line break would be misplaced.

* The encoding header is defined for future extension only. Only UTF-8
encoding is supported right now.

0 Comments:

Post a Comment

<< Home