Time Indexing Web Logging Case Study
Some application areas have
always held time information for each of the data items held.
Such an example is log file data, where applications
write special entries into a log file.
For each entry there is a field in the
line which is the time and date of the log entry.
The way to access that time and the associated data is
dependent on the format of the log file.
Each different application may write their log files using
a different format.
Having data in different formats has a consequence that there are different
mechanisms or tools for each of the formats.
Each tool can only utilize data in a familiar format.
A tool for one format will
generally not work with log file data in another format.
Do to identical time-based tasks on different format log files
requires re-inventing the wheel for each format.
Time-indexing solves this problem by having the time-based operations
defined for the container for any.
time-based data.
The log files shown for this example are for a mail server and for a web server.
Both applications will add an entry in their logs
when a significant event occurs.
The entries are discrete and individual pieces of data.
By looking at the formats of the the two log files presented below,
it is apparent that the way the time is presented and where it is
in each individual entry is different.
Below is a sample from a mail server log file:
Mar 17 14:01:03 netvista postfix/pickup[21336]: 0779CAB0B: uid=0 from=
Mar 17 14:01:03 netvista postfix/cleanup[21598]: 0779CAB0B: message-id=<20030317140102.0779CAB0B@mail.timeindexing.com>Mar 17 14:01:03 netvista postfix/nqmgr[3029]: 0779CAB0B: from=, size=664, nrcpt=1 (queue active)
Mar 17 14:01:04 netvista postfix/local[21600]: 0779CAB0B: to=, relay=local, delay=2, status=sent ("|/usr/bin/procmail -Y -a $DOMAIN")
Mar 17 14:59:21 netvista postfix/smtpd[21845]: connect from unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/smtpd[21845]: 0E1C7AAAD: client=unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/cleanup[21847]: 0E1C7AAAD: message-id=<1047913159.2358.73.camel@netvista.timeindexing.com>
Mar 17 14:59:22 netvista postfix/smtpd[21845]: disconnect from unknown[81.6.214.74]
Mar 17 14:59:22 netvista postfix/nqmgr[3029]: 0E1C7AAAD: from=, size=821, nrcpt=1 (queue active)
Mar 17 14:59:26 netvista postfix/smtp[21849]: 0E1C7AAAD: to=, relay=smtp.nildram.co.uk[195.112.4.54], delay=4, status=sent (250 Ok: queued as CA0781E22B4)
The time is at the start of the line and has the format: month day time.
An example is Mar 17 14:01:03. The rest of the data is
related to a mail message.
Below is a sample from a web server log file:
81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/ HTTP/1.1" 200 1341 "http://www.timeindexing.com/timeindexing/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/favicon.gif HTTP/1.1" 404 327 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:54 +0000] "GET /docs/movieonly.jpg HTTP/1.1" 404 329 "http://www.timeindexing.com/docs/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/summary.html HTTP/1.1" 200 11243 "http://www.timeindexing.com/docs/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/print.css HTTP/1.1" 200 418 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/normal.css HTTP/1.1" 200 455 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/diagrams/architecture.gif HTTP/1.1" 200 7268 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
81.6.214.74 - - [17/Mar/2003:16:46:55 +0000] "GET /docs/diagrams/web-audio.gif HTTP/1.1" 200 3407 "http://www.timeindexing.com/docs/summary.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020826"
The time is at the middle of the line and has the format:
[day/month/year:time +0000]. An example is
[17/Mar/2003:16:46:54 +0000]. The first field is the
IP address of the machine that connected to the web server, and the fields
after the time are the page that was requested from the server.
A final example is from a network packet sniffer log file:
17:20:06.748556 63.236.73.20.http > netvista.timeindexing.com.3867: . 1:1409(1408) ack 152 win 5792 <nop,nop,timestamp 627422282 62477830> (DF)
17:20:06.748618 netvista.timeindexing.com.3867 > 63.236.73.20.http: . ack 1997 win 8448 <nop,nop,timestamp 62478301 627422282> (DF)
17:20:06.772297 205.156.51.200.http > netvista.timeindexing.com.3868: . 501:1001(500) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478293>
17:20:06.772351 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1001 win 7500 <nop,nop,timestamp 62478303 1056862170> (DF)
17:20:06.780926 205.156.51.200.http > netvista.timeindexing.com.3868: . 1001:1501(500) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478293>
17:20:06.780990 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1501 win 8500 <nop,nop,timestamp 62478304 1056862170> (DF)
17:20:06.869071 205.156.51.200.http > netvista.timeindexing.com.3868: P 1501:1710(209) ack 161 win 65500 <nop,nop,timestamp 1056862170 62478303>
17:20:06.869137 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1710 win 8500 <nop,nop,timestamp 62478313 1056862170> (DF)
17:20:07.338818 205.156.51.200.http > netvista.timeindexing.com.3868: P 1710:1715(5) ack 161 win 65500 <nop,nop,timestamp 1056862171 62478313>
17:20:07.338871 netvista.timeindexing.com.3868 > 205.156.51.200.http: . ack 1715 win 8500 <nop,nop,timestamp 62478360 1056862171> (DF)
The time is also at the start of the line, but in this case the
format is: hour:minute:seconds.microseconds.
An example is 17:20:06.748556. This time format
has no year, or month, or day. There is no way to select
data by time outside of a very narrow scope.
One of the major concerns with management of servers such as
mail servers and web servers is what to do with the log files.
Many system managers keep the logs for each day separately,
and after 7 days the logs are removed. However, this approach
fragments the log files which contain much useful data, and
eventually throws that data away. Using this approach makes it
difficult to do long and medium term statistics and analysis
on server usage and behaviour. Other system managers realise
that the fragmentation is problematic, and that 7 days worth of log data is not
enough. They resort to keeping one large log file. These
managers now have the problem of how to get selections of data,
say for a particular day or a particular hour,
out of the log file in order to process it further. The times and
dates in the log file are hard to process in their textual
form.
Furthermore, any scheme used to get data out of one log file,
will not work with a different log file because the time formats are different.
The format of dates in the mail server logs is different from
the format of dates in the web server logs is different from
the format of dates in the network sniffer logs.
This is where time-indexing come in.
By using a time-index as a container for the log file,
data selections for periods of time can be easily retrieved.
This is because time-indexing has the operations required
to process times. The system manager can keep log files
as long as he needs because any time-based selection can be easily made.
Moreover, a time-based selection for one indexed log will work on
any other indexed log.
That is, if he requests all the data between midnight on March 1st
and midnight on April 1st for one indexed log only those values
will be presented. This will work for all indexed logs.
Although the indexed log may get quite large over time,
this is not really an issue given the size of discs these days.
Of more importance are
the benefits of using time-indexing.
These being that the
data security and data integrity of the content of the log files
is maintained.
It becomes possible to select any time-based selection of entries
from the indexed logs.
Most importantly, is that the effort is reduced and the subsequent cost
of managing log file data is also reduced.
|