Overview

Sequences are something I came up with to make SMTP and POP logs more meaningful. The basic idea is most services you provide have a session, a lifetime of sorts during which a number of various things happen - the variety of which is finite and defined by the protocol, usually the range is actually small. It's just they can potentially be repeated alot so generally simple counts are kept if anything at all. The small range makes it easy to just pull single ascii character keys out of thin air to represent particular requests within a protocol.

Alot of programs out there when they log the end of a session, be it SMTP, POP, or IMAP, will print some counts, like how many times the user issued DELE, RETR, TOP in POP3 for example. This can be useful but it loses too much information, the time dimension is lost completely, there is no preservation of the order of these events, the sequence.

The way ServerKit sequences solve this problem is almost comically simple. By efficiently implementing a way to store the sequence of events in the form of a character string which you would probably log at the end of sessions.

When a session begins, you would create a new sequence with server_sequence_new(). Or if you already have something you allocate when a session begins, like a session structure, just place a server_sequence_t within it, and be sure to call server_sequence_reset() before trying to use it, which will initialize it properly.

Then, for every event / request / command etc you support, you would pick a character. For the purpose of this example I will use POP3.

u
USER
p
PASS
c
CAPA
l
LIST
r
RETR
s
STAT
t
TOP
U
UIDL
d
DELE
q
QUIT
R
RSET

given the above assignments, when USER ocurred, I would call:

server_sequence_append(&session->sequence, 'u');

PASS:

server_sequence_append(&session->sequence, 'p');

QUIT:

server_sequence_append(&session->sequence, 'q');

What ServerKit is doing with these characters is it's assembling a sequence stored in the server_sequence_t. It doesnt just append the letters onto the end of a char * until it runs out of space, though that is basically what it is doing. It also compresses repeated characters by counting the repetitions, any time a character is appended multiple times consecutively its count at that slot in the sequence accumulates rather than taking another slot.

The sequences are limited to 512 slots in the timeline, every slot can have up to 256 repetitions.

When your session is finished, and you wish to log the sequence in string form somewhere you must print it. You do this using the server_sequence_print() function. This function will print into a buffer you provide with a specified length. If the print would have exceeded the length of the buffer you provided the end of the printed sequence gets replaced with a '$' to indicate it has been truncated.

Also, if a slot overflows the repetition count, its character will get followed by a '+' to indicate overflow, meaning $>$ 256 repetitions.

Since these chracters are utilized by ServerKit in the sequence implementation, they are not permitted for use by the append function, you will get a error if you try use '$\backslash$0', '+', or '$'. This means those three characters are not available for you to assign to commands.

Here are some sample sequence prints using the POP3 example:

cupsq
the most common POP3 sequence, capa user pass stat quit

cupsUrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdrdq
Another common sequence pattern, same as above followed with uidl and a series of retr dele retr dele... then quit

cupsr2d2q
An example demonstrating the repetition counts, retr retr dele dele

cupsr+d+q
Same as above but instead of 2 we overflowed the repetition counter in retr and dele.

cupsr$
Illustrating a tiny buffer supplied to the print function, the sequence was truncated and the $ indicates it.

cupq
A bad sign, if this is going on alot MySQL replication may be broken and users can't authenticate. Clients don't login to not even check the stat or uidl, this is a failed login.

cupupupupupupupupup
Whos fishing for logins?

cu+
Another fishing example, some POP3 servers say something different when you enter user for a valid vs. invalid user, and let you try repeatedly... you wouldnt support that, right?

Another interesting thing about having the sequence information is it gives you some very useful insight into usage patterns. After you collect megabytes of these logs you can really see what are the most common patterns to help direct your optimization efforts.

I can tell you "cupsq" is probably the most common POP3 sequence. Knowing this, perhaps the parser should place higher priority on the capa, user, pass, stat, and quit keywords, if your parser is designed such that the patterns are tested in some arbitrary order (ptree use can make this a non-issue).

Another thing this shows is that users usually authenticate, execute STAT then quit. They are not retrieving the sorted list of messages or uidls, so a good optimization would be to change things so a user could login and have the output of STAT available without pounding the Maildir and sorting the list of all the messages and retrieving the uidl list.

Defer the Maildir and uidl retrieval until a command which requires that data is run, and since cupsq is the most frequent command it would be a huge gain in mail system efficiency. Just cache the STAT data somewhere that can be quickly accessed, and update it on Maildir changes.

The repetition is limited to 8 bit fields because I didnt want to make printing the sequences any more expensive than necessary. Large scale server programs do alot of logging, you don't want to waste cpu cycles here, so it's a compromise, and I think it's pretty well balanced.

Knowing that the process of converting an integer to string is not particularly cheap, I limited it to 8 bits per repetition accumulator and simply created a lookup table of the 256 strings. So, when you are printing the sequence, that whole modulo divide remainder base 10 store string in reverse thing isnt happening. It's a simple string copy from a lookup table indexed by the repetition count.

2007-12-06