Re: [ALUG] Wanted - algorithm for bufferless search for a pattern in a stream

7 Aug 2012

      On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:
...
On 06/08/12 19:25, Chris Green wrote:
...
So, obviously one reads a byte at a time from the stream to see if it
matches the first character of the constant string, if that matches one
moves on to the second, etc.  When the whole string has been matched
then it's found - so far so good.  However, when a character *doesn't*
match one can't go back up the input stream, *but* we know the previous
characters as they're the ones in the constant string that we have
matched so far.  Thus we can 'wind back' by using the constant string
for the number of characters that did match and then continue down the
stream.
That's one "obvious" method, but not the only one.
Another would be that you run multiple pattern matches (ie even
though you're already up to N characters in a match, you're still
checking for the start of a new match simultaneously, so that
if/when the first pattern fails you don't rewind but just drop it
and carry on with any other sequences you're still matching).
However, for a long string that *could* use as much space as the length
of the string * (size of a pointer) couldn't it?  That's worse than
using a buffer the size of the string.
...
Bear in mind that if you're looking for "ABAB" in the stream
"xxABABABxx" there are two matches but your algorithm will only pick
up the first one, I don't know if that matters to you.
No, it doesn't.
...
However, since grep can search within a stream surely that would be
a good starting point?
cat /my/stream | grep 'really long pattern'
You can open a process and feed data to it byte-by-byte if you want
to from Any Good Programming Language [TM] - eg for PHP[*]:
    http://www.php.net/manual/en/function.proc-open.php
Er, no, this is on a microcontroller with no OS as such so there aren't
any built in facilities like 'grep'.  It only has 2k of RAM so
conserving RAM is a primary requirement, hence my question.

I think I have actually solved the problem with a simpler approach
because the string I am looking for is an XML identifier (or whatever
they're called) so I know it starts with '<' and ends with '>' and that
there aren't any embedded '<' characters i it.  So, knowing that I can
always restart my search when I hit a '<', simplifies things immensely.
(there may be special cases where one can have an embedded '<' but
they're not going to happen in this code).

-- 
Chris Green