Re: [ALUG] Wanted - algorithm for bufferless search for a pattern in a stream

7 Aug 2012


      On 07-Aug-2012 10:29:25 Chris Green wrote:
...
On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:
...
On 06/08/12 19:25, Chris Green wrote:
...
So, obviously one reads a byte at a time from the stream to see if it
matches the first character of the constant string, if that matches one
moves on to the second, etc.  When the whole string has been matched
then it's found - so far so good.  However, when a character *doesn't*
match one can't go back up the input stream, *but* we know the previous
characters as they're the ones in the constant string that we have
matched so far.  Thus we can 'wind back' by using the constant string
for the number of characters that did match and then continue down the
stream.
That's one "obvious" method, but not the only one.
Another would be that you run multiple pattern matches (ie even
though you're already up to N characters in a match, you're still
checking for the start of a new match simultaneously, so that
if/when the first pattern fails you don't rewind but just drop it
and carry on with any other sequences you're still matching).
However, for a long string that *could* use as much space as the length
of the string * (size of a pointer) couldn't it?  That's worse than
using a buffer the size of the string.
...
Bear in mind that if you're looking for "ABAB" in the stream
"xxABABABxx" there are two matches but your algorithm will only
pick up the first one, I don't know if that matters to you.
No, it doesn't.
...
However, since grep can search within a stream surely that would
be a good starting point?
cat /my/stream | grep 'really long pattern'


You can open a process and feed data to it byte-by-byte if you
want to from Any Good Programming Language [TM] - eg for PHP[*]:
    http://www.php.net/manual/en/function.proc-open.php
Er, no, this is on a microcontroller with no OS as such so there
aren't any built in facilities like 'grep'.  It only has 2k of RAM
so conserving RAM is a primary requirement, hence my question.
I think I have actually solved the problem with a simpler approach
because the string I am looking for is an XML identifier (or whatever
they're called) so I know it starts with '<' and ends with '>' and that
there aren't any embedded '<' characters i it.  So, knowing that I can
always restart my search when I hit a '<', simplifies things immensely.
(there may be special cases where one can have an embedded '<' but
they're not going to happen in this code).
-- 
Chris Green
Well, that certainly simplifies matters! However, on the general
question (and assuming that you're satisfied with just picking
up the first match, i.e. knowing that there's at least one match),
then the only fool-proof general method I can think of is the
following. It presumes that you have enough RAM for two copies
of the target string, plus whatever is needed for the program etc.
[A]: Store the target string in a block of RAM (called "TARGET").
[B]: Set up a FIFO buffer ("FIFO") of the same size as TARGET.
[C]: Set flag "MATCHED" to 0.
Then: Divert (or copy) the input stream into FIFO.
Each time a byte is put into FIFO (starting with the first byte
at which FIFO becomes full), check FIFO byte-for-byte against
TARGET.
If FIFO matches TARGET, set flag MATCHED to 1.
(By the way, just to be clear in the "copy" case: when FIFO is full,
inputting an additional byte drops the first byte and shifts the
FIFO contents down one, making room for the new byte).
The "copy" case would be appropriate if you need the input stream
as it comes, without the delay while the bytes work their way
through FIFO.
Ted.
-------------------------------------------------
E-Mail: (Ted Harding) Ted.Harding@wlandres.net
Date: 07-Aug-2012  Time: 12:28:08
This message was sent by XFMail
-------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: [ALUG] Wanted - algorithm for bufferless search for a pattern in a stream