Re: [ALUG] Wanted - algorithm for bufferless search for a pattern in a stream

8 Aug 2012


      On 07/08/12 13:51, (Ted Harding) wrote:
...
The sort of "edge case" I was thinking of when deciding against the
"obvious" method (based on your moving pointer algorithm) is:
Target:   ABCDABCDEFH
Input stream: ... ABCDABCDABCDEFHXYZ...
(so that the target is at ABCD|AABCDABCDEFH|XYZ...
It will happily get to the "^" point below:
ABCDABCD^ABCDEFHXYZ...
because the target is matched up to that point. But the next
byte is a mismatch, so then it should move back to byte 5 of the
above input stream and continue matching from that point
(using the accumulated matching bytes from the previous input bytes)
so that it knows it has matched
ABCDABCD
and only needs to watch out for EFHXYZ (which is about to arrive).
My "theory" is that it needs to track where it got to in the target, and 
replay the characters from the start (skipping the first one) as if they came 
from the input stream. So, from where you left off:
     Stream: ABCDABCD^ABCDEFHXYZ...
     Target: ABCDABCD^EFH
Replay the target starting after the first character:
     Stream: ABCDABCD^ABCDEFHXYZ...
     Target (replay): A[BCDABCD]EFH
     Target: ^ABCDABCDEFH
(ie target resets to start but keep track of start "[" and end "]" of the 
section of the buffer being replayed). The first few chars won't match but 
eventually:
     Stream: ABCDABCD^ABCDEFHXYZ...
     Target (replay): ABCDABCD[]EFH
     Target: ABCD^ABCDEFH
.. and you've exhausted the replay "buffer" (although note that the buffer is 
only actually two integers marking start/end of the section being replayed), 
and continue with the stream:
     Stream: ABCDABCDABCDEFH^XYZ...
     Target: ABCDABCDEFH^
...
This is already beginning to get complicated, and one can think
of examples where you may need a "stack" of pending partial matches.
Ah, but that's my point: I can't shake the suspicion that there might be cases 
where I'd effectively need to keep multiple replay "buffers" and if that 
number is arbitrary then this isn't a solution. But (especially after running 
through it above) I can't come up with an example that wouldn't work with the 
single start/end pair as above. Even if you want to pick up multiple matches 
which might overlap, eg ABCDABCD twice in ABCDABCDABCD, the above should still 
work; after the "hit" you still roll back:
     Stream: ABCDABCD^ABCD
     Target: ABCDABCD^ => HIT (1)
=>
     Stream: ABCDABCD^ABCD
     Target (replay): A[BCDABCD]
     Target: ^ABCDABCD
=>
     Stream: ABCDABCD^ABCD
     Target (replay): ABCDABCD[]
     Target: ABCD^ABCD
=>
     Stream: ABCDABCDABCD^
     Target: ABCDABCD^ => HIT (2)
...
At this point, you are beginning to need more RAM, and a more
complicated algorithm, which is why I thought that the FIFO + MATCH
procedure that I suggested would be (a) fool-proof, (b) reasonably
economical.
Your method would actually only need one (RAM) buffer of the size of the 
target string (the other would be in ROM) but if you're trying to match a 2K 
block of text in a 2GB stream you're still going to run out of RAM. I can't 
convince myself that my above algorithm would actually work in all cases 
though. In fact I *think* that my method is essentally the same as yours, but 
where yours stores N chars of received data in a buffer, mine stores it as 
references to the target string, and makes the (I think safe) assumption that 
if the first N chars of your buffer don't match the first N chars of the 
target then they can be discarded.
(I'm enjoying the exercise regardless!)
-- 
Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450
Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: [ALUG] Wanted - algorithm for bufferless search for a pattern in a stream