On 07/08/12 13:51, (Ted Harding) wrote:
The sort of "edge case" I was thinking of when deciding against the "obvious" method (based on your moving pointer algorithm) is:
Target: ABCDABCDEFH
Input stream: ... ABCDABCDABCDEFHXYZ... (so that the target is at ABCD|AABCDABCDEFH|XYZ...
It will happily get to the "^" point below:
ABCDABCD^ABCDEFHXYZ...
because the target is matched up to that point. But the next byte is a mismatch, so then it should move back to byte 5 of the above input stream and continue matching from that point (using the accumulated matching bytes from the previous input bytes) so that it knows it has matched
ABCDABCD
and only needs to watch out for EFHXYZ (which is about to arrive).
My "theory" is that it needs to track where it got to in the target, and replay the characters from the start (skipping the first one) as if they came from the input stream. So, from where you left off: Stream: ABCDABCD^ABCDEFHXYZ... Target: ABCDABCD^EFH
Replay the target starting after the first character: Stream: ABCDABCD^ABCDEFHXYZ... Target (replay): A[BCDABCD]EFH Target: ^ABCDABCDEFH (ie target resets to start but keep track of start "[" and end "]" of the section of the buffer being replayed). The first few chars won't match but eventually: Stream: ABCDABCD^ABCDEFHXYZ... Target (replay): ABCDABCD[]EFH Target: ABCD^ABCDEFH .. and you've exhausted the replay "buffer" (although note that the buffer is only actually two integers marking start/end of the section being replayed), and continue with the stream: Stream: ABCDABCDABCDEFH^XYZ... Target: ABCDABCDEFH^
This is already beginning to get complicated, and one can think of examples where you may need a "stack" of pending partial matches.
Ah, but that's my point: I can't shake the suspicion that there might be cases where I'd effectively need to keep multiple replay "buffers" and if that number is arbitrary then this isn't a solution. But (especially after running through it above) I can't come up with an example that wouldn't work with the single start/end pair as above. Even if you want to pick up multiple matches which might overlap, eg ABCDABCD twice in ABCDABCDABCD, the above should still work; after the "hit" you still roll back: Stream: ABCDABCD^ABCD Target: ABCDABCD^ => HIT (1) => Stream: ABCDABCD^ABCD Target (replay): A[BCDABCD] Target: ^ABCDABCD => Stream: ABCDABCD^ABCD Target (replay): ABCDABCD[] Target: ABCD^ABCD => Stream: ABCDABCDABCD^ Target: ABCDABCD^ => HIT (2)
At this point, you are beginning to need more RAM, and a more complicated algorithm, which is why I thought that the FIFO + MATCH procedure that I suggested would be (a) fool-proof, (b) reasonably economical.
Your method would actually only need one (RAM) buffer of the size of the target string (the other would be in ROM) but if you're trying to match a 2K block of text in a 2GB stream you're still going to run out of RAM. I can't convince myself that my above algorithm would actually work in all cases though. In fact I *think* that my method is essentally the same as yours, but where yours stores N chars of received data in a buffer, mine stores it as references to the target string, and makes the (I think safe) assumption that if the first N chars of your buffer don't match the first N chars of the target then they can be discarded.
(I'm enjoying the exercise regardless!)