On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:
On 06/08/12 19:25, Chris Green wrote:
So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.
That's one "obvious" method, but not the only one.
Another would be that you run multiple pattern matches (ie even though you're already up to N characters in a match, you're still checking for the start of a new match simultaneously, so that if/when the first pattern fails you don't rewind but just drop it and carry on with any other sequences you're still matching).
However, for a long string that *could* use as much space as the length of the string * (size of a pointer) couldn't it? That's worse than using a buffer the size of the string.
Bear in mind that if you're looking for "ABAB" in the stream "xxABABABxx" there are two matches but your algorithm will only pick up the first one, I don't know if that matters to you.
No, it doesn't.
However, since grep can search within a stream surely that would be a good starting point?
cat /my/stream | grep 'really long pattern'
You can open a process and feed data to it byte-by-byte if you want to from Any Good Programming Language [TM] - eg for PHP[*]: http://www.php.net/manual/en/function.proc-open.php
Er, no, this is on a microcontroller with no OS as such so there aren't any built in facilities like 'grep'. It only has 2k of RAM so conserving RAM is a primary requirement, hence my question.
I think I have actually solved the problem with a simpler approach because the string I am looking for is an XML identifier (or whatever they're called) so I know it starts with '<' and ends with '>' and that there aren't any embedded '<' characters i it. So, knowing that I can always restart my search when I hit a '<', simplifies things immensely. (there may be special cases where one can have an embedded '<' but they're not going to happen in this code).