On 07-Aug-2012 10:29:25 Chris Green wrote:
On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:
On 06/08/12 19:25, Chris Green wrote:
So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.
That's one "obvious" method, but not the only one.
Another would be that you run multiple pattern matches (ie even though you're already up to N characters in a match, you're still checking for the start of a new match simultaneously, so that if/when the first pattern fails you don't rewind but just drop it and carry on with any other sequences you're still matching).
However, for a long string that *could* use as much space as the length of the string * (size of a pointer) couldn't it? That's worse than using a buffer the size of the string.
Bear in mind that if you're looking for "ABAB" in the stream "xxABABABxx" there are two matches but your algorithm will only pick up the first one, I don't know if that matters to you.
No, it doesn't.
However, since grep can search within a stream surely that would be a good starting point?
cat /my/stream | grep 'really long pattern'
You can open a process and feed data to it byte-by-byte if you want to from Any Good Programming Language [TM] - eg for PHP[*]: http://www.php.net/manual/en/function.proc-open.php
Er, no, this is on a microcontroller with no OS as such so there aren't any built in facilities like 'grep'. It only has 2k of RAM so conserving RAM is a primary requirement, hence my question.
I think I have actually solved the problem with a simpler approach because the string I am looking for is an XML identifier (or whatever they're called) so I know it starts with '<' and ends with '>' and that there aren't any embedded '<' characters i it. So, knowing that I can always restart my search when I hit a '<', simplifies things immensely. (there may be special cases where one can have an embedded '<' but they're not going to happen in this code). -- Chris Green
Well, that certainly simplifies matters! However, on the general question (and assuming that you're satisfied with just picking up the first match, i.e. knowing that there's at least one match), then the only fool-proof general method I can think of is the following. It presumes that you have enough RAM for two copies of the target string, plus whatever is needed for the program etc.
[A]: Store the target string in a block of RAM (called "TARGET"). [B]: Set up a FIFO buffer ("FIFO") of the same size as TARGET. [C]: Set flag "MATCHED" to 0.
Then: Divert (or copy) the input stream into FIFO. Each time a byte is put into FIFO (starting with the first byte at which FIFO becomes full), check FIFO byte-for-byte against TARGET.
If FIFO matches TARGET, set flag MATCHED to 1.
(By the way, just to be clear in the "copy" case: when FIFO is full, inputting an additional byte drops the first byte and shifts the FIFO contents down one, making room for the new byte).
The "copy" case would be appropriate if you need the input stream as it comes, without the delay while the bytes work their way through FIFO.
Ted.
------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@wlandres.net Date: 07-Aug-2012 Time: 12:28:08 This message was sent by XFMail -------------------------------------------------