Wanted - algorithm for bufferless search for a pattern in a stream - main

List overview All Threads
Download

newer

Wanted - algorithm for bufferless search for a pattern in a stream

older

Keyboard and mouse 'crash'

[ALUG Announce] ALUG Norwich Pub...

Chris Green

6 Aug 2012 6 Aug '12

7:25 p.m.

I can see how to do this in principle but I can't iron out the details and I'm sure there must be some ready made code for doing it 'out there' on the internet. It's just rather difficult to come up with search terms when I don't know what to call it.

To explain in detail I want to search for a (constant) string in a byte stream that I can't rewind and I don't want to use any significant buffer space (the constant string I'm looking for is quite long).

So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.

If anyone can follow that rubbish explanation then could they possibly tell me what the algorithm is called and/or point me at some sample code?

-- Chris Green

Show replies by date

Mark Rogers

7 Aug 7 Aug

9:56 a.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 06/08/12 19:25, Chris Green wrote:

...

So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.

That's one "obvious" method, but not the only one.

Another would be that you run multiple pattern matches (ie even though you're already up to N characters in a match, you're still checking for the start of a new match simultaneously, so that if/when the first pattern fails you don't rewind but just drop it and carry on with any other sequences you're still matching).

There may be others....

Bear in mind that if you're looking for "ABAB" in the stream "xxABABABxx" there are two matches but your algorithm will only pick up the first one, I don't know if that matters to you.

However, since grep can search within a stream surely that would be a good starting point?

cat /my/stream | grep 'really long pattern'

You can open a process and feed data to it byte-by-byte if you want to from Any Good Programming Language [TM] - eg for PHP[*]: http://www.php.net/manual/en/function.proc-open.php

[*] Not necessarily a Good Programming Language but it's what I'm used to and ISTR Chris uses it too.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Chris Green

11:29 a.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:

...

On 06/08/12 19:25, Chris Green wrote:

...
So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.

That's one "obvious" method, but not the only one.

Another would be that you run multiple pattern matches (ie even though you're already up to N characters in a match, you're still checking for the start of a new match simultaneously, so that if/when the first pattern fails you don't rewind but just drop it and carry on with any other sequences you're still matching).

However, for a long string that *could* use as much space as the length of the string * (size of a pointer) couldn't it? That's worse than using a buffer the size of the string.

...

Bear in mind that if you're looking for "ABAB" in the stream "xxABABABxx" there are two matches but your algorithm will only pick up the first one, I don't know if that matters to you.

No, it doesn't.

...

However, since grep can search within a stream surely that would be a good starting point?
cat /my/stream | grep 'really long pattern'
You can open a process and feed data to it byte-by-byte if you want to from Any Good Programming Language [TM] - eg for PHP[*]: http://www.php.net/manual/en/function.proc-open.php

Er, no, this is on a microcontroller with no OS as such so there aren't any built in facilities like 'grep'. It only has 2k of RAM so conserving RAM is a primary requirement, hence my question.

I think I have actually solved the problem with a simpler approach because the string I am looking for is an XML identifier (or whatever they're called) so I know it starts with '<' and ends with '>' and that there aren't any embedded '<' characters i it. So, knowing that I can always restart my search when I hit a '<', simplifies things immensely. (there may be special cases where one can have an embedded '<' but they're not going to happen in this code).

-- Chris Green

Mark Rogers

12:12 p.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 07/08/12 11:29, Chris Green wrote:

...

However, for a long string that *could* use as much space as the length of the string * (size of a pointer) couldn't it? That's worse than using a buffer the size of the string.

That rather depends on the data, yes. However I couldn't quite get my head around how your approach would work without getting quite messy (albeit that sometimes messy is how it gets when you're on a microcontroller).

...

Er, no, this is on a microcontroller with no OS as such so there aren't any built in facilities like 'grep'. It only has 2k of RAM so conserving RAM is a primary requirement, hence my question.

Maybe no grep, but the grep source might be an interesting resource, or (to return to the original post) a good search keyword: what you want(ed) to do is implement grep-like pattern matching within the microcontroller. However your actual task turns out to be much simpler.

Out of curiosity, which controller and what environment are you developing in? The closest I've got to embedded code recently is a Raspberry Pi (ie not very close, and sheer luxury compared with the microcontroller code I played with a decade or more ago).

...

I think I have actually solved the problem with a simpler approach because the string I am looking for is an XML identifier (or whatever they're called) so I know it starts with '<' and ends with '>' and that there aren't any embedded '<' characters i it. So, knowing that I can always restart my search when I hit a '<', simplifies things immensely. (there may be special cases where one can have an embedded '<' but they're not going to happen in this code).

This does get much simpler yes. By definition you don't need to worry about rolling back since the first character is never in the rest of the string. Find first char, keep matching until end, reset on failure. You don't need to specifically reset when you see "<" (by definition, since it can't be in the search string, your matching code will fail at that point anyway).

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Chris Green

12:51 p.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On Tue, Aug 07, 2012 at 12:12:10PM +0100, Mark Rogers wrote:

...

Out of curiosity, which controller and what environment are you developing in? The closest I've got to embedded code recently is a Raspberry Pi (ie not very close, and sheer luxury compared with the microcontroller code I played with a decade or more ago).

It's a Nanode based on the ATMEGA328 chip. It's very like an Arduino, same dimensions etc. and one uses avr-gcc to compile.

-- Chris Green

Ted.Harding＠wlandres.net

12:28 p.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 07-Aug-2012 10:29:25 Chris Green wrote:

...

On Tue, Aug 07, 2012 at 09:56:16AM +0100, Mark Rogers wrote:

...
On 06/08/12 19:25, Chris Green wrote:

...
So, obviously one reads a byte at a time from the stream to see if it matches the first character of the constant string, if that matches one moves on to the second, etc. When the whole string has been matched then it's found - so far so good. However, when a character *doesn't* match one can't go back up the input stream, *but* we know the previous characters as they're the ones in the constant string that we have matched so far. Thus we can 'wind back' by using the constant string for the number of characters that did match and then continue down the stream.

That's one "obvious" method, but not the only one.

Another would be that you run multiple pattern matches (ie even though you're already up to N characters in a match, you're still checking for the start of a new match simultaneously, so that if/when the first pattern fails you don't rewind but just drop it and carry on with any other sequences you're still matching).

However, for a long string that *could* use as much space as the length of the string * (size of a pointer) couldn't it? That's worse than using a buffer the size of the string.

...
Bear in mind that if you're looking for "ABAB" in the stream "xxABABABxx" there are two matches but your algorithm will only pick up the first one, I don't know if that matters to you.

No, it doesn't.

...
However, since grep can search within a stream surely that would be a good starting point?
cat /my/stream | grep 'really long pattern'
You can open a process and feed data to it byte-by-byte if you want to from Any Good Programming Language [TM] - eg for PHP[*]: http://www.php.net/manual/en/function.proc-open.php
Er, no, this is on a microcontroller with no OS as such so there aren't any built in facilities like 'grep'. It only has 2k of RAM so conserving RAM is a primary requirement, hence my question.

I think I have actually solved the problem with a simpler approach because the string I am looking for is an XML identifier (or whatever they're called) so I know it starts with '<' and ends with '>' and that there aren't any embedded '<' characters i it. So, knowing that I can always restart my search when I hit a '<', simplifies things immensely. (there may be special cases where one can have an embedded '<' but they're not going to happen in this code). -- Chris Green

Well, that certainly simplifies matters! However, on the general question (and assuming that you're satisfied with just picking up the first match, i.e. knowing that there's at least one match), then the only fool-proof general method I can think of is the following. It presumes that you have enough RAM for two copies of the target string, plus whatever is needed for the program etc.

[A]: Store the target string in a block of RAM (called "TARGET"). [B]: Set up a FIFO buffer ("FIFO") of the same size as TARGET. [C]: Set flag "MATCHED" to 0.

Then: Divert (or copy) the input stream into FIFO. Each time a byte is put into FIFO (starting with the first byte at which FIFO becomes full), check FIFO byte-for-byte against TARGET.

If FIFO matches TARGET, set flag MATCHED to 1.

(By the way, just to be clear in the "copy" case: when FIFO is full, inputting an additional byte drops the first byte and shifts the FIFO contents down one, making room for the new byte).

The "copy" case would be appropriate if you need the input stream as it comes, without the delay while the bytes work their way through FIFO.

Ted.

------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@wlandres.net Date: 07-Aug-2012 Time: 12:28:08 This message was sent by XFMail -------------------------------------------------

Mark Rogers

1:25 p.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 07/08/12 12:28, (Ted Harding) wrote:

...

Well, that certainly simplifies matters! However, on the general question (and assuming that you're satisfied with just picking up the first match, i.e. knowing that there's at least one match), then the only fool-proof general method I can think of is the following. It presumes that you have enough RAM for two copies of the target string, plus whatever is needed for the program etc.

Obviously the target has to be stored somewhere, but this needn't be in RAM (I'm assuming that there's "plenty" of ROM space from the way Chris described the problem).

If so, you start with an integer index into the target set to zero, and when you find a character in the stream that matches the character pointed to by the index you increment the index. When you get to the end of the target string you have you match.

The trick is how to replay things when you get a miss midway into the target, but that's not actually that hard; you loop through the data from the second character in the target to the current (failed) position replaying them as if they came from the stream, retesting them as you go.

From a quick play the hardest bit seems to be coming up with test cases that pick up the edge cases. I can't find any particularly "difficult" examples but also can't prove they aren't out there.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Ted.Harding＠wlandres.net

1:51 p.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 07-Aug-2012 12:25:02 Mark Rogers wrote:

...

On 07/08/12 12:28, (Ted Harding) wrote:

...
Well, that certainly simplifies matters! However, on the general question (and assuming that you're satisfied with just picking up the first match, i.e. knowing that there's at least one match), then the only fool-proof general method I can think of is the following. It presumes that you have enough RAM for two copies of the target string, plus whatever is needed for the program etc.

Obviously the target has to be stored somewhere, but this needn't be in RAM (I'm assuming that there's "plenty" of ROM space from the way Chris described the problem).

If so, you start with an integer index into the target set to zero, and when you find a character in the stream that matches the character pointed to by the index you increment the index. When you get to the end of the target string you have you match.

The trick is how to replay things when you get a miss midway into the target, but that's not actually that hard; you loop through the data from the second character in the target to the current (failed) position replaying them as if they came from the stream, retesting them as you go.

From a quick play the hardest bit seems to be coming up with test cases that pick up the edge cases. I can't find any particularly "difficult" examples but also can't prove they aren't out there.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

The sort of "edge case" I was thinking of when deciding against the "obvious" method (based on your moving pointer algorithm) is:

Target: ABCDABCDEFH

Input stream: ... ABCDABCDABCDEFHXYZ... (so that the target is at ABCD|AABCDABCDEFH|XYZ...

It will happily get to the "^" point below:

ABCDABCD^ABCDEFHXYZ...

because the target is matched up to that point. But the next byte is a mismatch, so then it should move back to byte 5 of the above input stream and continue matching from that point (using the accumulated matching bytes from the previous input bytes) so that it knows it has matched

ABCDABCD

and only needs to watch out for EFHXYZ (which is about to arrive).

This is already beginning to get complicated, and one can think of examples where you may need a "stack" of pending partial matches. At this point, you are beginning to need more RAM, and a more complicated algorithm, which is why I thought that the FIFO + MATCH procedure that I suggested would be (a) fool-proof, (b) reasonably economical.

Ted.

------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@wlandres.net Date: 07-Aug-2012 Time: 13:51:00 This message was sent by XFMail -------------------------------------------------

Mark Rogers

8 Aug 8 Aug

10:21 a.m.

New subject: Wanted - algorithm for bufferless search for a pattern in a stream

On 07/08/12 13:51, (Ted Harding) wrote:

...

The sort of "edge case" I was thinking of when deciding against the "obvious" method (based on your moving pointer algorithm) is:

Target: ABCDABCDEFH

Input stream: ... ABCDABCDABCDEFHXYZ... (so that the target is at ABCD|AABCDABCDEFH|XYZ...

It will happily get to the "^" point below:

ABCDABCD^ABCDEFHXYZ...

because the target is matched up to that point. But the next byte is a mismatch, so then it should move back to byte 5 of the above input stream and continue matching from that point (using the accumulated matching bytes from the previous input bytes) so that it knows it has matched

ABCDABCD

and only needs to watch out for EFHXYZ (which is about to arrive).

My "theory" is that it needs to track where it got to in the target, and replay the characters from the start (skipping the first one) as if they came from the input stream. So, from where you left off: Stream: ABCDABCD^ABCDEFHXYZ... Target: ABCDABCD^EFH

Replay the target starting after the first character: Stream: ABCDABCD^ABCDEFHXYZ... Target (replay): A[BCDABCD]EFH Target: ^ABCDABCDEFH (ie target resets to start but keep track of start "[" and end "]" of the section of the buffer being replayed). The first few chars won't match but eventually: Stream: ABCDABCD^ABCDEFHXYZ... Target (replay): ABCDABCD[]EFH Target: ABCD^ABCDEFH .. and you've exhausted the replay "buffer" (although note that the buffer is only actually two integers marking start/end of the section being replayed), and continue with the stream: Stream: ABCDABCDABCDEFH^XYZ... Target: ABCDABCDEFH^

...

This is already beginning to get complicated, and one can think of examples where you may need a "stack" of pending partial matches.

Ah, but that's my point: I can't shake the suspicion that there might be cases where I'd effectively need to keep multiple replay "buffers" and if that number is arbitrary then this isn't a solution. But (especially after running through it above) I can't come up with an example that wouldn't work with the single start/end pair as above. Even if you want to pick up multiple matches which might overlap, eg ABCDABCD twice in ABCDABCDABCD, the above should still work; after the "hit" you still roll back: Stream: ABCDABCD^ABCD Target: ABCDABCD^ => HIT (1) => Stream: ABCDABCD^ABCD Target (replay): A[BCDABCD] Target: ^ABCDABCD => Stream: ABCDABCD^ABCD Target (replay): ABCDABCD[] Target: ABCD^ABCD => Stream: ABCDABCDABCD^ Target: ABCDABCD^ => HIT (2)

...

At this point, you are beginning to need more RAM, and a more complicated algorithm, which is why I thought that the FIFO + MATCH procedure that I suggested would be (a) fool-proof, (b) reasonably economical.

Your method would actually only need one (RAM) buffer of the size of the target string (the other would be in ROM) but if you're trying to match a 2K block of text in a 2GB stream you're still going to run out of RAM. I can't convince myself that my above algorithm would actually work in all cases though. In fact I *think* that my method is essentally the same as yours, but where yours stores N chars of received data in a buffer, mine stores it as references to the target string, and makes the (I think safe) assumption that if the first N chars of your buffer don't match the first N chars of the target then they can be discarded.

(I'm enjoying the exercise regardless!)

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

4902

Age (days ago)

4904

Last active (days ago)

main@lists.alug.org.uk

8 comments

3 participants

tags (0)

participants (3)

Chris Green
Mark Rogers
Ted.Harding＠wlandres.net