Alternatives to hash and binary chop search - are there any others? - main

List overview All Threads
Download

newer

Alternatives to hash and binary chop search - are there any others?

older

What is Opne Source

RE: [ALUG] Linux Newbie Book

Chris Green

1 Mar 2004 1 Mar '04

11:05 a.m.

This isn't really a Linux question but members of this list might have some ideas/answers.

I have a little 68000 assembler project I'm doing for someone (paid!) and basically it consists of replacing a 'built on the fly' look-up table with a ready made one which is uploaded with the code.

The existing code uses a rather messy hash look-up and one of the reasons for the re-write is to make it simpler and more reliable.

A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

-- Chris Green (chris@areti.co.uk)

Show replies by date

Ted.Harding＠nessie.mcc.ac.uk

1 Mar 1 Mar

12:21 p.m.

New subject: Alternatives to hash and binary chop search - are the

On 01-Mar-04 Chris Green wrote:

...

I have a little 68000 assembler project I'm doing for someone (paid!) and basically it consists of replacing a 'built on the fly' look-up table with a ready made one which is uploaded with the code. [...] A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

As a general rule, btree indexing gives efficient look-up; but your description suggests this may be "messy" to set up.

I have another idea, which I need to look into the properties of before deciding whether it's a good idea or a bad one for your application.

Basically, it involves setting up a "tree of lists", and it makes your database "content-addressable".

Starting from a 4-byte IP address, the top level (0) consists of pointers to 256 different level-1 nodes; which node you are led to is determined by the value of the first byte in the IP address.

Each level-1 node contains a) either a null pointer, denoting that there was no IP address in the data whose first byte had the corresponding value; b) or a set of 256 pointers to level-2 nodes just like level-0.

And so on down the line until you get to level-4. Each level-4 node contains a pointer to a data structure contain the information to be looked up if you reach that node (in your case, the ATM address).

Searching such a tree would be very quick: given an IP address B1.B2.B3.B4, you go to node B1 in level-1, then node B2 in level-2, then node B3 in level-3, and finally node B4 in level-4 where you then find the information you are looking for. Lookup time is independent of the data.

The issue I need to look into is how much space would be required to store this tree (and also how efficient and/or messy it would be to implement it). Given that the number of actual entries you need to store is about 400, you should not need anything like the superficially apparent 2^32 nodes in total. Only the 400 would lead all the way down to level-4, and all the others would run into a dead end at a higher level. Probably you only need a thousand or two nodes in all to store the tree for 400 level-4 nodes.

As I say, I'll try to look into what this leads to for 400. But you may be interested in the idea.

Best wishes, Ted.

-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 167 1972 Date: 01-Mar-04 Time: 12:14:41 ------------------------------ XFMail ------------------------------

Chris Green

12:47 p.m.

New subject: Alternatives to hash and binary chop search - are the

On Mon, Mar 01, 2004 at 12:14:41PM -0000, Ted Harding wrote:

...

On 01-Mar-04 Chris Green wrote:

...
I have a little 68000 assembler project I'm doing for someone (paid!) and basically it consists of replacing a 'built on the fly' look-up table with a ready made one which is uploaded with the code. [...] A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

As a general rule, btree indexing gives efficient look-up; but your description suggests this may be "messy" to set up.

I have another idea, which I need to look into the properties of before deciding whether it's a good idea or a bad one for your application.

That's really (reading below as well) a sort of hash table isn't it, using the first byte of the IP address as the hash key.

[snip]

...

The issue I need to look into is how much space would be required to store this tree (and also how efficient and/or messy it would be to implement it). Given that the number of actual entries you need to store is about 400, you should not need anything like the superficially apparent 2^32 nodes in total. Only the 400 would lead all the way down to level-4, and all the others would run into a dead end at a higher level. Probably you only need a thousand or two nodes in all to store the tree for 400 level-4 nodes.

As I say, I'll try to look into what this leads to for 400. But you may be interested in the idea.

It's certainly a possibility.

Isn't the worst case where every possible first byte value is present, this gives 256 top level (first byte) pointers (not really possible as some first byte values aren't allowed) so there are 256 * 256 * 4 bytes at the second level. Even that sounds a bit big without going any further. It would also need code to generate it as one certainly isn't going to upload it in that format.

The advantage of a binary chop is that it is quite possible to generate the table in assembler, it's just a matter of finding some way to sort it, preferably pre-assembler.

-- Chris Green (chris@areti.co.uk)

Steve Fosdick

2 Mar 2 Mar

9:15 a.m.

New subject: Alternatives to hash and binary chop search - are the

On 2004.03.01 12:46, Chris Green wrote:

...

The advantage of a binary chop is that it is quite possible to generate the table in assembler, it's just a matter of finding some way to sort it, preferably pre-assembler.

If you are concerned about a long debug loop perfecting an algorithm implemented in assembler, why not first implement it in C so you know you've got it right and then translate it?

You could compile and debug in on any platform with a decent C compiler and then, when you're ready to go to assembler you could even use a C cross compiler to generate and initial implementation in assembler. GCC can compile for 68000 and has a -S option to generate an assembler file rather than an object file. After that you could go over the assembler my hand to check it is well optimised and remove any dependencies on compiler supplied libraries and you're there.

Steve.

Steve Fosdick

1 Mar 1 Mar

3:23 p.m.

New subject: Alternatives to hash and binary chop search - are the

On 2004.03.01 12:46, Chris Green wrote:

...

The advantage of a binary chop is that it is quite possible to generate the table in assembler, it's just a matter of finding some way to sort it, preferably pre-assembler.

If you are concerned about a long debug loop perfecting an algorithm implemented in assembler, why not first implement it in C so you know you've got it right and then translate it?

Steve.

Richard Kettlewell

6:12 p.m.

Chris Green chris@areti.co.uk writes:

...

The existing code uses a rather messy hash look-up and one of the reasons for the re-write is to make it simpler and more reliable.

A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

Do you have enough control over the build process to mechanically ensure the table is sorted? Failing that is there any reason you can't sort the table at startup?

What do you think are the 'messy boundary conditions'?

-- http://www.greenend.org.uk/rjk/

Chris Green

7:05 p.m.

On Mon, Mar 01, 2004 at 06:11:20PM +0000, Richard Kettlewell wrote:

...

Chris Green chris@areti.co.uk writes:

...
The existing code uses a rather messy hash look-up and one of the reasons for the re-write is to make it simpler and more reliable.

A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

Do you have enough control over the build process to mechanically ensure the table is sorted? Failing that is there any reason you can't sort the table at startup?

What do you think are the 'messy boundary conditions'?

I think I can ensure the table is sorted, the code will probably be generated from a non-assembler source in some sort of format so that processing can sort it as well.

The 'messy boundary conditions' are handling things like what to do when finding the middle of a table with an odd number of entries etc.

-- Chris Green (chris@areti.co.uk)

Graham Trott

8:44 p.m.

On Monday 01 March 2004 19:04, Chris Green wrote:

...

On Mon, Mar 01, 2004 at 06:11:20PM +0000, Richard Kettlewell wrote:

...
Chris Green chris@areti.co.uk writes:

...
The existing code uses a rather messy hash look-up and one of the reasons for the re-write is to make it simpler and more reliable.

A linear search of the table probably isn't a good idea as it would mean that there would be significant differences in look-up time for entries at the beginning and end of the table.

I'm leaning towards a binary-chop search but wondered if anyone had any other ideas. Binary-chop is easy in principle but it's often quite difficult to implement in practice because one has to deal with messy boundary conditions. It's also a bit more difficult to create the sorted table necessary automatically and relying on the creator of the table to sort it might be a bit dodgy.

The table looks up ATM addresses (16 bit values) given an IP address (32 bit values), maximum size is of the order of 400 entries.

I'm a little puzzled about the hash being 'messy'. Before throwing out the bathwater, make sure the baby isn't in it. There are several ways of implementing hashtables, some of which tend to memory efficiency and others to flat-out speed. You don't say how performance-critical your application is so maybe it's simply using an inappropriate technique.

The approach I favour - and have used on 8-bit machines in assembler - is to use a small fixed-size table of pointers (typically 64) and a simple, arbitrary hashing algorithm that distributes keys in a near-random manner. Such as adding up all the characters of the key, multiplying by some prime to spread out the results then doing modulo (table size) on the result to index into the table. True randomness is achieved only at a significant cost and mostly ain't worth it.

The table contains pointers to linked lists of entries (held outside the table) all having the same hashcode. The first few key hashes will find empty cells in the table, but after a while these get used up. So you just set up a link from the previous entry to the next, with a null pointer to signify the end of the chain. With a 64-entry table the performance will be getting on for 64 times that of a linear search; quite adequate for moderate amounts of data. I used it for a symbol table in assemblers and compilers and it was never the limiting factor on performance, often with far more than 400 symbols. Because entries never get removed from a compiler symbol table they can all go into a linear memory area, with no need to worry about efficient memory reclaim.

-- GT

Richard Kettlewell

2 Mar 2 Mar

4:42 p.m.

Chris Green chris@areti.co.uk writes:

...

I think I can ensure the table is sorted, the code will probably be generated from a non-assembler source in some sort of format so that processing can sort it as well.

You could generate a perfect hash function for the data using gperf or similar.

...

The 'messy boundary conditions' are handling things like what to do when finding the middle of a table with an odd number of entries etc.

I always round down since that's the most convenient to write.

-- http://www.greenend.org.uk/rjk/

7845

Age (days ago)

7846

Last active (days ago)

main@lists.alug.org.uk

8 comments

6 participants

tags (0)

participants (6)

Chris Green
Graham Trott
Richard Kettlewell
Steve Fosdick
Steve Fosdick
Ted.Harding＠nessie.mcc.ac.uk