[Perl.sig] Perl-tips] Searching for items in a large list

Mon Jun 18 14:33:18 EST 2007

>From Jacinta Richardson:

==== Upcoming early bird dates ====

    Register by the below dates to get a FREE book, per person, per course.
    Courses with insufficient bookings by the early bird date may be
    cancelled.

            Melbourne
            ------
            http://perltraining.com.au/bookings/Melbourne.html

                    Programming Perl       Early Bird:  9th July

            Sydney
            --------
            http://perltraining.com.au/bookings/Sydney.html

                    Programming Perl       Early Bird:  30th July


==== Searching for items in a large list ====

    Often we need to search for one or more items in a larger list. This tip
    examines how we can do this and examines the efficiency of various
    techniques.


==   Searching for a single item   ==

    Let's pretend we have a list of usernames and we wish to find out
    whether a given name is in that list. A traditional approach using a
    foreach loop would look like this:

            my $found = 0;
            foreach my $username (@usernames) {
                    if($username eq $check_name) {
                            $found = 1;
                            last;
                    }
            }

    Those who have used Perl's ``grep'' function before would probably
    write:

            my $found = grep { $_ eq $check_name } @usernames;

    We could also use ``first'' from the ``Scalar::Util'' module (which
    comes standard with Perl), which allows us to find the first occurrence
    of the element in the list.

            use Scalar::Util qw(first);

            my $found = first { $_ eq $check_name } @usernames;

    Each of these are fine solutions, although the solution using
    ``foreach'' provides the best performance in most circumstances. It
    avoids searching the entire list (which ``grep'' does even in a scalar
    context), and the overhead of subroutine calls (which ``first'' uses for
    each comparison).


==   Searching for multiple items   ==

    Let's say we have a list of items and we want to test each one for
    existence in a much larger list. A simple solution may look like this:

            foreach my $wanted_book (@books_I_want) {
                    foreach my $library_book (@library_books) {
                            if($library_book eq $wanted_book) {
                                    # My book is available.  Order it.
                                    order_book($wanted_book);
                                    last;
                            }
                    }
            }

    An alternate solution using ``grep'' could look like this:

            foreach my $wanted_book (@books_I_want) {
                    if(grep { $_ eq $wanted_book} @library_books) {
                            # My book is available.  Order it.
                            order_book($wanted_book);
                            last;
                    }
            }

    If we have many library books (for example approximately 10,000) and a
    medium demand for book orders (approximately 150), one third of which do
    not exist in the library, then we find that ``foreach'' is moderately
    faster (26%) than ``grep'' which is moderately faster (26%) than
    ``first''. Each of these take between one third and half of a second to
    run on modern hardware.


==   Using a hash   ==

    If we're doing lots of searches, it becomes much faster to have our
    books in a hash, and use a hash-lookup:

            my $found = exists $library_index{$book_name};

    How much faster? About 100,000 times faster on average when compared to
    a linear search with foreach on a list with with 10,000 items in it. Of
    course, there is the additional overhead of building the hash in the
    first place. For two-character long keys, building a hash takes twice
    the time as building an array, and it's even longer for longer keys.
    It's just not worth the extra effort to build a hash if we're going to
    perform just a single search.

    However in our library example we're walking over our arrays multiple
    times, making this a prime candidate for using a hash to speed things
    up. We can use a *hash slice* to add our keys to the hash in a single
    step.

            my %library_search;
            @library_search{@library_books} = ();

            foreach my $book (@books_I_want) {
                    if(exists $library_search{$book}) {
                            # My book is available.  Order it.
                            order_book($book);
                    }
            }

    Or alternately, putting our wanted books into a hash (notice the change
    in the ``foreach'' loop's list).

            # Add library books into a hash
            my %books_I_want;
            @books_I_want{@books_I_want} = ();

            foreach my $library_book (@library_books) {
                    if(exists $books_I_want{$library_book}) {
                            # yay my book's available
                            order_book($library_book);

                            # We've ordered it, delete from hash
                            delete $books_I_want{$library_book};
                    }

                    # end if we have no more books we want
                    last unless keys %books_I_want;
            }

    Building the hash of all the library books (our first example) and then
    searching is much faster (1928%) than our simple linear searches with
    ``foreach'', while building the hash of our wanted books (our second
    example) is even faster (160%) than that.

    Both our examples requires that Perl walks through each of our arrays at
    least once, either to build the hash (using our hash slice), or to
    search the hash (using our foreach loop). Our second example is the
    faster of the two since we're building a smaller hash, which takes less
    time. We also have the potential to exit our main loop early if we find
    all the books we're looking for.

    However our second solution is only superior if we throw away our hashes
    at the end, and rebuild them whenever we need to order new books. If
    we're able to retain our hashes in a persistent process, then it's
    *much* faster to index all the library books once, and then loop over
    the smaller list each time:

            foreach my $wanted_book (@books_I_want) {
                    if(exists $prebuilt_library_books{$wanted_book}) {
                            # yay my book's available
                            order_book($wanted_book);
                    }
            }

    In fact, it's blindingly fast, running at more than 78 times the speed
    of the fastest code that builds the hash each time. Whenever you build a
    hash for the purposes of searching, it's often worthwhile to hang onto
    it as long as possible.


==   Benchmark and code   ==

    We used Perl's ``Benchmark'' module to obtain these results as follows:

                  Rate   first    grep foreach lb-hash  w-hash  pre-built
     first      2.03/s      --    -20%    -37%    -97%    -99%      -100%
     grep       2.55/s     26%      --    -21%    -96%    -98%      -100%
     foreach    3.21/s     58%     26%      --    -95%    -98%      -100%
     lb-hash    65.0/s   3103%   2451%   1928%      --    -62%      -100%
     w-hash      169/s   8238%   6540%   5178%    160%      --       -99%
     pre-built 13444/s 662497% 527577% 419353%  20588%   7846%         --

    (where lb-hash is building the hash from the library books and w-hash is
    building the hash from the wanted books). Note that these results will
    vary slightly from machine to machine and run to run.

    We can tell by the rate that we can run the solution using ``first''
    2.03 times in a second, while we can use the solution using ``foreach''
    3.21 times in a second. The results with less library books (about 5000)
    are much the same.

    ``Benchmark'' is a standard Perl module, and its documentation can be
    found on CPAN ( <http://search.cpan.org/perldoc?Benchmark> ) or can be
    obtained with ``perldoc Benchmark''.

    The original code used to generate these tests can be found at
    <http://perltraining.com.au/tips/listsearch.pl.html>. Note that there
    will be some variation between results from one invocation to the next.


==   Conclusion   ==

    There are reasons why you may have large lists instead of hashes. For
    example you may need the elements to be ordered, or an array may make
    more sense at other points of your program. However, if you are likely
    to be doing lots of random accesses into a list, including looking for
    items which do not exist, then it may be more efficient to create a
    hash. In our situation, even with the cost of creating a hash of all of
    the library books, we discovered a massive performance increase of over
    1900% than using a simple ``foreach'' loop. Using a better approach and
    creating a hash of our wanted books was an improvement of more than
    5100% over the simple ``foreach'' loop. If we can instead use a hash
    instead of a list our lookups are even faster.


==== Upcoming courses ====

    We're proud to be running the following courses over the next couple of
    months:

                Programming Perl
                        Melbourne:   7th - 10th August
                        Sydney:     27th - 30th August
                        Canberra:   20th - 23rd November

                Web Development with Perl
                        Canberra:   26th - 27th September

                Perl Security
                        Canberra:   28th September

    See http://perltraining.com.au/bookings/Melbourne.html
    http://perltraining.com.au/bookings/Sydney.html and
    http://perltraining.com.au/bookings/Canberra.html for more.


==== Corporate Courses ====

    http://perltraining.com.au/corporate.html

    Do you have a large group, or the need for training at a particular
    time? Perl Training Australia is happy to arrange a course in the time
    and place that best suits you. For more information read our page on
    Corporate Courses at http://perltraining.com.au/corporate.html or call
    us on +61-3-9354-6001.
_______________________________________________
This Perl tooltip and associated text is Copyright Perl Training Australia.
You may freely distribute this text so long as it is distributed in full
with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:
   Email: contact at perltraining.com.au
   Phone: 03 9354 6001

Perl-tips mailing list
To change your subscription details visit:
http://perltraining.com.au/cgi-bin/mailman/listinfo/perl-tips