std::equal

Today we are talking about equal_range. Take a look at the signatures here. This is a handy little algorithm that returns the range (as a pair of forward iterators) where each item in the returned range matches the value (where what "matches" is defined by the Compare). We will look at a practical use in this post, that actually relates to a previous post on a piece of software called Contex.

Suppose we have a vector of strings and we have a vector of unique queries. We want to count the number of times a query appears in the vector of strings. A very obvious way to do it takes $O(ns)$ where $n$ is the length of the string vector and $s$ is the number of queries. This is not very good. There are (at least) two better ideas

We can attempt to sweep the strings "at the same time" as the queries.
We can carry out a binary search on a sorted version of the strings for the beginning and end of a range of strings that satisfy each query, for each query.

Both of these involve sorting the vectors so we impose a cost of $O(n \log n)$ off the bat. The first solution then incurs a $O(n)$ cost in the event that all of (or most of) the queries appear in the vector. If all the queries are not in the vector then we suffer greatly and return to the $O(sn)$ cost.

std::unordered_map<std::string, int> matchingStrings(std::vector<std::string>& strings,
                                                     std::vector<std::string>& queries) {
 std::sort(strings.begin(), strings.end());
 std::sort(queries.begin(), queries.end());
 std::unordered_map<std::string,int> counts;
 auto s = strings.begin();
 bool notfound;
 int count = 0;
 for (const auto& q : queries) {
     count = 0;
     auto temp = s;
     notfound = false;
     while (q!=*s) {
         if (s != strings.end()-1) {

             s++;
         }
         else {
             counts[q] = count;
             notfound = true;
             s = temp;
             break;
         }
     }
     if (notfound) continue;
     while (q==*s) {
         count++;
         if (s != (strings.end()-1)) {
             s++;
         }
         else {
             break;
         }
     }
     counts[q] = count;
     }
     return counts;
}

If we use the other method that involves using a binary search then we can be a little more robust on our complexity estimates. The following code has a complexity bound of $O((n+s)\log(n))$. This irrespective of whether we find the query or not, obviously things get better if we actually find the query.

std::unordered_map<std::string, int> matchingStrings(std::vector<std::string>& strings,
 std::vector<std::string>& queries)
{
 std::sort(strings.begin(), strings.end());
 std::sort(queries.begin(), queries.end());

 std::unordered_map<std::string, int> counts;

 auto start = strings.begin();

 for (auto const& q : queries) {
     auto first = std::lower_bound(start, strings.end(),q);
     auto second = std::upper_bound(start, strings.end(),q);
     counts[q] = second - first;
     start = second;
     }
     return counts;
}

std::lower_bound and std::upper bound use a binary search, each incurring a logarithmic cost. Needless to say the code is so much cleaner and more readable. Note that we update the starting point of the binary searches, this is because looking before the end of the previous range is wasteful (both vectors are sorted). We can go one step further with the STL using std::equal_range...

std::unordered_map<std::string, int> matchingStrings(std::vector<std::string>& strings,
 std::vector<std::string>& queries)
{
 std::sort(strings.begin(), strings.end());
 std::sort(queries.begin(), queries.end());

 std::unordered_map<std::string, int> counts;

 auto start = strings.begin();

 for (auto const& q : queries) {
     auto p = std::equal_range(start, strings.end(), q);
     counts[q] = p.second - p.first;
     start = p.second;
     }
     return counts;
}

So much cleaner than the original code! The complexity of both methods is dominated by the sorting so they are essentially equivalent for sufficiently large inputs, although the first method can really struggle if a sufficient proportion of the queries are not present in the vector of strings (especially if these are alphabetically minimum).

Search This Blog

Down and out in Swansea

std::equal_range

Comments

Post a Comment

Popular posts from this blog

Using the AVX instruction set to boost performance

The magic behind deep learning : Obtaining gradients using reverse Automatic Differentiation.

Information Entropy

Optimising matrix multiplication by exploiting cache lines

Numerical Solution of the Black Scholes equation for European Options

Move Semantics