contex - obtaining the context in which words appear
As of recent I have been studying the standard threading library and thinking of things that I can "multi-thread", this seemed to fit the bill. The idea here is to create a piece of software that can look at a collection of words and then print the context in which they appear in a text, under certain constraints. These collection of words are given as a set of pairs ${W,L}$, each member of this set we will call a query. $W$ being the word of interest and $L$ being the number of words after the word $W$ we want to look at. We also specify $k$ which is the number of words that we want to look at before and after a query. The constraints
Now we have the task of tokenising the input, this is achieved by the following code
If we look at the above code we check if there is a full stop present in the current string that has been read in. For instance we might have read in the string {"Hello".} and later on in the text we read in {Hello} and our query might involve {Hello}. We obviously want to flag both instances. Our code achieves this by stripping the first instance of all punctuation. Our repeated search is quite cheap as we are using an unordered_map. The above code is $O(N_wordsInCorpus*N_lengthOfMaxLengthWord)$.
Note that the way in which we populate the query list l, it is sorted by design! Which is essential for our overlapping routine which appears later in the code.
Once we have tokenised and obtained the locations of the queries, we have to form the windows, respecting the location of the full stops, which were obtained in the above code.
This is linear in the number of FOUND queries (this is very likely longer that the list of queries). We keep track of the full stop position using id_fs and log_id. This allows us to sweep through the full stop list at the same time we sweep through the query list l.
Then we must find the overlapping windows, this is fairly simple to understand.
and finally output the result
The above code is quite important, it "boldifies" the query words and puts an ellipsis where the window ends and puts a full stop where a window is broken by the presence of a full stop! ✋🛑 . We iterate over the overlapped windows (which is obviously less than the total number of windows). Inside each of these loops we iterate over the range of the merged window and also carry out a binary search which is $log(N_fullstops)$.
When this code is executed with query {"cats", 3} {"why", 1} we obtain the following output.
This post is just a brief discussion of the core routines surrounding this particular way of tokenising a corpus and obtaining contextual information given some queries. In a single threaded environment. My next goal is to architect this into a usable piece of multithreaded software.
From above, if we let $F$ be the number of fullstops, $N$ be the number of words, $Q$ be the number of queries, $w$ be the number of windows and $O(o)$ be the number of merged windows, $l_N$ be the length of the maximum merged window length then $$O(Q) + O(N*MaxLenWord) + O(w) + O(ol_N) + O(o*\log(F))$$
is the complexity of this processing pipeline.
- We want $k$ words to the left of the query word $W$ and $k + L$ words to the right of $W$. We will refer to this sub-collection of words as a window.
- We want the L words after W and W itself to be highlighted.
- If we encounter a full stop then we truncate the window to respect this.
- If we have overlapping windows then we join the queries.
As with many things in programming there are loads of ways of doing the same thing. I will split this into a few steps.
- Read the queries into a std::unordered_map (hash table).
- Tokenise the corpus, putting the indices of the queries (if found) into a std::list (linked list).
- form the windows, truncating if a full stop rears its head.
- join the overlapping windows.
- print the output.
The following code samples are taken from contex. Reading the queries into a hash table is simple. The reason we choose a hash table is that we need to look up many times and search many times. The amortised complexity for this is $O(1)$ as opposed to $O(log(n))$ which is the case for a std::map (or balanced binary tree). The cost of inserting is $O(1)$ on average too, so the following code is quite cheap. Roughly $O(N_queries)$
void read_pairs(std::fstream& file, std::unordered_map<std::string, int>& query_map) { int i; std::string s; char a; while (file >> s >> a >> i) { std::cout << s << " " << i << std::endl; query_map[s] = i; } }
Now we have the task of tokenising the input, this is achieved by the following code
void tokenise(std::fstream& file, std::unordered_map<std::string, int>& query_map, std::vector<std::string>& split_string, std::vector<int>& full_stops, std::list<std::pair<int, int>>& l) { std::string line; int i = 0; while (file >> line) { auto full_stop_location = line.find_first_of("."); if (full_stop_location != string::npos) { // look at the substring to the left of the full_stop auto split = line.substr(0, full_stop_location); auto split_without_punc = strip_punc(split); auto it = query_map.find(split_without_punc); if (it != query_map.end()) { l.push_back(std::make_pair(i, it->second)); } split_string.push_back(split); // add the full stop full_stops.push_back(++i); // look at the substring to the right of the full_stop split = line.substr(full_stop_location); split_without_punc = strip_punc(split); it = query_map.find(split_without_punc); if (it != query_map.end()) { l.push_back(std::make_pair(i, it->second)); } split_string.push_back(split); } else { auto line_without_punc = strip_punc(line); auto it = query_map.find(line_without_punc); if (it != query_map.end()) { l.push_back(std::make_pair(i, it->second)); } split_string.push_back(line); if (line == ".") { full_stops.push_back(i); } } i++; } }
If we look at the above code we check if there is a full stop present in the current string that has been read in. For instance we might have read in the string {"Hello".} and later on in the text we read in {Hello} and our query might involve {Hello}. We obviously want to flag both instances. Our code achieves this by stripping the first instance of all punctuation. Our repeated search is quite cheap as we are using an unordered_map. The above code is $O(N_wordsInCorpus*N_lengthOfMaxLengthWord)$.
Note that the way in which we populate the query list l, it is sorted by design! Which is essential for our overlapping routine which appears later in the code.
Once we have tokenised and obtained the locations of the queries, we have to form the windows, respecting the location of the full stops, which were obtained in the above code.
for (auto& p : l) { std::cout << p.first << " " << p.second << std::endl; window temp = {p.first - 5, p.first, p.first + p.second + 4}; // we have gone too far ->> reset if (full_stop > temp[0]) { id_fs = log_id; full_stop = full_stops[id_fs]; } // if full_stop is before window we dont care log_id = id_fs; while (full_stop < temp[0]) { full_stop = full_stops[++id_fs]; } // is this in the window? int temp_id = id_fs; while (full_stop <= temp[2] && full_stop >= temp[0]) { if (full_stop < temp[1]) { temp[0] = full_stop + 1; } else { temp[2] = full_stop - 1; } full_stop = full_stops[++id_fs]; } id_fs = temp_id; windows.push_back(temp); }
This is linear in the number of FOUND queries (this is very likely longer that the list of queries). We keep track of the full stop position using id_fs and log_id. This allows us to sweep through the full stop list at the same time we sweep through the query list l.
Then we must find the overlapping windows, this is fairly simple to understand.
bool overlap(window& w1, window& w2) { if (w1[0] >= w2[0] && w1[0] <= w2[2]) return true; if (w2[0] >= w1[0] && w2[0] <= w1[2]) return true; return false; } std::vector<window> merge_sorted_winds(std::vector<window>& ws) { std::vector<window> outs; if (!ws.size()) return outs; window out = ws[0]; for (int i = 1; i < ws.size(); i++) { if (overlap(out, ws[i])) { out[2] = ws[i][2]; } else { outs.push_back(out); out = ws[i]; } if (i == ws.size() - 1) outs.push_back(out); } for (auto& o : outs) { std::cout << o[0] << " " << o[2] << std::endl; } return outs; }
and finally output the result
std::list<std::pair<int, int>>::iterator it = l.begin(); for (auto& o : outs) { if (o != *outs.begin()) std::cout << " "; bool open = false; for (int i = o[0]; i <= o[2]; i++) { if (i == it->first) { std::cout << "\033[32m"; open = true; } std::cout << split_string[i]; if (i == it->first + it->second - 1 || (i == o[2] && open)) { std::cout << "\033[0m"; it++; open = false; } std::cout << " "; } // if ... then words have been ommited if (!std::binary_search(full_stops.begin(), full_stops.end(), o[2] + 1)) { std::cout << "..."; } else { std::cout << "." << std::endl; } }
The above code is quite important, it "boldifies" the query words and puts an ellipsis where the window ends and puts a full stop where a window is broken by the presence of a full stop! ✋🛑 . We iterate over the overlapped windows (which is obviously less than the total number of windows). Inside each of these loops we iterate over the range of the merged window and also carry out a binary search which is $log(N_fullstops)$.
When this code is executed with query {"cats", 3} {"why", 1} we obtain the following output.
This post is just a brief discussion of the core routines surrounding this particular way of tokenising a corpus and obtaining contextual information given some queries. In a single threaded environment. My next goal is to architect this into a usable piece of multithreaded software.
From above, if we let $F$ be the number of fullstops, $N$ be the number of words, $Q$ be the number of queries, $w$ be the number of windows and $O(o)$ be the number of merged windows, $l_N$ be the length of the maximum merged window length then $$O(Q) + O(N*MaxLenWord) + O(w) + O(ol_N) + O(o*\log(F))$$
is the complexity of this processing pipeline.

Comments
Post a Comment