Algorithms for computing similarity joins in MapReduce were offered in Afrati et al. (2012). Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the “anchor-points algorithm” of Afrati et al. (2012). We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming distance d, and anchor points are chosen so that every possible input is within Hamming distance k of some anchor point, then it is sufficient to send each input to all anchor points within distance (d/2) + k, rather than d+k as was suggested in the earlier paper. This improves on the communication cost of the MapReduce algorithm, i.e., reduces the amount of data transmitted among machines. Further, the same holds for edit distance, provided inputs all have the same length n and either the length of all anchor points is n − k or the length of all anchor points is n + k. We then explore the problem of finding small sets of anchor points for edit distance, which also provides an improvement on the communication cost. We give a close-to-optimal technique to extend anchor sets (called “covering codes”) from the k = 1 case to any k. We then give small covering codes that use either a single deletion or a single insertion, or – in one algorithm – two deletions. Discovering covering codes for edit distance is important in its own right, since very little work is known.
Bibtex: Afrati et al. (2014)
Foto N Afrati, Anish Das Sarma, David Menestrina, Aditya Parameswaran, and Jeffrey D Ullman. Fuzzy joins using mapreduce. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 498–509. IEEE, 2012. ↩ 1 2
Foto N. Afrati, Anish Das Sarma, Anand Rajaraman, Pokey Rule, Semih Salihoglu, and Jeffrey D. Ullman. Anchor-points algorithms for hamming and edit distances using mapreduce. In Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014., 4–14. 2014. URL: http://dx.doi.org/10.5441/002/icdt.2014.05, doi:10.5441/002/icdt.2014.05. ↩