Fact checking from multiple sources is investigated from dif- ferent and diverse angles and the complexity and diversity of the problem calls for a wide range of methods and techniques [?]. Fact checking tasks are not easy to perform and, most importantly, it is not clear what kind of computations they involve. Fact checking usually involves a large num- ber of data sources that talk about the same thing but we are not sure which holds the correct information, or which has any information at all about the query we care for. A join among all or some data sources can guide us through a fact checking process. However, when we want to perform this join on a distributed computational environment such as MapReduce, it is not obvious how to distribute efficiently the records in the data sources to the reduce tasks in order to join any subset of them in a single MapReduce job. In this paper, we show that the nature of such sources (i.e., since they talk about similar things) offers this opportunity, i.e., to distribute the records with low replication. We also show that the multiway algorithm in [Afrati et al.] can be implemented efficiently in MapReduce when the relations in the join have large overlaps in their schemas (i.e., they share a large number of attributes).
Bibtex: Afrati et al. (2015)
Foto Afrati, Zaid Momani, and Nikos Stasinopoulos. Cross-checking data sources in mapreduce. In New Trends in Databases and Information Systems, pages 165–174. Springer, 2015. ↩