Handling skew is one of the major challenges in query processing. In distributed computa- tional environments such as MapReduce, uneven distribution of the data to the servers is not desired. The effect of skew in data is particularly evident in the communication cost in this context. In a MapRe- duce job the communication cost is the amount of data that is transferred from the mappers to the reducers. In this paper, we identify join attributes values that appear very frequently, Heavy Hitters, as a cause of increased communication cost. We introduce a novel technique for handling skew when we want to compute a multiway join in one MapReduce round with minimum communication cost. We distribute HH values optimally using an adaptation of the Shares  algorithm to achieve this minimum cost. This is an improvement over standard techniques (widely used so far) even in the simple case of a 2-way join. To this end, we introduce the notion of residual joins to help with the decomposition of joins. Thus, we provide the Extended Shares algorithm which handles Heavy Hitters for any multiway join with an optimal communication cost. Building upon the Extended Shares, we propose the Condensed Extended Shares where different Heavy Hitters are considered in combination. We can show that the condensed join can perform better than the extended join in some cases. Finally, we provide experimental evaluation of how the Shares algorithm can handle skew and Heavy Hitter values in various settings within the MapReduce framework. We show that mitigating the com- munication cost decisively leads to lower computation times in all cases.
Bibtex: Afrati et al. (2015)