att_abstract={{We are often thrilled by the abundance of information surrounding 
us and wish to integrate data from as many sources as possible.
However, understanding, analyzing, and using these data are often 
hard.  Too much data can introduce a huge integration cost, such 
as expenses for purchasing data and resources for integration and 
cleaning.  Furthermore, including low-quality data can even 
deteriorate the quality of integration results instead of bringing 
the desired quality gain. Thus, ``the more the better'' does not
always hold for data integration and often ``less is more''. 

In this paper, we study how to select a subset of sources before 
integration such that we can balance the quality of integrated data 
and integration cost.  Inspired by the Marginalism principle in 
economic theory, we wish to integrate a new source only if its
marginal gain, often a function of improved integration quality,
is higher than the marginal cost, associated with data-purchase 
expense and integration resources.  As a first step towards this 
goal, we focus on data fusion tasks, where the goal is to resolve
conflicts from different sources.  We propose a randomized 
solution for selecting sources for fusion and show empirically its
effectiveness and scalability on both real-world data and synthetic 
	att_authors={ds8961, xd0649, bs621s},
	att_copyright={{VLDB Foundation}},
	att_copyright_notice={{The definitive version was published in Very Large Databases, 2012. {{, 2013-08-26}}
	author={Divesh Srivastava and Xin Dong and Barna Saha},
	title={{Less is More: Selecting Sources Wisely for Integration}},