section{Introduction} difficult for a computer to disambiguate the

section{Introduction}

Entity resolution also being known as duplicate detection or entity reconciliation is the task of identifying and merging records that represent the same real-world entity cite{Yalavarthi:2017:SYQ:3132847.3132876}. Although, many sophisticated techniques using machine learning has improved over the years, still humans are better than Computers in recognizing elements of the same entity. For instance, it might be difficult for a computer to disambiguate the various manifestations of a similar object which might be very obvious for humans to identify with their experience and knowledge.For example, records having four different values like (a) University of Stuttgart, (b) Uni Stuttgart Cafe, (c) Technical University Stuttgart, (d) Universit├Ąt Stuttgart,a human can identify that records (a),(c),(d) refer to the same entity but it is difficult using algorithmic methods. Similarly, scenarios where one has to examine photographs and information which is specific to a region or country, human input is crucial to resolve entities. Figure 1 shows examples of two products which are marketed as two different names in other countries.
egin{figure}H
centering%
includegraphicswidth=7pc,height=4pc{figures/Figure1a}
includegraphicswidth=7pc,height=4pc{figures/Figure1b}
caption{Example of extbf{Entity Resolution} .}
label{EntityResolutionProblem}
end{figure}

With the advent of platforms for human computation like Amazon Mechanical Turk(AMT) and Crowdflower, it is now much easier to use humans in the process of entity resolution. These platforms support crowdsourced execution of tasks called Human Intelligence tasks (HIT extquotesingle s) where people perform simple tasks requiring little domain knowledge and get paid on per-job basis. So to gain maximum out of these platforms and maximizing the quality of Entity resolution, it is important to minimize the cost of crowdsourcing and also the human errors because of individual biases,malicious behavior, task complexity and ambiguity or simply due to lack of domain expertise.The average error rate can sometimes be as high as 25\% and it is very crucial to optimize crowdsourcing. The initial solution to it employs majority voting cite paper1, which is simply asking the question to multiple participants and accepting the majority answer. Some approach assign a universal error rate for crowdsourcing participants and some ignore this and considers it orthogonal to the problem of crowdsourced ER. However the approach where both these problems: quality assurance and crowdsourced ER, are considered together, has better quality of ER.
Significant contributionpaper 3 is made with this approach where potential crowd errors are considered to develop an end-to-end solution for the crowdsourcing-based ER problem.
The objective is to determine the next questions to ask for crowdsourced ER considering the presence of crowd errors. There are methods which considers only local and ad-hoc features to determine the next questions,and does not consider the entire cluster which results is higher crowdsourcing cost to achieve a satisfactory ER accuracy. These problems have been tackled with the solution paper1, where global metrics ,namely reliability, considers the strength of the entire clustering. Reliability based next crowdsourcing algorithm reduces significntly the crowdsourcing cost by reducing the crowdsourcing questions.

BACK TO TOP