Manuscript Number : CSEIT1835255
Handling Duplicate Data in Big Data
Authors(2) :-Jony Kumar, Mamta Yadav The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. This factor is crucial to ensure the success of any cross enterprise integration applications, business intelligence or data mining solutions.
Jony Kumar Cross Enterprise Integration, Duplicate Elimination, Semantic Entity, Big Data Publication Details Published in : Volume 3 | Issue 5 | May-June 2018 Article Preview
M. Tech Scholar CSE, M.D.U Rohtak,YCET Narnaul, Mahendergarh, India
Mamta Yadav
Assistant Professor CSE, M.D.U Rohtak,YCET Narnaul, Mahendergarh, India
Date of Publication : 2018-06-30
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 1163-1167
Manuscript Number : CSEIT1835255
Publisher : Technoscience Academy