Handling Duplicate Data in Big Data

Authors(2) :-Jony Kumar, Mamta Yadav

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. This factor is crucial to ensure the success of any cross enterprise integration applications, business intelligence or data mining solutions.

Authors and Affiliations

Jony Kumar
M. Tech Scholar CSE, M.D.U Rohtak,YCET Narnaul, Mahendergarh, India
Mamta Yadav
Assistant Professor CSE, M.D.U Rohtak,YCET Narnaul, Mahendergarh, India

Cross Enterprise Integration, Duplicate Elimination, Semantic Entity, Big Data

  1. Radu-Ioan,Ciobanu,Valentin Cristea, Ciprian Dobre and Florin Pop, Big Data Platforms for the Internet of Things,2014,Springer
  2. Flavio Bonomi, Rodolfo Milito, Preethi Natarajan and Jiang Zhu,Fog Computing: A Platform for Internet of Things and Analytics, springer (2014)
  3. Shintaro Yamamoto, Shinsuke Matsumoto,Sachio Saiki, and Masahide Nakamura Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe, Hyogo 657-8501, Japan,Using Materialized View as a Service of Scallop4SC for Smart City Application Services (2014)
  4. Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W. “Shared disk big data analytics with Apache Hadoop” (18-22 Dec. 2012)
  5. Kudakwashe Zvarevashe1, Dr. A Vinaya Babu, Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoopfor BigData Analytics (2014)
  6. Gartner: Hype cycle for big data, 2012. Technical report (2012)
  7. IBM, Zikopoulos, P., Eaton, C.:Understanding BigData: Analytics for Enterprise Class Hadoop and Streaming Data. 1st edn. McGraw-Hill Osborne Media,New York (2011)
  8. Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., Tufano, P.: Analytics: The realworld use of big data. IBM Institute for Business Value—executive report, IBM Institute for Business Value (2012)
  9. Evans, D.: The internet of things—howthe next evolution of the internet is changing everything. Technical report (2011)
  10. Cattell, R.: Scalable sql and nosql data stores. Technical report (2012)
  11. Apache: Hadoop (2014) (Online 20 Oct 2015)
  12. Jo Foley, M.: Microsoft drops dryad; puts its big-data bets on hadoop. Technical report (2011)
  13. Locatelli, O.: Extending nosql to handle relations in a scalable way models and evaluation framework (2012012)
  14. Robinson, I., Webber, J., Eifrem, E.: Graph Databases. O’Reilly Media, Incorporated (2013)
  15. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: amazon’s highlyavailable key-value store. SIGOPS Oper. Syst. Rev. 41, 205–220 (2007) Big Data Management Systems for the Exploitation 89
  16. Riak: Riak (Online Oct 2015)
  17. Apache: Couchdb (Online; Oct 2015)
  18. MongoDB: Mongodb (Online; Oct 2015)
  19. Hypertable: Hypertable (Online; Oct 2015)
  20. Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5, 1724–1735 (2012)
  21. Neo Technology, I.: Neo4j, the world’s leading graph database. (Online;Oct 2015)
  22. Amato, A., DiMartino, B., Venticinque, S.: Semantically augmented exploitation of pervasive environments by intelligent agents. In: ISPA, pp. 807–814.(2012)
  23. Jing Zhang, “A Distributed Cache for Hadoop File Distribution system in Real time Cloud Services “, 2012 ACM/IEEE 13th International Conference on Grid Computing.
  24. Pig.apachi.org (online Oct 2015).

Publication Details

Published in : Volume 3 | Issue 5 | May-June 2018
Date of Publication : 2018-06-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 1163-1167
Manuscript Number : CSEIT1835255
Publisher : Technoscience Academy

ISSN : 2456-3307

Cite This Article :

Jony Kumar, Mamta Yadav, "Handling Duplicate Data in Big Data", International Journal of Scientific Research in Computer Science, Engineering and Information Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 3, Issue 5, pp.1163-1167, May-June-2018. |          | BibTeX | RIS | CSV

Article Preview