Duplicate File Detection and Elimination

Kanupriya Joshi; Mrs. Mamta

doi:10.32628/CSEIT19544

Authors

Kanupriya Joshi M. Tech Scholar, Department of Computer Science and Engineering, I.G.U Rewari, YCET Narnaul, Haryana, India
Mrs. Mamta Assistant Professor, Department of Computer Science and Engineering, I.G.U Rewari, YCET Narnaul, Haryana, India

DOI:

https://doi.org/10.32628/CSEIT19544

Keywords:

Duplicate Record Detection, Cross Language Systems, entity matching, data cleaning, Big Data. Data Cleaning, Duplicate Data

Abstract

The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. In the duplicate elimination step, only one copy of exact duplicated records or file are retained and eliminated other duplicate records or file. The elimination process is very important to produce a cleaned data. Before the elimination process, the similarity threshold values are calculated for all the records which are available in the data set. The similarity threshold values are important for the elimination process.

References

Radu-Ioan,Ciobanu,Valentin Cristea, Ciprian Dobre and Florin Pop, Big Data Platforms for the Internet of Things,2014,Springer
Flavio Bonomi, Rodolfo Milito, Preethi Natarajan and Jiang Zhu,Fog Computing: A Platform for Internet of Things and Analytics, springer (2014)
Shintaro Yamamoto, Shinsuke Matsumoto,Sachio Saiki, and Masahide Nakamura Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe, Hyogo 657-8501, Japan,Using Materialized View as a Service of Scallop4SC for Smart City Application Services (2014)
Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W. “Shared disk big data analytics with Apache Hadoop” (18-22 Dec. 2012)
Kudakwashe Zvarevashe1, Dr. A Vinaya Babu, Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoopfor BigData Analytics (2014)
Gartner: Hype cycle for big data, 2012. Technical report (2012)
IBM, Zikopoulos, P., Eaton, C.:Understanding BigData: Analytics for Enterprise Class Hadoop and Streaming Data. 1st edn. McGraw-Hill Osborne Media,New York (2011)
Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., Tufano, P.: Analytics: The realworld use of big data. IBM Institute for Business Value—executive report, IBM Institute for Business Value (2012)
Evans, D.: The internet of things—howthe next evolution of the internet is changing everything. Technical report (2011)
Cattell, R.: Scalable sql and nosql data stores. Technical report (2012)
Apache: Hadoop (2014) (Online 20 Oct 2015)
Jo Foley, M.: Microsoft drops dryad; puts its big-data bets on hadoop. Technical report (2011)
Locatelli, O.: Extending nosql to handle relations in a scalable way models and evaluation framework (2012012)
Robinson, I., Webber, J., Eifrem, E.: Graph Databases. O'Reilly Media, Incorporated (2013)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: amazon's highlyavailable key-value store. SIGOPS Oper. Syst. Rev. 41, 205–220 (2007) Big Data Management Systems for the Exploitation 89
Riak: Riak (Online Oct 2015)
Apache: Couchdb (Online; Oct 2015)
MongoDB: Mongodb (Online; Oct 2015)
Hypertable: Hypertable (Online; Oct 2015)
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5, 1724–1735 (2012)
Neo Technology, I.: Neo4j, the world's leading graph database. (Online;Oct 2015)
Amato, A., DiMartino, B., Venticinque, S.: Semantically augmented exploitation of pervasive environments by intelligent agents. In: ISPA, pp. 807–814.(2012)
Jing Zhang, “A Distributed Cache for Hadoop File Distribution system in Real time Cloud Services “, 2012 ACM/IEEE 13th International Conference on Grid Computing.
Pig.apachi.org (online Oct 2015).

Duplicate File Detection and Elimination

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite