Duplicate File Detection and Elimination
DOI:
https://doi.org/10.32628/CSEIT19544Keywords:
Duplicate Record Detection, Cross Language Systems, entity matching, data cleaning, Big Data. Data Cleaning, Duplicate DataAbstract
The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. In the duplicate elimination step, only one copy of exact duplicated records or file are retained and eliminated other duplicate records or file. The elimination process is very important to produce a cleaned data. Before the elimination process, the similarity threshold values are calculated for all the records which are available in the data set. The similarity threshold values are important for the elimination process.
References
- Radu-Ioan,Ciobanu,Valentin Cristea, Ciprian Dobre and Florin Pop, Big Data Platforms for the Internet of Things,2014,Springer
- Flavio Bonomi, Rodolfo Milito, Preethi Natarajan and Jiang Zhu,Fog Computing: A Platform for Internet of Things and Analytics, springer (2014)
- Shintaro Yamamoto, Shinsuke Matsumoto,Sachio Saiki, and Masahide Nakamura Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe, Hyogo 657-8501, Japan,Using Materialized View as a Service of Scallop4SC for Smart City Application Services (2014)
- Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W. “Shared disk big data analytics with Apache Hadoop” (18-22 Dec. 2012)
- Kudakwashe Zvarevashe1, Dr. A Vinaya Babu, Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoopfor BigData Analytics (2014)
- Gartner: Hype cycle for big data, 2012. Technical report (2012)
- IBM, Zikopoulos, P., Eaton, C.:Understanding BigData: Analytics for Enterprise Class Hadoop and Streaming Data. 1st edn. McGraw-Hill Osborne Media,New York (2011)
- Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., Tufano, P.: Analytics: The realworld use of big data. IBM Institute for Business Value—executive report, IBM Institute for Business Value (2012)
- Evans, D.: The internet of things—howthe next evolution of the internet is changing everything. Technical report (2011)
- Cattell, R.: Scalable sql and nosql data stores. Technical report (2012)
- Apache: Hadoop (2014) (Online 20 Oct 2015)
- Jo Foley, M.: Microsoft drops dryad; puts its big-data bets on hadoop. Technical report (2011)
- Locatelli, O.: Extending nosql to handle relations in a scalable way models and evaluation framework (2012012)
- Robinson, I., Webber, J., Eifrem, E.: Graph Databases. O'Reilly Media, Incorporated (2013)
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: amazon's highlyavailable key-value store. SIGOPS Oper. Syst. Rev. 41, 205–220 (2007) Big Data Management Systems for the Exploitation 89
- Riak: Riak (Online Oct 2015)
- Apache: Couchdb (Online; Oct 2015)
- MongoDB: Mongodb (Online; Oct 2015)
- Hypertable: Hypertable (Online; Oct 2015)
- Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proc. VLDB Endow. 5, 1724–1735 (2012)
- Neo Technology, I.: Neo4j, the world's leading graph database. (Online;Oct 2015)
- Amato, A., DiMartino, B., Venticinque, S.: Semantically augmented exploitation of pervasive environments by intelligent agents. In: ISPA, pp. 807–814.(2012)
- Jing Zhang, “A Distributed Cache for Hadoop File Distribution system in Real time Cloud Services “, 2012 ACM/IEEE 13th International Conference on Grid Computing.
- Pig.apachi.org (online Oct 2015).
Downloads
Published
Issue
Section
License
Copyright (c) IJSRCSEIT

This work is licensed under a Creative Commons Attribution 4.0 International License.