Finding the duplicated data in cloud storage by using AdjDup Technique

Pranay Kumar Katta; Yogendra Prasad P

doi:10.32628/CSEIT1835180

Authors

Pranay Kumar Katta M.Tech, Computer Science and Engineering, Sree Rama Engineering College, Tirupathi, Andhra Pradesh, India
Yogendra Prasad P Assisstant Professor, Department of CSE, Sree Rama Engineering College, Tirupathi, Andhra Pradesh, India

Keywords:

Data deduplication, delta compression, storage system, index structure, performance evaluation

Abstract

Cloud computing greatly facilitates information providers who need to source their information to the cloud while not revealing their sensitive information to external parties and would like users with bound credentials to be ready to access the data.Data reduction has become progressively vital in storage systems due to the explosive growth of digital information within the world that has ushered within the huge information era. one amongst the most challenges facing large-scale information reduction is a way to maximally discover and eliminate redundancy at terribly low overheads. during this paper, we tend to present DARE, a low-overhead Deduplication-Aware resemblance detection and Elimination theme that effectively exploits existing duplicate-adjacency info for extremely economical resemblance detection in information deduplication based mostly backup/archiving storage systems. the most plan behind DARE is to use a theme, decision Duplicate-Adjacency based mostly alikeness Detection (DupAdj), by considering any 2 information chunks to be similar (i.e., candidates for delta compression) if their several adjacent information chunks are duplicate in an exceedingly deduplication system, so more enhance the resemblance detection potency by an improved super-feature approach. Our experimental results supported real-world and artificial backup datasets show that DARE solely consumes regarding 1/4 and 1/2 severally of the computation and assortment overheads needed by the standard super-feature approaches whereas detecting 2-10% a lot of redundancy and achieving a better turnout, by exploiting existing duplicate-adjacency information for resemblance detection and finding the â€œsweet spotâ€ for the super-feature approach.

References

"The data deluge," http://econ.st/fzkuDq.
J Gantz and D. Reinsel, "Extracting value from chaos," IDC review, pp. 1-12, 2011.
M A. L. DuBois and E. Sheppard, "Key considerations as deduplication evolves into primary storage," White Paper 223310, Mar 2011.
W J. Bolosky, S. Corbin, D. Goebel, and et al, "Single instance storage in windows 2000," in the 4th USENIX Windows Systems Symposium. Seattle,WA, USA: USENIX Association, August 2000, pp. 13-24.
S Quinlan and S. Dorward, "Venti: a new approach to archival storage," in USENIX Conference on File and Storage Technologies (FAST-02). Monterey, CA, USA: USENIX Association, January 2002, pp. 89-101.
B Zhu, K. Li, and R. H. Patterson, "Avoiding the disk bottleneck in the data domain deduplication file system." in the 6th USENIX Conference on File and Storage Technologies (FAST-08), vol. 8. San Jose, CA, USA: USENIX Association, February 2008, pp. 1-14.
D T. Meyer and W. J. Bolosky, "A study of practical deduplication," ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.
G Wallace, F. Douglis, H. Qian, and et al, "Characteristics of backup workloads in production systems," in the Tenth USENIX Conference on File and Storage Technologies (FAST-12). San Jose, CA: USENIX Association, February 2012, pp. 33-48.
A El-Shimi, R. Kalach, A. Kumar, and et al, "Primary data deduplication-large scale study and system design," in the 2012 conference on USENIX Annual Technical Conference. Boston, MA, USA: USENIX Association, June 2012, pp. 285-296.
L. L. You, K. T. Pollack, and D. D. Long, "Deep store: An archival storage system architecture," in the 21st International Conference on Data Engineering (ICDE-05). Tokyo, Japan: IEEE Computer Society Press, April 2005, pp. 804-815.
A. Muthitacharoen, B. Chen, and D. Mazieres, "A low-bandwidth network file system," in the ACM Symposium on Operating Systems Principles (SOSP-01). Banff, Canada: ACM Association, October 2001, pp. 1-14.
P. Shilane, M. Huang, G. Wallace, and et al, "WAN optimized replication of backup datasets using stream-informed delta compression," in the Tenth USENIX Conference on File and Storage Technologies (FAST-12). San Jose, CA, USA: USENIX Association, February 2012, pp. 49-64.
S. Al-Kiswany, D. Subhraveti, P. Sarkar, and M. Ripeanu, "Vmflock: virtual machine co-migration for the cloud," in the 20th international symposium on High Performance Distributed Computing, San Jose, CA, USA, June 2011, pp. 159-170.
X. Zhang, Z. Huo, J. Ma, and et al, "Exploiting data deduplication to accelerate live virtual machine migration," in 2010 IEEE International Conference on Cluster Computing (CLUSTER). Heraklion, Crete, Greece: IEEE Computer Society Press, September 2010, pp. 88-96.
F. Douglis and A. Iyengar, "Application-specific delta-encoding via resemblance detection," in USENIX Annual Technical Conference, General Track. San Antonio, TX, USA: USENIX Association, June 2003, pp. 113-126.
P. Kulkarni, F. Douglis, J. D. LaVoie, and J. M. Tracey, "Redundancy elimination within large collections of files," in the 2004 USENIX Annual Technical Conference. Boston, MA, USA: USENIX Association, June 2012, pp. 59-72.
P. Shilane, G. Wallace, M. Huang, and W. Hsu, "Delta compressed and deduplicated storage using stream-informed locality," in the 4th USENIX conference on Hot Topics in Storage and File Systems. Boston, MA, USA: USENIX Association, June 2012, pp. 201-214.
Q. Yang and J. Ren, "I-cash: Intelligently coupled array of ssd and hdd," in The 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-11). San Antonio, TX, USA: IEEE Computer Society Press, February 2011, pp. 278-289.
G. Wu and X. He, "Delta-ftl: improving ssd lifetime via exploiting content locality," in Proceedings of the 7th ACM European conference on Computer Systems (EuroSys). Bern, Switzerland: ACM, April 2012, pp. 253-266.
D. Gupta, S. Lee, M. Vrable, and et al, "Difference engine: harnessing memory redundancy in virtual machines," in the 5th Symposium on Operating Systems Design and Implementation. San Diego, CA, USA: USENIX Association, December 2008, pp. 309- 322.
B. Debnath, S. Sengupta, and J. Li, "Chunkstash: speeding up inline storage deduplication using flash memory," in the 2010 USENIX conference on USENIX annual technical conference. Boston, MA, USA: USENIX Association, June 2010, pp. 1-14.
R. C. Burns and D. D. Long, "Efficient distributed backup with delta compression," in the fifth workshop on I/O in parallel and distributed systems. San Jose, CA, USA: ACM Association, November 1997, pp. 27-36.
J. MacDonald, "File system support for delta compression." Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley., 2000.
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki, "Hydrastor: A scalable secondary storage." in USENIX Conference on File and Storage Technologies (FAST-09). San Jose, CA, USA: USENIX Association, February 2009, pp. 197-210.
M. Lillibridge, K. Eshghi, D. Bhagwat, and et al, "Sparse indexing: Large scale, inline deduplication using sampling and locality." in the 7th USENIX Conference on File and Storage Technologies (FAST-09). San Jose, CA: USENIX Association, February 2009, pp. 111-123.
L. Aronovich, R. Asher, E. Bachmat, and et al, "The design of a similarity based deduplication system," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. Haifa, Israel: ACM Association, May 2009, pp. 1-12.
F. Guo and P. Efstathopoulos, "Building a high-performance deduplication system," in the 2011 USENIX conference on USENIX Annual Technical Conference. Portland, OR, USA: USENIX Association, June 2011, pp. 271-284.
D. Bhagwat, K. Eshghi, D. D. Long, and et al, "Extreme binning: Scalable, parallel deduplication for chunk-based file backup," in IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS-09). London, UK: IEEE Computer Society Press, September 2009, pp. 1-9.
W. Xia, H. Jiang, D. Feng, and Y. Hua, "Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput," in the 2011 USENIX conference on USENIX annual technical conference. Portland, OR, USA: USENIX Association, June 2011, pp. 285-298.
K. Eshghi and H. K. Tang, "A framework for analyzing and improving content-based chunking algorithms," Tech. Rep. HPL- 2005-30(R.1),Hewlett Packard Laboratories, Palo Alto, 2005.
E. Kruus, C. Ungureanu, and C. Dubnicki, "Bimodal content defined chunking for backup streams," in the 7th USENIX Conference on File and Storage Technologies. USENIX Association, 2010.
B. Roma nski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki, "Anchor-driven subchunk deduplication," in The 4th Annual International Systems and Storage Conference (SYSTOR-11). Haifa, Israel: ACM Association, May 2011, pp. 1-13.
G. Lu, Y. Jin, and D. H. Du, "Frequency based chunking for data de-duplication," in 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS-10). Miami Beach, FL, USA: IEEE Computer Society Press, August 2010, pp. 287-296.
M. Rabin, Fingerprinting by random polynomials. Center for Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981. June 20.
N. Jain, M. Dahlin, and R. Tewari, "Taper: Tiered approach for eliminating redundancy in replica synchronization." in the USENIX Conference on File and Storage Technologies (FAST-05). San Francisco, CA, USA: USENIX Association, March 2005, pp. 281- 294.
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan, "Design tradeoffs for data deduplication performance in backup workloads," in Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST-15). USENIX Association, 2015, pp. 331-344.
D. Meister and A. Brinkmann, "Multi-level comparison of data deduplication in a backup scenario," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. Haifa, Israel: ACM Association, May 2009, pp. 13-24.
A. Broder, "Identifying and filtering near-duplicate documents," in Combinatorial Pattern Matching. Montreal, Canada: Springer, June 2000, pp. 1-10.
"On the resemblance and containment of documents," in Compression and Complexity of Sequences (SEQUENCES-97). Washington, DC, USA: IEEE, June 1997, pp. 21-29.
U. Manber et al., "Finding similar files in a large file system," in Proceedings of the USENIX Winter 1994 Technical Conference. San Francisco, CA, USA: USENIX Association, January 1994, pp. 1-10.
V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, "Generating realistic datasets for deduplication analysis," in the 2012 USENIX conference on Annual Technical Conference. Boston, MA, USA: USENIX Association, June 2012, pp. 261-272.
D. Meister, J. Kaiser, and A. Brinkmann, "Block locality caching for data deduplication," in Proceedings of the 6th International Systems and Storage Conference (Systor-13). ACM, 2013, pp. 1 12.

Finding the duplicated data in cloud storage by using AdjDup Technique

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite