Name: Expert Cloud Consulting | cloud consulting services | cloud computing services in pune, Mumbai, India
Address: Office No. 811, Gera Imperium Rise, Hinjewadi IT Park Wipro Circle, Hinjewadi Phase-II Rd, Rajiv Gandhi IT Park, Pune, India – 411057, Pune, Maharashtra, 411057, India
Telephone: +91 7822827579
Price range: start from Rs. 5000
Rating: 5

Implementing AWS DataSync with Hundreds of Millions of Objects

Feb 16, 2024

Introduction

Migrating large volumes of data across hybrid cloud environments can be challenging. AWS DataSync simplifies and accelerates these migrations by securely transferring data between various storage locations. Recently, AWS DataSync introduced a new feature called "manifests," allowing users to provide a list of source files or objects to be transferred, thereby optimizing task execution times. In this blog post, we will explore how to implement AWS DataSync with hundreds of millions of objects, discussing strategies, best practices, and considerations to efficiently manage large-scale data transfers.🛡️✨

AWS DataSync Overview 🔍

AWS DataSync facilitates data migrations to AWS by transferring data securely between on-premises storage, edge locations, other clouds, and AWS Storage. It is designed to handle large-scale data transfers and offers features such as encryption, data validation, and task scheduling.

Manifests: A New Feature📑

Manifests are lists of files or objects that users want to transfer using DataSync. Instead of copying every object in a storage bucket, DataSync can copy only the objects specified in the manifest file, which helps reduce task execution times.

Challenges with Large Datasets 🤔

When dealing with hundreds of millions of objects, users face various challenges, such as:
• Technical constraints, including network bandwidth and storage limitations.
• Compliance requirements, including Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
• Managing the size and quantity of data sets to be migrated.
• Prolonged data transfer wait times due to latency and bandwidth constraints.

Solution Walkthrough⚡
1️⃣ Restructure the Data 📦
• Break the source data into smaller tasks, each with fewer than 50 million objects.
• This prevents exceeding task quotas, which can cause inefficiencies.
• Smaller tasks also allow for better parallel processing.
• Considerations should include data size, structure, and source location.

2️⃣ Event-Driven with Manifest File 🎟️

Use an event-driven approach to ensure only necessary files are transferred. You can:
• Utilize an Amazon EventBridge schedule rule to invoke Lambda functions.
• Implement buffering for records to optimize performance.
• Use Amazon SQS for persistently capturing s3:ObjectCreated events.
3️⃣ Large Batch with Includes Filter 🎛️

When moving large batches, consider the following:

• Start Task Execution to validate the full data set.

• Utilize DataSync include filters with an EventBridge schedule rule.

• Implement UUID-based batching for predictable filter lengths.

Conclusion

AWS DataSync offers powerful features for managing data migrations across hybrid cloud environments. By implementing the proposed solution, users can efficiently transfer and synchronize large datasets, ensuring compliance with RTO and RPO requirements. In this blog post, we discussed the challenges associated with managing hundreds of millions of objects and provided a comprehensive solution using AWS DataSync, manifests, and include filters. By following best practices and leveraging these features, users can optimize data transfer tasks and achieve efficient and secure migrations.🌟🌐