Excelling S3: Secret Methods to Bucket Replication You Didn’t Know Existed!
As AWS DevOps engineers, we often receive a task to copy objects from one S3 bucket to another. As we look into the process for achieving this goal, we usually compare all the ways before taking action and consider many different preferences such as: Should the replication take place in another region (cross region), does the replication take place in the same account or different accounts, what is the fastest way, what is the cheapest way, sometimes we even think about the frequency of the replication, if it is a one-time replication or occurs every week, for instance.
There is no single correct way to copy objects between buckets, and you should decide which method to use according to your unique scenario.
Let’s compare the ways to perform object replication to better understand which method to choose!
Bucket Replication Rules
The most obvious solution to replicate data between 2 buckets will be of course the bucket replication feature. This feature is accessible from the ‘Management’ tab of the S3 bucket. You can create some replication rules in the specified bucket and then every object that you’ll upload to your first bucket will end up in the second one.
There are many advantages to using replication rules:
- Replication rules can be applied cross-account. Meaning that you can replicate objects to buckets in other accounts as well.
- When configuring the replication rules, you can define some filters to avoid unnecessary replication of objects and files.
- You can replicate delete markers between buckets.
- You can also change the storage class for the replicated objects.
- The replication takes only 15 minutes to happen.
Now let’s discuss two important notes that you have to keep in mind before starting the procedure:
- Before starting replicating, you must enable bucket versioning in both source and destination buckets. If your bucket has versioning disabled, AWS will automatically suggest you enable it.
- You need to specify a role for the first bucket in order to replicate objects into the other bucket.
One last thing you have to keep in mind is the delete markers replication. By default the replication does not replicate delete markers, to avoid malicious deletions, but you can configure the replication rule to copy these markers by navigating to ‘Additional replication options’ and selecting the ‘Delete marker replication’ option. Note that if you delete the delete marker (which means that you want to ‘permanently delete’ the object) this action won’t replicate to the destination bucket.
Another important thing to remember is that bucket replication rules can take some time to copy the objects when they are uploaded to the source bucket so actually, it’s not a real-time solution.
But what about the objects that were already created in the source bucket before we decided to start the replication? Will they be replicated as well?
They won’t be replicated, but you will notice one AWS suggestion that asks you if you wish to create a Batch Operation to replicate the existing objects. So, what is a Batch Operation?
Batch Operations
These kinds of operations are one-time jobs created to replicate some objects between buckets. This solution is used to replicate existing objects between your buckets. A good analogy can be a ‘copy and paste’ operation on your computer. You select the data you want to copy and paste it into the destination specified. S3 Batch Operations can perform actions across billions of objects and petabytes of data with a single request.
First, in order to create a job you should navigate to the ‘Batch Operations’ tab in the S3 console, and choose to create a new job. You’ll have to specify the manifest of the operation you want to perform. Your manifest should contain the objects that you want to include in the job. You are able to create a manifest in three different formats:
- CSV- You can specify the location of a CSV file in one of your S3 buckets that has the list of objects you want to copy in CSV format, for example:
Bucketname,Objectname1
Bucketname,Foldername/Objectname2
- You can also add versionID to your objects on a third column
Bucketname,Objectname3,V1E2R3S4I5O6N7I8D
If you don’t want to spend your entire afternoon filling a CSV file with all your thousands of objects in the source bucket, then AWS allows you to use a simple feature called S3 Inventory Report.
- Inventory Report- This is a feature of AWS S3 that you can find in each bucket in the ‘management’ tab. Amazon S3 Inventory provides comma-separated values (CSV) ( or even Apache optimized row columnar (ORC) and Apache Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for the S3 bucket and its objects. If you set up a weekly inventory, a report is generated every Sunday after the initial report. When creating an inventory report you specify a new S3 bucket that you want to store the results.
- Create manifest using S3 Replication configuration- this is simply a feature of S3 batch operations that allows you to create a manifest based on the objects located in your bucket. It automatically lists the bucket that you choose.
Then, after specifying the manifest, you can continue to select the destination bucket and configure the operation that you want to perform. You can make many operations, such as copy, invoke lambda, replace all object tags, etc. (This time, we are going to use the ‘copy’ operation). We can even change the storage class in the operation or change the KMS encryption key for storing the objects.
The last step in creating the operation is to specify the ‘additional options’, such as the priority of the operation, whether you need to create a completion report and where to store it, and, most importantly, what role your operation is going to use. The operation needs to have basic permissions to read data from the source bucket and put objects in the destination bucket.
Here is an example of a policy that you can use for your IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:PutObjectTagging"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::{{DestinationBucket}}/*"
},
{
"Action": [
"s3:GetObject",
"s3:GetObjectAcl",
"s3:GetObjectTagging",
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::{{SourceBucket}}",
"arn:aws:s3:::{{SourceBucket}}/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::{{ManifestBucket}}/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::{{ReportBucket}}/*"
]
}
]
}
Note—If you choose to use the Inventory report as the manifest for the batch Operation, you can select the report you want to use. Then, automatically, you will have an option to open the operation creation process, which can save you some time.
AWS DataSync
Another great solution for replicating objects between S3 buckets is using the Datasync service. Datasync allows you to sync data between two different storage solutions. You can copy data from S3 into EFS, from FSX to S3, or, in our case, replicate from one S3 bucket to another.
Important note: If you want to replicate objects between two buckets located in two different accounts, you can place the Datasync task in each account, but you’ll have to change the permissions according to the option you’ve chosen.
Before creating a Datasync task you need to perform a few simple actions:
- First we need to create a role that will be used by our Datasync task. This role should have permissions to read from the source bucket and write to the destination bucket. You also need to consider if you are using this replication in the same account or cross-account.
- Then, you need to change the policies of both the source and destination buckets. The Datasync tasks role should have access to the buckets, but if you are replicating across accounts, you need the bucket in the account without the Datasync task permissions for your IAM identity.
- The last step before the task creation is to create the locations that will be used by the task. For the same account replication, it’s quite simple, you can create the location from the AWS console. As for cross-account, currently, it is not possible to create the location from the UI so you have to use the AWS CLI using the following command:
aws datasync create-location-s3 --s3-bucket-arn arn:aws:s3:::<SOURCE_BUCKET> --s3-storage-class STANDARD --s3-config BucketAccessRoleArn="arn:aws:iam::<DESTINATION_ACCOUNT_ID>:role/datasync-role" --region <REGION>
Note: I’ve included example policies for the bucket policies and the DataSync role here.
Now for the task creation: As a first step, you must specify the source and destination locations that you created in the previous step. Then, you need to configure the task properties, such as the name, bandwidth limit, content filters, delete markers replication, etc. You can select to run this task on schedule or on-demand and view the logs in AWS Cloudwatch.
When your entire infrastructure is set you are able to start the task, and if everything is set correctly, then DataSync prepares your transfer by examining your source and destination locations to determine what to transfer. This is done by recursively scanning the contents and metadata of both locations to identify differences between the two. This process can take just minutes or a few hours depending on the content of both locations. Note that your task will fail if you are out of the DataSync service limitations.
Once DataSync is done preparing your transfer, it moves your data (including metadata) from the source to the destination bucket based on the settings.
AWS CLI
The last way we can copy data between S3 buckets is simply using the AWS CLI. We can use two useful commands from our terminal to achieve this goal.
The first is ‘aws s3 cp’ that copies objects between buckets, and the second is the ‘aws s3 sync’ that synchronizes these buckets. The ‘cp’ command is simply used to copy objects from your local computer to S3 bucket, or between two S3 buckets, and the ‘sync’ command used to sync two folders/buckets, so you can even specify to delete objects that are missing from the source folder.
aws s3 cp s3://SOURCE_BUCKET/index.html s3://DESTINATION_BUCKET/index.html
aws s3 sync s3://SOURCE_BUCKET s3://DESTINATION_BUCKET
To copy folders using the s3 cp command you have to add the –recursive flag.
You can find more examples of using these commands here.
Conclusion
In order to copy data between two AWS buckets, we can use many different ways. Every and each one of them can serve you in different ways and use cases. As for the differences between them, it can be about their pricing, the management of the operation that you want to perform, and the time it takes for the process to be available. You can even consider filter options in your replication. If you plan to continuously copy objects between buckets, you probably need the bucket replication rules. If you want to perform a one-time operation, then maybe Batch Operation is the correct solution for you, or you can even use the Datasync service, which will be more flexible with scheduling or performing multiple operations. If you need to integrate the replication with the CI platform or simply run it from your terminal, you’ll find that the easiest way is using the AWS CLI. In the end, after reading this article, you should have much more knowledge about the replication of objects in S3, and hopefully, it will come in handy in the future.