Amazon's Web Services (AWS), and in particular the Simple Storage Service (S3) [1] X Research source are widely used by many individuals and companies to manage their data, websites, and backends. These range from isolated individuals and small startups to multi-billion-dollar valuation companies such as Pinterest [2] X Research source and (formerly) Dropbox. [3] X Research source This page is not intended as a guide to onboarding S3; you can find many other such guides online. [4] X Research source Rather, it is targeted at individuals and companies whose expected S3 costs are between $10 per month and $10,000 per month. Other similar lists of tips are available online. [5] X Research source
This page also does not go into other kinds of S3 optimizations, such as optimizing bucket, folder and file names, and operation order, focused on maximizing throughput or minimizing latency. Most of these optimizations don't affect costs directly (either positively or negatively). Moreover, they generally become relevant only at a substantially greater scale than the scale that the target audience for this page is likely to be at. You can read more about such optimizations in the official S3 guide [6] X Research source and elsewhere.
Steps
Getting a Broad Understanding of S3
-
1Understand your S3 use case. S3 can be used for many goals.
- As a place to store files for live serving on websites, including image files [7] X Research source or an entire static website (usually behind a CDN such as Amazon CloudFront or CloudFlare).
Advertisement
[8] X Research source
-
1
- As a "data lake", a place for data that you consume from or generate in your applications: Essentially, S3 becomes the long-term storage for your data, with your initial generated data being logged to S3, and various applications reading from S3, transforming the data, and writing back to S3. [9] X Research source [10] X Research source [11] X Research source
- As a "data warehouse", a place to store long-term backups of structured and unstructured data not intended for further active consumption.
- As a place to store executables, scripts, and configurations necessary to launch new EC2 instances with your applications (or update your applications on existing instances).
-
2Understand the main way S3 affects costs. Numbers below are for standard storage, caveats associated with other forms of storage are discussed later. [12] X Research source All costs are applied and reported separately for each bucket and each hour; in other words, if you download your detailed billing report, you will see one line item each for every combination of bucket, hour, and type of cost.
- Storage costs : The cost is measured in storage space multiplied by time. You do not pay upfront for an allocated amount of storage space. Rather, every time you use more storage, you pay extra for that extra storage for the amount of time you use it. Costs can therefore fluctuate over time as the amount of data you've stored changed. Storage costs are reported separately for each bucket every hour. The pricing varies by region but is fixed within each region. As of March 2020, the cost ranges from 2.3 cents per GB-month in US Standard (North Virginia) to 4.05 cents per GB-month in São Paulo. [12] X Research source
- Request pricing : For standard storage, the cost for PUT, COPY, POST, or LIST Requests ranges from $0.005 per 1000 requests in US regions to $0.007 per 1000 requests in São Paulo. [12] X Research source The cost for GET and all other requests is an order of magnitude smaller, ranging from $0.004 per 10000 requests in all US regions to $0.0056 per 10000 requests in São Paulo. Note, however, that most of the cost associated with a GET request is captured in the data transfer costs (if the request is made from outside the region). Note also that for data stored in other kinds of storage, request pricing is a little higher. Another kind of request that becomes relevant when discussing lifecycle policies is the lifecycle transition request (e.g., transitioning something from standard storage to IA or Glacier).
- Data transfer costs : Costs are zero within the same AWS region (both S3 -> S3 and S3 -> EC2 instances), about 2 cents per GB for data transfer across regions, and about 9 cents per GB for data transfer to outside AWS.
- Retrieval pricing : This does not apply to standard storage, but rather, applies to two of the other storage classes, namely IA and Glacier. This pricing is applied per GB for the data retrieved.
-
3Understand the central role played by buckets in organizing your S3 files, and the use of "object" for S3 files.
- You can create buckets under your account. A bucket is identified by a string (the bucket name). There can be only one bucket with a given bucket name across all of S3, across customers; therefore you may not be able to use a bucket name if somebody else already uses it.
- Each bucket is associated with an AWS region, and is replicated across multiple availability zones within that region. The availability zone information is not available to the end user, but is an internal implementation detail that helps S3 maintain high durability and availability of data.
- Within each bucket, you can store your files either directly under the bucket or in folders. Folders don't need to be created or deleted. If you save a file it will automatically bring into existence "folders" for the file path to make sense, if they don't already exist. Once there are no files underneath it, the folder automatically ceases to exist.
- The way S3 stores the information is as a key-value store: for each prefix that is not a file name, it stores the set of files and folders with that prefix. For each file name, it maps that to the actual file. In particular, different files in a bucket may be stored in very different parts of the data center.
- S3 calls its files "objects", and you might encounter this term when reading up about S3 elsewhere.
-
4Understand the different ways you can interact with S3 files.
- You can upload and download files online, by logging in through a browser.
- Command line tools based on Python include the AWS Command Line Interface, [13] X Research source the antiquated s3cmd [14] X Research source and more recent s4cmd. [15] X Research source
- If using Java or another language that is based on the JVM (such as Scala) you can access S3 objects using the AWS Java SDK. [16] X Research source
- Deployment tools such as Ansible and Chef offer modules to manage S3 resources.
-
5Understand the pros and cons of dealing with S3 and its differences from a traditional filesystem.
- With S3, it requires a bit more gymnastics (and running time) to get a global picture of the amount of data used in a bucket or in subfolders of that bucket. That's because this data is not recorded anywhere directly but rather needs to be computed through recursive key-value lookups.
- Finding all the files that match a regex can be a very expensive operation, particularly when that regex includes wildcards in the middle of the expression rather than at the end.
- It is not possible to perform operations like appending data to a file: you have to get the file, modify it, and then put the whole modified file back up (see the point later about the sync capabilities).
- Moving or renaming files actually involve deleting objects and creating new ones. Moving a folder involves deleting and recreating all the objects under it. Each file move involves a GET and a PUT call, leading to increased request pricing. Moreover, moving objects can be expensive if the objects are stored in storage classes (Standard-IA and Glacier) where retrieval costs money.
- S3 can support file sizes up to 5 TB, but cross-region data transfer might start getting messed up for file sizes of more than a few hundred megabytes. The CLI uses multi-part upload for large files. Make sure that if your programs deal with large files, they operate through the multi-part upload, or split the output into smaller files.
- S3 does not provide full support for rsync. However, there is a sync command (aws s3 sync in the AWS CLI, and s3cmd sync in s3cmd) that syncs all contents between a local folder and a S3 folder, or between two S3 folders. For files that exist in both the source and the destination folder, it can detect identical files and avoid data transfer if the files are identical; however, it is not as efficient as rsync in that it may need to execute a full transfer if files differ a little bit, whereas rsync sends only a small diff for highly similar files. The other difference with rsync is that it applies to an entire folder, and file names cannot be changed.
Advertisement
Zipping/Compressing Data
-
1Make sure that you are compressing data where permitted by the requirements of your application before you begin.
- Explore what forms of zipping and compression are compatible with the processes you use to generate data, and the processes you use to read and process data.
- Make sure you are using zipping and compression for your biggest dumps of data insofar as it does not interfere with your application. In particular, raw user logs and structured data based on user activity are prime candidates for compression.
- As a general rule, compression will save not only on storage costs but also on transfer costs (when reading/writing the data) and might even end up making your application faster if upload/download time is a bigger bottleneck than local compression/decompression time. This is often the case.
- To take an example converting large structured data files to the BZ2 format can cause the storage space to go down by a factor varying from 3 to 10; however, BZ2 is compute-intensive to zip and unzip. Other compression algorithms to consider are gzip, lz4, and zstd. [5] X Research source
- Other possible ways of reducing the space include using column-based storage rather than row-based storage, and using binary formats (such as AVRO) rather than human-readable formats (such as JSON) for long-term data retention. [5] X Research source
-
2If compressing data is not possible at the point where you are first writing it out, consider running an alternative process to re-ingest and compress the data. This is generally a suboptimal solution and very rarely necessary, but there may be cases where it is relevant. If looking at such a solution you will need to run the calculations carefully based on the cost of reingesting and compressing the data and the total amount of time you intend to retain the data.Advertisement
Optimizing Storage Costs
-
1Understand the differences between the four types of S3 storage. [12] X Research source
- Standard storage is the most expensive for storage, but is cheapest and fastest for making changes to data. It is designed for 99.999999999% durability (over a year, i.e., this is the expected fraction of S3 objects that will survive over a year) and 99.99% availability (availability referring to the probability that a given S3 object is accessible at a given time). Note that in practice, it is very rare to lose data in S3, and there are bigger risk factors to data loss than data actually disappearing from S3 (these bigger factors include accidental data deletion and somebody maliciously hacking into your account to delete content, or even Amazon being forced to delete your data because of pressure from governments). [17] X Research source
- Reduced Redundancy Storage (RRS) used to be 20% cheaper than standard storage and offers a little less redundancy. You might wish to use it for a lot of your data that is not highly critical (such as full user logs). This is designed for 99.99% durability and 99.99% availability. However, as of December 2016, price reductions made to standard storage were not accompanied by corresponding price reductions to RRS, so RRS is equally or more expensive at present. [18] X Research source [19] X Research source
- Standard storage - Infrequent Access (called S3 - IA) is an option introduced by Amazon in September 2015, that combines the high durability of S3 with a low availability of only 99%. It is an option for storing long-term archives that do not need to be accessed often but that, when they need to be accessed, need to be accessed quickly. [20] X Research source S3 - IA is charged for a minimum of 30 days (even if objects are deleted before that) and a minimum object size of 128 KB. It is approximately half as expensive as S3, though the precise discount varies by region.
- Glacier is the cheapest form of storage. However, Glacier costs money to unarchive and make available again for reading and writing, with the amount you need to pay depending on the number of retrieval requests, the speed with which you want the data retrieved, and the size of data retrieved. Also, Glacier files have a minimum 90-day storage period: files deleted before then are charged for the remainder of the 90 days upon deletion.
-
2Get a sense of how your costs are growing.
- In a use case where you have a fixed set of files that you periodically update (effectively removing older versions) your monthly storage costs are approximately constant, with a fairly tight upper bound. Your cumulative storage spend grows linearly. This is a typical scenario for a set of executables and scripts.
- In a use case where you are continually generating new data at a constant rate, your monthly storage costs grow linearly. Your cumulative storage cost grows quadratically.
- In a use case where the rate of data generation itself is growing linearly, your monthly storage costs grow quadratically, and your cumulative storage cost grows steadily.
- In a use case where the rate of data generation is growing exponentially, both your monthly data storage cost and your cumulative data storage cost grow exponentially as well.
-
3Explore whether object versioning makes sense for your goals. [21] X Research source
- Object versioning allows you to keep older versions of a file. One advantage is that you can revisit an older version.
- When using object versioning, you can combine it with lifecycle policies to retire versions older than a certain age (if not the current version).
- If using object versioning, keep in mind that just listing files (using aws s3 ls or the online interface) will cause you to underestimate the total storage used, because you are charged for older versions that aren't included in the list.
-
4Explore lifecycle policies for your data.
- You can set policies to automatically delete data in particular buckets, or even with particular prefixes within buckets, that is more than a certain number of days old. This can help you better control your S3 costs and also help you comply with various privacy and data policies. Note that some data retention laws and policies might require you to maintain data for a minimum time; these put a lower bound on the time after which you can delete data in your lifecycle policy. Other policies or laws might require you to delete data within a certain time period; these put an upper bound on the time after which need to delete data in your lifecycle policy.
- With a lifecycle policy for deletion, the way your costs grow changes a lot. Now, with a constant stream of incoming data, your monthly storage costs remain constant rather than grow linearly, since you are storing only a moving window of data rather than all data so far. Even if the size of incoming data is growing linearly, your monthly storage costs only grow linearly rather than quadratically. This can help you tie your infrastructure costs to your revenue model: if your monthly revenue is roughly proportional to the rate at which you receive data, your storage model is scalable.
- A technical limitation: you cannot set two policies with the same bucket where one prefix is a subset of the other. Keep this in mind as you think about how to store your S3 data.
- In addition to lifecycle policies for deletion, you can also set policies to archive the data (i.e., convert it from standard storage to Glacier), reducing the storage costs. However, Glacier has a minimum retention period of 90 days: you are charged for 90 days of storage in Glacier, even if you choose to delete it before then. Therefore, if you intend to delete shortly, it's probably not a good idea to move to Glacier.
- You can also have a lifecycle policy to convert data in S3 (standard storage) to S3 - IA. This policy is ideal for data that you expect to be accessed frequently in the immediate aftermath of its creation but infrequently afterward. Files in IA have a minimum object size (you are charged for 128 KB file size for smaller files than that) and a minimum 30-day retention period.
- Note that lifecycle transitions themselves cost money, and it's often better to create objects directly in the desired storage class rather than transition them. You will need to do the calculations for your use case to know whether and when lifecycle transitioning makes sense.
-
Use the following heuristics for determining the best storage class based on your use case. While we talk as if we are dealing with a single file, we are really thinking of a setup where this is happening separately and independently for a large number of files.
- The first step to determining the right storage class is to get an estimate for your file size, retention period, expected number of accesses (as well as how that number varies over time based on age), and the maximum amount of time you can wait when you do need to access something. You can use all of these as parameters into a formula that calculates the expected cost of using each storage class. The formula gets rather complicated.
- Note that exact thresholds for these can vary based on the current prices in your region. Prices vary by region and keep changing over time. In particular, the following matter: storage pricing for each storage class, request pricing for each storage class, retrieval pricing for each storage class, and minimum size and minimum retention period requirements. With these caveats, heuristics are below.
- If you intend to retain data for two weeks or less, standard storage is preferable both to IA and to Glacier storage. The reason is that the minimum retention periods (30 days for IA, 90 days for Glacier) cancel the cost advantages (at most double for IA, about six times for Glacier) at two weeks or less.
- If your file size is 64 KB or less, then standard storage always beats out IA storage. That's because the minimum size requirement of IA (128 KB) cancels the cost advantage (at most double).
- If you intend to access each file once a month or more frequently, then standard storage wins relative to both IA and Glacier. That's because the extra cost of even one data retrieval destroys the monthly storage saving.
- Let's say you have data that you need to initially keep in standard storage for a month, after which you are okay with moving it to IA for a month or more, as you expect to not need to access it at all after that. It makes sense to move it to IA only if the total number of megabyte-months in IA state per file is at least 1. That's because the lifecycle transition cost of moving to IA needs to be overcome by the cost saving. For instance, if you want to keep the data for one additional month, the file size should be at least 1 MB for it to be a worthwhile expenditure. Note the minimum 30-day period makes transitions for shorter times even less worthwhile.
- Similarly, for migration to Glacier, the breakeven is at about 2.5 megabyte-months for each file. Note, however, that the minimum 90-day retention period in Glacier complicates matters; if you intend to benefit from moving data to Glacier for a month, the file size should be 7.5 MB or higher.
- If you expect to not need to access the content after writing it to S3, the optimal strategy is usually either standard or Glacier, with the trade-off depending on the retention period. However, there is a sweet spot in between where IA is the best option (for instance, storing 128 KB for one month).To illustrate this, below is an image for the simple case where you need to keep a single file of a fixed size for a fixed amount of time, with zero expected accesses after it is stored. The time in months is on the horizontal axis and file size in GB is on the y axis. A point is colored blue, red, or yellow depending on whether the optimal storage class from a cost perspective is standard, IA, or Glacier. We use costs as in the US Standard region in December 2016.
- As you increase the expected number of accesses of the data, standard becomes optimal for more and more use cases (i.e., for larger data sizes, and for longer retention periods). IA also starts becoming optimal in cases where Glacier would previously have been optimal. In other words, standard takes over from IA and IA takes over from Glacier.
-
6Use the following common-sense benchmarks based on your storage use case. This will help you get a sense of how much to expect in storage costs.
- If you are live-serving a static website or images: Storage costs are likely to be a few cents, with details depending on the size of your site. The main costs of serving a live site are the request pricing and transfer costs.
- If you are storing a data lake with the main user-generated stream being web or app activity (i.e., web request logs): An individual web request log line can vary in maximum size from 1 KB (if you keep all the standard headers and fields) to 10 KB (if you also include peripheral information about the user and context). If you get a million web requests a month, and keep old web request logs for a month, that translates to somewhere between 1 GB and 10 GB of storage, which is between 2.3 cents and 40.5 cents in monthly storage cost. The cost scales linearly both with your traffic and with your decision on how long to store. For instance, with a billion web requests a month and storing data for a year, your monthly data storage cost shoots up to somewhere between $276 and $4860. Using binary formats and zipping/compressing can bring costs down further.
- If you are storing archives of images and video footage: For instance, if you are a television network that shoots footage regularly and wants to keep archives of old footage available in case it becomes relevant later. This is a use case where the total space for storage can be quite significant. For instance, with 10 hours of daily video footage, you could be adding something in the range of 100 GB (uncompressed) every day. If you put this footage in standard storage for the first year and then archive to Glacier for another nine years, your total data would come to 365 TB (36.5 TB in standard storage) and your monthly S3 storage costs (before compression) would be about $2200 (two-thirds for Glacier, one-thirds for standard storage). Compression of various sorts can reduce storage costs by a factor varying from 2 to 10.
- It's instructive to look at the bills of some power S3 users to get a sense of just how much a bill can vary.
- Dropbox was reported to have 500 petabytes of data in S3 before moving it off to its own servers [22] X Research source At current prices quoted online, that would cost about $10.5 million per month. Although Dropbox likely got a significant discount and achieved benefits from data deduplication and compression, its bill was likely still at least hundreds of thousands of dollars a month.
- Another extreme example of a large user is DigitalGlobe, which is moving 100 PB of high-resolution satellite imagery to S3. [23] X Research source
- Pinterest reported that it adds 20 terabytes of data a day, which in standard storage, means that their monthly bill would go up by $600/month every day. If this rate of data addition continues for ten years, they would have total storage of about 75 PB and a monthly bill on the order of hundreds of thousands of dollars.
- Beyond these extreme use cases, however, even some of the world's largest companies have fairly low S3 bills. For instance, in the end of 2013, Airbnb reported having 50 TB of high-resolution home photo data, an amount that would cost about $1150 per month at today's prices. [24] X Research source
Advertisement
Optimizing Data Transfer Costs
-
1If using S3 for live-serving content, put it behind a CDN such as Amazon CloudFront, CloudFlare, or MaxCDN.
- The CDN has a large number of edge locations in different parts of the world, usually ranging from dozens to hundreds.
- The user's request for the page is routed to the nearest CDN edge location. That edge location then checks if it has an updated copy of the resource. If not, it fetches it from S3. Otherwise it serves the copy it has.
- The upshot: end users see higher availability and lower latency (as the resources are served from a location physically close to them) and the number of requests and amount of data transfer out from S3 is kept low. Explicitly, the number of requests is bounded by (number of edge locations) X (number of files) if you are never updating files; if you are updating files you have to multiply by the number of file updates as well.
-
2Understand the key co-location advantage of EC2/S3. If your primary use for S3 is to read and write data to EC2 instances (i.e., any of the use cases other than the live serving instance), then this advantage is best reaped if your S3 bucket is located in the same AWS region as the EC2 instances that read or write to it. This will have several advantages:
- Low latency (less than a second)
- High bandwidth (in excess of 100 Mbit/second): Note that bandwidth is actually quite good between the different US regions, so this is not a significant issue if all your regions are in the US, but it can be significant between the US and EU, EU and Asia-Pacific, or the US and Asia-Pacific.
- No data transfer costs (however, you still pay the request pricing) [12] X Research source
-
3Determine the location (AWS region) of your S3 bucket(s).
- If you're running EC2 instances that read from or write to the S3 buckets: As noted in Step 1, colocation of S3 and EC2, to the extent feasible, helps with bandwidth, latency, and data transfer costs. Therefore, an important consideration in locating your S3 bucket is: where do you expect to have the EC2 instances that will interact with this S3 bucket? If the EC2 instances are mostly backend instances, then you should consider the costs of these instances. If they are frontend instances, consider what regions you expect to get most of your traffic from. By and large, you should expect EC2 instance considerations to be more important than S3 considerations in determining the region. So it generally makes sense to first decide where you expect your EC2 instance capacity to be, and then have your S3 buckets there. In general, S3 costs are lower in the same regions that EC2 instances are, so this luckily does not create a conflict.
- If there are other AWS services that you must have, but that are not available in all regions, this might also constrain your choice of region.
- If you are frequently uploading files from your home computer to S3, you might consider getting a bucket in a region closer to your home, to improve the upload latency. However, this should be a minor consideration relative to the others.
- If you expect to use S3 for live-serving static images, decide the location based on where you expect to get your traffic from.
- In some cases, the policies you are obligated to follow based on law or contract constrain your choice of region for S3 data storage. Also keep in mind that the physical location of your S3 bucket could affect what governments are able to legally compel Amazon to release your data (although such occurrences are fairly rare). [25] X Research source
-
4Investigate whether cross-region replication makes sense for your bucket. [26] X Research source Cross-region replication between buckets in different regions automatically syncs up updates to data in one bucket with data in other buckets. The change may not happen immediately, and large file changes in particular are constrained by bandwidth limitations between regions. Keep in mind the following pros and cons of cross-region replication. [5] X Research source
- You pay more in S3 storage costs, because the same data is mirrored across multiple regions.
- You pay in S3 <-> S3 data transfer costs. However, if the data is being read or written by EC2 instances in multiple regions, this might be offset by savings in the S3 -> EC2 data transfer costs. The main way this can help is if you are loading the same S3 data into EC2 instances in many different regions. For instance, suppose you have 100 instances each in US East and US West where you need to load the same data from a S3 bucket in US West. If you do not replicate this bucket in US East, you pay for the transfer cost of the 100 data transfers from the S3 bucket to the US East machines. If you replicate the bucket in US East, you pay only once for the data transfer costs.
- Cross-region replication thus makes a lot of sense for executables, scripts, and relatively static data, where you value cross-region redundancy, where updates to the data are infrequent, and where most of the data transfer is in the S3 -> EC2 direction. Another advantage is that if this data is replicated across regions, it's much faster to spin up new instances, enabling more flexible EC2 instance architectures.
- For logging applications (where data is being read by many frontend instances and needs to be logged in a central location in S3) it is better to use a service such as Kinesis to collate data streams across regions rather than use cross-region replicated S3 buckets.
- If you are using S3 for live-serving of static images on a website, cross-region replication may make sense if your website traffic is global and rapid loading of images is important.
-
5If syncing regular updates to already existing files, choose a folder structure that allows the use of the AWS CLI's sync feature.
- The "aws s3 sync" command behaves like rsync, but can only be run at the level of a folder. Therefore, keep your folder structure such that you can use this command.
-
6Keep in mind the following heuristics for estimating transfer costs.
- For a live-serving static website, monthly data transfer out, without CDN, is equal to total traffic times size of each page visited (including images and other resources loaded on the page). For instance, for a million pageviews and an average page size of 100 KB, the total data out is 100 GB, costing $9 per month.
- For a live-serving static website behind a CDN, the CDN imposes an upper bound on the total data transfer out. Specifically, if you don't update the data at all, so that the CDN serves its own cache, the total data transfer out is bounded by the product of your website's total size and the number of edge locations of the CDN, regardless of traffic volume. For instance, if your site has a total of 1000 pages of 100 KB each, the total size is 100 MB. If there are 100 edge locations, that gives a total data transfer out limit of 10 GB per month, or a cost limit of 90 cents per month. However, if you update some of the files, you have to count each file again after each update.
- The extent to which CDNs save relative to having no CDN depends on the diversity of access to your content and also on the geographic spread of access. If your content is accessed in one geographic region, you will save more. If people access a small number of pages on your site, you will save more. If, within each region, people access a small number of pages on your site (even if the pages differ by region) you will save more. CDN savings can range from 50% to 99%. [8] X Research source
Advertisement
Optimizing Cost Due to Request Pricing
-
1If request pricing is a significant concern, keep your data in standard storage. See Part 3, Step 5 for more information.
-
2If live-serving a static site or static images or video through S3, put it behind a CDN. This is for the same reasons as those discussed in Part 4, Step 1.
-
3If using S3 as a data store for key-value lookup, you need to trade off PUT request pricing against data transfer pricing when determining the sizes of the files you need to shard your data into.
- If you partition the data into a large number of small files, then you need to large number of PUTs to insert the data, but each lookup is faster and uses less data transfer since you need to read a smaller file from S3.
- On the other hand, if you partition the data into a small number of large files, then you need a small number of PUTs, but each access costs a lot in data transfer cost (as you need to read a large file).
- The trade-off usually happens somewhere in the middle. Mathematically, the number of files you should use is the square root of the ratio of a data transfer cost term to a PUT cost term.
-
4In general, using a smaller number of medium-sized files is better for a data lake.
-
5If you are subdividing data across files, use a small number of mid-sized files (somewhere between 1 MB and 100 MB) to minimize request pricing and overload.
- A smaller number of larger files reduces the number of requests needed to retrieve and load the data, as well as to write the data.
- Since there is a small amount of latency associated with each file read, distributed computing processes (such as Hadoop-based or Apache Spark-based processes) that read files will generally proceed faster with a small number of mid-sized files than with a large number of small files.
- The fewer your overall number of files, the less costly it is to run queries that try to match arbitrary regular expressions.
- An important caveat is that, in many cases, the natural output type is a large number of small files. This is true for the outputs of distributed computing workloads, where each node in the cluster computes and outputs a small file. It is also true if data is being written out in real time and we want to write out the data within a short time interval. If you expect to read and process this data repeatedly, consider coalescing the data into larger files. Also, for data coming in in real time, consider using streaming services such as Kinesis to collate data before writing it out to S3.
-
6If you see large unexpected request costs, look for rogue processes that are doing regex matching. Make sure that any regex-matching uses wildcards as near the end of the file as possible.
-
7Keep in mind the following heuristics for request costs.
- Request costs should be between 0% and 20% of storage costs. If they are higher, consider whether you are using the right storage class, sharding the data in the right sizes, or doing unnecessary or inefficient operations. Also check for unnecessary lifecyle transitions as well as rogue regex matching processes.
- Requests costs should be less than transfer costs if your data is primarily being shipped to outside AWS (if your data is being shipped within the same AWS region, there should be no data transfer costs, so this does not apply, as request costs will be positive while transfer costs will be zero).
Advertisement
Monitoring and debugging
-
1Set up monitoring for your S3 costs.
- Your AWS account has access to the billing data that provides the full breakdown of costs. Set up a billing alert so that the data starts getting sent to Amazon CloudWatch. You can then set up more alerts using CloudWatch. [27] X Research source CloudWatch data comes in as data points every few hours, but does not include a detailed breakdown along all the dimensions of interest.
- At any time, you can download detailed breakdown by hour and service type from your root account. This data is usually 24-48 hours late, i.e., you will not see information for the most recent 24-48 hours. For S3, you can download in a spreadsheet or CSV format the data with breakdown by hour, bucket, region, and operation type (GET, POST, LIST, PUT, DELETE, HEADOBJECT, or whatever your operations are).
-
2Write scripts to give easy-to-read daily reports of your costs broken down in various ways.
- At the high level, you may wish to report a breakdown of your costs between storage, transfer, and request pricing.
- Within each of these, you may want to break costs down further based on the storage class (Standard, RRS, IA, and Glacier).
- Within request pricing, you may want to break down costs by the operation type (GET, POST, LIST, PUT, DELETE, HEADOBJECT, or whatever your operations are).
- You can also provide a breakdown by bucket.
- As a general rule, you need to decide the number of dimensions you drill down by trading off ease of quick understanding against sufficient granularity. A generally good tradeoff is to include drilldowns along one dimension at a time (e.g., one drilldown by bucket, one drilldown by storage vs. transfer vs. request pricing, one drilldown by storage class) in your daily report, and drill down further only if something seems out of the ordinary.
-
3Build an expected cost model and use your script to identify discrepancies between actual costs and your model.
- Without a model of what costs should be, it's hard to take a look at the costs and know if they're wrong.
- The process of building a cost model is a good exercise in clearly articulating your architecture and potentially thinking of improvements even without looking at the pattern of actual costs.
-
4Debug high costs.
- If the culprit is storage costs, see Part 3.
- If the culprit is huge data transfer costs, see Part 4.
- If the culprit is request pricing, see Part 5.
Advertisement
Expert Q&A
Tips
- Keep track of your Amazon S3 costs. You cannot optimize what you do not measure. One of the biggest advantages of S3 is that you don't have to think too hard about file storage: you have effectively unlimited file storage that's not tied to a particular instance. This offers a lot of flexibility, but on the other hand it also means you might lose track of how much data you are using and how much it is costing you. You should periodically review your S3 costs, and also set up AWS Billing Alerts in Amazon CloudWatch to alert you when S3 costs for a given month exceed a threshold. [28] X Research sourceThanks
- Do not use Amazon S3 for any operations that require latencies of under a second. You may use S3 to initiate instances that run such operations, or for periodic refreshes of the data and configurations on these instances, but don't rely on S3 file GETs for operations where you need to respond within milliseconds. If using S3 for logging, buffer the activities locally on your frontend instance (or write them to a stream such as Kinesis) and then log them periodically to S3.Thanks
- Do not use S3 for applications that involve frequent reading and writing of data. Amazon S3 is suited more for medium-term and long-term data logging than as a place to store and rapidly update and look up data. Consider using databases (and other data stores) for rapidly updating data. One important thing to remember with S3 is that immediate read/write consistency is not guaranteed: it may take a few seconds after a write for the read to fetch the newly written file. Moreover, you will also see a huge bill if you try to use it this way, because request pricing gets pretty high if you use S3 this way.Thanks
References
- ↑ Amazon S3 (Wikipedia)
- ↑ Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410 TB of Data
- ↑ The Epic Story of Dropbox’s Exodus From the Amazon Cloud Empire
- ↑ Amazon S3: The Beginner's Guide
- ↑ 5.0 5.1 5.2 5.3 Optimizing Costs for S3 , Jacek Migdal, SumoLogic
- ↑ Request Rate and Performance Considerations
- ↑ Amazon S3 as Image Hosting
- ↑ 8.0 8.1 About, Hosting section , gwern.net
- ↑ Data lake , Wikipedia, the free encyclopedia
- ↑ Data Lake vs Data Warehouse: Key Differences , KDNuggets (for definition of data lake)
- ↑ Streaming Real-time Data into an S3 Data Lake at MeetMe , Jeff Barr, Amazon Web Services Blog, September 9, 2016
- ↑ 12.0 12.1 12.2 12.3 12.4 S3 Pricing
- ↑ AWS Command Line Interface , retrieved December 4, 2016
- ↑ How to Install s3cmd in Linux and Manage Amazon s3 Buckets , March 10, 2014
- ↑ s4cmd , GitHub (BloomReach), retrieved December 4, 2016
- ↑ AWS SDK for Java , retrieved December 4, 2016
- ↑ Has Amazon S3 ever lost data permanently?
- ↑ Amazon S3 Reduced Redundancy Storage
- ↑ AWS Storage Update – S3 & Glacier Price Reductions + Additional Retrieval Options for Glacier , Jeff Barr, Amazon Web Services Blog, November 21, 2016
- ↑ AWS S3 pricing downshifts with new Infrequent Access tier
- ↑ Using Versioning
- ↑ How Dropbox Moved 500PB Of Customer Files Off AWS. With 500 petabytes of customer files to manage, Dropbox decided to become a post-cloud company. That meant moving a core operation off AWS. Here's how it was done.
- ↑ AWS Snowmobile – Move Exabytes of Data to the Cloud in Weeks , Jeff Barr, November 30, 2016
- ↑ Airbnb Case Study , Amazon Web Services, retrieved December 4, 2016
- ↑ Amazon Web Services: Whitepaper on EU Data Protection , November 2016
- ↑ Cross-Region Replication
- ↑ Monitor Estimated Charges Using Billing Alerts
- ↑ Monitor Estimated Charges Using Billing Alerts