Pre-Aggregated Batch Data Delivery
The Metamarkets batch ingestion system is built to continuously ingest and process rolled up event data via Amazon's Simple Storage Service (S3). Clients upload their event streams on an ongoing basis to an S3 bucket that Metamarkets can access. The typical data delivery cadence for batch ingestion is hourly uploads, although the range includes some clients that deliver as often as once every 15 minutes to some that provide data once per day. Metamarkets begins to load and process the incoming data as soon as it appears in the client's S3 bucket (the "raw store") and loads the processed data into Druid, our distributed data store. Once the data is loaded into our Druid cluster, it is available for clients to explore on the dashboard.
Setting Up For Batch Ingestion
Data should be delivered in hourly aligned directories. Data posted in the hourly folder should only contain data with timestamps within that hour. It is important to find a balance between the size of individual files and the overall number of files per hour-aligned directory. While extremely large files can be troublesome to deal with, having too many files can introduce excessive upload latencies as each file has to be referenced, retrieved, and loaded separately. We highly recommend individual file sizes to be around 50MB (compressed), which typically corresponds to roughly 50k records/events per file.
Formatting Your Data
Files should be plain-text JSON format with one rolled up record per line. Be sure to use UTF-8 encoding. Files should be compressed using the compression format: ".gz". Compressed formats allow for faster ingestion and processing of log files, in addition to reducing your storage and delivery costs. Filenames should indicate the compression format used ("filename.gz").
The rolled up JSON data should contain only top level name/value pairs and follow standard formatting practices. Below is an example of a rolled up record:
{
"timestamp": "2014-03-05T04:58:23.200Z",
"cnt" : 100,
"app_name" : "Metamarkets",
"bundle" : ,
"country": "USA",
"device_name": "iPhone",
"winning_bidder": "John",
"bid_cnt": 200,
"bid_sum" : 1000
}
Note: This above record is "pretty printed" for readability. However, records must be delivered to the Metamarkets batch ingestion system in a flat format, one record per line, each record separated from the next by a newline character.
Two important fields that are required in every record is the "timestamp" (of any time in the hour) and "cnt", which is a simple count of the total records that were rolled up into that permutation. Any other metrics that are important for that permutation, such as bid count, impression count, and revenue, should be summed up. On the Metamarkets dashboard, your Account Manager has the ability to build additional, derived metrics from these base counts such as eCPM or Fill Rate.
Delivering Your Data
Once your files are ready, they can be delivered to the batch ingestion system. Before you send the files:
- Set up a specially configured S3 bucket.
- Understand the specific directory structure for data files.
- Know the delivery frequency specified during your initial setup.
These requirements are described in the following sections. The Metamarkets batch ingestion and processing layer is built to receive data through Amazon S3. Follow the instructions below to set up an S3 bucket and authenticate Metamarkets with read-only access.
-
Go to http://aws.amazon.com to create an account.
-
Login to your AWS account on the AWS Management Console and navigate to the S3 tab.
-
Create a new bucket, in the US Standard S3 region, named "metamx-NAME-share", where "Name" is your organizational name. Make sure the bucket name is a DNS-compliant address. For example, the name should not include any underscores; use hyphens instead, as shown in the example above. Be sure to disable the "Requester Pays" feature. For more information on S3 bucket name guidelines, see the AWS documentation.
-
Once you have created a bucket, click the "Properties" icon on the upper-right menu to bring up its "Properties".
-
Click "Add Bucket Policy" in this "Properties"window: This will bring up a window entitled "Bucket Policy Editor".
-
Cut and paste the S3 bucket read-only policy shown below (editing it to include your bucket name where appropriate) into the text edit box of the Bucket Policy Editor window. Click "Save".
{
"Version": "2008-10-17",
"Id": "Metamarkets-Ingestion-Bucket-Access-c31b3dd9-68df-470f-a0d3-9df6c0d31b21",
"Statement": [
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::metamx--share"
},
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::metamx--share/*"
}
]
}
S3 Directory Structure
Incoming data files need to be organized according to the time frame (in UTC) of the events they contain. We use a hierarchical folder convention to separate years, months, day, and hours. There are multiple reasons for requiring a hierarchical folder structure, examples include: accounting for conducting joins between data feeds, data reprocessing, and various debugging purposes. The following is an example showing three uploaded client files using this folder structure:
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=09/file1ab.gz
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=09/file224.gz
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=10/fileZ.gz
The month, day and hour in the directory name should always be 2 digits (zero-padded when single digit). Note that files compressed with gzip have a ".gz" suffix. By looking at the directory structure, we can tell that these files correspond to events which took place in year 2015, month October, day 17. More specifically, we see that two of the files hold events for hour 09 (UTC) and the third file corresponds to events for hour 10 (UTC) of that day. Structuring data with this layout makes it easy to quickly navigate and troubleshoot issues according to the timeframe of interest.
If your log files are already in S3, use a simple bucket-to-bucket server-side copy. The simple s3cmd utility is generally the best bet. If you are transmitting data from a local Hadoop DFS to S3, we recommend either using the distcp utility or just dumping the processed data using the "s3n" file system (see customizations for Amazon S3).
Changing the Schema
To avoid errors, the names of existing fields should not be edited. However, deprecated existing fields can be given an empty (or null) value, or be completely deleted.
Updated about 3 years ago