Batch Data Delivery
The Metamarkets batch ingestion system is built to continuously ingest and process event data via Amazon's Simple Storage Service (S3). Clients upload their event streams on an ongoing basis to an S3 bucket that Metamarkets can access. The typical data delivery cadence for batch ingestion is hourly uploads, although delivery can occur more frequently, in some cases once every 15 minutes. Metamarkets begins to load and process the incoming data as soon as it appears in the client's S3 bucket (the "raw store") and loads the processed data into Druid, our distributed data store. Once the data is loaded into our Druid cluster, it is available for clients to explore on the dashboard.
Setting Up For Batch Ingestion
- Data must be delivered in hourly aligned directories.
- Data must be in JSON format and gzipped
- Data of similar content (ex - Impressions) should be merged together in the same directory.
- Data posted in the hourly folder should only contain data with timestamps within that hour.
- It is important to find a balance between the size of individual files and the overall number of files per hour-aligned directory. While extremely large files can be troublesome to deal with, having too many files can introduce excessive upload latencies as each file has to be referenced, retrieved, and loaded separately. We suggest individual file sizes of around 50-100MB (compressed), which typically corresponds to roughly 50k records/events per file.
Formatting Your Data
Files should be plain-text JSON format with one rolled up record per line. Be sure to use UTF-8 encoding. Files should be compressed using the compression format: ".gz". Compressed formats allow for faster ingestion and processing of log files, in addition to reducing your storage and delivery costs. Filenames should indicate the compression format used ("filename.gz").
For formatting practices, see the pages Integration Guide for Exchanges , Integration Guide for DSPs , or speak to your Data Engineer fors anything custom.
Delivering Your Data
Once your files are ready, they can be delivered to the batch ingestion system. Before you send the files:
- Set up a specially configured S3 bucket.
- Understand the specific directory structure for data files.
- Know the delivery frequency specified during your initial setup.
These requirements are described in the following sections. The Metamarkets batch ingestion and processing layer is built to receive data through Amazon S3. Follow the instructions below to set up an S3 bucket and authenticate Metamarkets with read-only access.
-
Go to to create an account.
-
Login to your AWS account on the AWS Management Console and navigate to the S3 tab.
-
Create a new bucket, in the US Standard S3 region, named "metamx-NAME-share", where "Name" is your organizational name. Make sure the bucket name is a DNS-compliant address. For example, the name should not include any underscores; use hyphens instead, as shown in the example above. Be sure to disable the "Requester Pays" feature. For more information on S3 bucket name guidelines, see the AWS documentation.
-
Once you have created a bucket, click the "Properties" icon on the upper-right menu to bring up its "Properties".
-
Click "Add Bucket Policy" in this "Properties"window: This will bring up a window entitled "Bucket Policy Editor".
-
Cut and paste the S3 bucket read-only policy shown below (editing it to include your bucket name where appropriate) into the text edit box of the Bucket Policy Editor window. Click "Save".
{
"Version": "2008-10-17",
"Id": "Metamarkets-Ingestion-Bucket-Access-c31b3dd9-68df-470f-a0d3-9df6c0d31b21",
"Statement": [
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::metamx--share"
},
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::metamx--share/*"
}
]
}
S3 Directory Structure
Incoming data files need to be organized according to the time frame (in UTC) of the events they contain. We use a hierarchical folder convention to separate years, months, day, and hours. There are multiple reasons for requiring a hierarchical folder structure, examples include: accounting for conducting joins between data feeds, data reprocessing, and various debugging purposes. The following is an example showing three uploaded client files using this folder structure:
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=09/file1ab.gz
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=09/file224.gz
s3://metamx-yourorg-share/data/auctions/y=2015/m=10/d=17/H=10/fileZ.gz
The month, day and hour in the directory name should always be 2 digits (zero-padded when single digit). Note that files compressed with gzip have a ".gz" suffix. By looking at the directory structure, we can tell that these files correspond to events which took place in year 2015, month October, day 17. More specifically, we see that two of the files hold events for hour 09 (UTC) and the third file corresponds to events for hour 10 (UTC) of that day. Structuring data with this layout makes it easy to quickly navigate and troubleshoot issues according to the timeframe of interest.
If your log files are already in S3, use a simple bucket-to-bucket server-side copy. The simple s3cmd utility is generally the best bet. If you are transmitting data from a local Hadoop DFS to S3, we recommend either using the distcp utility or just dumping the processed data using the "s3n" file system (see customizations for Amazon S3).
S3 Logging
As part of your data uploads, Metamarkets requires logging to be enabled for the data you intend on sending to us for processing. Metamarkets may use these logs for troubleshooting, pipeline optimization, and/or billing. In order to allow Metamarkets to access these logs, please follow the instructions below to set up the proper configuration.
-
If you have not enabled logging for the bucket Metamarkets is ingesting data from, please follow these steps:
- Go to http://aws.amazon.com
- Login to your AWS account on the AWS Management Console and navigate to the S3 tab
- Click the “Properties” icon on the upper-right menu to bring up the “Properties” options on the bucket Metamarkets is ingesting data from
- Under “Logging”, click “Enabled”
- Under “Target Bucket”, select the same name of the bucket Metamarkets is processing data from
- Under “Target Prefix”, make sure “s3logs/” is entered
- Click Save
-
Log into your Amazon S3 console and under Services, select CloudFormation.
Note: the person implementing this configuration must have permission to manage CloudFormation and IAM resources. -
Click on "Create Stack" and under "Choose a template", either upload the template provided by Metamarkets, or use the following s3URL: https://s3.amazonaws.com/metamx-aws-prod-client-templates/MetamarketsS3LogRole_v2.template.
-
The template allows you to set the following parameters:
- "Stack name"
- Name the stack metamx-your-org-s3Logs
- "S3LogBucketName"
- Description: "New or existing bucket for S3 logs"
- "Stack name"
-
You will then click that you understand that the template will create IAM resources, and then click "Create".
-
Once the stack has finished being created, send your Metamarkets representative the results on the "Outputs" tab.
- RoleARN: This IAM role identifier allows Metamarkets to ingest the S3 Logs
- S3BucketName: This is the name of S3 Bucket that the Metamarkets role can have access to, in this case, it should be the bucket name you inputted into Cloudformation
Note: If interested, you can set lifecycle policies on the log bucket of interest. You can either leave this variable blank (no lifecycle policy) or 35 days at minimum. This can be applied through the S3 console on the respective bucket. Reach out to your Technical Account Manager or Sales Director if you have any questions.
Changing the Schema
To avoid errors, the names of existing fields should not be edited. However, deprecated existing fields can be given an empty (or null) value, or be completely deleted.
Updated over 7 years ago