Handling Lookup Tables
While you can deliver your data denormalized (i.e., the data is completely self contained and self-describing), under some circumstances it is necessary to separate the description of the event stream into multiple files: the essential log file (raw data), and supporting "lookup tables".
Lookup tables contain key-value data that Metamarkets uses to expand certain values in the raw data. Typically a code or ID (the key) is mapped to a longer string of characters (the value). This allows for more compact data since only the IDs need to be provided.
Types of Lookup Solutions Supported by Metamarkets
Based on the how often the raw data's supporting "lookup table" values update with time and whether or not your users would like to keep a historic view of how a dimension has changed over time, Metamarkets will work with your team to apply one of our various lookup table solutions.
Static Lookups
Static Lookups, also otherwise known as Ingestion Time Lookups, are lookups that are ingested at processing time. When a record is being processed, if a match is found between the record and lookup's key, the lookup's corresponding value at that moment in time is extracted and carbon-copied into Druid for the records it processed.
Static lookups are best suited for dimensions with values that require you to how it has changed over time, or if the values are never expected to change. We recommend leveraging Dynamic Lookups if the values are expected to change.
Dynamic Lookups
Since static lookups transform and store the data permanently, any changes to the mapping would require reprocessing the entire data set to ensure consistency.
To address the case when values in a lookup are expected to change with time we, developed dynamic lookups. Dynamic Lookups, also known as Query Time Lookups, are lookups that retrieved at query time, as apposed to being used at ingestion time.
Benefits of dynamic lookups include:
- Historical continuity for dimension values that change frequently without requiring reprocessing the entire data set.
- Time savings, because there is no data set reprocessing required to complete the update.
- Dynamic lookups that are kept separate from the data set. Thus, any human errors introduced in the lookup do not impact the underlying data set.
- The ability for users to create new dimension tables from metadata associated with a dimension table. For example, advertiser or publisher account ownerships can change during the course of a quarter. In such cases, a dynamic lookup can be updated on the fly to reflect the most current changes.
Creating and Using Lookup Tables
Entries in the lookup-table file should be similar to the following (note the JSON format).
{"advertiser_id":001,"advertiser_name":"Chilli One"}
{"advertiser_id":002,"advertiser_name":"Squad Two"}
Translating from an identifier to a human-readable value is accomplished by querying the lookup table. This "normalized" representation of the data is more compact.
Note, however, that if you update the event-stream data you may also need to update the lookup tables themselves. You can maintain a static S3 directory containing the lookup tables and update the tables whenever any new fields or mappings are added (e.g., appending new advertiser ID/Name pairs as they are added to the raw data feed). Here is an example of a static S3 directory for a lookup table (in a file in JSON format) to be used on incoming data:
s3://metamx-<your-org>-share/lookups/adv_mapping.json
You must have an S3 bucket that is set up correctly for use with RDI.
As new data is ingested from your S3 Bucket, the lookup table found at the predefined bucket/filepath extension is imported and used for mappings.
Lookup files could become fairly large as more and more values are appended, and this can adversely affect performance.
If the aggregate size of all your static lookup files exceeds 50MB, contact your Metamarkets account manager to discuss strategies for managing table sizes.
If the aggregate size of all your dynamic lookup files exceeds 10MB, contact your Metamarkets account manager to discuss strategies for managing table sizes.
Lookup Tables Requirements Summary
Lookup-table data must meet the following requirements:
- Single key-value pairs are provided as single lines of flat JSON.
- For static lookups, files must contain all of the ID/name pairs that are needed to process any data that was posted to RDI over the trailing 14 days.
- For dynamic lookups, files must contain all of the ID/name pairs for a given dimension, as the entire table will be pulled in at query time
- The size of all static lookup-table files does not exceed 50MB in size.
- Dynamic Lookups require that each row in the lookup have a unique key and value.
- For example, this would not be acceptable for Dynamic Lookups
{"campaignid":"2057","groupid":"6","campaign_name": "Livestrong Women"}
{"campaignid":"2057","groupid":"1", "campaign_name": "Livestrong Men"}
- The lookup would need to have the key as a concatenated string stored in Druid as well as passed in the Dynamic lookup as follows
{"campaignid_group":"2057-6", "campaign_name": "Livestrong Women"}
{"campaignid_group":"2057-1", "campaign_name": "Livestrong Men"}
- If you have a lookup that contains any numeric values that you intend to use for metrics, static lookups must be applied.
- Lookup-table files must be stored in an AWS S3 bucket.
- The data adheres to all other applicable standards, including the ISO standards.
Examples of Lookup Table Data
The following sections give examples of non-time-series data that may be found in lookup tables and what service may be most applicable.
Campaign Metadata
The lookup table maps a campaign ID to a campaign name. Since campaign names are not bound to change with time, this is probably best suited for static lookups.
{"campaign_id":001,"campaign_name":"Troy & Abed in the Morning"}
{"campaign_id":002,"campaign_name":"Senor Chang's Spanish Class"}
Advertiser Names
The lookup table maps an advertiser ID to an advertiser name. Since advertiser names can possibly change with time, i.e: Rebrandings, Acquisitions, etc, then this is probably best suited for dynamic lookups.
{"advertiser_id":001,"advertiser_name":"Chilli One"}
{"advertiser_id":002,"advertiser_name":"Squad Two"}
Audience Segments
The lookup table maps a segment ID to demographic data. Since audience segment names are not are not bound to change with time, this is probably best suited for static lookups.
{"segment_id":001,"segment_name": "Up-and-Comers"}
{"segment_id":002,"segment_name": "Urban Elders"}
Account Manager & Account Tier Metadata
The lookup table maps account managers to their associated accounts and the strategic tier it falls under. Since accounts can be shifted around by account managers, as well as the tier of the account could change, a dynamic lookup would be best suited for this use case to maintain historical continuity.
{"publisher_id":001,"account_manager_name":"Troy Barnes", "Tier": 1}
{"publisher_id":002,"account_manager_name":"Abed Nadir", "Tier": 2}
Setting Up an S3 Bucket
If you plan to use a lookup table with RDI, you must store the table data in an Amazon S3 bucket that you set up. Follow the instructions below to correctly set up the S3 bucket and authenticate Metamarkets with read-only access.
-
Go to http://aws.amazon.com to create an account.
-
Login to your AWS account on the AWS Management Console and navigate to the S3 tab.
-
Create a new bucket, in the US Standard S3 region, named
metamx--share
, where is your organization name.
Make sure the bucket name is a DNS-compliant address. For example, the name should not include any underscores; use hyphens instead, as shown in the example above. Be sure to disable the Requester Pays feature. For more information on S3 bucket-name guidelines, see the AWS documentation. -
Once you have created a bucket, click the "Properties" icon on the upper-right menu to bring up its Properties.
-
Click "Add Bucket Policy" in this Properties window. This will bring up a window entitled "Bucket Policy Editor".
-
Cut and paste the S3 bucket read-only policy shown below (editing it to include your bucket name where appropriate) into the text edit box of the Bucket Policy Editor window. Click "Save".
{
"Version": "2008-10-17",
"Id": "Metamarkets-Ingestion-Bucket-Access-c31b3dd9-68df-470f-a0d3-9df6c0d31b21",
"Statement": [
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::metamx--share"
},
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::metamx--share/*"
}
]
}
Note: If the data is uploaded into the S3 bucket by an AWS IAM account different from the
account that owns the bucket, Metamarkets may not be able to access that data. The simplest way to avoid this problem is to upload the data from the bucket owner account (or an IAM user under that account).
Read-Write AWS Bucket Policy
If your Metamarkets account manager has indicated that you must use a read-write policy, please use the permissions below.
{
"Version": "2008-10-17",
"Id": "Metamarkets-Ingestion-Bucket-Access-c31b3dd9-68df-470f-a0d3-9df6c0d31b21",
"Statement": [
{
"Sid": "Metamarkets Ingestion Bucket Access",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::metamx--share"
},
{
"Sid": "Metamarkets Ingestion Bucket List",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::906138931002:user/ingestion",
"arn:aws:iam::354387701946:role/aws-prod-ingest-role",
"arn:aws:iam::906138931002:root"
]
},
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:PutObject",
"s3:GetObjectAcl",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::metamx--share/*"
}
]
}
Updated about 7 years ago