Technical Documentation | FAQ

The following are frequently asked questions on the data-ingestion services provided by Metamarkets. For information specific to your use case or for help with any concerns, contact us.

What are the biggest challenges typically seen during integration?

We commonly see clients underestimating the amount of time & work that goes into setting up the data stream on their side. There may be additional work involved to reduce the latency of the data (explained below) or to clean up values in some of the fields.

Can you describe the scenarios under which I'll be uploading or transferring data?

There are three scenarios under which you'll upload data:

Upload from your repository to your Amazon S3 storage This scenario occurs when you upload lookup tables, initial data for onboarding, and restated data for a backfill.
Transfer from your Amazon S3 storage to Metamarkets storage You'll allow Metamarkets to read from your S3 storage for purposes such as transferring lookup tables, initial data for onboarding, and restated data for a backfill.
Upload from your repository to Metamarkets real-time ingestion points Real-time (current) data is streamed from your data source via HTTPS to Metamarkets ingestion points. In some cases, throttled backfill data can also be uploaded this way.

What's the most efficient way to upload or transfer my data?

For AWS services, pricing is controlled by Amazon and is subject to change. You can estimate your costs by looking at the prices Amazon publishes for its S3 service and (if you use Amazon EC2 instances) EC2 service.

One potential way to reduce your costs is to try using the same AWS region as Metamarkets, called "us-east-1". While Amazon does not guarantee that requests for a region will be fulfilled or offered at a reduced rate, this has been found to be an effective way to reduce transfer costs.

Will I have to reformat my data before uploading it?

Your data must follow a schema that is based on industry standards. Metamarkets provides detailed documentation to help you transform your data into a format that follows that schema. In addition, our engineers are experts in answering your questions and providing guidance on meeting requirements for standards-based data formats.

Why can't I just provide my data as is?

The ability to analyze your data in real time depends on that data being ingested, processed, and surfaced in real time. For that reason, the Metamarkets Realtime Data Ingestion (RDI) platform is designed to ingest standards-based data. Conforming to industry standards enables RDI to consistently provide near-instantaneous access to your data, which means you can perform ad-hoc queries on that data seconds after uploading it.

Will I be able to include dimensions with unique values?

Yes. Certain dimensions, such as device ID, contain a very large number of unique values. These dimensions are summarized and presented as a metric to offer a useful view into the data. For example, in the case of device IDs, which can indicate the number of unique users or visitors, it is useful to know the sum of IDs sent, not the individual IDs themselves. In addition, summarization maintains high performance by reducing the load on the resources that surface your data.

How many times do you process the data?

Metamarkets processes all data stream twice for accuracy. Real-time ingestion allows data to be surfaced in the Metamarkets dash in a matter of seconds. With batch ingestion, surfacing information can take hours. In addition, real-time ingestion allows clients to stream data directly into the Metamarkets Realtime Data Ingestion service, while batch ingestion requires loading data into a client-owned AWS S3 bucket.

Real Time processing: As data comes in, MMX holds events for 20 minutes for any joined streams (such as impressions) before releasing to be loaded into the Druid database. To handle any events that come in some time after the Auction occurs, we have the second method of processing.
Batch processing: Because some events may not make the window above, we set up a second batch Hadoop job to process any late events. These batch jobs generally run on a 6-12 hour delay after the auction and can be configured to wait longer. The join window for this job is a much wider 2 hours, depending on setup. These two systems working in tandem will ensure that the data is both available quickly and accurate long term.

How long does Metamarkets retain data for?

Metamarkets retains raw data sent to endpoints for 14 days before purging. Processed data seen on dashboards is duplicated for the length of the retention period defined in the contract.

Can data be batched to the ingestion endpoints?

Normally, you will stream data to a real-time ingestion endpoint. This endpoint is not designed to batch-process data. Under certain circumstances, such as when backfilling historical data, batching data to a specially provided ingestion endpoint will be necessary. Your Metamarkets account manager will provide guidance as to when to use batch uploads.

How will I know that ingested data is visible in my Metamarkets dash?

You can use the Metamarkets dashboard to query the time periods for which you have provided data. If data you expect to see does not appear, it most likely arrived with a timestamp outside the allowable window. There are three different windows that are important in sending data to Metamarkets

Time from Present: Part of being able to write data to Druid in realtime required a time boundary in order to allow continuous processing. Because of this, we can only load events in realtime that are within 30 minutes of the present time, including a few minutes of processing on the MMX side.
Time for Joins: If there are multiple streams of data being sent, such as Auctions, Impressions, and Clicks, there is a limit of how long the system can hold events in memory before needing to release to be loaded. Upon receiving the auction summary record, a 20-minute window is opened for joining that record with any late-arriving impression and click records. When the window closes, the joined data is then immediately surfaced in the Metamarkets dashboard.
Batch Job: As mentioned above we have a secondary batch job designed to catch events that miss these two windows above. The job generally runs at a 4-12 hour delay, but if you have situations where you have a long tail of events coming in up to 24 hours later, we can increase the delay time to ensure the job doesn't run until then. When the batch fix-up job runs for a particular hour, all data received for that hour and the previous hour is also reprocessed, creating a join window of up to 2 hours to include clicks and impressions that occur near the time of the original request.

What if I have events such as Conversions or Installs that may happen hours to days afterward?

We commonly see this question when it comes to events that may happen long after the Auction occurs. These events preclude a join with the auction record because it missed the real-time window period: 20 minutes, and even the batch fix-up window period: 4 hours. Metamarkets can handle these types of events with some additional work.

First we ask that the aside from including the timestamp when the Auction occurred, we also ask that the timestamp when the Event occurred be included in the record.

Additionally we ask that any dimensions/metrics that are included in the original Auction be pre-joined to the Event being sent.

From here Metamarkets can load the data in seamlessly.

Below is an example of an abbreviated Conversion JSON:

    {
      "id": "AFEWSEBD5EB5FI32DASFCD452BB78DVE",
      "timestamp": "2014-03-05T04:58:23.000Z",
      "at": 2,
      "bcat": ["IAB26","IAB25"],
      "app": { ... },
      "imp": [ ... ],
      "device": { ... },
      "user": { ... },
      "bid_responses": [ ... ],
      "conversions":  {
            "conversion_timestamp": "2014-03-07T03:15:00.000Z",
            "conversion_charge": 10,
            "conversion_type": "Subscription"
              }
    }

Where do you see the biggest validation issues come from?

There are several situations where Metamarkets may not process data. Events that can not be processed fall into several categories and will be handled in the following ways:

Events posted unsuccessfully will be returned with status code: 420 or 503.
Events posted successfully (status code: 201) but with a file error (e.g. End of File error) will stall the pipeline. Data engineers will notify you of the corrupt file and restart the pipeline once the file is removed or replaced.
Events posted successfully but with invalid elements (e.g. malformed timestamp) will be dropped, and at this time no method exists to notify of this error.
Valid files with valid but incorrect JSON (e.g. wrong field name or incorrect nesting) will drop the fields corresponding to the incorrectly formed JSON.
We do deduplicate identical events by default

Do I need to do any data cleanup work before sending?

To get the most out of Metamarkets, we suggest that any fields sent are sent are "clean" for two reasons. First, having a concise, low cardinality field will give your end users an easier time in finding the data they need. Having three values such as "iphone", "iPhone", and "Apple IPhone" will require the user to know the difference and select altogether when trying to view a single cohort. Secondly, dashboard & query performance is directly affected by the cardinality of fields. Replacing nonsensical values with a default "Not Available" can make a significant difference.

How secure will my data be?

Our security practices implement the very best practices, both in our approach to safeguarding your data and in remaining vigilant through ongoing threat assessments and proactive responses. You can read more about our security practices here, or contact your Metamarkets representative for a more thorough walkthrough of our security policies.

How am I charged for these services?

Our pricing models are based on the structure of the data being ingested, how closely it complies with standards, and how many events are uploaded. Metamarkets works with clients to determine their requirements for volume and nonstandard content to determine fair pricing for each use case.

If you charge per event, how are events counted?

Data is uploaded to our analytics platform in the form of files that contain records. Each record is a one-line flat JSON object in standard format. An event equals one of those records.