The Metamarkets Realtime Data Ingestion (RDI) platform receives and processes data in real time. In turn, the data you upload can be viewed in the Metamarkets dashboard in seconds. This document explains real-time ingestion in more detail, and shows you how to upload your data to RDI.
In the context of RDI, ingestion refers to the intake of data for the purpose of processing that data and surfacing it as information in the Metamarkets dashboard. Unlike systems limited to batch processing only, which ingest one portion of the data at a time, real-time ingestion can accept a stream of data. The stream is initiated and executed programmatically, using an HTTPS connection pool. This type of connection allows for concurrent and continuous uploads of data files while minimizing latency.
Before uploading data to RDI, be sure to understand what formats and data structures are supported. Your data should follow the standards and schema prescribed in the appropriate Metamarkets Integration Guide.
With RDI, you continually deliver data to an HTTPS endpoint (using POST operations), where it is immediately prepared for display and querying on your Metamarkets dashboard. RDI accepts data on the basis of event timestamp; properly formatted data that does not meet on-time delivery requirements can be processed at a later time.
Use the following as a checklist before getting started with uploading data:
- Expected Volume – Provide your Metamarkets account representative with the expected average and peak data volumes in GB/sec (or GB/hour) that you will deliver to RDI. These estimates will be used to configure quotas on the delivered volume, with headroom to accommodate peaks. Providing volume estimates is a prerequisite to the remaining items in this list.
- Endpoints – HTTPS endpoints are in the form of URL addresses to which you will post your data. RDI ingests the data at those endpoints. You may have one or more endpoints to use, depending on the type of data you are sending. Obtain the URLs from your Metamarkets data engineer.
- Credentials – The endpoints are protected by a set of login credentials. You will have to use these credentials to successfully connect to the endpoints. Obtain the credentials from your Metamarkets data engineer.
- Formatting – Ensure that your data is formatted correctly, as prescribed in the Metamarkets Integration Guide appropriate for your type of data.
- Compression – Ensure that your data is stored in compressed files.
- Test Uploads – Test the connection to the endpoint using sample data.
For example, to use curl to post a test file called realtime.json.gz, use the following:
curl --data-binary @realtime.json.gz --user <username>:<password> -H'Content-Type: application/json' -H'Content-Encoding: gzip' https://rt-evaluate.metamarkets.com/events/endpoint-name -v
The username, password, and data feed will be supplied to you.
- Programmatic Uploading – Obtain or build an HTTP client for posting your data to the endpoints. There are many programming languages with HTTP-client libraries that support posting data via HTTPS. To ensure that your data and credentials are sent encrypted, HTTPS is required for connecting to the endpoints.
- Current Data – Ensure that the event timestamps in the data you plan to upload are within 10 minutes of current time (the time you upload the data). RDI accepts current data only. If the data isn't current, see the section "Late or Historical Data" for more information. See the Best Practices section for more tips on how to make uploading your data an efficient and successful process.
Data files should be plain-text JSON format with one record per line, with each record separated from the next by a newline character. Records should follow the schema prescribed in the appropriate Metamarkets Integration Guide. Although you can batch records into files of up to a megabyte in size (compressed), batch sizes of approximately 0.5 MB pre-compressed are recommended for best performance.
Be sure to use UTF-8 encoding and use
Content-type: application/json in the HTTP header.
Files should be compressed to maximize throughput. Compressed formats allow for faster ingestion of your log files by RDI, in addition to reducing your storage and delivery costs. RDI accepts files compressed with the gzip format.
If using compression (recommended), add an extension to the file name indicating the type of compression (e.g., .gz).
To upload your data, simply initiate the connection to the RDI endpoints using your chosen HTTP-client library. RDI will respond with an HTTP 2xx code within 50–100ms, depending on the volume of data being uploaded.
Be sure to deliver data in a smooth and continuous pattern. Attempting to deliver data in large batches, for example once per hour, may cause the volume rate to exceed your quota and result in a failed upload (HTTP 420 error code).
If you experience an issue with the upload, see the advice in the Best Practices and the FAQ.
These recommendations are intended to provide you with a trouble-free process when uploading your data to RDI. Some recommendations are not applicable for low data volumes, but will make scaling volumes an easier and smoother process.
- Ensure that timestamps in the data are in the ISO format. Other formats will fail.
- Compress the data with GZIP.
- Batch data into 300–600Kb chunks (precompressed size). Sending larger batches of data can significantly slow the ingestion process.
- Single event posting is not supported - all data must be batched.
- Use a connection pool in your HTTP client to allow for concurrent connections when data is posted. Set the pool to have a maximum of four connections. If throughput suffers or if you notice that all four connections are often being used simultaneously, increase this number. Set a timeout of 15 seconds; connections should never reach this timeout unless your network is down or extremely noisy.
- To make efficient use of HTTPS connections, ensure that they are persistent (keep-alive is enabled). HTTP 1.1 has persistent connections by default. See the documentation for your HTTP server for configuring persistent connections.
- If you wish to upload a substantial amount of "catch up" or historical data, contact your Metamarkets account manager to discuss setting up a backfill operation. A data set that covers several hours or more may be too large to upload via the method described in the [troubleshooting section].
- Parse the request HTTP status for a success (2xx) or failure (e.g., 4xx) code. See the troubleshooting section below on how to handle errors.
- To prevent data loss, implement a procedure for handling general delivery failures, such as when your volume substantially exceeds quotas.
- For retries, use an exponential back-off algorithm.
- If you are uploading lookup tables to an AWS S3 bucket, use the us-east region to maximize proximity to the RDI servers.
If you experience any problems after you begin uploading data to RDI, refer to these solutions to common issues. If you still cannot resolve your issue, contact your Metamarkets representative or send an email to [email protected].
You may notice that data that has been successfully posted to RDI has not been immediately surfaced in the dashboard. RDI is intended for displaying current data. Current data is defined as data with events that have a timestamp no more than 15 minutes behind the time the data is posted to RDI.
Data with a timestamp more than 30 minutes older than current time is saved and surfaced in your dashboard usually within 24 hours. Data older than 1 day will be stored but not automatically surfaced. In this case, contact your Metamarkets account manager to arrange for the data to be backfilled or set up a custom configuration.
If your data uploads fall behind, do not attempt to backfill all of the data by uploading it at once. That will likely exceed your volume quota and result in a failed upload (HTTP 420 error code). Instead, increase the overall throughput rate at which data is posted to a level between the normal rate and the quota (but no more than twice the normal rate), until data timestamps are near current time. If an HTTP 420 error code is returned, reduce the rate.
In order to support your actual data requirements, your agreement with Metamarkets includes a quota for a maximum number of events per stated time period. If that quota is violated, an HTTP 420 code is returned, indicating that the undelivered data should be dropped or re-sent using an exponential backoff pattern. Contact your Metamarkets account manager if you need more information about your quota or require a change.
Other 4xx codes imply a problem with the received request. For example, 401 indicates an access problem related to permission to perform a POST, while a 404 indicates no such resource. For these problems, confirm that your HTTP-client code has the correct HTTPS endpoint URL and credentials to successfully connect.
If RDI servers are temporarily unavailable or overloaded, an HTTP 5xx error code is returned. Either drop the undelivered data or retry sending it using an exponential backoff pattern.
Updated less than a minute ago