Importing a custom data set

For most customers connecting data sources will be the primary source of data collection, for users with webhooks, we have a custom webhook integration as well.

However, in some use-cases, you may want to import a custom dataset. This article covers the schema and procedure for importing a custom dataset to AdLibertas.

Uploading the data

To get started all we need is data uploaded to an s3 bucket, please reach out to your account manager for AWS credentials.

All files will be nested underneath an s3 folder in this format:

s3://adlibertas-<<company_name>>/athena/production/singulari/v1/s3_events_123456/

Here is an example of the output

Partitioning:

If you have used hive or AWS glue/ athena before, our s3 paths should be familiar. Each upload should be categorized underneath a dataset_id and an event_date.

A dataset can be as large as you like and contain multiple apps. However, we ask that you separate uploads by an event_date (YYYY-MM-DD).

Breaking down the example file partitions, you can see the dataset_id=com.awesomeapp and event_date=2021-05-28 where *com.awesomeapp* is the dataset ID and *2021-05-28* is the event date.

Formatting:

Each file uploaded is in a JSONL file gzipped format. Please see this example file for the schema of each JSON object (our schema is very close to the Firebase analytics schema). This is the schema used for the Hive data table. Do note most of the columns are not required, but to get the most out of the AdLibertas platform, we ask that you provide as much data as possible.

These columns should be provided at a minimum:

event_timestamp
event_name (ad_impression,in_app_purchase,user_engagement,etc)
event_params (this can be a revenue event [IAP or impression], engagment time in the app, user activity, etc)
user_id (your unique user ID)
user_pseudo_id (IDFV or Google App Set ID)
user_properties (assigned user properties, often AB test assignments)
user_first_touch_timestamp (Install timestamp)
device.vendor_id (IDFV) and/or device.advertising_id (IDFA or GAID)
geo.country (ISO2 country code)
app_info.id (app bundle ID)
platform (IOS or ANDROID)
traffic_source.name (campaign name for tracking LTV & ROAS by campaign)

Example SQL query

Here is the comprehensive schema of the Hive table that the new dataset needs to follow.

CREATE EXTERNAL TABLE `s3_events_123456`(
  `event_timestamp` bigint COMMENT 'from deserializer', 
  `event_name` string COMMENT 'from deserializer', 
  `event_params` array<struct<key:string,value:struct<string_value:string,int_value:bigint,float_value:double,double_value:double>>> COMMENT 'from deserializer', 
  `event_previous_timestamp` bigint COMMENT 'from deserializer', 
  `event_value_in_usd` double COMMENT 'from deserializer', 
  `event_bundle_sequence_id` bigint COMMENT 'from deserializer', 
  `event_server_timestamp_offset` bigint COMMENT 'from deserializer', 
  `user_id` string COMMENT 'from deserializer', 
  `user_pseudo_id` string COMMENT 'from deserializer', 
  `user_properties` array<struct<key:string,value:struct<string_value:string,int_value:bigint,float_value:double,double_value:double,set_timestamp_micros:bigint>>> COMMENT 'from deserializer', 
  `user_first_touch_timestamp` bigint COMMENT 'from deserializer', 
  `user_ltv` struct<revenue:double,currency:string> COMMENT 'from deserializer', 
  `device` struct<category:string,mobile_brand_name:string,mobile_model_name:string,mobile_marketing_name:string,mobile_os_hardware_model:string,operating_system:string,operating_system_version:string,vendor_id:string,advertising_id:string,language:string,is_limited_ad_tracking:string,time_zone_offset_seconds:bigint,browser:string,browser_version:string,web_info:string> COMMENT 'from deserializer', 
  `geo` struct<continent:string,country:string,region:string,city:string,sub_continent:string,metro:string> COMMENT 'from deserializer', 
  `app_info` struct<id:string,version:string,install_store:string,firebase_app_id:string,install_source:string> COMMENT 'from deserializer', 
  `traffic_source` struct<name:string,medium:string,source:string> COMMENT 'from deserializer', 
  `stream_id` string COMMENT 'from deserializer', 
  `platform` string COMMENT 'from deserializer', 
  `event_dimensions` string COMMENT 'from deserializer', 
  `ecommerce` string COMMENT 'from deserializer', 
  `items` array<string> COMMENT 'from deserializer')
PARTITIONED BY ( 
  `dataset_id` string, 
  `event_date` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='app_info,device,ecommerce,event_bundle_sequence_id,event_dimensions,event_name,event_params,event_previous_timestamp,event_server_timestamp_offset,event_timestamp,event_value_in_usd,geo,items,platform,stream_id,traffic_source,user_first_touch_timestamp,user_id,user_ltv,user_properties,user_pseudo_id') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://adlibertas-<<COMPANY>>/athena/production/singulari/v1/s3_events_123456'

PreviousConnecting AdLibertas to BigQuery NextIAP Connections

Last updated 3 years ago