# Importing a custom data set

For most customers connecting data sources will be the primary source of data collection, for users with webhooks, we have a [custom webhook integration](https://docs.adlibertas.com/data-integrations/custom-integrations/sending-events-via-webhooks-to-adlibertas) as well.

However, in some use-cases, you may want to import a custom dataset. This article covers the schema and procedure for importing a custom dataset to AdLibertas.

**Uploading the data**

To get started all we need is data uploaded to an s3 bucket, please reach out to your account manager for AWS credentials.

All files will be nested underneath an s3 folder in this format:

```
s3://adlibertas-<<company_name>>/athena/production/singulari/v1/s3_events_123456/
```

[Here is an example of the output](https://2la1xp2tiyea46tdsj1ai1pe-wpengine.netdna-ssl.com/wp-content/uploads/2022/02/project_idmy-demo-project-dc2a9_dataset_idanalytics_227417573_table_idevents_20210528_000000000000.json_.gz)

**Partitioning:**

If you have used hive or AWS glue/ athena before, our s3 paths should be familiar. Each upload should be categorized underneath a `dataset_id` and an `event_date`.

A dataset can be as large as you like and contain multiple apps. However, we ask that you separate uploads by an `event_date` (YYYY-MM-DD).

Breaking down the example file partitions, you can see the `dataset_id=com.awesomeapp` and `event_date=2021-05-28` where \*com.awesomeapp\* is the dataset ID and \*2021-05-28\* is the event date.

**Formatting:**

Each file uploaded is in a JSONL file gzipped format. Please see [this example file](https://2la1xp2tiyea46tdsj1ai1pe-wpengine.netdna-ssl.com/wp-content/uploads/2022/02/project_idmy-demo-project-dc2a9_dataset_idanalytics_227417573_table_idevents_20210528_000000000000.json_.gz) for the schema of each JSON object (our schema is very close to the Firebase analytics schema). This is the schema used for the Hive data table. Do note most of the columns are not required, but to get the most out of the AdLibertas platform, we ask that you provide as much data as possible.

These columns should be provided at a minimum:

* `event_timestamp`
* `event_name` (ad\_impression,in\_app\_purchase,user\_engagement,etc)
* `event_params` (this can be a revenue event \[IAP or impression], engagment time in the app, user activity, etc)
* `user_id` (your unique user ID)
* `user_pseudo_id` (IDFV or Google App Set ID)
* `user_properties` (assigned user properties, often AB test assignments)
* `user_first_touch_timestamp` (Install timestamp)
* `device.vendor_id` (IDFV) and/or `device.advertising_id` (IDFA or GAID)
* `geo.country` (ISO2 country code)
* `app_info.id` (app bundle ID)
* `platform` (IOS or ANDROID)
* `traffic_source.name` (campaign name for tracking LTV & ROAS by campaign)

**Example SQL query**

Here is the comprehensive schema of the Hive table that the new dataset needs to follow.

```
CREATE EXTERNAL TABLE `s3_events_123456`(
  `event_timestamp` bigint COMMENT 'from deserializer', 
  `event_name` string COMMENT 'from deserializer', 
  `event_params` array<struct<key:string,value:struct<string_value:string,int_value:bigint,float_value:double,double_value:double>>> COMMENT 'from deserializer', 
  `event_previous_timestamp` bigint COMMENT 'from deserializer', 
  `event_value_in_usd` double COMMENT 'from deserializer', 
  `event_bundle_sequence_id` bigint COMMENT 'from deserializer', 
  `event_server_timestamp_offset` bigint COMMENT 'from deserializer', 
  `user_id` string COMMENT 'from deserializer', 
  `user_pseudo_id` string COMMENT 'from deserializer', 
  `user_properties` array<struct<key:string,value:struct<string_value:string,int_value:bigint,float_value:double,double_value:double,set_timestamp_micros:bigint>>> COMMENT 'from deserializer', 
  `user_first_touch_timestamp` bigint COMMENT 'from deserializer', 
  `user_ltv` struct<revenue:double,currency:string> COMMENT 'from deserializer', 
  `device` struct<category:string,mobile_brand_name:string,mobile_model_name:string,mobile_marketing_name:string,mobile_os_hardware_model:string,operating_system:string,operating_system_version:string,vendor_id:string,advertising_id:string,language:string,is_limited_ad_tracking:string,time_zone_offset_seconds:bigint,browser:string,browser_version:string,web_info:string> COMMENT 'from deserializer', 
  `geo` struct<continent:string,country:string,region:string,city:string,sub_continent:string,metro:string> COMMENT 'from deserializer', 
  `app_info` struct<id:string,version:string,install_store:string,firebase_app_id:string,install_source:string> COMMENT 'from deserializer', 
  `traffic_source` struct<name:string,medium:string,source:string> COMMENT 'from deserializer', 
  `stream_id` string COMMENT 'from deserializer', 
  `platform` string COMMENT 'from deserializer', 
  `event_dimensions` string COMMENT 'from deserializer', 
  `ecommerce` string COMMENT 'from deserializer', 
  `items` array<string> COMMENT 'from deserializer')
PARTITIONED BY ( 
  `dataset_id` string, 
  `event_date` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='app_info,device,ecommerce,event_bundle_sequence_id,event_dimensions,event_name,event_params,event_previous_timestamp,event_server_timestamp_offset,event_timestamp,event_value_in_usd,geo,items,platform,stream_id,traffic_source,user_first_touch_timestamp,user_id,user_ltv,user_properties,user_pseudo_id') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://adlibertas-<<COMPANY>>/athena/production/singulari/v1/s3_events_123456'
```

```
```

```
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.adlibertas.com/data-integrations/custom-integrations/importing-a-custom-data-set.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
