Data import methods

This documentation only applies if you are using Predictive AI in standalone mode (previously Tinyclues) or if you have subscribed to the IR add-on (Identity Resolution).

There are several ways to import data into Splio CDP / Predictive CDP:

  • Drop files on an SFTP server on a regular basis
  • Give access to your Big Query data set to Splio via a Service account

The methods are not exclusive: You can import part of the data with the SFTP method, and the rest with Big Query. This highly depends on your data sources and your IT organization.

ℹ️ You can also use our Shopify app to synchronize all your Shopify Data with Splio CDP.

SFTP

⚠️ The following requirements must be followed even after the initial setup to avoid import issues.

Please ensure that your CSV files match the following specifications:

  • Authentication is done using a login and a password provided by your contact at Splio
  • CSV with proper Delimiter, Quotechar, and Escapechar
  • Headers are mandatory
  • Encoding: UTF-8
  • File names and headers should not contain accented letters or spaces (é, è, ñ, etc.) and should respect the following pattern:
table_source_subsection_YYYYMMDD.csv`  (ex: `contacts_magento_prospects_20230724.csv`)
  • By default, files must be submitted daily at a fixed time. If required, we can go up an import every three hours.

Note that if you plan to send more than one file per day you should respect the following file naming pattern: table_source_subsection_YYYYMMDDHHMMSS.csv

  • Files can only be compressed using ZIP or Gzip format.
  • Files size must not exceed 10Gb (once uncompressed).
    • Full data history can be split into separate files to ease the transfer. In this case, please respect the following name pattern: table_source_subsection_YYYYMMDDHHMMSS_1.csv, table_source_subsection_YYYYMMDDHHMMSS_2.csv
  • Files must be without a password and not be partitioned and/or contained within any folder but exported as stand-alone files.
  • The file structure (header, date formats, etc.) must remain stable over time.
  • Files should only contain the incremental data since the last export.
    • Note that we recommend you send the last 2 days' worth of data in order to avoid gaps (we manage duplicates on our side).
    • Do not forget to send us the updated lines of your user (opt-in status update included) and product tables
  • Note that we can also manage full replacement for specific tables: user, product, or any reference/dimension
    • ⚠️ Please keep in mind that if you want to delete users, this purge process should be followed, full replacement doesn't delete related date scope (purchase history...) nor delete in any linked Splio database (Marketing Automation, Loyalty, Mobile Wallets)

BigQuery (via Service account)

On BigQuery you are able to create tables and datasets on your environment and allow Splio to read your data securely through IAM role management. The costs of querying the data that you put at our disposal are charged to Splio’s service account.

⚠️ Any dataset or table that you would like to share with us must be located in the EU multi-region (it cannot be in a single region)

From Google Cloud’s documentation:
“Giving a view access to a dataset is also known as creating an authorized view in BigQuery. An authorized view lets you share query results with particular users and groups without giving them access to the underlying tables. You can also use the view's SQL query to restrict the columns (fields) the users are able to query.”

The process can be summarized as follows:

  1. Create a dedicated Splio dataset for data sharing in your environment
  2. Assign access controls to your project (give access to Splio on the dataset)
  3. Create the tables/views of the data you want to share

The step-by-step tutorial from Google BigQuery explains how to technically proceed with this workflow.

How should my data be refreshed?

Data freshness plays a crucial role in predictive quality. The data you give us access to should be up to date. We do not need real-time data, but we need for the data to be updated daily.

Next steps

The next sections of this documentation will take you through specific requirements for the tables that will be shared.

Depending on the materialization strategy of the shared data, you may need to set up workflows to keep the data updated (see the following subsection).

⚠️ The tables should always contain the defined history depth, and not only incremental updates. For instance, if we agreed on 2 years of transactional data, this table should always contain at least 2 years of transactions. Daily updates should append the new data to the existing one, and the oldest day of data can be purged. By default and for GDPR reasons, we accept a maximum of 4 years of history.

Views or tables?

Even if we can work with both data structures, we encourage you to use views as much as possible, since it would relieve you from the task of keeping the data updated: as long as the underlying source data is updated, the view is updated as well!

The only limitation to that is if the view is too complex and needs a lot of resources to be queried. In that case, it might be interesting to transform it into a table to reduce the cost of querying it.

Table Partitioning

For event tables (purchases, broadcast, and reaction logs, browsing, etc.) we request that you use partitioned tables (as the shared table or underlying source table of the view), allowing cost optimizations while querying the tables.

For those tables, we kindly ask you to share with us the partition rules (field, type, and granularity) in order to adapt our ingestion workflow.

Splio’ service account

Access to your dataset should be given to Splio from a Service account that will be exclusively used to read your data and save it into Splio’s BigQuery environment. Your Technical Account Manager will provide you with the address of the service account at the beginning of your setup. Please note that we only need read access to your data and will not write directly into your dataset.

The process to give access to a service account on your dataset is explained here.