Data Requisites

Updated 12th April, 2021

Obviously AI requires a structured dataset to get meaningful prediction outcomes. The dataset needs to be structured, but not necessarily clean. Meaning, it can have inconsistencies like text values in number columns OR empty cells.

We made a quick DIY check list to ensure your data is well structured and machine learning ready.

General Guidelines

What columns should I bring in my dataset?

A training dataset that's machine learning ready typically contains several types of columns (features), while you don't need them all, having as many as possible can help make better predictions. Here's a list of most common column types:

  • Identifier column: Anything we use to distinguish a customer from another. Only ONE required. (e.g. User ID, Name, Customer ID, etc.)
  • Demographic columns: Any columns with demographic data that relates to the user OR the line item in the row. (e.g. Age, Location, Income, etc.)
  • Product/Usage columns: Any columns that record activity done by the customer on your product OR details of their account. (e.g. Number of sessions, Account type, etc.)
  • Transactional columns: Any columns with details on transactions done by the customer. (e.g. Monthly charges, Payment method, Contract length, etc.)
  • Prediction column: Data of historical activity, that you would like to predict. (e.g. Churn, Lead status, Sales, Revenue, etc.)

To learn more about the type of columns, check out the following links:

How do I structure a good training dataset?

A good training dataset is typically structured as follows:

  • One place: Data is organized in a single file, table or view.
  • Few missing values: As few missing values as possible. This is done to avoid rows being dropped.
  • At least 1,000 rows: Ensures meaningful results. More rows, the better.
  • At least 5 columns: Ensures meaningful results. More columns, the better.
  • No long addresses: Break down addresses into City, State, Country and ZIP, instead of long form.
  • NO PII Required: No personal identifying information (PII) columns are required (e.g. Email, Phone, Address, etc.)
  • No long sentences: Avoid long phrases in columns (e.g. chat log, emails, feedback paragraph, etc.)

For CSV files

Below is the checklist of pre-requisites for CSV files.

  • File size is less than 25 MB.
  • First row is column names.
  • First column is an ID column.
  • File has a minimum of 1,000 rows and 5 columns.
  • File has very few empty cells.
  • File is in a .CSV format.
  • Here is a sample file for your reference.

For Databases

Below is the checklist of pre-requisites for connecting your database.

  • Ensure Obviously AI's IP address is whitelisted on your firewall. This can be found under Connection Requirements when adding the dataset.
  • First column in your table is an ID column.
  • Table has a minimum of 1,000 rows and 5 columns.
  • Table has very few empty cells.