7. Data Import

MGRID Aperture allows datasets to be imported by providing a comma separated values file (CSV) and a dataset definition file (XML) describing the structure of the CSV file.

There are three ways to import datasets:

  • Import by discovery in user home directory; this is an automatic process that may be enabled during installation.

  • Import by end-user in Aperture UI; a user can add a dataset in the web frontend.

  • Import by end-user via Token API; see Upload Token API

Aperture also allows unstructured files to be imported. These files are stored as-is, i.e. without modification or validity checks.

7.1. Dataset Import

The table definition file is based on the format provided by Aridhia: http://www.aridhia.com/resources/user-guides/loading-data-into-workspace

7.1.1. Example

An example of a CSV file:

patient,pseudo,age,gender,chest pain,rest SBP,cholesterol,fasting blood sugar > 120
"patient1","pseudo1",63,"male","typical ang",145,233,1
"patient2","pseudo2",67,"male","asymptomatic",160,286,0
"patient3","pseudo3",67,"male","asymptomatic",120,229,0
"patient4","pseudo4",37,"male","non-anginal",130,250,0

With its accompanying XML file:

<?xml version="1.0" encoding="utf-8"?>
<DatasetDefinition>
    <Columns>
        <Column Name="patient" Type="text" Deidentify="pseudonymize"/>
        <Column Name="pseudo" Type="text" Deidentify="pseudonym"/>
        <Column Name="age" Type="numeric" Deidentify="hide"/>
        <Column Name="gender" Type="text" Deidentify="hide" StoreInKeyfile="true"/>
        <Column Name="chest pain" Type="text" Deidentify="keep"/>
        <Column Name="rest SBP" Type="numeric" Deidentify="anonymize:numbergroup(10)"/>
        <Column Name="cholesterol" Type="numeric" Deidentify="anonymize:blank"/>
        <Column Name="fasting blood sugar > 120" Type="boolean"/>
    </Columns>
    <Format
        Delimiter=','
        NullQualifier='NA'
        TextQualifier='"'
        Encoding='UTF-8'
        Header='true'
        HeaderCase='lower'
        DateFormat='ISO, MDY'
    />
    <Success RemovefromUpload='True'/>
    <Fail RemovefromUpload='False'/>
</DatasetDefinition>

The XML file specifies the columns (names, types, and possibly deidentify actions), the format (delimiters, encoding, existence of headers, etc.) and if the files should be removed after upload on success and failure.

7.1.2. Deidentify actions

For each column, a deidentify action may be specified (if omitted, ‘keep’ is assumed). The list of supported actions is as follows:

Action

Description

Keep

The column and its values are kept as-is.

Drop

The column and its values are not visible in the output. The values are only recoverable via the key file if ‘StoreInKeyfile’ is set to ‘true’.

Pseudonymize

The values are replaced with pseudonyms (either generated by the de-id server or provided by the user in the column with de-id action ‘pseudonym’), and are recoverable via the key file.

Pseudonym

The column and its values are visible in the output, and the values in the column with de-id action ‘pseudonymize’ are replaced with these values.

Anonymize

The values are anonymized with the provided function. The values are only recoverable via the key file if ‘StoreInKeyfile’ is set to ‘true’. See below for the full set of supported functions.

The default ‘blank’ anonymization function can be used by specifying ‘anonymize’ as the deidentify action.

Anonymization functions without arguments can be used by specifying ‘anonymize:function’, e.g. ‘anonymize:round’.

Anonymization functions with arguments can be used by specifying ‘anonymize:function(arguments)’, e.g. ‘anonymize:numbergroup(5)’ and ‘anonymize:agegroup(20161122,10)’

7.1.3. XML Schema Definition

The table definition file must adhere to the following XML Schema Definition (XSD):

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns="http://aridhia-mgrid.com/ddf/2"
           targetNamespace="http://aridhia-mgrid.com/ddf/2"
           elementFormDefault="qualified"
           attributeFormDefault="unqualified">
    <xs:simpleType name="DeidentifyAction">
        <xs:union>
            <xs:simpleType>
                <xs:restriction base="xs:string">
                    <xs:enumeration value="drop"/>
                    <xs:enumeration value="keep"/>
                    <xs:enumeration value="pseudonym"/>
                    <xs:enumeration value="pseudonymize"/>
                </xs:restriction>
            </xs:simpleType>
            <xs:simpleType>
                <xs:restriction base="xs:string">
                    <xs:pattern value="anonymize:[a-z_]([a-z_()0-9])+"/>
                </xs:restriction>
            </xs:simpleType>
        </xs:union>
    </xs:simpleType>
    <xs:complexType name="Column">
        <xs:attribute name="Name" type="xs:string" use="required"/>
        <xs:attribute name="Type" type="xs:string" use="required"/>
        <xs:attribute name="Description" type="xs:string"/>
        <xs:attribute name="Deidentify" type="DeidentifyAction"/>
        <xs:attribute name="StoreInKeyfile" type="xs:boolean"/>
    </xs:complexType>
    <xs:complexType name="Columns">
        <xs:sequence>
            <xs:element name="Column" type="Column" minOccurs="1" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>
    <xs:complexType name="Format">
        <xs:attribute name="Delimiter" type="xs:string"/>
        <xs:attribute name="NullQualifier" type="xs:string"/>
        <xs:attribute name="TextQualifier" type="xs:string"/>
        <xs:attribute name="Encoding" type="xs:string"/>
        <xs:attribute name="Header" type="xs:string"/>
        <xs:attribute name="HeaderCase" type="xs:string"/>
        <xs:attribute name="DateFormat" type="xs:string"/>
        <xs:attribute name="CopyToFile" type="xs:boolean" default="true"/>
    </xs:complexType>
    <xs:complexType name="CatalogueIdentifier">
        <xs:attribute name="uuid" type="xs:string" use="required"/>
        <xs:attribute name="href" type="xs:string"/>
    </xs:complexType>
    <xs:complexType name="SuccessAction">
        <xs:attribute name="RemovefromUpload" type="xs:string" default="True"/>
    </xs:complexType>
    <xs:complexType name="FailAction">
        <xs:attribute name="RemovefromUpload" type="xs:string" default="False"/>
    </xs:complexType>
    <xs:simpleType name="CreateAction">
        <xs:restriction base="xs:string">
            <xs:enumeration value="create"/>
            <xs:enumeration value="append"/>
        </xs:restriction>
    </xs:simpleType>
    <xs:complexType name="DatasetDefinitionType">
        <xs:all>
            <xs:element name="Title" type="xs:string" minOccurs="0"/>
            <xs:element name="Url" type="xs:string" minOccurs="0"/>
            <xs:element name="Description" type="xs:string" minOccurs="0"/>
            <xs:element name="AuthorizationReference" type="xs:string" minOccurs="0"/>
            <xs:element name="PrivacyDisclaimer" minOccurs="0"/>
            <xs:element name="Columns" type="Columns"/>
            <xs:element name="Format" type="Format" minOccurs="0"/>
            <xs:element name="CatalogueIdentifier" type="CatalogueIdentifier" minOccurs="0"/>
            <xs:element name="Success" type="SuccessAction" minOccurs="0"/>
            <xs:element name="Fail" type="FailAction" minOccurs="0"/>
        </xs:all>
        <xs:attribute name="SchemaVersion" default="1.0"/>
        <xs:attribute name="TableName" type="xs:string"/>
        <xs:attribute name="Action" type="CreateAction" default="create"/>
    </xs:complexType>
    <xs:element name="DatasetDefinition" type="DatasetDefinitionType"/>
    <xs:element name="TableDefinition" type="DatasetDefinitionType">
        <xs:annotation>
            <xs:documentation>
        TableDefinition is deprecated. Use only for backwards compatibility with
        existing files.
      </xs:documentation>
        </xs:annotation>
    </xs:element>
</xs:schema>

7.2. File Import

Aperture also allows unstructured files to be imported. These files are stored as-is, i.e. without modification or validity checks. An unstructured file can be anything that helps the user to work with the data in the Aperture project, e.g. a regular text file, an image, or a presentation file.