5. Source

5.1. Introduction

Queries are defined within a Project and retrieve their data from a Source. Query Builder needs to be configured with where the source can be located, and what the structure of that source is.

5.2. Limit, mask and deidentify

Queries that are in the process of being built will be sampled by the web interface. This sampling is subjected to limiting, masking and deidentification.

  • Limit defines how many rows are sought in the sample
  • If mask is True: a count distinct of each column is returned. This masks the output and in effect only shows the number of distinct rows returned.
  • If mask is False: return the limited sample columns, rewriting any with a deid action to ‘deidentified’.

There are two important things to note here:

  1. The deidentify action is not applied in the sample, the columns to be deidentified are simply renamed to ‘deidentified’. This ensures that only those identifiers that were delivered to the user in the final resultset are recorded in the deidentification server.
  2. The final resultset is subject to deidentification if any column is marked with a deid action. Limiting and masking are not applied for queries that are exported.

5.3. Settings

SOURCE

The database type, either psql for PostgreSQL or mssql for Microsoft SQL Server.

SOURCE_LIMIT

The maximum number of rows to be returned when running a limited query. This limited query is used to sample the output for a query as it is defined in the user interface.

SOURCE_MASK

A boolean switch that determines if during query construction the output of the query should be masked. If True returns only a count of distinct rows in the output of the query.

SOURCE_WORLD

The world definition for the configured source. Should point to a yaml definition that specifies the structure of the schemas, tables, attributes and relationships in the source.

The (yaml) format is:

schemas:
  <schemaname>:
    tables:
      <tablename>:
        columns:
          - {name: <columnname>, type: <columntype>,
              [deid: pseudonymize],
              [identifies: population]}
        fk:
          <childschemaname>.<childtablename>:
            - [<columnname>, <childcolumnname>]
        keys: [<columnname>]
  • Multiple schemas, tables, columns, foreign keys, and keys are supported.
  • The source defines the columntypes, typically all default source columntypes are supported.
  • Per column a deid action can be defined. Only allowed action at this time is pseudonymize.
  • Each table column can contain a population statement that explains to the Query Builder what type of population is identified by this attribute. At query result calculation time, Query Builder will use these identity statements to determine the number of population members in the current query.
  • Foreign keys are registered at their target. So the table defining the key attribute gets to register child tables that point to it.
  • Multiple foreign keys with other child tables can be expressed by issueing multiple childtablename statements.
  • Multiple columns under one foreign key are considered compound foreign keys.
  • Global keys (think patient ids, instance ids) can be mentioned in the keys directive.
  • Global keys across different tables are matched by name.

World file is validated for internal consistancy on server startup. At an installation a first version can be generated using the Invoke inspect command.

SOURCE_SERVER

Source database hostname.

SOURCE_USERNAME

Source database username. The Query Builder source database account needs read-only access to the source schemas as defined in the SOURCE_WORLD.

SOURCE_PASSWORD

Source database password.

SOURCE_DATABASE

Source database name.