5. Source

5.1. Introduction

Queries are defined within a Project and retrieve their data from a Source. QueryBuilder needs to be configured with where the source can be located, and what the structure of that source is.

5.2. Limit, mask and deidentify

Queries that are in the process of being built will be sampled by the web interface. This sampling is subjected to limiting, masking and deidentification.

  • Limit defines how many rows are sought in the sample

  • If mask is True: a count distinct of each column is returned. This masks the output and in effect only shows the number of distinct rows returned.

  • If mask is False: return the limited sample columns, rewriting any with a deid action to ‘deidentified’.

There are two important things to note here:

  1. The deidentify action is not applied in the sample, the columns to be deidentified are simply renamed to ‘deidentified’. This ensures that only those identifiers that were delivered to the user in the final resultset are recorded in the deidentification server.

  2. The final resultset is subject to deidentification if any column is marked with a deid action. Limiting and masking are not applied for queries that are exported.

5.3. Join mechanism

A Query consists of a number of projections and filters. The projections are columns of source tables. Source tables are inner joined together using the relationships defined in the world yaml.

It is possible that two tables needed for the query are not connected together. In order to make the join between the tables a third joining table may be necessary.

The join algorithm operates as follows:

  • Determine the minimal set of partitions for all source tables in the requested projection.

    • A partition is a set of tables that can be joined together via direct connections (inner join relationships)

  • If multiple partitions result, we need a transitive table between the partitions in order to join them.

    • The shortest join path (in terms of the number of tables) between partitions is established, and intermediate tables + conditions are added to the first partition.

    • Then the second partition is added to the first (it is now reachable via the transitive table/join), and this process is repeated until one partition remains.

In general care must be taken in the definition of the world yaml. Join paths can be minimized by correctly identifying bridge tables and taking care to include enough of them to speed up query execution speed and reduce total join path lengths.

5.4. Settings

SOURCE

The database type, either psql for PostgreSQL or mssql for Microsoft SQL Server.

SOURCE_LIMIT

The maximum number of rows to be returned when running a limited query. This limited query is used to sample the output for a query as it is defined in the user interface.

SOURCE_MASK

A boolean switch that determines if during query construction the output of the query should be masked. If True returns only a count of distinct rows in the output of the query.

SOURCE_PROHIBIT_PII_FILTERS

A boolean switch that prohibits table attributes that are marked with a deid attribute (and are assumed personally identifyable information) to be used in default filter expressions. This restricts the possibility to use information contained in that attribute to formulate a filtering condition.

Note that these attributes are always allowed – regardless of this configuration setting – in dataset filter conditions (in dataset, not in dataset) to facilitate dataset joining.

SOURCE_WORLD

The world definition for the configured source. Should point to a yaml definition that specifies the structure of the schemas, tables, attributes and relationships in the source.

The (yaml) format is:

schemas:
  <schemaname>:
    tables:
      <tablename>:
        columns:
          - {name: <columnname>, type: <columntype>,
              [deid: pseudonymize],
              [identifies: population]}
        fk:
          <childschemaname>.<childtablename>:
            - [<columnname>, <childcolumnname>]
        keys: [<columnname>]
  • Multiple schemas, tables, columns, foreign keys, and keys are supported.

  • The source defines the columntypes, typically all default source columntypes are supported.

  • Per column a deid action can be defined. Only allowed action at this time is pseudonymize.

  • Each table column can contain a population statement that explains to the QueryBuilder what type of population is identified by this attribute. At query result calculation time, QueryBuilder will use these identity statements to determine the number of population members in the current query.

  • Foreign keys are registered at their target. So the table defining the key attribute gets to register child tables that point to it.

  • Multiple foreign keys with other child tables can be expressed by issueing multiple childtablename statements.

  • Multiple columns under one foreign key are considered compound foreign keys.

  • Global keys (think patient ids, instance ids) can be mentioned in the keys directive.

  • Global keys across different tables are matched by name.

World file is validated for internal consistancy on server startup. At an installation a first version can be generated using the Invoke inspect command.

SOURCE_SERVER

Source database hostname.

SOURCE_USERNAME

Source database username. The QueryBuilder source database account needs read-only access to the source schemas as defined in the SOURCE_WORLD.

SOURCE_PASSWORD

Source database password.

SOURCE_DATABASE

Source database name.

SOURCE_CURSOR_SIZE

Determines the amount of rows sent in each batch from Postgres to QueryBuilder.