5. Source¶

5.1. Introduction¶

Queries are defined within a Project and retrieve their data from a Source. QueryBuilder needs to be configured with where the source can be located, and what the structure of that source is.

5.2. Limit, mask and deidentify¶

Queries that are in the process of being built will be sampled by the web interface. This sampling is subjected to limiting, masking and deidentification.

Limit defines how many rows are sought in the sample
If mask is True: a count distinct of each column is returned. This masks the output and in effect only shows the number of distinct rows returned.
If mask is False: return the limited sample columns, rewriting any with a deid action to ‘deidentified’.

There are two important things to note here:

The deidentify action is not applied in the sample, the columns to be deidentified are simply renamed to ‘deidentified’. This ensures that only those identifiers that were delivered to the user in the final resultset are recorded in the deidentification server.
The final resultset is subject to deidentification if any column is marked with a deid action. Limiting and masking are not applied for queries that are exported.

5.3. Join mechanism¶

A Query consists of a number of projections and filters. The projections are columns of source tables. Source tables are inner joined together using the relationships defined in the world yaml.

It is possible that two tables needed for the query are not connected together. In order to make the join between the tables a third joining table may be necessary.

The join algorithm operates as follows:

Determine the minimal set of partitions for all source tables in the requested projection.
- A partition is a set of tables that can be joined together via direct connections (inner join relationships)
If multiple partitions result, we need a transitive table between the partitions in order to join them.
- The shortest join path (in terms of the number of tables) between partitions is established, and intermediate tables + conditions are added to the first partition.
- Then the second partition is added to the first (it is now reachable via the transitive table/join), and this process is repeated until one partition remains.

5.3.1. Tips for world definition¶

In general care must be taken in the definition of the world yaml. Join paths can be minimized by correctly identifying bridge tables and taking care to include enough of them to speed up query execution speed and reduce total join path lengths.

Make sure there are there are little superfluous join paths. All tables need to be reachable, but if they can be reached via two routes, the algorithm might take one or the other depending the start conditions of the query.

Note

Although you cannot influence the route it takes, it will take the same route given the same query, so as not to break the principle of least astonishment.

Tables that are essential in joining should provide rows in common with their join partners. If they do not, they will constrain any join that traverses them, leading to loss of rows in the resultset. See Unequal joins for a detailed example.

An invoke command is available that draws the currently defined world.yaml as a graphviz dot file. This can then be converted into an image using dot outside the container.

mgrid@91177921ef3e:/opt/mgrid/querybuilder$ inv app.inspect.joinpaths 2>world.dot

5.3.2. Unequal joins¶

The relationships in the world yaml define our possible join paths. A join path could look like this Person.id=Address.person_id, which would signify that we can join rows in Person with rows in Address when their individual person id attributes match.

You can think of a joined row in the following manner; if Person.id and Address.person_id match, the joined row has all attributes from Person followed by all attributes of Address. So simply put we will get a table that contains a PersonAddress.

What happens when our Address table does not include an address for each Person? At that point the PersonAddress will not contain the person either. Here Address has constrained the query result.

What happens if we have multiple Address entries for each person? At that point PersonAddress will contain for each person with multiple addresses multiple PersonAddress pairs. Here Address increases the query result.

5.4. Settings¶

See Configuration chapter for more info, specifically the settings starting with SOURCE_.

The (yaml) format for SOURCE_WORLD is:

schemas:
  <schemaname>:
    tables:
      <tablename>:
        columns:
          - {name: <columnname>, type: <columntype>,
              [deid: pseudonymize],
              [identifies: population]}
        fk:
          <childschemaname>.<childtablename>:
            - [<columnname>, <childcolumnname>]
        keys: [<columnname>]

Multiple schemas, tables, columns, foreign keys, and keys are supported.
The source defines the columntypes, typically all default source columntypes are supported.
Per column a deid action can be defined. Only allowed action at this time is pseudonymize.
Each table column can contain a population statement that explains to the QueryBuilder what type of population is identified by this attribute. At query result calculation time, QueryBuilder will use these identity statements to determine the number of population members in the current query.
Foreign keys are registered at their target. So the table defining the key attribute gets to register child tables that point to it.
Multiple foreign keys with other child tables can be expressed by issueing multiple childtablename statements.
Multiple columns under one foreign key are considered compound foreign keys.
Global keys (think patient ids, instance ids) can be mentioned in the keys directive.
Global keys across different tables are matched by name.

World file is validated for internal consistancy on server startup. At an installation a first version can be generated using the Invoke inspect command.