4. Installation¶

Note

To be able to install XFM you need a token (defined as ${REPO_TOKEN} below). This token provides access to the MGRID package repository. You can request a token by sending an e-mail to support@mgrid.net.

4.1. Quickstart¶

For a single node install, you can execute

$ REPO_TOKEN=<YOUR_TOKEN>
$ curl -s http://${REPO_TOKEN}:@www.mgrid.net/quickstart/install.sh | sudo bash

This will install and setup a base XFM configuration. This accepts CDA R2 messages with RIM version 2.33R1 Normative Edition 2011. The RIM database is optimized for large message rates.

If successful, you can access the XFM command line tools by running:

$ su - xfmadmin

For available commands and testing an see Using xfmctl.

4.2. Detailed Procedure¶

XFM runs on a set of (virtual) machines, and is bundled with a command line tool (xfmctl) to help setting up and manage a XFM deployment.

xfmctl is installed on a machine which manages the XFM deployment, and should have network access to the target machines.

Install xfmctl and its dependencies (xfmctl is written in Python and it is recommended to run it in a virtualenv environment).

First set some environment variables:

export REPO_TOKEN=<YOUR_TOKEN>
export ADMIN=xfmadmin
export ADMIN_HOME=/home/$ADMIN
export RELEASE=$(cat /etc/redhat-release | grep -oE '[0-9]+\.[0-9]+' | cut -d"." -f1)

Then add the MGRID package repository:

curl -s https://${REPO_TOKEN}:@packagecloud.io/install/repositories/mgrid/mgrid3/script.rpm.sh | sudo bash

This add a Yum repository to the system, so the xfmctl package becomes available for installation.

Add the install group and user:

groupadd $ADMIN && useradd -d $ADMIN_HOME -m -g $ADMIN $ADMIN

Install the XFM command line tools (xfmctl). This requires the Extra Packages for Enterprise Linux (EPEL) repository.

yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-${RELEASE}.noarch.rpm
yum install -y mgridxfm3-xfmctl

The next command switches to the $ADMIN user, and preserves the environment variables (-E switch).

sudo -E -u $ADMIN bash
cd $ADMIN_HOME

Create SSH keys for key-based authentication (allows xfmctl connect to the nodes).

ssh-keygen -t rsa -b 2048 -f $ADMIN_HOME/.ssh/xfm_rsa -P ""

Create the virtual environment:

virtualenv ${ADMIN_HOME}/pyenv
source ${ADMIN_HOME}/pyenv/bin/activate
echo source ${ADMIN_HOME}/pyenv/bin/activate >> ${ADMIN_HOME}/.bashrc
pip install /opt/mgrid/xfm3-xfmctl/xfmctl-*.tar.gz

If successfull the xfmctl command is available in your virtualenv. It is also added to the .bashrc file so the virtualenv is activated when logging in as $ADMIN. Before it can be used it needs a configuration file.

4.3. xfmctl Configuration¶

xfmctl reads its configuration from a file xfm.json (expected in the directory where xfmctl is run).

xfmctl supports several target environments through so called backends; it can setup and manage XFM nodes using existing cloud infrastructure, interact directly with virtual machines, or Docker containers. The backends are:

plain: This backend is the most basic and does not create XFM instances itself (i.e. the XFM instances are already brought up by some other (possibly manual process)). xfmctl only needs to know the instance names and how to access the it using Secure Shell (SSH).
amazon: For the amazon backend xfmctl uses the EC2 web service to create and manage instances. A valid Amazon EC2 account is required. When an instance is available, xfmctl also connects to it using SSH.

xfmctl reads its configuration from xfm.json in its current directory. The xfmctl distribution contains sample configuration files for each backend, see Example configuration file for an example single node configuration.

The xfmctl distribution contains sample configuration files for each backend, see Example configuration file for an example single node configuration for the plain backend. Available settings are:

xfm
- backend Selected backend. For the selected backend a configuration section backend_<NAME> must exist; for example backend_plain.
- config XFM deployment configuration. Site-specific components (e.g., custom workers, messaging configuration, parser definitions) can be packaged, and selected with this configuration item. base is the default XFM configuration.
- repo Name of the MGRID package repository. Default is mgrid3.
- repotoken Access token of the MGRID package repository.
- sshkey Path to the private key used to access XFM instances.
- persistence (true, false) Set whether message persistence should be enabled for all messages (i.e. the broker stores messages to disk before sending acknowledgements). Enable this option when messages should survive broker restarts or failures (when disabling persistence messages are kept in-memory). Note that writing messages to disk affects performance.
- enable_metrics (true, false) Set whether XFM components should send metrics to the metrics server.
gateway
- hostname Hostname used by nodes (e.g., ingesters, command-line tools) to access the gateway RabbitMQ instance.
- username RabbitMQ username for messaging with the gateway.
- password RabbitMQ password for messaging with the gateway.
broker
- hostname Hostname used by nodes (e.g., ingesters, transformers, loaders, command-line tools) to access the broker RabbitMQ instance.
- username RabbitMQ username for messaging with the broker.
- password RabbitMQ password for messaging with the broker.
ingester
- procspernode The number of processes to start on a node/instance.
- prefetch The amount of messages that are prefetched from the gateway.
- flowcontrol Subcategory for flow control settings.
  - threshold Number of messages that are queued for tranformers and loaders (summed) before flow control is activated (to limit intake from the gateway).
  - period Period in milliseconds for querying queue sizes. This will determine how often the actual queue sizes are checked against the threshold (because of this polling mechanism it is possible that the combined queue sizes exceed the configured threshold).
transformer
- procspernode The number of processes to start on a node/instance.
- prefetch The amount of messages that are prefetched from the broker.
- json Settings for the JSON transformer (not used in this tutorial).
  - partitions Number of database table partitions.
  - table Name of the destination table.
  - column Name of the destination table column.
loader
- procspernode The number of processes to start on a node/instance.
- group Subcategory for group settings.
  - timeout Timeout in milliseconds for grouping (aggregating) messages. If less messages are received then the group size before the timeout, the (partial) group is processed. This avoids stalling of messages.
  - size_low, size_high The loaders choose a random group size at startup to avoid running in lockstep when uploading towards the data lake. These parameters control the lower and upper bound of the chosen group size. The prefetch size is chosen as 2 times the group size.
  - pond Settings for the pond databases.
    pgversion PostgreSQL database version to use; 9.4 or 9.5.
    
    port Listening port of the pond database server.
metrics Metrics backend (Graphite)
- hostname Hostname used by nodes (e.g., ingesters, loaders, command-line tools) to access the metrics server.
- port Port for sending metric data. Note that this is the server port as used by clients.
- secret Key for accessing the metrics web API.
lake
- pgversion PostgreSQL database version to use; 9.4 or 9.5.
- datadir Directory of the lake database files.
- hostname Hostname of the lake database.
- port Port of the lake database.
- name Name of the lake database.
- username Username to access the lake database.
- password Password to access the lake database. Note this is used verbatim in a pgpass file, so : and \ should be escaped.
backend_plain
- username Username used to access an instance.
- hosts Key-value pairs of XFM instances. The key is the role name, and the value is a list of IP addresses or hostnames of the nodes belonging to that role (nodes should only have a single role). Typically each role represents a group of nodes with the same configuration profile, such as all ingester nodes.
  
  The role name is used to determine the configuration profile (i.e. installation instructions).
  
  In its base configuration XFM contains the following roles (this can be extended through site-specific configurations):
  - gateway Gateway broker (RabbitMQ). Should contain at most 1 entry.
  - broker Messaging broker (RabbitMQ). Should contain at most 1 entry.
  - ingester Nodes running Ingesters. Can contain 1 or more entries.
  - transformer Nodes running Transformers. Can contain 1 or more entries.
  - loader Nodes running Loaders. Can contain 1 or more entries.
  - lake Lake database (PostgreSQL). Should contain at most 1 entry.
  - rabbitmq Combination of gateway and broker on a single RabbitMQ instance. Should contain at most 1 entry.
  - worker Combination of Ingesters, Transformers and Loaders. Can contain 1 or more entries.
  - singlenolake Combination of all components on a single node except the Lake. Should contain at most 1 entry.
  - singlenode Combination of all components on a single node. Should contain at most 1 entry.
backend_amazon
- imageid Identifier of the Amazon Machine Image (AMI) to use.
- username Username used to access an instance (typically ec2-user).
- keyname Name of the key pair.
- cert Path to the certificate to access the Amazon AWS EC2 endpoint.
- securitygroup Security group to use for an instance.
- management, gateway, broker, ingester, transformer, loader
  - sizeid The size identifier of an instance hardware configuration (e.g., t1.micro).

The created SSH key should be used as the sshkey in the xfm.json configuration file (as created in a previous step).

4.4. Installing nodes¶

Before xfmctl can start installation of a node, it must be able to access it using key-based authentication. To copy the created key to the node, do:

$ ssh-copy-id -i $ADMIN_HOME/.ssh/xfm_rsa <USERNAME>@<HOSTNAME>

Substitute <USERNAME> with the username used in the backend username in xfm.json, and <HOSTNAME> for each hostname (or IP address) used in the backend hosts section. To list the configured addresses run:

$ xfmctl list

Now xfmctl should have access to the node. To test, do:

$ xfmctl --hosts=<HOSTNAME> -- cat /etc/redhat-release

When successful, it should print the distribution version, without prompting for a username or password.

xfmctl needs to know on which nodes to run its commands. Above the --hosts parameter was used to select individual nodes, but the --roles parameter can be used to run commands on all nodes listed for that role in the backend hosts section in xfm.json. Multiple roles can be provided, separated with commas, for example:

$ xfmctl --roles=ingester,transformer,loader -- cat /etc/redhat-release

Now xfmctl can access the nodes it should be allowed privileged access, to be able to make system-wide changes to the node (e.g., install and configure software).

To enable privileged access using xfmctl, execute the following command for all roles (prompts for the root password on each node):

xfmctl --roles=singlenode -- su -c "\"mkdir -p /etc/sudoers.d && echo $ADMIN 'ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/999_xfm\""

When ready, a node can be installed:

$ xfmctl --roles=singlenode setup update_config update

The installation commands are as follows:

setup
- Adds repositories to the node needed to install XFM dependencies in addition to the base RedHat and CentOS repositories, install the XFM bootstrap package and configures Puppet. Installed repositories are:
  MGRID
  
  Puppet
  
  PostgreSQL (9.4 and 9.5)
  
  Extra Packages for Enterprise Linux (EPEL)
  
  Software Collections (SCL), only for RedHat/CentOS 6.
update_config Copies the settings in xfm.json to a node such that it is available during installation.
update Installs or updates a node. The actual steps performed depend on the role of the node as configured in the backend hosts section in xfm.json.

4.5. Amazon backend: Creating nodes¶

When using the Amazon backend, there is an additional command create. This command requires a role parameter. As was already seen in creating the management server instance this is done by passing the role after a colon:

$ xfmctl create:management

Note that while the create command returns after the instance is running with networking enabled, it can take some additional time before access using SSH is possible. If subsequent update commands timeout, it often helps to wait a bit and retry.

4.5.1. Updating configuration¶

In setting up the management server the configuration from xfm.json was uploaded. When this file changes (e.g., to change the prefetch of the ingesters) the instances should be updated to reflect the changes:

$ xfmctl --roles=ingester update_config update

4.5.2. Example configuration file¶

{
  "xfm": {
    "backend": "plain",
    "config": "base",
    "repo": "mgrid3",
    "repotoken": "${REPO_TOKEN}",
    "sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
    "persistence": false,
    "enable_metrics": false
  },
  "gateway": {
    "hostname": "localhost",
    "username": "xfm",
    "password": "tr4nz00m"
  },
  "broker": {
    "hostname": "localhost",
    "username": "xfm",
    "password": "tr4nz00m"
  },
  "ingester": {
    "procspernode": 1,
    "prefetch": 50,
    "flowcontrol": {
      "threshold": 5000,
      "period": 2000
    }
  },
  "transformer": {
    "procspernode": 1,
    "prefetch": 50,
    "json": {
      "partitions": 400,
      "table": "document",
      "column": "document"
    }
  },
  "loader": {
    "procspernode": 1,
    "group": {
      "timeout": 1000,
      "size_low": 50,
      "size_high": 100
    },
    "pond": {
      "pgversion": "9.4",
      "port": 5433
    }
  },
  "metrics": {
    "hostname": "localhost",
    "port": 2003,
    "secret": "verysecret"
  },
  "lake": {
    "pgversion": "9.4",
    "datadir": "/var/lib/pgsql/9.4/lake",
    "hostname": "localhost",
    "port": 5432,
    "name": "lake",
    "username": "xfmuser",
    "password": "lake"
  },
  "backend_plain": {
    "username": "${ADMIN}",
    "hosts": {
      "singlenode": [
        "127.0.0.1"
      ]
    }
  }
}

4.6. Multinode install¶

Below is an example on how to install multiple nodes.

1 Management server
1 RabbitMQ message broker
1 Ingester node
1 Transformer node
1 Loader node
1 Lake node

Add all hosts to xfm.json (see xfmctl Configuration for an explanation of the settings):

{
  "xfm": {
    "backend": "plain",
    "config": "base",
    "repo": "mgrid3",
    "repotoken": "${REPO_TOKEN}",
    "sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
    "persistence": false,
    "enable_metrics": false
  },
  "gateway": {
    "hostname": "192.168.1.110",
    "username": "xfm",
    "password": "tr4nz00m"
  },
  "broker": {
    "hostname": "192.168.1.110",
    "username": "xfm",
    "password": "tr4nz00m"
  },
  "ingester": {
    "procspernode": 1,
    "prefetch": 50,
    "flowcontrol": {
      "threshold": 5000,
      "period": 2000
    }
  },
  "transformer": {
    "procspernode": 1,
    "prefetch": 50,
    "json": {
      "partitions": 400,
      "table": "document",
      "column": "document"
    }
  },
  "loader": {
    "procspernode": 1,
    "group": {
      "timeout": 1000,
      "size_low": 50,
      "size_high": 100
    },
    "pond": {
      "pgversion": "9.4",
      "port": 5433
    }
  },
  "metrics": {
    "hostname": "192.168.1.110",
    "port": 2003,
    "secret": "${METRICS_API_SECRET}"
  },
  "lake": {
    "pgversion": "9.4",
    "datadir": "/var/lib/pgsql/9.4/lake",
    "hostname": "localhost",
    "port": 5432,
    "name": "lake",
    "username": "xfmuser",
    "password": "lake"
  },
  "backend_plain": {
    "username": "${ADMIN}",
    "hosts": {
      "rabbitmq": [
        "192.168.1.110"
      ],
      "ingester": [
        "192.168.1.120"
      ],
      "transformer": [
        "192.168.1.130",
        "192.168.1.131"
      ],
      "loader": [
        "192.168.1.140",
        "192.168.1.141"
      ],
      "lake": [
        "192.168.1.150"
      ]
    }
  }
}

After preparing each node for installation (see Installing nodes), start the installation:

xfmctl --parallel --roles=rabbitmq,lake,ingester,transformer,loader setup update_config update

The --parallel switch is optional but allows installation of multiple nodes in parallel.

When successfull, the host-based access configuration of the lake should be edited for loader access. To do this edit the file /etc/xfm/lake_hba.conf on the lake node. For example for the loaders in the xfm.json from above (we assume a CIDR mask length of 24), see here for details:

host lake xfmuser 192.168.1.0/24 trust