4. Installation¶
Note
To be able to install XFM you need a token (defined as ${REPO_TOKEN}
below). This token provides access to the MGRID package repository. You can
request a token by sending an e-mail to support@mgrid.net
.
4.1. Quickstart¶
For a single node install, you can execute
$ REPO_TOKEN=<YOUR_TOKEN>
$ curl -s http://${REPO_TOKEN}:@www.mgrid.net/quickstart/install.sh | sudo bash
This will install and setup a base XFM configuration. This accepts CDA R2 messages with RIM version 2.33R1 Normative Edition 2011. The RIM database is optimized for large message rates.
If successful, you can access the XFM command line tools by running:
$ su - xfmadmin
For available commands and testing an see Using xfmctl.
4.2. Detailed Procedure¶
XFM runs on a set of (virtual) machines, and is bundled with a command line tool
(xfmctl
) to help setting up and manage a XFM deployment.
xfmctl
is installed on a machine which manages the XFM deployment, and
should have network access to the target machines.
Install xfmctl
and its dependencies (xfmctl
is written in Python
and it is recommended to run it in a virtualenv
environment).
First set some environment variables:
export REPO_TOKEN=<YOUR_TOKEN>
export ADMIN=xfmadmin
export ADMIN_HOME=/home/$ADMIN
export RELEASE=$(cat /etc/redhat-release | grep -oE '[0-9]+\.[0-9]+' | cut -d"." -f1)
Then add the MGRID package repository:
curl -s https://${REPO_TOKEN}:@packagecloud.io/install/repositories/mgrid/mgrid3/script.rpm.sh | sudo bash
This add a Yum repository to the system, so the xfmctl
package becomes
available for installation.
Add the install group and user:
groupadd $ADMIN && useradd -d $ADMIN_HOME -m -g $ADMIN $ADMIN
Install the XFM command line tools (xfmctl
). This requires the Extra
Packages for Enterprise Linux (EPEL) repository.
yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-${RELEASE}.noarch.rpm
yum install -y mgridxfm3-xfmctl
The next command switches to the $ADMIN
user, and preserves
the environment variables (-E
switch).
sudo -E -u $ADMIN bash
cd $ADMIN_HOME
Create SSH keys for key-based authentication (allows xfmctl
connect to the
nodes).
ssh-keygen -t rsa -b 2048 -f $ADMIN_HOME/.ssh/xfm_rsa -P ""
Create the virtual environment:
virtualenv ${ADMIN_HOME}/pyenv
source ${ADMIN_HOME}/pyenv/bin/activate
echo source ${ADMIN_HOME}/pyenv/bin/activate >> ${ADMIN_HOME}/.bashrc
pip install /opt/mgrid/xfm3-xfmctl/xfmctl-*.tar.gz
If successfull the xfmctl
command is available in your virtualenv
. It
is also added to the .bashrc
file so the virtualenv
is activated when
logging in as $ADMIN
. Before it can be used it needs a configuration file.
4.3. xfmctl Configuration¶
xfmctl
reads its configuration from a file xfm.json
(expected in the
directory where xfmctl
is run).
xfmctl
supports several target environments through so called backends; it
can setup and manage XFM nodes using existing cloud infrastructure, interact
directly with virtual machines, or Docker containers. The backends are:
plain
: This backend is the most basic and does not create XFM instances itself (i.e. the XFM instances are already brought up by some other (possibly manual process)).xfmctl
only needs to know the instance names and how to access the it using Secure Shell (SSH).amazon
: For the amazon backend xfmctl uses the EC2 web service to create and manage instances. A valid Amazon EC2 account is required. When an instance is available,xfmctl
also connects to it using SSH.
xfmctl
reads its configuration from xfm.json
in its current directory. The
xfmctl
distribution contains sample configuration files for each backend,
see Example configuration file for an example single node configuration.
The xfmctl
distribution contains sample configuration files for each
backend, see Example configuration file for an example single node configuration for the
plain
backend. Available settings are:
xfm
backend
Selected backend. For the selected backend a configuration sectionbackend_<NAME>
must exist; for examplebackend_plain
.config
XFM deployment configuration. Site-specific components (e.g., custom workers, messaging configuration, parser definitions) can be packaged, and selected with this configuration item.base
is the default XFM configuration.repo
Name of the MGRID package repository. Default ismgrid3
.repotoken
Access token of the MGRID package repository.sshkey
Path to the private key used to access XFM instances.persistence
(true
,false
) Set whether message persistence should be enabled for all messages (i.e. the broker stores messages to disk before sending acknowledgements). Enable this option when messages should survive broker restarts or failures (when disabling persistence messages are kept in-memory). Note that writing messages to disk affects performance.enable_metrics
(true
,false
) Set whether XFM components should send metrics to the metrics server.
gateway
hostname
Hostname used by nodes (e.g., ingesters, command-line tools) to access the gateway RabbitMQ instance.username
RabbitMQ username for messaging with the gateway.password
RabbitMQ password for messaging with the gateway.
broker
hostname
Hostname used by nodes (e.g., ingesters, transformers, loaders, command-line tools) to access the broker RabbitMQ instance.username
RabbitMQ username for messaging with the broker.password
RabbitMQ password for messaging with the broker.
ingester
procspernode
The number of processes to start on a node/instance.prefetch
The amount of messages that are prefetched from the gateway.flowcontrol
Subcategory for flow control settings.threshold
Number of messages that are queued for tranformers and loaders (summed) before flow control is activated (to limit intake from the gateway).period
Period in milliseconds for querying queue sizes. This will determine how often the actual queue sizes are checked against the threshold (because of this polling mechanism it is possible that the combined queue sizes exceed the configured threshold).
transformer
procspernode
The number of processes to start on a node/instance.prefetch
The amount of messages that are prefetched from the broker.json
Settings for the JSON transformer (not used in this tutorial).partitions
Number of database table partitions.table
Name of the destination table.column
Name of the destination table column.
loader
procspernode
The number of processes to start on a node/instance.group
Subcategory for group settings.timeout
Timeout in milliseconds for grouping (aggregating) messages. If less messages are received then the group size before the timeout, the (partial) group is processed. This avoids stalling of messages.size_low
,size_high
The loaders choose a random group size at startup to avoid running in lockstep when uploading towards the data lake. These parameters control the lower and upper bound of the chosen group size. The prefetch size is chosen as 2 times the group size.pond
Settings for the pond databases.pgversion
PostgreSQL database version to use;9.4
or9.5
.port
Listening port of the pond database server.
metrics
Metrics backend (Graphite)hostname
Hostname used by nodes (e.g., ingesters, loaders, command-line tools) to access the metrics server.port
Port for sending metric data. Note that this is the server port as used by clients.secret
Key for accessing the metrics web API.
lake
pgversion
PostgreSQL database version to use;9.4
or9.5
.datadir
Directory of the lake database files.hostname
Hostname of the lake database.port
Port of the lake database.name
Name of the lake database.username
Username to access the lake database.password
Password to access the lake database. Note this is used verbatim in a pgpass file, so:
and\
should be escaped.
backend_plain
username
Username used to access an instance.hosts
Key-value pairs of XFM instances. The key is the role name, and the value is a list of IP addresses or hostnames of the nodes belonging to that role (nodes should only have a single role). Typically each role represents a group of nodes with the same configuration profile, such as all ingester nodes.The role name is used to determine the configuration profile (i.e. installation instructions).
In its base configuration XFM contains the following roles (this can be extended through site-specific configurations):
gateway
Gateway broker (RabbitMQ). Should contain at most 1 entry.broker
Messaging broker (RabbitMQ). Should contain at most 1 entry.ingester
Nodes running Ingesters. Can contain 1 or more entries.transformer
Nodes running Transformers. Can contain 1 or more entries.loader
Nodes running Loaders. Can contain 1 or more entries.lake
Lake database (PostgreSQL). Should contain at most 1 entry.rabbitmq
Combination of gateway and broker on a single RabbitMQ instance. Should contain at most 1 entry.worker
Combination of Ingesters, Transformers and Loaders. Can contain 1 or more entries.singlenolake
Combination of all components on a single node except the Lake. Should contain at most 1 entry.singlenode
Combination of all components on a single node. Should contain at most 1 entry.
backend_amazon
imageid
Identifier of the Amazon Machine Image (AMI) to use.username
Username used to access an instance (typicallyec2-user
).keyname
Name of the key pair.cert
Path to the certificate to access the Amazon AWS EC2 endpoint.securitygroup
Security group to use for an instance.management
,gateway
,broker
,ingester
,transformer
,loader
sizeid
The size identifier of an instance hardware configuration (e.g.,t1.micro
).
The created SSH key should be used as the sshkey
in the xfm.json
configuration file (as created in a previous step).
4.4. Installing nodes¶
Before xfmctl
can start installation of a node, it must be able to access it using key-based authentication. To copy the created key to the node, do:
$ ssh-copy-id -i $ADMIN_HOME/.ssh/xfm_rsa <USERNAME>@<HOSTNAME>
Substitute <USERNAME>
with the username used in the backend username
in
xfm.json
, and <HOSTNAME>
for each hostname (or IP address) used in the
backend hosts
section. To list the configured addresses run:
$ xfmctl list
Now xfmctl
should have access to the node. To test, do:
$ xfmctl --hosts=<HOSTNAME> -- cat /etc/redhat-release
When successful, it should print the distribution version, without prompting for a username or password.
xfmctl
needs to know on which nodes to run its commands. Above the
--hosts
parameter was used to select individual nodes, but the --roles
parameter can be used to run commands on all nodes listed for that role in the
backend hosts
section in xfm.json
. Multiple roles can be provided,
separated with commas, for example:
$ xfmctl --roles=ingester,transformer,loader -- cat /etc/redhat-release
Now xfmctl
can access the nodes it should be allowed privileged access, to
be able to make system-wide changes to the node (e.g., install and configure
software).
To enable privileged access using xfmctl
, execute the following
command for all roles (prompts for the root
password on each node):
xfmctl --roles=singlenode -- su -c "\"mkdir -p /etc/sudoers.d && echo $ADMIN 'ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/999_xfm\""
When ready, a node can be installed:
$ xfmctl --roles=singlenode setup update_config update
The installation commands are as follows:
setup
Adds repositories to the node needed to install XFM dependencies in addition to the base RedHat and CentOS repositories, install the XFM bootstrap package and configures Puppet. Installed repositories are:
- MGRID
- Puppet
- PostgreSQL (9.4 and 9.5)
- Extra Packages for Enterprise Linux (EPEL)
- Software Collections (SCL), only for RedHat/CentOS 6.
update_config
Copies the settings inxfm.json
to a node such that it is available during installation.update
Installs or updates a node. The actual steps performed depend on the role of the node as configured in the backendhosts
section inxfm.json
.
4.5. Amazon backend: Creating nodes¶
When using the Amazon backend, there is an additional command create
.
This command requires a role parameter. As was already seen in
creating the management server instance this is done by passing the role after a
colon:
$ xfmctl create:management
Note that while the create
command returns after the instance is running with
networking enabled, it can take some additional time before access using SSH is
possible. If subsequent update
commands timeout, it often helps to wait a
bit and retry.
4.5.1. Updating configuration¶
In setting up the management server the configuration from xfm.json
was
uploaded. When this file changes (e.g., to change the prefetch
of the
ingesters) the instances should be updated to reflect the changes:
$ xfmctl --roles=ingester update_config update
4.5.2. Example configuration file¶
{
"xfm": {
"backend": "plain",
"config": "base",
"repo": "mgrid3",
"repotoken": "${REPO_TOKEN}",
"sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
"persistence": false,
"enable_metrics": false
},
"gateway": {
"hostname": "localhost",
"username": "xfm",
"password": "tr4nz00m"
},
"broker": {
"hostname": "localhost",
"username": "xfm",
"password": "tr4nz00m"
},
"ingester": {
"procspernode": 1,
"prefetch": 50,
"flowcontrol": {
"threshold": 5000,
"period": 2000
}
},
"transformer": {
"procspernode": 1,
"prefetch": 50,
"json": {
"partitions": 400,
"table": "document",
"column": "document"
}
},
"loader": {
"procspernode": 1,
"group": {
"timeout": 1000,
"size_low": 50,
"size_high": 100
},
"pond": {
"pgversion": "9.4",
"port": 5433
}
},
"metrics": {
"hostname": "localhost",
"port": 2003,
"secret": "verysecret"
},
"lake": {
"pgversion": "9.4",
"datadir": "/var/lib/pgsql/9.4/lake",
"hostname": "localhost",
"port": 5432,
"name": "lake",
"username": "xfmuser",
"password": "lake"
},
"backend_plain": {
"username": "${ADMIN}",
"hosts": {
"singlenode": [
"127.0.0.1"
]
}
}
}
4.6. Multinode install¶
Below is an example on how to install multiple nodes.
- 1 Management server
- 1 RabbitMQ message broker
- 1 Ingester node
- 1 Transformer node
- 1 Loader node
- 1 Lake node
Add all hosts to xfm.json (see xfmctl Configuration for an explanation of the settings):
{
"xfm": {
"backend": "plain",
"config": "base",
"repo": "mgrid3",
"repotoken": "${REPO_TOKEN}",
"sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
"persistence": false,
"enable_metrics": false
},
"gateway": {
"hostname": "192.168.1.110",
"username": "xfm",
"password": "tr4nz00m"
},
"broker": {
"hostname": "192.168.1.110",
"username": "xfm",
"password": "tr4nz00m"
},
"ingester": {
"procspernode": 1,
"prefetch": 50,
"flowcontrol": {
"threshold": 5000,
"period": 2000
}
},
"transformer": {
"procspernode": 1,
"prefetch": 50,
"json": {
"partitions": 400,
"table": "document",
"column": "document"
}
},
"loader": {
"procspernode": 1,
"group": {
"timeout": 1000,
"size_low": 50,
"size_high": 100
},
"pond": {
"pgversion": "9.4",
"port": 5433
}
},
"metrics": {
"hostname": "192.168.1.110",
"port": 2003,
"secret": "${METRICS_API_SECRET}"
},
"lake": {
"pgversion": "9.4",
"datadir": "/var/lib/pgsql/9.4/lake",
"hostname": "localhost",
"port": 5432,
"name": "lake",
"username": "xfmuser",
"password": "lake"
},
"backend_plain": {
"username": "${ADMIN}",
"hosts": {
"rabbitmq": [
"192.168.1.110"
],
"ingester": [
"192.168.1.120"
],
"transformer": [
"192.168.1.130",
"192.168.1.131"
],
"loader": [
"192.168.1.140",
"192.168.1.141"
],
"lake": [
"192.168.1.150"
]
}
}
}
After preparing each node for installation (see Installing nodes), start the installation:
xfmctl --parallel --roles=rabbitmq,lake,ingester,transformer,loader setup update_config update
The --parallel
switch is optional but allows installation of multiple nodes
in parallel.
When successfull, the host-based access configuration of the lake should be
edited for loader access. To do this edit the file /etc/xfm/lake_hba.conf
on the lake
node. For example for the loaders in the xfm.json
from
above (we assume a CIDR mask length of 24), see here for
details:
host lake xfmuser 192.168.1.0/24 trust