.. _running: Running COBALT =============== This document provides a comprehensive guide to running COBALT observations. It covers the preparation steps, workflow details, and instructions for both single-node and multi-node setups. .. note:: If you encounter any problems during the preparation process, refer to the :ref:`Common Issues ` section for troubleshooting tips. **Contents:** 1. **Preparation**: Learn how to set up a COBALT environment and load the required modules. See :ref:`Preparation `. 2. **Workflow**: Understand the sequence of scripts and processes involved in running a COBALT observation. See :ref:`Workflow `. 3. **Running an Observation (Single Node)**: Step-by-step instructions for running an observation on a single node. See :ref:`Running an Observation `. 4. **Running an Observation (Multi-Node)**: Instructions for running an observation across multiple nodes. See :ref:`Running a multi-node Observation `. 5. **Common Issues**: Troubleshooting tips for common problems encountered while running COBALT. See :ref:`Common Issues `. Preparation ------------ .. _preperation: For convenience, a script to load all required modules is provided in the COBALT repository. By including this bash script in the :code:`.bashrc` file, the modules are loaded automatically when a new shell is opened. **To use this script:** .. code-block:: bash git clone https://git.astron.nl/cobalt/cobalt-installation vi ~/.bashrc # Now, add the following line to the end of the file: source /data/cobalt/spack/activate_cobalt.sh Workflow -------- .. _workflow: Running a COBALT observation involves several scripts that interact with each other. The scripts in `GPUProc/{src,test}` contain several scripts to start observations. The call chain is as follows: .. OnlineControl (MAC) .. | | ^ .. V | | .. startBGL.$h -> runObservation.sh -> mpirun.sh -> [mpirun] rtcp .. ^ V ^ ^ .. | stopBGL.sh -/ | .. | | .. tstartBGL.sh tMACfeedback.sh .. tProductionParsets.sh .. testParset.sh .. graphviz:: digraph observation_flow { rankdir=LR; // Primary Nodes (Main Steps) node [shape=box, style=filled, fillcolor="#438dd5", fontcolor=white, fontsize=14]; "OnlineControl (MAC)"; "startBGL.$h"; "runObservation.sh"; "mpirun.sh"; "[mpirun] rtcp"; // Intermediate Nodes node [fillcolor="#5a9bd4", shape=ellipse, fontsize=14]; "stopBGL.sh"; "tMACfeedback.sh"; "tProductionParsets.sh"; "testParset.sh"; // Connections "runObservation.sh" -> "OnlineControl (MAC)"; "startBGL.$h" -> "runObservation.sh"; "stopBGL.sh" -> "runObservation.sh"; "runObservation.sh" -> "mpirun.sh"; "mpirun.sh" -> "[mpirun] rtcp"; "OnlineControl (MAC)" -> "stopBGL.sh"; "tMACfeedback.sh" -> "runObservation.sh"; "testParset.sh" -> "runObservation.sh"; "tProductionParsets.sh" -> "runObservation.sh"; } The central thread is the main call chain, with the following roles and responsibilities: .. rst-class:: enumerated-list 1. `startBGL.sh` **Syntax:** .. rst-class:: small-code .. code-block:: bash ./startBGL.sh 1 2 3 $PARSET $OBSID **Description:** - Starts the observation in the background by calling `runObservation.sh`. - Augments the parset with keys from `$LOFARROOT/etc/parset-additions.d/*`. - Redirects output to `$LOFARROOT/var/log/rtcp-$OBSID.log`. - Tracks the PID for `stopBGL.sh`. - Informs OnlineControl when the observation starts. 2. `runObservation.sh` **Syntax:** .. rst-class:: small-code .. code-block:: bash ./runObservation.sh $PARSET **Description:** - Runs the observation in the foreground and calls `mpirun.sh`. - Optionally augments the parset with COBALT-specific settings. - Forced localhost execution with (e.g., 4) MPI nodes can be enabled using `-l 4`. - Copies `Observation$OBSID_feedback` to OnlineControl (ccu001). - Reports `ABORT` or `FINISHED` to OnlineControl (can be suppressed with `-F`). - Creates a PID file for :code:`stopBGL.sh`. 3. `mpirun.sh` **Syntax:** .. rst-class:: small-code .. code-block:: bash mpirun.sh -x LOFARROOT=$LOFARROOT \ -H `mpi_node_list -n $PARSET` `which rtcp` \ $PARSET **Description:** - Acts as `mpirun`, but wraps the currently selected MPI library (OpenMPI, MVAPICH2, or no MPI). - Starts the COBALT software by launching the `rtcp` executable with the parset. 4. `stopBGL.sh` **Syntax:** .. rst-class:: small-code .. code-block:: bash ./stopBGL.sh 1 $OBSID **Description:** - Stops the corresponding observation using the saved PID. - Signals a running `runObservation.sh` to finish or abort. 5. `rtcp` **Description:** - The actual program that performs the correlation/beamforming. Running an Observation ------------------------------------ .. _running-single-node: To run a observation in COBALT, one should provide a so-called *parset* file. The parset is either generated by the Radio Observatory software (the Scheduler) or hand-crafted. After a parset file has been selected, a new observation run can be initiated. The following steps are required: .. note:: For simplicity, the following example assumes that COBALT is on ran on a single node, i.e., `localhost`. For a multi-node setup, please refer to the :ref:`Running an Observation (Multi-Node) ` section. 1. Change the :code:`$PARSET` variable such that point to the correct file, i.e.: .. rst-class:: small-code .. code-block:: bash export PARSET=/path/to/parset 2. Now, the :code:`$LOFARROOT` environment variable should be set to the root directory of the COBALT installation. For instance, if COBALT is installed in :code:`cobalt`, it can be set by running: .. rst-class:: small-code .. code-block:: bash export LOFARROOT=/home/[user]/cobalt For simplicity, the `bin` directory can be added to the :code:`$PATH` as well: .. rst-class:: small-code .. code-block:: bash export PATH=$LOFARROOT/build/gnucxx11_CEP4_optarch/bin:$PATH 3. Prepare the LOFARROOT subdirectories: .. rst-class:: small-code .. code-block:: bash mkdir -p $LOFARROOT/nfs/parset/ mkdir -p $LOFARROOT/var/log/ mkdir -p $LOFARROOT/var/run/ mkdir -p $LOFARROOT/nfs/feedback/ 4. To start a observation, run the following command format shall be used. .. rst-class:: small-code .. code-block:: bash pkill outputProc # Ensure that no stale outputProc processes are running runObservation.sh -l -P [PID_file] -c [PIPE_file] [PARSET] [OBSERVATION_ID] Note that the `-l` and `-P` flags are optional. The `-l` flag is used to run the observation on localhost, while the `-P` flag is used to create a PID file. **Example:** This command starts an observation using the specified parset file and observation ID, while storing the process ID in the given PID file and using the specified pipe for communication. The observation ID shall match the one in the parset file. .. rst-class:: small-code .. code-block:: bash runObservation.sh -l -P cobalt/var/run/rtcp-5.pid -c cobalt/var/run/rtcp-5.pipe ../../../CobaltTest/test/tManyPartTABOutput.parset 123882 **Arguments:** - :code:`-l`: Run solely on localhost using a specified number of MPI processes. This is useful for isolated testing. - :code:`-P cobalt/var/run/rtcp-5.pid`: Specifies the path to the PID file where the process ID of the observation will be stored. - :code:`-c cobalt/var/run/rtcp-5.pipe`: Specifies the path to the named pipe used for communication. - :code:`../../../CobaltTest/test/tManyPartTABOutput.parset`: The parset file that defines the observation parameters. - :code:`123882`: The observation ID, which uniquely identifies the observation. All available command line options for `runObservation.sh` are listed below: .. list-table:: `runObservation.sh` Command Line Options :widths: 15 85 :header-rows: 1 * - Option - Description * - :code:`-A` - Do NOT augment parset. * - :code:`-B` - Do NOT add broken antenna information. * - :code:`-C` - Run with check tool specified in environment variable :code:`LOFAR_CHECKTOOL`. * - :code:`-F` - Do NOT send data points to a PVSS gateway. * - :code:`-P` - Create PID file. * - :code:`-d` - Dummy run: don't execute anything. * - :code:`-l` - Run solely on localhost using :code:`nprocs` MPI processes (isolated test). * - :code:`-p` - Enable profiling. This enforces: - Sequential kernel execution. - One SubbandProc/GPU (no overlap between kernels and GPU transfers). - Performance statistics collection. * - :code:`-o` - Add option :code:`KEY=VALUE` to the parset. * - :code:`-x` - Propagate environment variable :code:`KEY=VALUE`. .. After a parser file has been selected, change :code:`$PARSET` to point to the correct file. .. **Key Notes:** .. * The :code:`mpi_node_list` utility extracts the list of hosts that :code:`rtcp` should run on by extracting the :code:`Cobalt.Hardware.Node[x].host` keys and joining them with commas. .. * :code:`LOFARROOT` is propagated to ensure :code:`rtcp` can find the installation. .. * :code:`which rtcp` expands "rtcp" into its full path, preventing the :code:`$PATH` in the login shell from redirecting to a different executable. .. **Command Line Parameters:** .. .. list-table:: .. :widths: 15 85 .. :stub-columns: 1 .. * - :code:`-p` .. - Enable profiling. This enforces: .. - Sequential kernel execution .. - One SubbandProc/GPU (no overlap between kernels and GPU transfers) .. - Performance statistics collection .. Configuration .. ============= .. The sections below describe key parset parameters that configure the input/output of COBALT: .. * Mapping of stations to antenna fields .. * Selection of COBALT hardware to use .. * Configuration of each antenna field .. * Configuration of output Running an multi-node Observation ------------------------------------ .. _running-multi-node: `WIP` Common Issues ------------- .. _common-issues: .. list-table:: :widths: 1 1 1 :header-rows: 1 * - Problem - Cause - Solution * - :code:`mpirun` hangs - The likely cause is that the key fingerprint for the host has changed or has not been added to the known hosts file before - Try to log in to the SSH server manually and accept the key fingerprint. Then try to run the job again. * - `No such file or directory` or related errors - Currently, the `runObservation.sh` script is not able to create all required directories in the `LOFARROOT` directory. - Create the required directories manually (also see :ref:`Running an Observation `). * - `Cannot open IERSeop97 table` - This error is related to `casacore `_. - Create a file :code:`~/.casarc` with the following content: :code:`measures.directory: /var/software/casa-measures` * - `The MPI_Alloc_mem() function was called before MPI_INIT was invoked.` - MPI is not initialized. - Ensure that COBALT was built with MPI support, i.e. `USE_MPI` is set to `ON` in the CMake configuration.