How To’s

Run the simplest MSNoise run ever

This recipe is a kind of “let’s check this data rapidly”:

msnoise db init --tech 1

msnoise config set startdate=2019-01-01
msnoise config set enddate=2019-02-01
msnoise config set overlap=0.5
msnoise config set mov_stack=1,5,10

msnoise scan_archive --path /path/to/archive --recursively
msnoise populate --fromDA
msnoise new_jobs --init

msnoise admin # add 1 filter in the Filter table
# or
msnoise db execute "insert into filters (ref, low, mwcs_low, high, mwcs_high, rms_threshold, mwcs_wlen, mwcs_step, used) values (1, 0.1, 0.1, 1.0, 1.0, 0.0, 12.0, 4.0, 1)"

msnoise compute_cc
msnoise stack -r
msnoise reset STACK
msnoise stack -m
msnoise compute_mwcs
msnoise compute_dtt
msnoise plot dvv

Run MSNoise using lots of cores on a HPC

Avoid Database I/O by using the hpc flag

With MSNoise 1.6, most of the API calls have been cleaned from calling the database, for example the def stack() called a SELECT on the database for each call, which is useless as configuration parameters are not supposed to change during the execution of the code. This modification allows running MSNoise on an HPC infrastructure with a remote central MySQL database.

The new configuration parameter hpc is used for flagging if MSNoise is running High Performance. If True, the jobs processed at each step are marked Done when finished, but the next jobtype according to the workflow is not created. This removes a lot of select/update/insert actions on the database and makes the whole much faster (one INSERT instead of tons of SELECT/UPDATE/INSERT).

Commands and actions with hpc = N :

  • msnoise new_jobs: creates the CC jobs

  • msnoise compute_cc: processes the CC jobs and creates the STACK jobs

  • msnoise stack -m: processes the STACK jobs and creates the MWCS jobs

  • etc…

Commands and actions with hpc = Y :

  • msnoise new_jobs: creates the CC jobs

  • msnoise compute_cc: processes the CC jobs

  • msnoise new_jobs --hpc CC:STACK: creates the STACK jobs based on the CC jobs marked “D”one

  • msnoise stack -m: processes the STACK jobs

  • msnoise new_jobs --hpc STACK:MWCS: creates the MWCS jobs based on the STACK jobs marked “D”one

  • etc…

Set up the HPC

To avoid having to rewrite MSNoise for using techniques relying on MPI or other parallel computing tools, I decided to go “simple”, and this actually works. The only limitation of the following is that you need to have a strong MySQL server machine that accepts hundreds or thousands of connections. In my case, the MySQL server is running on a computing blade, and its my.cnf is configured to allow 1000 users/connections, and to listen on all its IPs.

The easiest set up (maybe not your sysadmin’s preferred, please check), is to

  • install miniconda on your home directory and make miniconda’s python executable your default python (I add the paths to .profile).

  • Then install the requirements and finally MSNoise.

  • As usual, create a project folder and msnoise db init there, choose MySQL and provide the hostname of the machine running the MySQL server.

At that point, your project is ready. I usually request an interactive node on the HPC for doing the msnoise populate and `msnoise scan_archive. Our jobs scheduler is PBS, so this command

qsub -I -l walltime=02:00:00 -l select=1:ncpus=16:mem=1g

requests an Interactive node with 16 cpus, 1GB ram, for 2 hours. Once connected, check that the python version is correct (or source .profile again). Because we requested 16 cores, we can msnoise -t 16 scan_archive --init.

Depending on the server configuration, you can maybe run the msnoise admin on the login node, and access it via its hostname:5000 in your browser. If not, the easiest way to set up the config is running msnoise config set <parameter>=<value> from the console. To add filters, do it either:

  • in the Admin

  • using MySQL workbench connected to your MySQL server

  • using such commands msnoise db execute "insert into filters (ref, low, mwcs_low, high, mwcs_high, rms_threshold, mwcs_wlen, mwcs_step, used) values (1, 0.1, 0.1, 1.0, 1.0, 0.0, 12.0, 4.0, 1)"

  • using msnoise db dump, edit the filter table in CSV format, then msnoise db import filters --force

Once done, the project is set up and should run. Again, test if all goes OK in an interactive node.

To run on N cores in parallel, we have the advantage that, e.g. for CC jobs, the day-jobs are independent. We can thus request an “Array” of single cores, which is usually quite easy to get on HPCs (most users run heavily parallel codes and request large number of “connected” cores, while we can run “shared”).

The job file in my PBS case looks like this for computing the CC:

#!/bin/bash
#PBS -N MSNoise_PDF_CC
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=1:mem=1g
#PBS -l place=shared
#PBS -J 1-400
cd /scratch-a/thomas/2019_PDF
source /space/hpc-home/thomas/.profile
msnoise compute_cc2

This requests 400 cores with 1GB of RAM. The content of my .profile file contains:

# added by Miniconda3 installer
export PATH="/home/thomas/miniconda3/bin:$PATH"
export MPLBACKEND="Agg"

The last line is important as nodes are usually “head-less” and matplotlib and packages relating to it would fail if they expect a gui-capable system.

For submitting this job, run qsub qc.job. The process usually routes stdout and stderr to files in the current directory, make sure to check them if jobs seem to have failed. If all goes well, calling msnoise info -j repeatedly from the login or interactive node’s console should show the evolution of Todo, In Progress and Done jobs.

Note

HPC experts are welcome to suggest, comment, etc… It’s a quick’n’dirty solution, but it works for me!

Reprocess data

When starting to use MSNoise, one will most probably need to re-run different parts of the Workflow more than one time. By default, MSNoise is designed to only process “what’s new”, which is antagonistic to what is wanted. Hereafter, we present cases that will cover most of the re-run techniques:

When adding a new filter

If new filter are added to the filters list in the Configurator, one has to reprocess all CC jobs, but not for filters already existing. The recipe is:

  • Add a new filter, be sure to mark ‘used’=1

  • Set all other filters ‘used’ value to 0

  • Redefine the flag of the CC jobs, from ‘D’one to ‘T’odo with the following:

  • Run msnoise reset CC --all

  • Run msnoise compute_cc

  • Run next commands if needed (stack, mwcs, dtt)

  • Set back the other filters ‘used’ value to 1

The compute_cc will only compute the CC’s for the new filter(s) and output the results in the STACKS/ folder, in a sub-folder named by a formatted integer from the filter ID. For example: STACKS/01 for ‘filter id’=1, STACKS/02 for ‘filter id’=2, etc.

When changing the REF

When changing the REF (ref_begin and ref_end), the REF stack has to be re-computed:

msnoise reset STACK --all
msnoise stack -r

The REF will then be re-output, and you probably should reset the MWCS jobs to recompute daily correlations against this new ref:

msnoise reset MWCS --all
msnoise compute_mwcs

When changing the MWCS parameters

If the MWCS parameters are changed in the database, all MWCS jobs need to be reprocessed:

msnoise reset MWCS --all
msnoise compute_mwcs

shoud do the trick.

When changing the dt/t parameters

msnoise reset DTT --all
msnoise compute_dtt

Recompute only the specific days

You want to recompute CC jobs after a certain date only, for whatever reason:

msnoise reset CC --rule="day>='2019-01-01'"

SQL experts can also use the msnoise db execute command (with caution!):

msnoise db execute "update jobs set flag='T' where jobtype='CC' and day>='2019-01-01'"

If you want to only reprocess one day:

msnoise reset CC --rule="day='2019-01-15'"

Define one’s own data structure of the waveform archive

The data_structure.py file contains the known data archive formats. If another data format needs to be defined, it will be done in the custom.py file in the current project folder:

See also

Check the “Populate Station Table” step in the Populate Station Table.

How to have MSNoise work with 2+ data structures at the same time

In this case, the easiest solution is to scan the archive(s) with the “Lazy Mode”:

msnoise scan_archive --path /path/to/archive1/ --recursively
msnoise scan_archive --path /path/to/archive2/ --recursively

etc.

Remember to either manually fill in the station table, or

msnoise populate --fromDA

How to duplicate/dump the MSNoise configuration

To export all tables of the current database, run

msnoise db dump

This will create as many CSV files as there are tables in the database.

Then, on a new location, init a new msnoise project and import the tables one by one:

msnoise db init
msnoise db import config --force
msnoise db import stations --force
msnoise db import filters --force
msnoise db import data_availability --force
msnoise db import jobs --force

Testing the Dependencies

Once installed, you should be able to import the python packages in a python console. MSNoise comes with a little script called bugreport.py that can be useful to check if you have all the required packages (+ some extras).

The usage is such:

$ msnoise bugreport -h

usage: msnoise bugreport [-h] [-s] [-m] [-e] [-a]

Helps determining what didn\'t work

optional arguments:
  -h, --help     show this help message and exit
  -s, --sys      Outputs System info
  -m, --modules  Outputs Python Modules Presence/Version
  -e, --env      Outputs System Environment Variables
  -a, --all      Outputs all of the above

On my Windows machine, the execution of

$ msnoise bugreport -s -m

results in:

************* Computer Report *************

----------------+SYSTEM+-------------------
Windows
PC1577-as
10
10.0.17134
AMD64
Intel64 Family 6 Model 158 Stepping 9, GenuineIntel

----------------+PYTHON+-------------------
Python:3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]

This script is at d:\pythonforsource\msnoise_stack\msnoise\msnoise\bugreport.py

---------------+MODULES+-------------------

Required:
[X] setuptools: 41.2.0
[X] numpy: 1.15.4
[X] scipy: 1.3.0
[X] pandas: 0.25.0
[X] matplotlib: 3.1.1
[X] sqlalchemy: 1.3.8
[X] obspy: 1.1.0
[X] click: 7.0
[X] pymysql: 0.9.3
[X] flask: 1.1.1
[X] flask_admin: 1.5.3
[X] markdown: 3.1.1
[X] wtforms: 2.2.1
[X] folium: 0.10.0
[X] jinja2: 2.10.1

Only necessary if you plan to build the doc locally:
[X] sphinx: 2.2.0
[X] sphinx_bootstrap_theme: 0.7.1

Graphical Backends: (at least one is required)
[ ] wx: not found
[ ] pyqt: not found
[ ] PyQt4: not found
[X] PyQt5: present (no version)
[ ] PySide: not found

Not required, just checking:
[X] json: 2.0.9
[X] psutil: 5.6.3
[ ] reportlab: not found
[ ] configobj: not found
[X] pkg_resources: present (no version)
[ ] paramiko: not found
[X] ctypes: 1.1.0
[X] pyparsing: 2.4.2
[X] distutils: 3.7.3
[X] IPython: 7.7.0
[ ] vtk: not found
[ ] enable: not found
[ ] traitsui: not found
[ ] traits: not found
[ ] scikits.samplerate: not found

The [X] marks the presence of the module. In the case above, PyQt4 is missing, but that’s not a problem because PyQt5 is present. The “not-required” packages are checked for information, those packages can be useful for reporting / hacking / rendering the data.