seisflows.preprocess.pyaflowa

The Pyaflowa preprocessing module for waveform gathering, preprocessing and misfit quantification. We use the name ‘Pyaflowa’ to avoid any potential name overlaps with the actual Pyatoa package.

Classes

Pyaflowa

Pyaflowa Preprocess [Preprocess Base]

Module Contents

class seisflows.preprocess.pyaflowa.Pyaflowa(min_period=1.0, max_period=10.0, preproc_toggles=None, pyflex_parameters=None, pyadjoint_parameters=None, fix_windows=False, revalidate=False, adj_src_type='cc_traveltime', plot_waveforms=True, preprocess_log_level='DEBUG', export_datasets=True, export_figures=True, export_log_files=True, workdir=os.getcwd(), path_preprocess=None, path_solver=None, path_data=None, path_output=None, obs_data_format='SAC', syn_data_format='ASCII', data_case='data', components=None, start=None, ntask=1, nproc=1, source_prefix=None, **kwargs)

Pyaflowa Preprocess [Preprocess Base]

Preprocessing and misfit quantification using Python’s Adjoint Tomography Operations Assistant (Pyatoa)

Parameters

type min_period:

float

param min_period:

Minimum filter corner in unit seconds. Bandpass filter if set with max_period, highpass filter if set without max_period, no filtering if not set and `max_period also not set

type preproc_toggles:

dict

param preproc_toggles:

A dictionary with keys that represent toggles that allow the User to turn on/off the default preprocessing steps. Corresponding values should be ‘True’ to toggle on, and ‘False’ for off. All toggles are set ‘True’ by default. - ‘standardize’: resamples and trims time series to match. Turn off if

your ‘obs’ and ‘syn’ data are already the same length. See Pyatoa.manager.standardize()

  • ‘preprocess’: Detrend, taper, filter and normalize (optional). Turn off if your data are synthetics that do not need filtering, or if your data are already preprocessed

  • ‘window’: misfit windowing using PyFlex. Turn off if you want to compute adjoint sources on the entire trace.

type pyflex_parameters:

dict

param pyflex_parameters:

overwrite for Pyflex parameters defined in the Pyflex.Config object. Incorrectly defined argument names will raise a TypeError. See Pyflex docs for detailed parameter defs: http://adjtomo.github.io/pyflex/#config-object

type pyadjoint_parameters:

dict

param pyadjoint_parameters:

overwrite for Pyadjoint parameters defined in the Pyadjoint.Config object for the given adj_src_type. Incorrectly defined argument names will raise a TypeError. See Pyadjoint docs for detailed parameter definitions: https://adjtomo.github.io/pyadjoint/

type fix_windows:

bool or str

param fix_windows:

How to address misfit window evaluation at each evaluation. Options to re-use misfit windows collected during an inversion, available options: [True, False, ‘ITER’, ‘ONCE’, ‘OFF’] - True: Re-use windows after first evaluation (i01s00); - False: Calculate new windows each evaluation; - ‘ITER’: Calculate new windows at first evaluation of

each iteration (e.g., i01s00… i02s00…

  • ‘ONCE’: Calculate new windows at first evaluation of the workflow, i.e., at self.par.BEGIN

type revalidate:

bool

param revalidate:

Only used if fix_windows is True, ITER or ONCE. Windows that are retrieved from datasets will be revalidated against parameters in the Config object (ccshift, dlna, etc.), and rejected if they fall outside defined bounds. If False, windows will not be revalidated. Caution is advised with this parameter as event misfit may be artificially reducedby removing a significant number of windows, which is possible when `revalidate`==True

type adj_src_type:

str

param adj_src_type:

Adjoint source type to evaluate misfit, defined by Pyadjoint. See pyadjoint.config.ADJSRC_TYPES for detailed options list - ‘waveform’: waveform misfit function - ‘convolution’: convolution misfit function - ‘exponentiated_phase’: exponentiated phase from Yuan et al. 2020 - ‘cc_traveltime’: cross-correlation traveltime misfit - ‘multitaper’: multitaper misfit function

type plot_waveforms:

bool

param plot_waveforms:

plot waveform figures and source receiver maps during the preprocessing stage. Maps require metadata, and if they are not provided then only waveforms + windows + adjoint sources will be plotted

type preprocess_log_level:

str

param preprocess_log_level:

Log level to set Pyatoa, Pyflex, Pyadjoint. Available: [‘null’: no logging, ‘warning’: warnings only, ‘info’: task tracking, ‘debug’: log all small details (recommended)]

type export_datasets:

bool

param export_datasets:

periodically save the output ASDFDataSets which contain data, metadata and results collected during the preprocessing procedure

type export_figures:

bool

param export_figures:

periodically save the output basemaps and data-synthetic waveform comparison figures

type export_log_files:

bool

param export_log_files:

periodically save log files created by Pyatoa

Paths

type path_preprocess:

str

param path_preprocess:

scratch path for preprocessing related steps

***

min_period = 1.0
max_period = 10.0
fix_windows = False
revalidate = False
adj_src_type = 'cc_traveltime'
plot_waveforms = True
preprocess_log_level = 'DEBUG'
pyflex_parameters
pyadjoint_parameters
export_datasets = True
export_figures = True
export_log_files = True
path
syn_data_format = ''
obs_data_format = ''
source_prefix = None
_data_case = ''
_start = None
_ntask = 1
_nproc = 1
_source_prefix = None
_syn_acceptable_data_formats = ['ASCII']
_acceptable_source_prefixes = ['SOURCE', 'FORCESOLUTION', 'CMTSOLUTION']
_acceptable_fix_windows = ['ITER', 'ONCE', True, False]
_inv = None
_config = None
_fix_windows = False
check()

Checks Parameter and Path files, will be run at the start of a Seisflows workflow to ensure that things are set appropriately.

setup()

Sets up data preprocessing machinery by establishing an internally defined directory structure that will be used to store the outputs of the preprocessing workflow

static ftag(config)

Create a re-usable file tag from the Config object as multiple functions will use this tag for file naming and file discovery.

Parameters:

config (pyatoa.core.config.Config) – Configuration object that must contain the ‘event_id’, iteration and step count

quantify_misfit(source_name=None, save_residuals=None, export_residuals=None, save_adjsrcs=None, components=None, iteration=1, step_count=0, _serial=False, **kwargs)

Main processing function to be called by Workflow module. Generates total misfit and adjoint sources for a given event with name source_name.

Note

Meant to be called by workflow.evaluate_objective_function and run on system using system.run() to get access to compute nodes.

Parameters:
  • source_name (str) – name of the event to quantify misfit for. If not given, will attempt to gather event id from the given task id which is assigned by system.run()

  • save_residuals (str) – if not None, path to write misfit/residuls to

  • export_residuals (str) – export all residuals (data-synthetic misfit) that are generated by the external solver to path_output. If False, residuals stored in scratch may be discarded at any time in the workflow

  • save_adjsrcs (str) – if not None, path to write adjoint sources to

  • components (list) – optional list of components to ignore preprocessing traces that do not have matching components. The adjoint sources for these components will be 0. E.g., [‘Z’, ‘N’]. If None, all available components will be considered.

  • iteration (int) – current iteration of the workflow, information should be provided by workflow module if we are running an inversion. Defaults to 1 if not given (1st iteration)

  • step_count (int) – current step count of the line search. Information should be provided by the optimize module if we are running an inversion. Defaults to 0 if not given (1st evaluation)

  • _serial (bool) – debug function to turn preprocessing to a serial task whereas it is normally a multiprocessed parallel task

_setup_quantify_misfit(source_name, save_adjsrcs=None, components=None)

Gather a list of filenames of matching waveform IDs that can be run through the misfit quantification step, and generate empty adjoint sources so that Solver knows which components are zero’d out.

Parameters:
  • source_name (str) – the name of the source to process

  • components (list) – optional list of components to ignore preprocessing traces that do not have matching components. The adjoint sources for these components will be 0. E.g., [‘Z’, ‘N’]. If None, all available components will be considered.

Return type:

list of tuples

Returns:

[(observed filename, synthetic filename)]. tuples will contain filenames for matching stations + component for obs and syn

_instantiate_manager(obs_fid, syn_fid, config)

Convenience function to return a Manager object that is filled with the required data and metadata. This is defined as it’s own function primarily for debuggin purposes as it allows the User to quickly retrieve an object ready for processing

_quantify_misfit_single(obs_fid, syn_fid, config, save_adjsrcs=False)

Main Pyatoa processing function to quantify misfit + generation adjsrc.

Run misfit quantification for a single event-station pair. Gather data, preprocess, window and measure data, save adjoint source if requested, and then returns the total misfit and the collected windows for the station.

Parameters:
  • obs_fid (str) – filename for the observed waveform to be processed

  • syn_fid (str) – filename for the synthetic waveform to be procsesed

  • config (pyatoa.core.config.Config) – Config object that defines all the processing parameters required by the Pyatoa workflow

  • save_adjsrcs (str) – path to directory where adjoint sources should be saved. Filenames will be generated automatically by Pyatoa to fit the naming schema required by SPECFEM. If False, no adjoint sources will be saved. They of course can be saved manually later using Pyatoa + PyASDF

finalize()

Run serial finalization tasks at the end of a given iteration. These tasks are specific to Pyatoa, used to store figures and data in the more permanent output/ directory. Scratch files are deleted during this operation to free up disk space.

_check_fixed_windows(iteration, step_count)

Determine how to address re-using misfit windows during an inversion workflow. Throw some log messages out to let the User know whether or not misfit windows will be re used throughout an inversion.

True: Always fix windows except for i01s00 because we don’t have any

windows for the first function evaluation

False: Don’t fix windows, always choose a new set of windows Iter: Pick windows only on the initial step count (0th) for each

iteration. WARNING - does not work well with Thrifty Inversion because the 0th step count is usually skipped

Once: Pick new windows on the first function evaluation and then fix

windows. Useful for when parameters have changed, e.g. filter bounds

Parameters:
  • iteration (int) – The current iteration of the SeisFlows3 workflow, within SeisFlows3 this is defined by optimize.iter

  • step_count (int) – Current line search step count within the SeisFlows3 workflow. Within SeisFlows3 this is defined by optimize.line_search.step_count

Return type:

tuple (bool or None, str)

Returns:

(bool on whether to use windows from the previous step or None if fix window turned off, a message that can be sent to the logger)

_config_auxiliary_logger(fid)

Create a log file to track processing of a given source-receiver pair. Because each station is processed asynchronously, we don’t want them to log to the main file at the same time, otherwise we get a random mixing of log messages. Instead we have them log to temporary files, which are combined at the end of the processing script in serial.

Parameters:

fid (str) – full path and filename for logger that will be configured

Return type:

logging.Logger

Returns:

a logger which does NOT log to stdout and only logs to the given file defined by fid

_finalize_logging(config, total_windows, total_misfit)

Each source-receiver pair has made its own log file. This function collects these files and writes their content back into the main log. This is a lot of IO but should be okay since the files are small.

Note

This was the most foolproof method for having multiple parallel processes write to the same file. I played around with StringIO buffers and file locks, but they became overly complicated and ultimately did not work how I wanted them to. This function trades filecount and IO overhead for simplicity.

Warning

The assumption here is that the number of source-receiver pairs is manageable (in the thousands). If we start reaching file count limits on the cluster then this method for logging may have to be re-thought. See link for example: https://stackless.readthedocs.io/en/3.7-slp/howto/

logging-cookbook.html#using-concurrent-futures-processpoolexecutor

Parameters:
  • config (pyatoa.core.config.Config) – Config object that will be queried for iteration, step count and event ID information

  • total_windows (int) – total number of windows collected for a given source. this will be written to the final log message

  • total_misfit (float) – total misfit for a given source. this will be written to the final log message