seisflows.preprocess.pyaflowa
The Pyaflowa preprocessing module for waveform gathering, preprocessing and misfit quantification. We use the name ‘Pyaflowa’ to avoid any potential name overlaps with the actual Pyatoa package.
Classes
Pyaflowa Preprocess [Preprocess Base] |
Module Contents
- class seisflows.preprocess.pyaflowa.Pyaflowa(min_period=1.0, max_period=10.0, preproc_toggles=None, pyflex_parameters=None, pyadjoint_parameters=None, fix_windows=False, revalidate=False, adj_src_type='cc_traveltime', plot_waveforms=True, preprocess_log_level='DEBUG', export_datasets=True, export_figures=True, export_log_files=True, workdir=os.getcwd(), path_preprocess=None, path_solver=None, path_data=None, path_output=None, obs_data_format='SAC', syn_data_format='ASCII', data_case='data', components=None, start=None, ntask=1, nproc=1, source_prefix=None, **kwargs)
Pyaflowa Preprocess [Preprocess Base]
Preprocessing and misfit quantification using Python’s Adjoint Tomography Operations Assistant (Pyatoa)
Parameters
- type min_period:
float
- param min_period:
Minimum filter corner in unit seconds. Bandpass filter if set with max_period, highpass filter if set without max_period, no filtering if not set and `max_period also not set
- type preproc_toggles:
dict
- param preproc_toggles:
A dictionary with keys that represent toggles that allow the User to turn on/off the default preprocessing steps. Corresponding values should be ‘True’ to toggle on, and ‘False’ for off. All toggles are set ‘True’ by default. - ‘standardize’: resamples and trims time series to match. Turn off if
your ‘obs’ and ‘syn’ data are already the same length. See Pyatoa.manager.standardize()
‘preprocess’: Detrend, taper, filter and normalize (optional). Turn off if your data are synthetics that do not need filtering, or if your data are already preprocessed
‘window’: misfit windowing using PyFlex. Turn off if you want to compute adjoint sources on the entire trace.
- type pyflex_parameters:
dict
- param pyflex_parameters:
overwrite for Pyflex parameters defined in the Pyflex.Config object. Incorrectly defined argument names will raise a TypeError. See Pyflex docs for detailed parameter defs: http://adjtomo.github.io/pyflex/#config-object
- type pyadjoint_parameters:
dict
- param pyadjoint_parameters:
overwrite for Pyadjoint parameters defined in the Pyadjoint.Config object for the given adj_src_type. Incorrectly defined argument names will raise a TypeError. See Pyadjoint docs for detailed parameter definitions: https://adjtomo.github.io/pyadjoint/
- type fix_windows:
bool or str
- param fix_windows:
How to address misfit window evaluation at each evaluation. Options to re-use misfit windows collected during an inversion, available options: [True, False, ‘ITER’, ‘ONCE’, ‘OFF’] - True: Re-use windows after first evaluation (i01s00); - False: Calculate new windows each evaluation; - ‘ITER’: Calculate new windows at first evaluation of
each iteration (e.g., i01s00… i02s00…
‘ONCE’: Calculate new windows at first evaluation of the workflow, i.e., at self.par.BEGIN
- type revalidate:
bool
- param revalidate:
Only used if fix_windows is True, ITER or ONCE. Windows that are retrieved from datasets will be revalidated against parameters in the Config object (ccshift, dlna, etc.), and rejected if they fall outside defined bounds. If False, windows will not be revalidated. Caution is advised with this parameter as event misfit may be artificially reducedby removing a significant number of windows, which is possible when `revalidate`==True
- type adj_src_type:
str
- param adj_src_type:
Adjoint source type to evaluate misfit, defined by Pyadjoint. See pyadjoint.config.ADJSRC_TYPES for detailed options list - ‘waveform’: waveform misfit function - ‘convolution’: convolution misfit function - ‘exponentiated_phase’: exponentiated phase from Yuan et al. 2020 - ‘cc_traveltime’: cross-correlation traveltime misfit - ‘multitaper’: multitaper misfit function
- type plot_waveforms:
bool
- param plot_waveforms:
plot waveform figures and source receiver maps during the preprocessing stage. Maps require metadata, and if they are not provided then only waveforms + windows + adjoint sources will be plotted
- type preprocess_log_level:
str
- param preprocess_log_level:
Log level to set Pyatoa, Pyflex, Pyadjoint. Available: [‘null’: no logging, ‘warning’: warnings only, ‘info’: task tracking, ‘debug’: log all small details (recommended)]
- type export_datasets:
bool
- param export_datasets:
periodically save the output ASDFDataSets which contain data, metadata and results collected during the preprocessing procedure
- type export_figures:
bool
- param export_figures:
periodically save the output basemaps and data-synthetic waveform comparison figures
- type export_log_files:
bool
- param export_log_files:
periodically save log files created by Pyatoa
Paths
- type path_preprocess:
str
- param path_preprocess:
scratch path for preprocessing related steps
- min_period = 1.0
- max_period = 10.0
- fix_windows = False
- revalidate = False
- adj_src_type = 'cc_traveltime'
- plot_waveforms = True
- preprocess_log_level = 'DEBUG'
- pyflex_parameters
- pyadjoint_parameters
- export_datasets = True
- export_figures = True
- export_log_files = True
- path
- syn_data_format = ''
- obs_data_format = ''
- source_prefix = None
- _data_case = ''
- _start = None
- _ntask = 1
- _nproc = 1
- _source_prefix = None
- _syn_acceptable_data_formats = ['ASCII']
- _acceptable_source_prefixes = ['SOURCE', 'FORCESOLUTION', 'CMTSOLUTION']
- _acceptable_fix_windows = ['ITER', 'ONCE', True, False]
- _inv = None
- _config = None
- _fix_windows = False
- check()
Checks Parameter and Path files, will be run at the start of a Seisflows workflow to ensure that things are set appropriately.
- setup()
Sets up data preprocessing machinery by establishing an internally defined directory structure that will be used to store the outputs of the preprocessing workflow
- static ftag(config)
Create a re-usable file tag from the Config object as multiple functions will use this tag for file naming and file discovery.
- Parameters:
config (pyatoa.core.config.Config) – Configuration object that must contain the ‘event_id’, iteration and step count
- quantify_misfit(source_name=None, save_residuals=None, export_residuals=None, save_adjsrcs=None, components=None, iteration=1, step_count=0, _serial=False, **kwargs)
Main processing function to be called by Workflow module. Generates total misfit and adjoint sources for a given event with name source_name.
Note
Meant to be called by workflow.evaluate_objective_function and run on system using system.run() to get access to compute nodes.
- Parameters:
source_name (str) – name of the event to quantify misfit for. If not given, will attempt to gather event id from the given task id which is assigned by system.run()
save_residuals (str) – if not None, path to write misfit/residuls to
export_residuals (str) – export all residuals (data-synthetic misfit) that are generated by the external solver to path_output. If False, residuals stored in scratch may be discarded at any time in the workflow
save_adjsrcs (str) – if not None, path to write adjoint sources to
components (list) – optional list of components to ignore preprocessing traces that do not have matching components. The adjoint sources for these components will be 0. E.g., [‘Z’, ‘N’]. If None, all available components will be considered.
iteration (int) – current iteration of the workflow, information should be provided by workflow module if we are running an inversion. Defaults to 1 if not given (1st iteration)
step_count (int) – current step count of the line search. Information should be provided by the optimize module if we are running an inversion. Defaults to 0 if not given (1st evaluation)
_serial (bool) – debug function to turn preprocessing to a serial task whereas it is normally a multiprocessed parallel task
- _setup_quantify_misfit(source_name, save_adjsrcs=None, components=None)
Gather a list of filenames of matching waveform IDs that can be run through the misfit quantification step, and generate empty adjoint sources so that Solver knows which components are zero’d out.
- Parameters:
source_name (str) – the name of the source to process
components (list) – optional list of components to ignore preprocessing traces that do not have matching components. The adjoint sources for these components will be 0. E.g., [‘Z’, ‘N’]. If None, all available components will be considered.
- Return type:
list of tuples
- Returns:
[(observed filename, synthetic filename)]. tuples will contain filenames for matching stations + component for obs and syn
- _instantiate_manager(obs_fid, syn_fid, config)
Convenience function to return a Manager object that is filled with the required data and metadata. This is defined as it’s own function primarily for debuggin purposes as it allows the User to quickly retrieve an object ready for processing
- _quantify_misfit_single(obs_fid, syn_fid, config, save_adjsrcs=False)
Main Pyatoa processing function to quantify misfit + generation adjsrc.
Run misfit quantification for a single event-station pair. Gather data, preprocess, window and measure data, save adjoint source if requested, and then returns the total misfit and the collected windows for the station.
- Parameters:
obs_fid (str) – filename for the observed waveform to be processed
syn_fid (str) – filename for the synthetic waveform to be procsesed
config (pyatoa.core.config.Config) – Config object that defines all the processing parameters required by the Pyatoa workflow
save_adjsrcs (str) – path to directory where adjoint sources should be saved. Filenames will be generated automatically by Pyatoa to fit the naming schema required by SPECFEM. If False, no adjoint sources will be saved. They of course can be saved manually later using Pyatoa + PyASDF
- finalize()
Run serial finalization tasks at the end of a given iteration. These tasks are specific to Pyatoa, used to store figures and data in the more permanent output/ directory. Scratch files are deleted during this operation to free up disk space.
- _check_fixed_windows(iteration, step_count)
Determine how to address re-using misfit windows during an inversion workflow. Throw some log messages out to let the User know whether or not misfit windows will be re used throughout an inversion.
- True: Always fix windows except for i01s00 because we don’t have any
windows for the first function evaluation
False: Don’t fix windows, always choose a new set of windows Iter: Pick windows only on the initial step count (0th) for each
iteration. WARNING - does not work well with Thrifty Inversion because the 0th step count is usually skipped
- Once: Pick new windows on the first function evaluation and then fix
windows. Useful for when parameters have changed, e.g. filter bounds
- Parameters:
iteration (int) – The current iteration of the SeisFlows3 workflow, within SeisFlows3 this is defined by optimize.iter
step_count (int) – Current line search step count within the SeisFlows3 workflow. Within SeisFlows3 this is defined by optimize.line_search.step_count
- Return type:
tuple (bool or None, str)
- Returns:
(bool on whether to use windows from the previous step or None if fix window turned off, a message that can be sent to the logger)
- _config_auxiliary_logger(fid)
Create a log file to track processing of a given source-receiver pair. Because each station is processed asynchronously, we don’t want them to log to the main file at the same time, otherwise we get a random mixing of log messages. Instead we have them log to temporary files, which are combined at the end of the processing script in serial.
- Parameters:
fid (str) – full path and filename for logger that will be configured
- Return type:
logging.Logger
- Returns:
a logger which does NOT log to stdout and only logs to the given file defined by fid
- _finalize_logging(config, total_windows, total_misfit)
Each source-receiver pair has made its own log file. This function collects these files and writes their content back into the main log. This is a lot of IO but should be okay since the files are small.
Note
This was the most foolproof method for having multiple parallel processes write to the same file. I played around with StringIO buffers and file locks, but they became overly complicated and ultimately did not work how I wanted them to. This function trades filecount and IO overhead for simplicity.
Warning
The assumption here is that the number of source-receiver pairs is manageable (in the thousands). If we start reaching file count limits on the cluster then this method for logging may have to be re-thought. See link for example: https://stackless.readthedocs.io/en/3.7-slp/howto/
logging-cookbook.html#using-concurrent-futures-processpoolexecutor
- Parameters:
config (pyatoa.core.config.Config) – Config object that will be queried for iteration, step count and event ID information
total_windows (int) – total number of windows collected for a given source. this will be written to the final log message
total_misfit (float) – total misfit for a given source. this will be written to the final log message