aurora.pipelines package

Submodules

aurora.pipelines.fourier_coefficients module

Supporting codes for building the FC level of the mth5

Here are the parameters that are defined via the mt_metadata fourier coefficients structures: “anti_alias_filter”: “default”, “bands”, “decimation.factor”: 4.0, “decimation.level”: 2, “decimation.method”: “default”, “decimation.sample_rate”: 0.0625, “extra_pre_fft_detrend_type”: “linear”, “prewhitening_type”: “first difference”, “window.clock_zero_type”: “ignore”, “window.num_samples”: 128, “window.overlap”: 32, “window.type”: “boxcar”

Creating the decimations config requires a decision about decimation factors and the number of levels. We have been getting this from the EMTF band setup file by default. It is desirable to continue supporting this, however, note that the EMTF band setup is really about processing, and not about making STFTs.

For the record, here is the legacy decimation config from EMTF, a.k.a. decset.cfg: ` 4     0      # of decimation level, & decimation offset 128  32.   1   0   0   7   4   32   1 1.0 128  32.   4   0   0   7   4   32   4 .2154  .1911   .1307   .0705 128  32.   4   0   0   7   4   32   4 .2154  .1911   .1307   .0705 128  32.   4   0   0   7   4   32   4 .2154  .1911   .1307   .0705 `

This essentially corresponds to a “Decimations Group” which is a list of decimations. Related to the generation of FCs is the ARMA prewhitening (Issue #60) which was controlled in EMTF with pwset.cfg 4 5 # of decimation level, # of channels 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Note 1: Assumes application of cascading decimation, and that the decimated data will be accessed from the previous decimation level.

Note 2: We can encounter cases where some runs can be decimated and others can not. We need a way to handle this. For example, a short run may not yield any data from a later decimation level. An attempt to handle this has been made in TF Kernel by adding a is_valid_dataset column, associated with each run-decimation level pair.

Note 3: This point in the loop marks the interface between _generation_ of the FCs and

their _usage_. In future the code above this comment would be pushed into create_fourier_coefficients() and the code below this would access those FCs and execute compute_transfer_function()

aurora.pipelines.fourier_coefficients.add_fcs_to_mth5(m: MTH5, fc_decimations: list | None = None) None[source]

Add Fourier Coefficient Levels ot an existing MTH5.

Notes:

  • This module computes the FCs differently than the legacy aurora pipeline. It uses scipy.signal.spectrogram. There is a test in Aurora to confirm that there are equivalent if we are not using fancy pre-whitening.

  • Nomenclature: “usssr_grouper” is the output of a group-by on unique {survey, station, sample_rate} tuples.

Parameters:
  • m (MTH5 object) – The mth5 file, open in append mode.

  • fc_decimations (Union[str, None, List]) – This specifies the scheme to use for decimating the time series when building the FC layer. None: Just use default (something like four decimation levels, decimated by 4 each time say. String: Controlled Vocabulary, values are a work in progress, that will allow custom definition of the fc_decimations for some common cases. For example, say you have stored already decimated time series, then you want simply the zeroth decimation for each run, because the decimated time series live under another run container, and that will get its own FCs. This is experimental. List: (UNTESTED) – This means that the user thought about the decimations that they want to create and is passing them explicitly. – probably will need to be a dictionary actually, since this would get redefined at each sample rate.

aurora.pipelines.fourier_coefficients.fc_decimations_creator(initial_sample_rate: float, decimation_factors: list | None = None, max_levels: int | None = 6, time_period: TimePeriod | None = None) list[source]

Creates mt_metadata FCDecimation objects that parameterize Fourier coefficient decimation levels.

Note 1: This does not yet work through the assignment of which bands to keep. Refer to mt_metadata.transfer_functions.processing.Processing.assign_bands() to see how this was done in the past

Parameters:
  • initial_sample_rate (float) – Sample rate of the “level0” data – usually the sample rate during field acquisition.

  • decimation_factors (list (or other iterable)) – The decimation factors that will be applied at each FC decimation level

  • max_levels (int) – The maximum number of decimation levels to allow

  • time_period

Returns:

fc_decimations – Each element of the list is an object of type mt_metadata.transfer_functions.processing.fourier_coefficients.Decimation, (a.k.a. FCDecimation).

The order of the list corresponds the order of the cascading decimation
  • No decimation levels are omitted.

  • This could be changed in future by using a dict instead of a list,

  • e.g. decimation_factors = dict(zip(np.arange(max_levels), decimation_factors))

Return type:

list

aurora.pipelines.fourier_coefficients.get_degenerate_fc_decimation(sample_rate: float) list[source]

Makes a default fc_decimation list. WIP This “degnerate” config will only operate on the first decimation level. This is useful for testing but could be used in future if an MTH5 stored time series in decimation levels already as separate runs.

Parameters:

sample_rate (float) – The sample rate assocaiated with the time-series to convert to Spectrogram

Returns:

output – List has only one element which is of type mt_metadata.transfer_functions.processing.fourier_coefficients.Decimation.

Return type:

list

aurora.pipelines.fourier_coefficients.read_back_fcs(m: MTH5 | Path | str, mode='r')[source]

This is mostly a helper function for tests. It was used as a sanity check while debugging the FC files, and also is a good example for how to access the data at each level for each channel.

The Time axis of the FC array will change from level to level, but the frequency axis will stay the same shape (for now – storing all fcs by default)

Parameters:

m – pathlib.Path, str or an MTH5 object The path to an h5 file that we will scan the fcs from

aurora.pipelines.helpers module

Helper functions for processing pipelines. These maybe reorganized later into other modules.

aurora.pipelines.helpers.initialize_config(processing_config: Processing | str | Path) Processing[source]

Helper function to return an intialized processing config.

Parameters:

processing_cfg (Union[Processing, str, pathlib.Path]) – Either an instance of the processing class or a path to a json file that a Processing object is stored in.

Returns:

config – Object that contains the processing parameters

Return type:

mt_metadata.transfer_functions.processing.aurora.Processing

aurora.pipelines.process_mth5 module

This module contains the main methods used in processing mth5 objects to transfer functions.

The main function is called process_mth5. This function was recently changed to process_mth5_legacy, os that process_mth5 can be repurposed for other TF estimation schemes. The “legacy” version corresponds to aurora default processing.

Notes on process_mth5_legacy: Note 1: process_mth5 assumes application of cascading decimation, and that the decimated data will be accessed from the previous decimation level. This should be revisited. It may make more sense to have a get_decimation_level() interface that provides an option of applying decimation or loading pre-decimated data. This will be addressed via creation of the FC layer inside mth5.

Note 2: We can encounter cases where some runs can be decimated and others can not. We need a way to handle this. For example, a short run may not yield any data from a later decimation level. An attempt to handle this has been made in TF Kernel by adding a is_valid_dataset column, associated with each run-decimation level pair.

Note 3: This point in the loop marks the interface between _generation_ of the FCs and

their _usage_. In future the code above this comment would be pushed into create_fourier_coefficients() and the code below this would access those FCs and execute compute_transfer_function(). This would also be an appropriate place to place a feature extraction layer, and compute weights for the FCs.

aurora.pipelines.process_mth5.append_chunk_to_stfts(stfts: dict, chunk: Dataset, remote: bool) dict[source]

Aggregate one STFT into a larger dictionary that tracks all the STFTs

Parameters:
  • stfts (dict) – has keys “local” and “remote”.

  • chunk (xr.Dataset) – The data to append to the dictionary

  • remote (bool) – If True, append the chunk to stfts[“remote”], else append to stfts[“local”]

Returns:

stfts – Same as input but now has new chunk appended to it.

Return type:

dict

aurora.pipelines.process_mth5.get_spectrogams(tfk, i_dec_level, units='MT')[source]

Given a decimation level id, loads a dictianary of all spectragrams from information in tfk. TODO: Make this a method of TFK

Parameters:
  • tfk (TransferFunctionKernel) –

  • i_dec_level (integer) – The decimation level of the spectrograms.

  • units (str) – “MT” or “SI”, likely to be deprecated

Returns:

stfts – The short time fourier transforms for the decimation level as a dictionary. Keys are “local” and “remote”. Values are lists, one (element) xr.Dataset per run

Return type:

dict

aurora.pipelines.process_mth5.load_stft_obj_from_mth5(i_dec_level: int, row: Series, run_obj: RunGroup, channels: list | None = None) Dataset[source]

Load stft_obj from mth5 (instead of compute)

Note #1: See note #1 in time_series.frequency_band_helpers.extract_band

Parameters:
  • i_dec_level (int) – The decimation level where the data are stored within the Fourier Coefficient group

  • row (pandas.core.series.Series) – A row of the TFK.dataset_df

  • run_obj (mth5.groups.run.RunGroup) – The original time-domain run associated with the data to load

Returns:

stft_chunk – An STFT from mth5.

Return type:

xr.Dataset

aurora.pipelines.process_mth5.make_stft_objects(processing_config, i_dec_level, run_obj, run_xrds, units='MT')[source]

Operates on a “per-run” basis. Applies STFT to all time series in the input run.

This method could be modified in a multiple station code so that it doesn’t care if the station is “local” or “remote” but rather uses scale factors keyed by station_id (WIP - issue #329)

Parameters:
  • processing_config (mt_metadata.transfer_functions.processing.aurora.Processing) – Metadata about the processing to be applied

  • i_dec_level (int) – The decimation level to process

  • run_obj (mth5.groups.master_station_run_channel.RunGroup) – The run to transform to stft

  • run_xrds (xarray.core.dataset.Dataset) – The data time series from the run to transform

  • units (str) – expects “MT”. May change so that this is the only accepted set of units

  • station_id (str) – To be deprecated, this information is contained in the run_obj as run_obj.station_group.metadata.id

Returns:

stft_obj – Time series of calibrated Fourier coefficients per each channel in the run

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.process_mth5.merge_stfts(stfts: dict, tfk: TransferFunctionKernel)[source]

Applies concatenation along the time axis to multiple arrays of STFTs from different runs. At the TF estimation level we treat all the FCs in one array. This builds the array for both the local and the remote STFTs.

Parameters:
  • stfts (dict) – The dict is keyed by “local” and “remote”. Each value is a list of STFTs (one list for local and one for remote)

  • tfk (TransferFunctionKernel) – Just here to let us know if there is a remote reference to merge or not.

Returns:

local_merged_stft_obj, remote_merged_stft_obj – Both are xr.Datasets

Return type:

Tuple

aurora.pipelines.process_mth5.process_mth5(config, tfk_dataset=None, units='MT', show_plot=False, z_file_path=None, return_collection=False, processing_type='legacy')[source]

This is a pass-through method that routes the config and tfk_dataset to MT data processing. It currently only supports legacy aurora processing.

Parameters:
  • config (mt_metadata.transfer_functions.processing.aurora.Processing or path to json) – All processing parameters

  • tfk_dataset (aurora.tf_kernel.dataset.Dataset or None) – Specifies what datasets to process according to config

  • units (string) – “MT” or “SI”. To be deprecated once data have units embedded

  • show_plot (boolean) – Only used for dev

  • z_file_path (string or pathlib.Path) – Target path for a z_file output if desired

  • return_collection (boolean) – return_collection=False will return an mt_metadata TF object return_collection=True will return aurora.transfer_function.transfer_function_collection.TransferFunctionCollection

  • processing_type (string) – Controlled vocabulary, must be one of [“legacy”,] This is not really supported now, but the idea is that in future, the config and tfk_dataset can be passed to another processing method if desired.

Returns:

tf_obj – The transfer function object

Return type:

TransferFunctionCollection or mt_metadata.transfer_functions.TF

aurora.pipelines.process_mth5.process_mth5_legacy(config, tfk_dataset=None, units='MT', show_plot=False, z_file_path=None, return_collection=False)[source]

This is the main method used to transform a processing_config, and a kernel_dataset into a transfer function estimate.

Parameters:
  • config (mt_metadata.transfer_functions.processing.aurora.Processing or path to json) – All processing parameters

  • tfk_dataset (aurora.tf_kernel.dataset.Dataset or None) – Specifies what datasets to process according to config

  • units (string) – “MT” or “SI”. To be deprecated once data have units embedded

  • show_plot (boolean) – Only used for dev

  • z_file_path (string or pathlib.Path) – Target path for a z_file output if desired

  • return_collection (boolean) – return_collection=False will return an mt_metadata TF object return_collection=True will return aurora.transfer_function.transfer_function_collection.TransferFunctionCollection

Returns:

  • tf_collection (TransferFunctionCollection or mt_metadata TF) – The transfer function object

  • tf_cls (mt_metadata.transfer_functions.TF) – TF object

aurora.pipelines.process_mth5.process_tf_decimation_level(config: Processing, i_dec_level: int, local_stft_obj: Dataset, remote_stft_obj: Dataset | None, units='MT')[source]

Processing pipeline for a single decimation_level

TODO: Add a check that the processing config sample rates agree with the data TODO: Add units to local_stft_obj, remote_stft_obj sampling rates otherwise raise Exception This method can be single station or remote based on the process cfg

Parameters:
  • config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) – Config for a single decimation level

  • i_dec_level (int) – decimation level_id ?could we pack this into the decimation level as an attr?

  • local_stft_obj (xarray.core.dataset.Dataset) – The time series of Fourier coefficients from the local station

  • remote_stft_obj (xarray.core.dataset.Dataset or None) – The time series of Fourier coefficients from the remote station

  • units (str) – one of [“MT”,”SI”]

Returns:

transfer_function_obj – The transfer function values packed into an object

Return type:

aurora.transfer_function.TTFZ.TTFZ

aurora.pipelines.process_mth5.save_fourier_coefficients(dec_level_config, row, run_obj, stft_obj) None[source]

Optionally saves the stft object into the MTH5. Note that the dec_level_config must have its save_fcs attr set to True to actually save the data. WIP

Note #1: Logic for building FC layers: If the processing config decimation_level.save_fcs_type = “h5” and fc_levels_already_exist is False, then open in append mode, else open in read mode. We should support a flag: force_rebuild_fcs, normally False. This flag is only needed when save_fcs_type==”h5”. If True, then we open in append mode, regarless of fc_levels_already_exist The task of setting mode=”a”, mode=”r” can be handled by tfk (maybe in tfk.validate())

Parameters:
  • dec_level_config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) – The information about decimation level associated with row, run, stft_obj

  • row (pd.Series) – A row of the TFK.dataset_df

  • run_obj (mth5.groups.run.RunGroup) – The run object associated with the STFTs.

  • stft_obj (xr.Dataset) – The data to pack intp the mth5.

aurora.pipelines.process_mth5.triage_issue_289(local_stfts: list, remote_stfts: list)[source]

Takes STFT objects in and returns them after shape-checking and making sure they are same. WIP: Timing Error Workaround See Aurora Issue #289. Seems associated with getting one fewer sample than expected from the edge of a run.

returns: Tuple

(local_stfts, remote_stfts) The original input arguments, shape-matched

aurora.pipelines.run_summary module

aurora.pipelines.time_series_helpers module

Collection of modules used in processing pipeline that operate on time series

aurora.pipelines.time_series_helpers.apply_prewhitening(decimation_obj, run_xrds_input)[source]

Applies pre-whitening to time series to avoid spectral leakage when FFT is applied.

If “first difference”, may want to consider clipping first and last sample from the differentiated time series.

Parameters:
  • decimation_obj (mt_metadata.transfer_functions.processing.aurora.DecimationLevel) – Information about how the decimation level is to be processed

  • run_xrds_input (xarray.core.dataset.Dataset) – Time series to be pre-whitened

Returns:

run_xrds – pre-whitened time series

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.apply_recoloring(decimation_obj, stft_obj)[source]

Inverts the pre-whitening operation in frequency domain.

Parameters:
  • decimation_obj (mt_metadata.transfer_functions.processing.fourier_coefficients.decimation.Decimation) – Information about how the decimation level is to be processed

  • stft_obj (xarray.core.dataset.Dataset) – Time series of Fourier coefficients to be recoloured

Returns:

stft_obj – Recolored time series of Fourier coefficients

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.calibrate_stft_obj(stft_obj, run_obj, units='MT', channel_scale_factors=None)[source]

Calibrates frequency domain data into MT units.

Development Notes:

The calibration often raises a runtime warning due to DC term in calibration response = 0. TODO: It would be nice to suppress this, maybe by only calibrating the non-dc terms and directly assigning np.nan to the dc component when DC-response is zero.

Parameters:
  • stft_obj (xarray.core.dataset.Dataset) – Time series of Fourier coefficients to be calibrated

  • run_obj (mth5.groups.master_station_run_channel.RunGroup) – Provides information about filters for calibration

  • units (string) – usually “MT”, contemplating supporting “SI”

  • scale_factors (dict or None) – keyed by channel, supports a single scalar to apply to that channels data Useful for debugging. Should not be used in production and should throw a warning if it is not None

Returns:

stft_obj – Time series of calibrated Fourier coefficients

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.nan_to_mean(xrds: Dataset) Dataset[source]

Set Nan values in xr.Dataset to the mean value per channel.

xrds: xr.Dataset

Time series data

Returns:

run_xrds – The same as the input time series but NaN values are replaced by the mean of the time series (per channel).

Return type:

xr.Dataset

aurora.pipelines.time_series_helpers.prototype_decimate(config: Decimation, run_xrds: Dataset) Dataset[source]
Basically a wrapper for scipy.signal.decimate. Takes input timeseries (as xarray

Dataset) and a Decimation config object and returns a decimated version of the input time series.

TODO: Consider moving this function into time_series/decimate.py TODO: Consider Replacing the downsampled_time_axis with rolling mean, or somthing that takes the average value of the time, not the window start TODO: Compare outputs with scipy resample_poly, which also has an FIR AAF and appears faster TODO: Add handling for case that could occur when sliced time axis has a different length than the decimated data – see mth5 issue #217 https://github.com/kujaku11/mth5/issues/217

Parameters:
  • config (mt_metadata.transfer_functions.processing.aurora.Decimation) –

  • run_xrds (xr.Dataset) – Originally from mth5.timeseries.run_ts.RunTS.dataset, but possibly decimated multiple times

Returns:

xr_ds – Decimated version of the input run_xrds

Return type:

xr.Dataset

aurora.pipelines.time_series_helpers.run_ts_to_stft(decimation_obj, run_xrds_orig)[source]

Converts a runts object into a time series of Fourier coefficients. Similar to run_ts_to_stft_scipy, but in this implementation operations on individual windows are possible (for example pre-whitening per time window via ARMA filtering).

Parameters:
  • decimation_obj (mt_metadata.transfer_functions.processing.aurora.DecimationLevel) – Information about how the decimation level is to be processed

  • run_ts (xarray.core.dataset.Dataset) – normally extracted from mth5.RunTS

Returns:

stft_obj – Note that the STFT object may have inf/nan in the DC harmonic, introduced by recoloring. This really doesn’t matter since we don’t use the DC harmonic for anything.

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.run_ts_to_stft_scipy(decimation_obj, run_xrds_orig)[source]

Converts a runts object into a time series of Fourier coefficients. This method uses scipy.signal.spectrogram.

Parameters:
  • decimation_obj (mt_metadata.transfer_functions.processing.aurora.DecimationLevel) – Information about how the decimation level is to be processed

  • run_xrds_orig (: xarray.core.dataset.Dataset) – Time series to be processed

Returns:

stft_obj – Time series of Fourier coefficients

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.truncate_to_clock_zero(decimation_obj, run_xrds)[source]

Compute the time interval between the first data sample and the clock zero Identify the first sample in the xarray time series that corresponds to a window start sample.

Parameters:
  • decimation_obj (mt_metadata.transfer_functions.processing.aurora.DecimationLevel) – Information about how the decimation level is to be processed

  • run_xrds (xarray.core.dataset.Dataset) – normally extracted from mth5.RunTS

Returns:

run_xrds – same as the input time series, but possibly slightly shortened

Return type:

xarray.core.dataset.Dataset

aurora.pipelines.time_series_helpers.validate_sample_rate(run_ts, expected_sample_rate, tol=0.0001)[source]

Check that the sample rate of a run_ts is the expected value, and warn if not.

Parameters:
  • run_ts (mth5.timeseries.run_ts.RunTS) – Time series object with data and metadata.

  • expected_sample_rate (float) – The sample rate the time series is expected to have. Normally taken from the processing config

aurora.pipelines.transfer_function_helpers module

This module contains helper methods that are used during transfer function processing.

Development Notes: Note #1: repeatedly applying edf_weights seems to have no effect at all. tested 20240118 and found that test_compare in synthetic passed whether this was commented or not. TODO confirm this is a one-and-done add doc about why this is so.

aurora.pipelines.transfer_function_helpers.apply_weights(X, Y, RR, W, segment=False, dropna=False)[source]

Applies data weights (W) to each of X, Y, RR. If weight is zero, we set to nan and optionally dropna.

Parameters:
  • X (xarray.core.dataset.Dataset) –

  • Y (xarray.core.dataset.Dataset) –

  • RR (xarray.core.dataset.Dataset or None) –

  • W (numpy array) – The Weights to apply to the data

  • segment (bool) – If True the weights may need to be reshaped.

  • dropna (bool) – Whether or not to drop zero-weighted data. If true, we drop the nans.

Returns:

X, Y, RR – Same as input but with weights applied and (optionally) nan dropped.

Return type:

tuple

aurora.pipelines.transfer_function_helpers.drop_nans(X: Dataset, Y: Dataset, RR: Dataset | None) tuple[source]

Drops Nan from any input xarrays. - Just a helper intended to enhance readability

Development Notes:

TODO: document the implications of dropna on index of xarray for other weights

Parameters:
  • X (xr.Dataset) – The input data for regression

  • Y (xr.Dataset) – The output data for regression

  • RR (Union[xr.Dataset, None]) – The remote refernce data for regression

Returns:

X, Y, RR – Returns the input arugments with nan dropped form the xarrays.

Return type:

tuple

aurora.pipelines.transfer_function_helpers.get_estimator_class(estimation_engine: str) RegressionEstimator[source]
Parameters:

estimation_engine (str) – One of the keys in the ESTIMATOR_LIBRARY, designates the method that will be used to estimate the transfer function

Returns:

estimator_class – The class that will do the TF estimation

Return type:

aurora.transfer_function.regression.base.RegressionEstimator

aurora.pipelines.transfer_function_helpers.process_transfer_functions(dec_level_config, local_stft_obj, remote_stft_obj, transfer_function_obj, segment_weights=[], channel_weights=None)[source]

This is the main tf_processing method. It is based on TTFestBand.m

Note #1: Although it is advantageous to execute the regression channel-by-channel vs. all-at-once, we need to keep the all-at-once to get residual covariances (see issue #87)

Note #2: Consider placing the segment weight logic in its own module with the various functions in a dictionary. Possibly can combines (product) all segment weights, like the following pseudocode:

W = ones
for wt_style in segment_weights:
  fcn = wt_functions[style]
  w = fcn(X, Y, RR, )
  W *= w
return W

TODO: Consider push the nan-handling into the band extraction as a kwarg.

Parameters:
  • dec_level_config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) –

  • local_stft_obj (xarray.core.dataset.Dataset) –

  • remote_stft_obj (xarray.core.dataset.Dataset or None) –

  • transfer_function_obj (aurora.transfer_function.TTFZ.TTFZ) – The transfer function container ready to receive values in this method.

  • segment_weights (numpy array or list of strings) – 1D array which should be of the same length as the time axis of the STFT objects If these weights are zero anywhere, we drop all that segment across all channels If it is a list of strings, each string corresponds to a weighting algorithm to be applied. [“jackknife_jj84”, “multiple_coherence”, “simple_coherence”]

  • channel_weights (numpy array or None) –

Returns:

transfer_function_obj

Return type:

aurora.transfer_function.TTFZ.TTFZ

aurora.pipelines.transfer_function_helpers.set_up_iter_control(config)[source]

Initializes an IterControl object based on values in the processing config.

Development Notes: TODO: Review: maybe better to just make this the __init__ method of the IterControl object, iter_control = IterControl(config)

Parameters:

config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) –

Returns:

iter_control – Object with parameters about iteration control in regression

Return type:

aurora.transfer_function.regression.iter_control.IterControl

aurora.pipelines.transfer_function_helpers.stack_fcs(X, Y, RR)[source]

Reshape 2D arrays of frequency and time to 1D.

Notes: When the data for a frequency band are extracted from the Spectrogram, each channel is a 2D array, one axis is time (the time of the window that was FFT-ed) and the other axis is frequency. However if we make no distinction between the harmonics (bins) within a band in regression, then all the FCs for each channel can be put into a 1D array. This method performs that reshaping (ravelling) operation. It is not important how we unravel the FCs but it is important that the same indexing scheme is used for X, Y and RR.

TODO: Consider this take a list and return a list rather than X,Y,RR

TODO: Consider decorate this with @dataset_or_dataarray

Parameters:
  • X (xarray.core.dataset.Dataset) –

  • Y (xarray.core.dataset.Dataset) –

  • RR (xarray.core.dataset.Dataset or None) –

Returns:

X, Y, RR

Return type:

Same as input but with stacked time and frequency dimensions

aurora.pipelines.transfer_function_kernel module

This module contains the TrasnferFunctionKernel class which is the main object that links the KernelDataset to Processing configuration.

class aurora.pipelines.transfer_function_kernel.TransferFunctionKernel(dataset=None, config=None)[source]

Bases: object

Attributes:
all_fcs_already_exist

Return true of all FCs needed to process data already exist in the mth5s

config

Returns the processing config object

dataset

returns the KernelDataset object

dataset_df

returns the KernelDataset dataframe

kernel_dataset

returns the KernelDataset object

mth5_objs

Returns self._mth5_objs

processing_config

Returns the processing config object

processing_summary

Returns the processing summary object – creates if it doesn’t yet exist.

processing_type

A description of the processing, will get passed to TF object,

Methods

apply_clock_zero(dec_level_config)

get clock-zero from data if needed

check_if_fcs_already_exist()

Fills out the "fc" column of dataset dataframe with True/False.

export_tf_collection(tf_collection)

Assign transfer_function, residual_covariance, inverse_signal_power, station, survey

get_mth5_file_open_mode()

check the mode of an open mth5 (read, write, append)

initialize_mth5s()

returns a dict of open mth5 objects, keyed by station_id

is_valid_dataset(row, i_dec)

Given a row from the RunSummary, answer the question:

make_processing_summary()

Create the processing summary table.

memory_check()

Checks if a RAM issue should be anticipated.

show_processing_summary([omit_columns])

Prints the processing summary table via logger.

update_dataset_df(i_dec_level)

This function has two different modes.

valid_decimations()

Get the decimation levels that are valid.

validate()

apply all validators

validate_decimation_scheme_and_dataset_compatability([...])

Checks that the decimation_scheme and dataset are compatable.

validate_processing()

Do some Validation checks.

validate_save_fc_settings()

Update save_fc values in the config to be appropriate.

property all_fcs_already_exist: bool

Return true of all FCs needed to process data already exist in the mth5s

apply_clock_zero(dec_level_config)[source]

get clock-zero from data if needed

Parameters:

dec_level_config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) –

Returns:

dec_level_config – The modified DecimationLevel with clock-zero information set.

Return type:

mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel

check_if_fcs_already_exist()[source]

Fills out the “fc” column of dataset dataframe with True/False.

If all FC Levels for a given station-run are already built, mark True otherwise False.

Iterates over the processing summary_df, grouping by unique “Survey-Station-Run”s. (Could also iterate over kernel_dataset.dataframe, to get the groupby).

Note 1: Because decimation is a cascading operation, we avoid the case where some (valid) decimation levels exist in the mth5 FC archive and others do not. The maximum granularity is the “station-run” level. For a given run, either all relevant FCs are in the h5 or we treat as if none of them are. To support variations at the decimation-level, an appropriate way to address would be to store decimated time series in the archive as well (they would simply be runs with different sample rates, and some extra filters).

Note #2: run_sub_df may have multiple rows, even though the run id is unique. This could happen for example when you have a long run at the local station, but multiple (say two) shorter runs at the reference station. In that case, the processing summary will have a separate row for the intersection of the long run with each of the remote runs. We ignore this for now, selecting only the first element of the run_sub_df, under the assumption that FCs have been created for the entire run, or not at all. This assumption can be relaxed in future by using the time_period attribute of the FC layer. For now, we proceed with the all-or-none logic. That is, if a [‘survey’, ‘station’, ‘run’,] has FCs, assume that the FCs are present for the entire run. We assign the “fc” column of dataset_df to have the same boolean value for all rows of same [‘survey’, ‘station’, ‘run’] .

Returns: None

Modifies self.dataset_df inplace, assigning bools to the “fc” column

property config

Returns the processing config object

property dataset

returns the KernelDataset object

property dataset_df: DataFrame

returns the KernelDataset dataframe

export_tf_collection(tf_collection)[source]

Assign transfer_function, residual_covariance, inverse_signal_power, station, survey

Parameters:

tf_collection (aurora.transfer_function.transfer_function_collection.TransferFunctionCollection) – Contains TF estimates, covariance, and signal power values

Returns:

tf_cls – Transfer function container

Return type:

mt_metadata.transfer_functions.core.TF

get_mth5_file_open_mode() str[source]

check the mode of an open mth5 (read, write, append)

initialize_mth5s()[source]

returns a dict of open mth5 objects, keyed by station_id

A future version of this for multiple station processing may need nested dict with [survey_id][station]

Returns:

mth5_objs – Keyed by stations. local station id : mth5.mth5.MTH5 remote station id: mth5.mth5.MTH5

Return type:

dict

is_valid_dataset(row, i_dec) bool[source]
Given a row from the RunSummary, answer the question:

“Will this decimation level yield a valid dataset?”

Parameters:
  • row (pandas.core.series.Series) – Row of the self._dataset_df (corresponding to a run that will be processed)

  • i_dec (integer) – refers to decimation level

Returns:

is_valid – Whether the (run, decimation_level) pair associated with this row yields a valid dataset

Return type:

Bool

property kernel_dataset

returns the KernelDataset object

make_processing_summary()[source]

Create the processing summary table. - Melt the decimation levels over the run summary. - Add columns to estimate the number of FFT windows for each row

Returns:

processing_summary – One row per each run-deciamtion pair

Return type:

pd.DataFrame

memory_check() None[source]

Checks if a RAM issue should be anticipated.

Notes: Requires an estimate of available RAM, and an estimate of the dataset size Available RAM is taken from psutil, Dataset size is number of samples, times the number of bytes per sample Bits per sample is estimated to be 64 by default, (8-bytes)

property mth5_objs: dict

Returns self._mth5_objs

property processing_config

Returns the processing config object

property processing_summary

Returns the processing summary object – creates if it doesn’t yet exist.

property processing_type

A description of the processing, will get passed to TF object, can be used for Z-file

Could add a version or a hashtag to this Could also check dataset_df If remote.all==False append “Single Station”

show_processing_summary(omit_columns=('mth5_path', 'channel_scale_factors', 'start', 'end', 'input_channels', 'output_channels', 'num_samples_overlap', 'num_samples_advance', 'run_dataarray'))[source]

Prints the processing summary table via logger.

Parameters:

omit_columns (tuple) – List of columns to omit when showing channel summary (used to keep table small).

update_dataset_df(i_dec_level)[source]

This function has two different modes. The first mode initializes values in the array, and could be placed into TFKDataset.initialize_time_series_data() The second mode, decimates. The function is kept in pipelines because it calls time series operations.

Notes: 1. When assigning xarrays to dataframe cells, df dislikes xr.Dataset, so we convert to DataArray before assignment

Parameters:
  • i_dec_level (int) – decimation level id, indexed from zero

  • config (mt_metadata.transfer_functions.processing.aurora.decimation_level.DecimationLevel) – decimation level config

Returns:

dataset_df – Same df that was input to the function but now has columns:

Return type:

pd.DataFrame

valid_decimations()[source]

Get the decimation levels that are valid. This is used when iterating over decimation levels in the processing, we do not want to have invalid levels get processed (they will fail).

Returns:

dec_levels – Decimations from the config that are valid.

Return type:

list

validate()[source]

apply all validators

validate_decimation_scheme_and_dataset_compatability(min_num_stft_windows=None)[source]

Checks that the decimation_scheme and dataset are compatable. Marks as invalid any rows that will fail to process based incompatibility.

Refers to issue #182 (and #103, and possibly #196 and #233). Determine if there exist (one or more) runs that will yield decimated datasets that have too few samples to be passed to the STFT algorithm.

Strategy for handling this: Mark as invlaid any rows of the processing summary that do not yield long enough time series to window. This way all other rows, with decimations up to the invalid cases will still process.

WCGW: If the runs are “chunked” we could encounter a situation where the chunk fails to yield a deep decimation level, yet the level could be produced if the chunk size were larger. In general, handling this seems a bit complicated. We will ignore this for now and assume that the user will select a chunk size that is appropriate to the decimation scheme, i.e. use large chunks for deep decimations.

A general solution: return a log that tells the user about each run and decimation level … how many STFT-windows it yielded at each decimation level. This conjures the notion of (run, decimation_level) pairs ——-

validate_processing()[source]

Do some Validation checks. WIP.

Things that are validated: 1. The default estimation engine from the json file is “RME_RR”, which is fine ( we expect to in general to do more RR processing than SS) but if there is only one station (no remote)then the RME_RR should be replaced by default with “RME”. - also if there is only one station, set reference channels to []

  1. make sure local station id is defined (correctly from kernel dataset)

validate_save_fc_settings()[source]

Update save_fc values in the config to be appropriate. - If all FCs exist, set save_fc to False.

aurora.pipelines.transfer_function_kernel.mth5_has_fcs(m, survey_id, station_id, run_id, remote, processing_config, **kwargs)[source]

Checks if all needed fc-levels for survey-station-run are present under processing_config

Note #1: At this point in the logic, it is established that there are FCs associated with run_id and there are

at least as many FC decimation levels as we require as per the processing config. The next step is to assert whether it is True that the existing FCs conform to the recipe in the processing config.

kwargs are here as a pass through to the decorator … we pass mode=”r”,”a”,”w”

Parameters:
  • m

  • survey_id

  • station_id

  • run_id

  • dataset_df

aurora.pipelines.transfer_function_kernel.station_obj_from_row(row)[source]

Access the station object Note if/else could avoidable if replacing text string “none” with a None object in survey column

Parameters:

row (pd.Series) – A row of tfk.dataset_df

Return type:

station_obj

Module contents