Source code for fouriertransform.crosstable

'''
This module contains the CrossTable class.
'''

from __future__ import(
	division,
	print_function,
	)

__docformat__ = 'restructuredtext en'
__all__ = ['CrossTable']

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings

#import exceptions
from .exceptions import(
	FormulaError,
	LengthError,
	SampleError
	)

#import helper functions
from .crosstable_helper import *


[docs]class CrossTable(object): __doc__ = ''' Class for generating formula tables for FT-ICR MS sample sets, generating summary tables and statistics, and correlating formula abundances with environmental measurements. Parameters ---------- intensities : 2D array-like Either 2D np.ndarray or pd.Dataframe of all formula intensities for all samples. drop_int_above : none, int, or float Tells the method whether to drop formulae that represent greater than some amount of the total intensity -- i.e. fliers. The value of `drop_int_above` is the percent of total intensity (i.e. the relative abundance) value above which peaks will be dropped. Defaults to `None`. drop_high_OC : Boolean If `True`, formulae with O/C ratio above 1.0 will be dropped. Defaults to `True`. drop_high_HC : Boolean If `True`, formulae with H/C ratio above 2.0 will be dropped. Defaults to `True`. formulae : list List of formula names. If `None`, pulls formula names from `intensities` index values. Defaults to `None`. rescale : `None` or str String telling the method how to rescale peak intensities, either 'fraction', 'max_peak', or None. If 'fraction', scales intensities such that the sum for *each sample* is equal to 100. If 'max_peak', scales intensities such that the largest peak *in the entire sample set* is equal to 100. sam_names : `None` or list List of sample names. If `None`, pulls sample names from column names of `intensities`. Defaults to `None`. Warnings -------- UserWarning If `sam_names` = `None` and `intensities` does not contain sample names, then names each sample according to column number. Notes ----- When calculating correlation van Krevelen plots, `CrossTable` scaling should be set to 'fraction'. This way, it is the relative abundance within a given sample that is correlated with environmental parameters. References ---------- EnviroOrg_ software, National High Magnetic Field Laboratory, Tallahassee, FL. Koch and Dittmar (2006), Rapid Comm. Mass. Spec., 20, 926-932. LaRowe and van Cappellen (2011), Geochim. Cosmochim. Ac., 67, 2030-2042. Legendre and Legendre (2012), Numerical Ecology, Elsevier, 2nd Ed. Santi-Temkiv et al. (2013), PLoS One, doi:10.1371/journal.pone.0053550. .. _EnviroOrg: https://nationalmaglab.org/user-facilities/icr/icr-software Examples -------- Generating an arbitrary bare-bones cross table containing made-up formulae:: #import modules import fouriertransform as ft import numpy as np import pandas as pd #generate arbitrary data intensities = np.ones([2,2]) formulae = ['C1H1O1N0S0P0', 'C2H2O1N0S1P0'] sam_names = ['sample_1','sample_2'] #create crosstable ct = ft.CrossTable( intensities, formulae = formulae, rescale = None, sam_names = sam_names) Importing all EnviroOrg files contained within a directory and combining to for a cross table:: #input directory name dir_name = '~/Desktop/EO_directory' #create crosstable ct = ft.CrossTable.from_eo( dir_name, file_names = 'all', #can replace with list of names for subset rescale = 'fraction', drop_int_above = 5, #drops fliers above 5 pct total intensity drop_high_HC = True, #drops peaks with H/C > 2 drop_high_OC = True) #drops peaks with O/C > 1 Generating a summary table for the samples contained within the ``CrossTable`` instance `ct`:: #make the summary table sum_tab = ct.generate_summary() #save as csv sum_tab.to_csv('~/Desktop/summary_table.csv') Plotting the cross table data in van Krevelen space. You can plot either intensities or class for a single sample, the difference between two samples, or correlations between relative intensities across all samples and some other environmental parameter (e.g. bulk carbon isotopes). For all plotting methods, matplotlib scatterplot keyword arguments can be used. Plotting a single sample's intensity. Here, I'm taking the log of intensities to see a wider dynamic range:: #import additional modules import matplotlib.pyplot as plt #extract sample name s = 'sample_1' #make axis fig, ax = plt.subplots(1,1) #make van krevelen plot ax = ct.plot_sample_vk( s, ax = ax, plot_type = 'intensity', log = True, edgecolor = 'w', s = 40, cmap = 'YlGnBu') Plotting compounds by class for a single sample:: #make axis fig, ax = plt.subplots(1,1) #make van krevelen plot ax = ct.plot_sample_vk( s, ax = ax plot_type = 'class', edgecolor = 'w', s = 40) Plotting the difference between two samples. Formulae that are present in 'sample_1' but not present in 'sample_2' will be plotted and color coded according to class:: #extract second sample name s2 = 'sample_2' #make axis fig, ax = plt.subplots(1,1) #make van krevelen plot ax = ct.plot_difference_vk( s, s2, ax = ax, edgecolor = 'w', s = 40) Plotting correlations with some independently measured environmental parameter, here called `env_param`. You can either plot correlations as Pearson or Spearman coefficients, and you can decide what fraction of the total samples a formula must be contained in in order to be considered for correlation. You can also determine the significance cutoff (alpha) and can only consider certain compound classes (i.e. CHO, CHON, CHOS) rather than all compounds. Resulting plot only retains statistically significant formulae. Resulting correlation coefficients and p-values for statistically significant formulae are also stored in 'stats'. An example:: #make axis fig, ax = plt.subplots(1,2) #make van krevelen plot for all formulae ax[0], stats = ct.plot_correlation_vk( env_param, ax = ax[0], corr_type = 'Spearman', f = 1, alpha = 0.05, cmpd_cls = 'all', edgecolor = 'w', s = 40, cmap = 'coolwarm', vmin = -1, vmax = 1) #make van krevelen plot for CHON formulae only ax[1], stats = ct.plot_correlation_vk( env_param, ax = ax[1], corr_type = 'Spearman', f = 1, alpha = 0.05, cmpd_cls = 'CHON', edgecolor = 'w', s = 40, cmap = 'coolwarm', vmin = -1, vmax = 1) Any figure can then be saved to the drive according to:: fig.savefig('~/Desktop/van_krevelen_figure.pdf') **Attributes** AImod : pd.Series Series of modified AI value for each formula, with index as formula composition. Length `nF`. cmpd_cat : pd.Series Series of category of each formula (e.g. aliphatic, condensed aromatic high oxygen content, etc.), with index as formula composition. Length `nF`. cmpd_class : pd.Series Series of class of each formula (i.e. CHO, CHON, CHOS, CHOP, CHONS), with index as formula composition. Length `nF`. cmpd_mass : pd.Series Series of the mass of each formula, with index as formula composition. Length `nF`. formulae : list List of strings containing each molecular formula, in the format: C(1-99)H(1-99)O(1-99)N(0-9)S(0-9)P(0-9). Length `nF`. intensities : pd.DataFrame DataFrame containing the MS intensities of each formula within each sample. Shape [`nF` x `nS`]. nF : int Number of total detected formulae in dataset. NOSC : pd.Series Series of the nominal oxidation state of carbon (NOSC) for each formula, with index as formula composition. Length `nS`. nS : int Number of samples in dataset. sam_names : list List of strings containing each sample name. Length `nS`. ''' def __init__( self, intensities, drop_int_above = None, #new v0.0.3 drop_high_OC = True, #new v0.0.3 drop_high_HC = True, #new v0.0.3 formulae = None, rescale = None, sam_names = None): #check that intensities and formulae are in right format ints_all, forms_all, sams = _check_int( intensities, formulae = formulae, sam_names = sam_names) #new in v0.0.3: #drop high H/C, high O/C, and fliers comps, forms, ints = _drop_forms( forms_all, ints_all, drop_int_above = drop_int_above, drop_high_HC = drop_high_HC, drop_high_OC = drop_high_OC) nF, nS = np.shape(ints) _check_forms(forms) #rescale intensities if necessary if rescale in ['fraction', 'Fraction']: m = ints.sum() ints = 100*ints/m elif rescale in ['max_peak', 'Max_peak', 'Max_Peak']: m = ints.max() ints = ints*100/m elif rescale is not None: raise ValueError( 'Rescale value of %r is not recognized. Must be "fraction",' ' "max_peak", or "None".' % rescale) #generate a chemical composition table #comps = _gen_chem_comp(forms) #store results self.intensities = ints self.sam_names = sams self.formulae = forms self.nF = nF self.nS = nS self._chem_comp = comps #calculate compound mass, category, class, NOSC, and AImod self.cmpd_mass = _calc_mass(self) self.AImod = _calc_AImod(self) self.NOSC = _calc_NOSC(self) self.cmpd_class = _calc_class(self) self.cmpd_cat = _calc_category(self) #define classmethod to generate instance from EnviroOrg output files
[docs] @classmethod def from_eo( cls, dir_path, drop_int_above = None, drop_high_OC = True, drop_high_HC = True, file_names = 'all', rescale = None): ''' Classmethod to generate a ``CrossTable`` instance by inputting EnviroOrg files generated for each sample from a given sample set. Parameters ---------- dir_path : str String containing the (absolute) path pointing to the directory containing files to be imported. drop_int_above : none, int, or float Tells the method whether to drop formulae that represent greater than some amount of the total intensity -- i.e. fliers. The value of `drop_int_above` is the percent of total intensity (i.e. the relative abundance) value above which peaks will be dropped. Defaults to `None`. drop_high_OC : Boolean If `True`, formulae with O/C ratio above 1.0 will be dropped. Defaults to `True`. drop_high_HC : Boolean If `True`, formulae with H/C ratio above 2.0 will be dropped. Defaults to `True`. file_names : str or list Either a list of strings containing the filenames to be imported or the string 'all'. If 'all', method will automatically import all files within the provided directory. Defaults to 'all'. rescale : str or None Tells the method if and how to rescale peak intensities for cross- sample comparisons. Either `None`, 'fraction', or 'max_peak'. If 'fraction', scales each sample such that the sum of intensities for a given sample is equal to unity. If 'max_peak', scales such that the most intense peak detected in each sample is equal to 100 and all other peaks are scaled accordingly. Defaults to `None`. Raises ------ FileError If the directory path does not exist. FileError If the inputted file does not exist or is not in .csv format. ValueError If the value for `rescale` is not 'fraction', 'max_peak', or `None`. ValueError If `drop_int_above` is not `None`, float, or int. Notes ----- This function requires that inputted files are in .csv format. File names will be used as sample names in resulting cross table (omitting the .csv extension), so using shortened filenames is encouraged for brevity. References ---------- EnviroOrg_ software, National High Magnetic Field Laboratory, Tallahassee, FL. .. _EnviroOrg: https://nationalmaglab.org/user-facilities/icr/icr-software ''' #combine all files into a single dataframe intensities = _combine_EO_samples( dir_path, file_names) #return CrossTable instance return cls( intensities, drop_int_above = drop_int_above, drop_high_OC = drop_high_OC, drop_high_HC = drop_high_HC, formulae = None, rescale = rescale, sam_names = None)
#define method for generating a summary table
[docs] def generate_summary(self): ''' Method for generating a summary table of the ``CrossTable`` instance, including relative compound class and category abundances (both by formula number and by MS intensity), total formulae detected, and average m/z (both by formula number and by MS intensity). Returns ------- sum_df : pd.DataFrame Dataframe of the summary statistics, including total formulae, average mass, percent compound class, and percent compound category (all but total formulae calculate both by peak number and by relative intensity) for each sample in the ``CrossTable`` instance. References ---------- Santi-Temkiv et al. (2013), PLoS One, doi:10.1371/journal.pone.0053550. ''' sum_df = _gen_sum_tab(self) return sum_df
#define method for plotting a sample van Krevelen
[docs] def plot_sample_vk( self, sam_name, ax = None, plot_type = 'class', log = True, **kwargs): ''' Method for generating a van Krevelen plot for a given sample within the sample set. Can color-code points either according to compound class (i.e. CHO, CHON, CHOS, CHOP, CHONS) or according to peak intensity. Parameters ---------- sam_name : str String of the sample name to be plotted. Must be contained within the ``CrossTable`` instance. ax : None or plt.axis Axis to plot on. If `None`, creates an axis. plot_type : str Type of plot to make, either 'class' for plotting color-coded compound classes or 'intensity' for plotting peak intensities. log : boolean For intensity plots, tells the method whether or not to perform a log transform on intensities. If there are many low-abundance compounds, log transform will better show their presence. Returns ------- ax : plt.axis Axis containing the van Krevelen plot of interest. Raises ------ SampleError If sam_name is not present within the ``CrossTable`` instance. ValueError If `type` is not 'class' or 'intensity'. See Also -------- plot_difference_vk Function for plotting presence/absense differences between two samples. plot_correlation_vk Function for plotting correlation coefficients for individual chemical formulae relative to a given environmental controlling parameter. ''' #check for errors if sam_name not in self.sam_names: raise SampleError( 'Sample name %r is not contained within the CrossTable!' % sam_name) #make axis if necessary if ax is None: fig, ax = plt.subplots(1,1) #set axis labels and title ax.set_title(sam_name + ' ' + plot_type) ax.set_ylabel('H/C') ax.set_xlabel('O/C') #calculate H/C and O/C for formulae contained within the sample ints = self.intensities[self.intensities[sam_name] > 0][sam_name] ind = ints.index HC = self._chem_comp.loc[ind,'H']/self._chem_comp.loc[ind,'C'] OC = self._chem_comp.loc[ind,'O']/self._chem_comp.loc[ind,'C'] #if plot_type is 'intensities', log if necessary and plot if plot_type in ['intensity','Intensity']: #log transform if necessary if log: c = np.log10(ints) lab = r'$log_{10}$ peak intensity' else: c = ints lab = 'peak intensity' #sort by ascending intensity ind_sort = np.argsort(c) #plot results vk = ax.scatter( OC[ind_sort], HC[ind_sort], c = c[ind_sort].values, **kwargs) #add colorbar cbar = fig.colorbar(vk, label = lab) #if plot type is 'class', extract classes and plot elif plot_type in ['class', 'Class']: #extract all unique classes ccls = self.cmpd_class.unique() classes = self.cmpd_class[ind] #make a list of colors, extending to all possible classes colors = [ [.071, 0, .40], #CHO [.431, .082, .035], #CHON [.769, .455, 0], #CHOS [0, .561, .569], #CHOP [.133, .376, .035], #CHONS ] #loop through and plot for i, cc in enumerate(ccls): #extract indices ci = classes[classes == cc].index #plot ax.scatter( OC[ci], HC[ci], c = colors[i], label = cc, **kwargs) #add legend ax.legend( loc = 'best', framealpha = 1, scatterpoints = 1, markerscale = 2, fontsize = 'small') #raise error if type is not class of intensity else: raise ValueError( 'Plot type %r not recognized, must be "class" or "intensity"' % plot_type) return ax
#define method for plotting a sample van Krevelen
[docs] def plot_difference_vk( self, sam_name1, sam_name2, ax = None, plot_type = 'class', int_denom = 'max', shared = False, **kwargs): ''' Method for generating a van Krevelen plot of the difference between two samples within a given sample set. Can color-code points either according to compound class (i.e. CHO, CHON, CHOS) or according to peak intensity. Parameters ---------- sam_name1 : str String of the name of the sample containing all formulae of interest. (i.e. formulae present in this sample but not in `sam_name2` will be plotted). Must be contained within the ``CrossTable`` instance. sam_name2 : str String of the name of the sample to be subtracted from `sam_name1`. Formulae absent from this sample but present in `sam_name1` will be plotted. Must be contained within the ``CrossTable`` instance. ax : None or plt.axis Axis to plot on. If `None`, creates an axis. plot_type : str Type of plot to make, either 'class' for plotting color-coded compound classes or 'intensity' for plotting peak intensities. If 'intensity', plotted values are the difference in relative intensity between `sam_name1` and `sam_name2`, divided by a normalizing denominator (defined below): x = (r_1 - r_2) / denom i.e. `sam_name1` is the "final" sample and `sam_name2` is the "initila" sample. int_denom : str Normalizing denominator for 'intensity' difference plots, either 'max' or 'sum'. If 'max', divides by the maximum relative intensity in either `sam_name` or `sam_name2`. If 'sum', divides by the sum in both samples. shared : boolean For 'intensity' difference plots, if `True`, only considers formulae that are present in both samples. Defaults to `False`. Returns ------- ax : plt.axis Axis containing the van Krevelen plot of interest. Raises ------ SampleError If sam_name is not present within the ``CrossTable`` instance. See Also -------- plot_correlation_vk Function for plotting correlation coefficients for individual chemical formulae relative to a given environmental controlling parameter. plot_sample_vk Function for plotting a van Krevelen diagram for a single sample, color coded either by peak intensity or by compound class. ''' #check that sample names exist if sam_name1 not in self.sam_names: raise SampleError( 'Sample name %r is not contained within the CrossTable!' % sam_name1) elif sam_name2 not in self.sam_names: raise SampleError( 'Sample name %r is not contained within the CrossTable!' % sam_name2) #make axis if necessary if ax is None: fig, ax = plt.subplots(1,1) #set axis labels and title ax.set_title(sam_name1 + ' - ' + sam_name2 + ' compound classes') ax.set_ylabel('H/C') ax.set_xlabel('O/C') #if plot type is 'class', extract differences and plot by class if plot_type in ['class', 'Class']: #calculate index in sam_name1 but missing from sam_name2 ind = self.intensities[ (self.intensities[sam_name1] > 0) & (self.intensities[sam_name2] == 0) ].index #calculate H/C and O/C for formulae contained within the sample HC = self._chem_comp.loc[ind,'H']/self._chem_comp.loc[ind,'C'] OC = self._chem_comp.loc[ind,'O']/self._chem_comp.loc[ind,'C'] #extract all unique classes ccls = self.cmpd_class.unique() classes = self.cmpd_class[ind] #make a list of colors, extending to all possible classes colors = [ [.071, 0, .40], #CHO [.431, .082, .035], #CHON [.769, .455, 0], #CHOS [0, .561, .569], #CHOP [.133, .376, .035], #CHONS ] #loop through and plot for i, cc in enumerate(ccls): #extract indices ci = classes[classes == cc].index #plot ax.scatter( OC[ci], HC[ci], c = colors[i], label = cc, **kwargs) #add legend ax.legend( loc = 'best', framealpha = 1, scatterpoints = 1, markerscale = 2, fontsize = 'small') #if plot_type is 'intensities', calculate the difference in intensity # and plot elif plot_type in ['intensity','Intensity']: #calculate index in either sample #if shared, only formulae that are in both if shared: ind = self.intensities[ (self.intensities[sam_name1] > 0) & (self.intensities[sam_name2] > 0) ].index else: ind = self.intensities[ (self.intensities[sam_name1] > 0) | (self.intensities[sam_name2] > 0) ].index #calculate H/C and O/C for formulae contained within the sample HC = self._chem_comp.loc[ind,'H']/self._chem_comp.loc[ind,'C'] OC = self._chem_comp.loc[ind,'O']/self._chem_comp.loc[ind,'C'] #calculate the relative intensities within each sample ints = self.intensities.loc[ind, [sam_name1, sam_name2]] if int_denom in ['sum','Sum']: #calculate the normalized difference c = (ints[sam_name1] - ints[sam_name2]) / ints.sum(axis = 1) lab = 'peak intensity difference (sum normalized)' elif int_denom in ['max','Max']: #calculate the normalized difference c = (ints[sam_name1] - ints[sam_name2]) / ints.max(axis = 1) lab = 'peak intensity difference (max normalized)' #raise error if denom type is not max or sum else: raise ValueError( 'int_denom %r not recognized, must be "max" or "sum"' % int_denom) #sort by ascending intensity ind_sort = np.argsort(np.abs(c)) #plot results vk = ax.scatter( OC[ind_sort], HC[ind_sort], c = c[ind_sort].values, **kwargs) #add colorbar cbar = fig.colorbar(vk, label = lab) #raise error if type is not class of intensity else: raise ValueError( 'Plot type %r not recognized, must be "class" or "intensity"' % plot_type) return ax
#define method for plotting a sample van Krevelen
[docs] def plot_correlation_vk( self, env_param, alpha = 0.05, ax = None, cmpd_cls = 'all', corr_type = 'Pearson', f = 1.0, **kwargs): ''' Method for generating a van Krevelen plot to compare individual formula relative abundances against environmental parameters of interest. Can handle peaks that are only present in a subset of samples and can correlate using a range of statistical methods. Parameters ---------- env_param : list, array, or pd.Series List, array, or series of values for a particular environmental variable. If list or array, assumes values are in the same order as sample names in the ``CrossTable`` instance. Length `nS`. If Series with sample names as index, only considers samples present in the index for correlation (i.e. can handle missing samples). Samples with NaN values will also be dropped. alpha : float The significance value to use for retaining statistically significant formulae for plotting, must be between 0 and 1. Defaults to `0.05`. ax : None or plt.axis Axis to plot on. If `None`, creates an axis. Defaults to `None`. cmpd_cls: str String saying which compound class to use for correlations. Must be one of: all, \n CHO, \n CHON, \n CHOS \n Defaults to 'all'. corr_type : str String saying which statistical method to use for correlation. Currently accepts: Pearson, \n Spearman \n Defaults to 'Pearson'. f : float The fraction of total samples in which a formula must be present in order to be considered for correlation, ranging between 0.0 and 1.0. If some samples are missing env_param data, then f is the fraction of retained samples in which a formula must be present. Defaults to 1. Returns ------- ax : plt.axis Axis containing the van Krevelen plot of interest. stats : df.DataFrame Dataframe of resulting r (for Pearson) or rho (for Spearman) values (collectively termed 'corr') and corresponding p-values for all formulae that are included in the plot. Raises ------ TypeError If `env_param` is not list, np.ndarray, or pd.Series. LengthError If `env_param` is not of length `nS` (if list or array only). ValueError If `fraction_present` is not between 0 and 1. LengthError If number of samples is less than 3. ValueError If `corr_type` is not 'Pearson' or 'Spearman'. ValueError If `cmpd_cls` is not 'all', 'CHO', 'CHON', or 'CHOS'. TypeError If `intensities` is not ``pd.DataFrame``. TypeError If `env_param` is not ``pd.Series``. TypeError If `alpha` is not float between 0 and 1. SampleError If sample names in `env_param` and `intensities` do not match. See Also -------- plot_difference_vk Function for plotting presence/absense differences between two samples. plot_sample_vk Function for plotting a van Krevelen diagram for a single sample, color coded either by peak intensity or by compound class. References ---------- Legendre and Legendre (2012), Numerical Ecology, Elsevier, 2nd Ed. ''' #ensure env_param is the right type stype = type(env_param).__name__ l = len(env_param) if stype == 'Series': #drop missing values if they exist env_param.dropna(inplace = True) elif stype in ['list', 'ndarray']: #ensure right length if l != self.nS: raise LengthError( 'when list or array, `env_param` must contain the same' ' number of samples as the ``CrossTable`` object.') env_param = pd.Series(env_param, index = self.sam_names) else: raise TypeError( 'env_param is type %r. Must be list, np.ndarray, or pd.Series' % s) #ensure fraction present between 0 and 1 if type(f) not in [float, int] or f < 0 or f > 1: raise ValueError( 'f must be int or float between 0 and 1!') #ensure cmpd_cls is the right string if cmpd_cls not in ['all','CHO','CHON','CHOS']: raise ValueError( 'cmpd_cls value of %r not recognized. Must be one of "all", ' ' "CHO", "CHON", or "CHOS"' % cmpd_cls) #make axis if necessary if ax is None: fig, ax = plt.subplots(1,1) #set axis labels and title title = 'Env. corr. (' + corr_type + \ ', f = ' + str(f) + \ ', alpha = ' + str(alpha) + \ ', cmpds = ' + cmpd_cls + ')' ax.set_title(title) ax.set_ylabel('H/C') ax.set_xlabel('O/C') #retain only intensities from samples that exist in env_param # i.e. drop missing samples sams = env_param.index ints = self.intensities[sams] nS = len(sams) #extract indices of peaks present in >= f fraction of samples i = ints[(ints > 0).sum(axis = 1) >= f * nS].index #new in v.0.0.4! #only retain compounds within the selected class classes = self.cmpd_class[i] if cmpd_cls is not 'all': ind = classes[classes == cmpd_cls].index else: ind = i #store number of retained formulae nRet = len(ind) #calculate correlations and only keep significantly correlated forms ind_sig, rhos, pvals = _calc_corr( ints.loc[ind,:], env_param, alpha = alpha, corr_type = corr_type) nSig = len(ind_sig) #retain significant formulae and sort by ascending rho c = rhos[ind_sig] lab = corr_type + 'corr. coef.' #new in v.0.0.4! #plots highest ABSOLUTE rho values on top! ind_sort = np.abs(c).argsort() #calculate H/C and O/C HC = self._chem_comp.loc[ind_sig,'H']/self._chem_comp.loc[ind_sig,'C'] OC = self._chem_comp.loc[ind_sig,'O']/self._chem_comp.loc[ind_sig,'C'] #plot results vk = ax.scatter( OC[ind_sort], HC[ind_sort], c = c[ind_sort].values, **kwargs) #add colorbar cbar = ax.figure.colorbar(vk, label = lab) #calculate summary statistics and plot as text mu = rhos[ind_sig].abs().mean() sig = rhos[ind_sig].abs().std() pct_sig = 100*nSig / nRet l1 = r'$n_{sig}$ = %.0f (%.1f %%); ' % (nSig, pct_sig) l2 = r'$\mu_{\| \rho \|}$ = %.2f; ' % mu l3 = r'$\sigma_{\| \rho \|}$ = %.2f' % sig text = l1 + l2 + l3 #make text ax.text(0.05, 0.10, text, transform = ax.transAxes, horizontalalignment = 'left', verticalalignment = 'top') #set ranges ax.set_ylim([0,2]) ax.set_xlim([0,1]) #new in version 0.0.4! #make statistics dataframe stats = pd.DataFrame( index = ind_sig, columns = ['corr','pvals']) stats['corr'] = rhos[ind_sig] stats['pvals'] = pvals[ind_sig] return ax, stats