Tuesday, 13 September 2016

New selection tool in Cluster analysis tab

A new feature just added to the Biodiverse cluster analysis tab is the ability to control the colour of branches on the tree and the cells that contain them.  This is perhaps most useful if you want to colour your Biodiverse plot to match some pre-existing map (and is the reason some users requested it).

[[ UPDATE 21-Aug-2017.  This feature has been renamed as User defined in the GUI, so wherever you see Multiselect below, you will now see User defined.  ]]

In a nutshell, users can now switch to the Multiselect mode using the combo box where the lists are selected (and the default is still Cluster).  Once there they can choose a colour or accept the system generated default, click on a branch and watch the branch and all of its descendants and the associated groups (cells) plot in that colour.

The multiselect mode is turned on by selecting it in the lower left combo box.  


Users can assign colours to any branch in the tree to colour its descendants and the associated groups. In this example the red clade has also had a sub-clade cleared of colour (note the black branches and the highlighted cells that are not coloured).

Once the branch is selected the default colour changes to the next colour in the palette (unless you turn it off using the button to the left of the brush).  Repeatedly clicking on the branch will cycle through the palette, so if you missed the colour then just keep clicking until it goes around.  The palette in use at the moment has nine colours (it is the 9-colour paired palette from http://colorbrewer2.org).

You can also uncolour branches by selecting the brush icon to change to clear mode.  When in this mode, the mouse icon will change to a brush when a branch is hovered over to remind users what will happen when they click.

There is also little need to fear mis-clicks, as users can undo and redo selections.  Simply press the "u" key on the keyboard to undo one click, and repeat to keep going back.  If you over-do it then you can press "r" to redo and reinstate a branch colour.  Note that the redo list is reset as soon as you colour a branch.


The colour selection uses the same colour selector window as for the shapefile overlays and cell outline colours.

The colour selector can be used to specify your own colours.


Unfortunately the eyedropper selector does not work well on Windows, as it can only select colours from open Biodiverse windows.  This is a limitation of the system.  The workaround is to use a colour selector tool to copy the colour specification to the clipboard and then paste it into the Color name box in the selector window.   A list of possible tools is in this superuser.com question (with the caveat that I have not tested any).

You can also type colour names into the Color name box, and the small sample I tested of the colours at these URLs worked (mostly).  DarkGoldenRod or LemonChiffon anyone?
http://www.w3schools.com/colors/colors_names.asp
https://en.wikipedia.org/wiki/X11_color_names


Shawn Laffan
12-Sep-2016


For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes 

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users



Tuesday, 30 August 2016

Easier to use randomisation results

This is an updated version of a previous post.  The key difference is that the significance classification described previously was too confusing, as values could be positive or negative and became more significant as they approached zero.  Instead, Biodiverse now provides relative ranks which can easily be converted to significant/not significant for any alpha cutoff.  This change should not affect many users, as the 0.99_003 release containing it was never announced...

One of the issues users face with the randomisations in Biodiverse is what to do with them once they have been run.

A key point is that the results are stored on the other analysis objects themselves, as extra indices and lists.  The index names themselves are a bit cryptic, but are consistent, and there is a description of what they mean here:  https://purl.org/biodiverse/wiki/AnalysisTypes#where-do-the-randomisation-results-go-and-what-do-they-mean

Even then, it can be difficult working out which of your groups have index scores that are significantly different from the set of randomised results.  This is because the data are plotted as a continuum (which in turn is because it uses the same plotting process as the original index scores).  The first image below is an example of this plotting.

One can easily export the data and work with them in a GIS or stats package, but any tied values need to be factored in for lower tail tests, for example as used in the CANAPE process.

With the next version of Biodiverse this categorisation will become a little easier.  Biodiverse will automatically calculate rank-relative positions that can be easily converted into significance levels.  These are stored in new lists that can be displayed and exported in the same way as any other data.

As an example, imagine you have run a randomisation analysis for a BaseData containing a spatial analysis in which you calculated phylogenetic endemism.  Assume that the randomisation's name is rr (not a good name, but it's convenient to type here), so the spatial analysis will now have three lists you can plot.  The first two are the same as ever:  SPATIAL_RESULTS contains the observed results for each group (cell), and rr>>SPATIAL_RESULTS contains indices to track the randomisations for each index in SPATIAL_RESULTS.  For example, for PE_WE there will be C_PE_WEQ_PE_WET_PE_WE and P_PE_WE collating, respectively, the number of times observed PE_WE was higher than that generated using the randomised data, the number of times observed PE_WE was compared against the scores from the randomised data, the number of times the observed and random scores were tied, and the proportion of iterations that the observed score was higher than the random scores (P_PE_WE = C_WE_PE / Q_PE_WE).

The new list is rr>>p_rank>>SPATIAL_RESULTS.  This contains a set of results using the same names as the original indices in SPATIAL_RESULTS, but converted to their rank relative positions.  Importantly, the lower tail ranks take into account any ties in the comparisons, thus simplifying any code that uses theses results.  Also, any value that would be considered not significant at alpha=0.05 (one tailed, high or low) is converted to undef (null).  This makes any plots of the results clearer within Biodiverse so one can more easily see which groups would pass a one-tailed high or low test.

An example plot is in the second image below.

All the ranks have been combined into the same list to reduce the number of indices and lists generated, and thus use less memory and disk space (an index cannot be simultaneously significantly high and low so there is no overlap).  If you are interested in a one tailed test for high values then ignore the low values, and vice-versa.  The values can be easily separated after exporting the results.

Currently the results are plotted in the same manner as any data, but there are plans to allow users to overlay the randomisation significance results over the top of the observed results, for example by masking our any non-significant scores for a given threshold.

The current system plots all scores, regardless of whether they pass a threshold or not.  This is useful, but is difficult to interpret when looking for significance against the randomisation.

The new randomisation list contains indices for the rank-relative positions of the observed values against the randomly generated values. These can then be used for one and two tailed significance tests.  The plotting could be improved, e.g. in this case it appears there are only two values, but this is simply due to the colour scaling.  However, it works well enough now for exploration - proper maps can also be made using a GIS or stats package.


To try it out you will need the 1.99_004 release (or later).  It can be accessed from https://github.com/shawnlaffan/biodiverse/wiki/Downloads

This is a new implementation, so any feedback about usability would be very useful.  

Shawn Laffan
29-Aug-2016

For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes 

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 



Monday, 29 August 2016

New, more efficient file format

Users of Biodiverse will perhaps be familiar with what is called the "native" format for basedata, trees, matrices and projects.  These are the .bds, .bts, .bms and .bps files that are created when you save these objects.

The reality is that the "native" format is just a serialisation format in which all the various parts of the perl data structures that make up an object (e.g. a tree) are converted to a format that can be written to disk and then re-read at a later date, possibly on another computer.

While the format we have been using (called Storable) is stable and has done a good job over the years, a newer, more efficient format called Sereal is now available.  Version 2 of Biodiverse will use this new format by default.

The main reason for shifting to the Sereal format is efficiency: saving files is faster, and the file sizes are smaller.  See details here: http://blog.booking.com/sereal-a-binary-data-serialization-format.html 

These size and speed improvements will not be very noticeable for small files, but it can all add up when one is working with tens of thousands of groups (e.g. cells) and thousands of labels (e.g. species) across hundreds of spatial and cluster analyses.  A quick experiment with such a data set resulted in a greatly reduced file size (~1.6GB to ~750MB), with the time taken to save to file reducing from 30s to 12s.  The file load times were about the same at ~20s.  (Admittedly this was not a very scientific experiment, but the results were consistent across multiple runs).

What do users need to be aware of?  The main thing is that files created in Biodiverse version 2 will not be backwards compatible.  This means that Biodiverse version 1.1 or earlier will not be able to open files created using version 2 by default.  However, the "save as" dialogues have the option to save to the old format so you can maintain compatibility with older versions if you are in a mixed environment.

Also, any file in the old format that is loaded into Biodiverse version 2 will still be saved using the old format unless the user explicitly saves it to the new format.

If you want to test the new file format then it will be available in the 1.99_004 development release which should be coming out within the next week.


Shawn Laffan
29-Aug-2016


For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes 

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 

Saturday, 6 August 2016

Biodiverse now categorises your randomisation results

[[ 2016-08-30:  This post has been superseded - the categorisation was not sufficiently clear.  See this followup post for how the system works now ]]


One of the issues users face with the randomisations in Biodiverse is what to do with them once they have been run.

One key point is that the results are stored on the other analysis objects themselves, as extra indices and lists.  The index names themselves are a bit cryptic, but are consistent, and there is a description of what they mean here:  https://purl.org/biodiverse/wiki/AnalysisTypes#where-do-the-randomisation-results-go-and-what-do-they-mean

Even then, it can be difficult working out which of your groups have index scores that are significantly different from the randomisation.  This is because the data are plotted as a continuum (which in turn is because it uses the same plotting process as the original index scores).  The first image below is an example of this plotting.

One can easily  export the data and work with them in a GIS or stats package, but any tied values need to be factored in for lower tail tests, for example as used in the CANAPE process.

With the next version of Biodiverse this categorisation will become a little easier.  Biodiverse will automatically categorise your randomisation results into significance levels, putting the results into new lists on the objects that can be displayed and exported in the same way as any other data.

As an example, imagine you have run a randomisation analysis for a BaseData containing a spatial analysis in which you calculated phylogenetic endemism.  Assume that the randomisation's name is rr (not a good name, but it's convenient to type here), so the spatial analysis will now have three lists you can plot.  The first two are the same as ever:  SPATIAL_RESULTS contains the observed results for each group (cell), and rr>>SPATIAL_RESULTS contains indices to track the randomisations for each index in SPATIAL_RESULTS.  For example, for PE_WE there will be C_PE_WE, Q_PE_WE, T_PE_WE and P_PE_WE collating, respectively, the number of times observed PE_WE was higher than that generated using the randomised data, the number of time observed PE_WE was compared against the scores from the randomised data, the number of times the observed and random scores were tied, and the proportion of iterations that the observed score was higher than the random scores (P_PE_WE = C_WE_PE / Q_PE_WE).

The new list is rr>>sig>>SPATIAL_RESULTS.  This contains a set of categorisations for one and two tailed tests for each index found in SPATIAL_RESULTS.  The lower tail tests take into account any ties in the comparisons.  An example plot is in the second image below.

For the PE_WE example, one has SIG_1TAIL_PE_WE and SIG_2TAIL_PE_WE.  SIG_1TAIL_PE_WE is a one-tailed test for higher or lower than expected.  It has a value of 0.01 if it is significantly high at alpha=0.01, 0.05 if high for alpha=0.05, -0.01 if it is significantly low at alpha=0.01, and -0.05 for low at alpha=0.05.  If it is not significant then it has a null (undefined) value.

SIG_2TAIL_PE_WE has the same numbers, but for a two tailed test.  Values of -0.05 and 0.05 are low or high for a two tailed alpha=0.05, i.e. the observed scores are in the outer 5% of the distribution of random scores (<2.5% or >97.5%, respectively), while those with -0.01 or 0.01 are in the outer 1% and significant at alpha=0.01.

The upper and lower one-tailed tests have been combined into the same list to reduce the number of indices and lists generated, and thus use less memory and disk space (an index cannot be both significantly high and low so there is no overlap).  If you are interested in a one tailed test for high values then ignore the low values, and vice-versa.  The values can be easily separated after exporting the results.

Currently the results are plotted in the same manner as any data, but there are plans to allow users to overlay the randomisation significance results over the top of the observed results, for example by masking our any non-significant scores.

The current system plots all scores, regardless of whether they pass a threshold or not.  This is useful, but is difficult to interpret when looking for significance against the randomisation.

The new categorisation filters out any score that is not significant at an alpha level of either 0.05 or 0.01 (here the one-tailed results are plotted, so negative values are significantly low).  The plotting could be improved, but it will work well enough now for exploration - proper maps can also be made using a GIS or stats package.  

This is a new implementation, so any feedback about usability would be very useful.  

Shawn Laffan
06-Aug-2016

For more details about Biodiverse, see http://purl.org/biodiverse 

For the full list of changes in the 1.99 series (leading to version 2) see https://purl.org/biodiverse/wiki/ReleaseNotes 

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList


You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 


Saturday, 7 May 2016

Biodiverse now includes species richness estimation indices

One of the new features in the upcoming version 2 release of Biodiverse will be the ability to calculate indices of species richness estimation, as well as their associated confidence intervals and useful metadata.

If you are impatient and want to try them now then you can use a development release from the 1.99 series.   https://purl.org/biodiverse/wiki/Downloads#development-release

The indices included are Chao1, Chao2, the Abundance Coverage Estimator (ACE) and the Incidence Coverage Estimator (ICE).  Chao1 and ACE are abundance based and use the label sample counts as the abundances, while Chao2 and ACE are incidence based and use the number of occupied groups (cells) by each label in the sample as the incidences.  More detailed explanations of these indices and references are given in the help pages for the EstimateS software.  That site also includes formulae in its appendices B and C.

Four species richness estimation calculations are now in Biodiverse.  

Example results for the Acacia data set used in Mishler et al. (2014).  

Links with SpadeR and EstimateS

The calculations have been calibrated to match the SpadeR package for R, cross-referencing with EstimateS as needed.  For those wondering about reproducible results, a test driven development approach was used.  In this approach, the results from SpadeR for a given input data set were set as tests in Biodiverse and the Biodiverse code was then checked and updated until it reproduced the expected values.  The tests remain in place so we can readily identify if a change in another part of Biodiverse affects these calculations.

There are different formulae for the Chao variance and confidence intervals when the sample is missing either singletons or doubletons (species with only one or two samples/incidences).  In these cases Biodiverse follows the logic given in the EstimateS documentation.  The CHAO1_META and CHAO2_META result lists record which formulae were used for the estimate (CHAO_FORMULA index), the variance (VARIANCE_FORMULA index) and confidence interval formula (CI_FORMULA index), with the numbers corresponding to those given in the EstimateS documentation.

As with EstimateS, Biodiverse falls back to using Chao1 or Chao2 in cases where ACE or ICE, respectively, cannot be calculated.  These cases are broader than in EstimateS and include
  1. Where all rare species are singletons/uniques.
  2. Where none of the species are singletons/uniques.
  3. Where none of the species are rare/infrequent.

Biodiverse returns an undefined value for the ACE and ICE richness estimates when:
  1. There are no species.
  2. All the samples are uniques/singletons.
  3. (For ICE only) There is only one group in the neighbour sets (this avoids a divide by zero error in the sample size correction).
The calculation of ICE in SpadeR includes a correction for the number of sample units (groups).  In Biodiverse this is calculated as the number of non-empty groups.  For most users this is not an issue since they do not have empty groups but, in cases where there are such groups, all the other indices that use labels will ignore them.  This follows the logic that empties are unsampled, as opposed to sampled but empty.

Does having a second neighbour set affect the results? 

In Biodiverse, all the species richness estimation indices are calculated using the union of the two neighbour sets, so the results will be the same if you have one neighbour set specifying an analysis window (e.g. sp_circle (radius => 500000)) and two neighbour sets that result in the same overall set of groups (e.g. neighbour set 1 is sp_self_only() and neighbour set 2 is sp_circle(radius => 500000)).   (If you are unsure what the term "neighbour set" means in Biodiverse, it is the set of groups used in a calculation.  Usually they are contiguous sets of groups around the processing group, but they can be arbitrarily complex.  More details are given in the spatial conditions help.)

Future changes

The ACE and ICE indices currently use 10 as the threshold for rarity.  In a future version of Biodiverse this will be controllable by the user, but we need to change some of the GUI infrastructure first.  

Lists of which species were rare in the samples can also be returned if there is need.

The improved Chao indices described in Chiu et al. (2014) are also listed for implementation under issue #592.  The Jackknife indices could also be added if there is a need for them.  


I want to try it now

As noted above, these indices will be in the forthcoming version 2 release of Biodiverse, but at the time of writing they are available in the 1.99_002 development release.  https://purl.org/biodiverse/wiki/Downloads#development-release


Shawn Laffan, 07-May-2016


For more details about Biodiverse, see http://purl.org/biodiverse

For the full list of changes in the 2.0 release see https://purl.org/biodiverse/wiki/ReleaseNotes#version-2

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users 


Tuesday, 19 April 2016

More CANAPE - how to restrict your cluster analysis to groups with significant endemism

Two previous posts described more of the background to the CANAPE analysis and how to do it yourself.

[[  2017-08-10 UPDATE - the second part of the condition use an incorrect index name, so the clustering only applied to the palaeo and some of the mixed cells. ]]


What was not covered in those posts was the final step in the Mishler et al. (2014) paper in which the cells classified as palaeo- neo- or mixed-endemism under the CANAPE scheme were clustered to identify groupings with shared branches from the tree.

The aim of this post is to fill that gap.


The way it is done is conceptually simple.  We just need to constrain the cluster analysis to only use cells that are identified as having significant phylogenetic endemism.  The way it is done takes just a few steps, and a bit of typing.


Analyses in Biodiverse can be constrained to work only on a subset of groups (cells) using a definition query (the same as a "where clause" used in database queries).  The way it is currently done is a little more involved than is ideal, largely due to the amount of typing needed, but is not too complex to get working.  (One day we will develop a spatial conditions builder interface to make spatial conditions easier).

So, imagine you have run a spatial analysis called "phylo_end" in which you have told the system to calculate the indices under the  Phylogenetic Endemism and Relative Phylogenetic Endemism, type 2 calculations.  You then ran 999 iterations of a randomisation called CANAPE to assess how often the observed index values were higher than random.  The spatial results object now has an additional list called CANAPE>>SPATIAL_RESULTS which contains the counts of how often the observed value was higher than the randomised values, how many iterations were applied for each index, how many ties there were, and the proportion that were higher than random.  (See the help for more details about the naming scheme).


An example of the setup for the randomisation.
The randomisations have been completed so now the spatial analysis has an output list called CANAPE>>SPATIAL_RESULTS.  This list contains indices that summarise how often the observed values were higher than (or tied with) those calculated using the randomised data.


The next step is to start a new cluster analysis and set our definition query.  The selected basedata has a randomisation output, so you will be warned about possible issues with synchronisation.  This is only a potential problem if you run more iterations of the randomisation, which is not the case here so you can safely ignore the warning.

You will see this warning when you open a new cluster analysis, but can ignore it in this case.  


Set the Metric to be PHYLO_JACCARD (or whatever metric you like - it is your analysis after all) and enter the text below in the Definition Query box (you will need to expand it to enter text).  If your output and randomisation names differ then edit the text as appropriate.

sp_get_spatial_output_list_value (
    output => 'phylo_end', 
    list   => 'CANAPE>>SPATIAL_RESULTS', 
    index  => 'P_PE_WE'
) > 0.95
|| 
sp_get_spatial_output_list_value (
    output => 'phylo_end', 
    list   => 'CANAPE>>SPATIAL_RESULTS', 
    index  => 'P_PHYLO_RPE_NULL2'
) > 0.95



What this does is use the sp_get_spatial_output_list_value() subroutine to access the randomisation indices.  If either of the PE scores for a group using the observed or alternate trees are higher than the randomisations for 95% of iterations (these are the P_PE_WE and P_RPE_NULL2 indices) then that group will pass the test. Only those groups (cells) that pass the test will be considered for the cluster analysis.

Example of the cluster analysis setup, with a definition query specified.  Only those groups (cells) that pass the test will be used in the cluster analysis.
(The text in the image is different from the main text only in whitespace and formatting.  The code is the same).    

Once the analysis is run it will look something like the image below.
Completed cluster analysis.  Cells that failed the definition query were not considered for clustering and are plotted in grey.  
And there it is, a cluster analysis of the CANAPE regions to identify which ones have similar sets of branches from the tree.

Of course, definitions queries are extremely versatile and can be used for all sorts of applications.  You might, for example, want to cluster within a region defined by a shapefile but consider geographic ranges from across a broader study region.

It is also worth noting that spatial analyses can use definition queries, but they affect the analysis differently.  Calculations will only be completed for groups that pass the definition query, but those groups can have neighbours that fail the definition query.  This allows one to more easily analyse data with larger window sizes but with a guard area (data on the edges of the analysis region which, if not considered, could cause biased results).


Shawn Laffan, 19-Apr-2016


For more details about Biodiverse, see http://purl.org/biodiverse

To see what else Biodiverse has been used for, see https://purl.org/biodiverse/wiki/PublicationsList

You can also join the Biodiverse-users mailing list at http://groups.google.com/group/Biodiverse-users

Tuesday, 26 January 2016

More on tree visualisations in Biodiverse

Here's a link to a recent blog post in the Methods in Ecology and Evolution blog:  https://methodsblog.wordpress.com/2016/01/22/biodiverse/

It provides some more details on the tree visualisations described in a previous blog post.  http://biodiverse-analysis-software.blogspot.com.au/2014/10/new-tree-plots-in-biodiverse.html

It also has some details to the recently published range-weighted turnover paper (in early view at the time of writing):

Laffan, S.W., Rosauer, D.F., Di Virgilio, G., Miller, J.T., Gonzales-Orozco, C., Knerr, N., Thornhill, A. & Mishler, B.D. (in press) Range-weighted metrics of species and phylogenetic turnover can better resolve biogeographic breaks and boundaries. Methods in Ecology and Evolution.