bio_analytics.pl -- Computational biology data analytics.

Collects a number of biological data analytics tasks. This library provides tools for the bio_db served data, empowering downstream analyses of user's experimental data.

Installation and loading:

?- pack_install(bio_analytics).  % also installs pack(lib). other dependencies are at load time via lib
?- use_module(library(lib)).
?- lib(bio_analytics).

The library comes with one experimental dataset: data/silac/bt.csv which is used in the example fiiles in directory examples/ .

author
- nicos angelopoulos
version
- 0.1 2019/4/22
- 0.2 2019/5/08
 gene_family(+Family, +Org, -Symbols)
If Family is a known family alias for organism Org, it is expanded to a list of its constituent gene symbols.

Currently 2 organisms are supported: human and mouse.
Family can be a gene ontology term (atom of the form GO:XXXXXX).
If family is a list of numbers or atoms that map to numbers, then they are taken to be Entrez ids which are converted to gene symbols in Symbols.

If family is a list of symbols, it is passed on to Symbols.

Listens to debug(gene_family).

Known families:

autophagy/human
from: http://autophagy.lu/clustering/index.html
 Located family as the bio_analytics gene collection: autophagy
 ?- gene_family( autophagy, human, Auto ), length( Auto, Len ).
 Auto = ['AMBRA1', 'APOL1', 'ARNT', 'ARSA', 'ARSB', 'ATF4', 'ATF6', 'ATG10', 'ATG12'|...],
 Len = 232.

 ?- debug( gene_family ).
 ?- gene_family( 375, hs, Symbs ), length( Symbs, Len ).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.
 
 ?- gene_family( 'GO:0000375', hs, Symbs ), length( Symbs, Len ).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.

 ?- gene_family( [55626, 8542, 405], hs, Auto ).
 Converted input from Entrezes to Symbols.
 Auto = ['AMBRA1', 'APOL1', 'ARNT'].

 ?- gene_family( unknown, hs, Auto ).
 ERROR: Unhandled exception: gene_family(cannot_find_input_family_in_the_known_ones(unknown,[autophagy]))

Family datasets are in pack(bio_analytics/data/families).

author
- nicos angelopoulos
version
- 0:1 2019/3/5, from old code
To be done
- error reporting via print_message/2
 go_org_symbol(+Org, +Go, -Symb)
Get symbols of Go id for Organism.

Go can be either a GO: atom or an integer.

[debug]  ?- go_org_symbol( mouse, 2, Symbs ), write(Symbs), nl, fail.
Akt3
Mef2a
Mgme1
Mpv17
Mrpl15
Mrpl17
Mrpl39
Msto1
Opa1
Slc25a33
Slc25a36
Tymp
false.

[debug]  ?- go_org_symbol( hs, 'GO:0000002', Symbs ), write(Symbs), nl, fail.
AKT3
LONP1
MEF2A
MGME1
MPV17
MSTO1
OPA1
PIF1
SLC25A33
SLC25A36
SLC25A4
TYMP
false.
author
- nicos angelopoulos
version
- 0:1 2019/04/07
 go_org_symbols(+GoT, +Org, -Symbols)
Gets the symbols belonging to a GO term.
?- go_org_symbols( 'GO:0000375', human, Symbs ), length( Symbs, Len ).
Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP'|...],
Len = 26.

?- go_org_symbols( 'GO:0000375', mouse, Symbs ), length( Symbs, Len ).
Symbs = ['Dbr1', 'Rbm17', 'Rnu4atac', 'Rnu6atac', 'Scaf11', 'Sf3a2', 'Slu7', 'Srrm1', 'Srsf10'|...],
Len = 10.

?- go_org_symbols( 375, human, Symbs ), length( Symbs, Len ).
Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
Len = 26.
author
- nicos angelopoulos
version
- 0.1 2019/4/7
See also
- go_org_symb/3, just a findall of
- go_symbols_reach/3 for a version that travells the ontology
 go_symbols_reach(+GoT, -Symbols, +Opts)
Gets the symbols belonging to a GO term. Descents to GO child relations, which by default are includes (reverse of is_a) and consists_of (reverse of part_of) to pick up Symbols recursively.

Opts

organism(Org=hs)
also known is: mouse. Note that human, also resolves to hs
descent(Desc=true)
whether to collect symbols from descendant GO terms
as_child_includes(Inc=true)
collect from edge_gont_include/2
as_child_consists_of(Cns=true)
collect from edge_gont_consists_of/2
as_child_regulates(Reg=false)
collect from edge_gont_regulates/2
as_child_negatively_regulates(Reg=false)
collect from edge_gont_negatively_regulates/2
as_child_positively_regulates(Reg=false)
collect from edge_gont_positively_regulates/2
debug(Dbg=false)
see options_append/3

Listens to debug(go_symbols_reach).

?- go_symbols_reach( 'GO:0000375', Symbs, [] ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.

?- go_symbols_reach( 'GO:0000375', Symbs, organism(mouse) ), length( Symbs, Len ).
Symbs = ['4930595M18Rik', 'Aar2', 'Aqr', 'Bud13', 'Bud31', 'Casc3', 'Cdc40', 'Cdc5l', 'Cdk13'|...].
Len = 190.

?- go_symbols_reach( 375, Symbs, true ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.
author
- nicos angelopoulos
version
- 0.1 2015/7/26
- 0.2 2019/4/7 added organism, moved to new pack
 go_string_graph(+GoTerm, -Graph, +Opts)
Plot the STRING graph of all the Symbols in a GO term.

The Graph can be saved via options to wgraph_plot/2.

Opts are passed to symbols_string_graph/3, go_term_symbols/3 and wgraph_plot/2.

Opts

organism(Org=hs)
organism for the operation
plot(Plot=false)
whether to plot the graph
save(Save=false)
whether to save the graph (passed to wgraph_plot/2
stem_type(Sty=go_name)
constructs stem for the output filenames
?- go_string_graph( 'GO:0016601', G, true ).
?- go_string_graph( 'GO:0016601', G, plot(true) ).
author
- nicos angelopoulos
version
- 0.1 2016/4/12
To be done
- make sure underlying options are compatible.
 symbols_string_graph(+Symbols, -Graph, +Opts)
Create the string database Graph between Symbols.

Opts

include_orphans(Orph=true)
set to false to exclude orphans from Graph
organism(Org=hs)
which organism do the gene symbols come from
minw(500)
minimum weight (0 =< x =< 999) - not checked
sort_pairs(Spairs=true)
set to false to leave order of edges dependant on order of Symbols
sort_graph(Sort=true)
set to false for not sorting the results
?- Got = 'GO:0043552', gene_family( Got, hs, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['AMBRA1', 'ATG14', 'CCL19', 'CCL21', 'CCR7'|...],
SymbsLen = 31,
Graph = ['TNFAIP8L3', 'AMBRA1'-'ATG14':981, 'AMBRA1'-'PIK3R4':974|...],
GraphLen = 235.

% the following is flawed as it uses Got from mouse and tries to build the graph in human STRING...
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = Graph, Graph = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8', 'Fgf2', 'Fgfr3', 'Fgr'|...],
SymbsLen = GraphLen, GraphLen = 28.

% this the correct way for running the first query for mouse:
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, organism(mouse) ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8'|...],
SymbsLen = 28,
Graph = ['Ambra1', 'Cdc42', 'Lyn', 'Nod2', 'Pdgfra', 'Sh3glb1', 'Tnfaip8l3', 'Vav3', ... : ...|...],
GraphLen = 117.
author
- nicos angelopoulos
version
- 0.1 2016/1/18
- 0.2 2019/4/8, added organism to incorporate mouse
 exp_diffex(+Exp, -DEPrs, -NonDEPrs, +Opts)
Select a sublist of significanntly differenentially expressed genes in an experiment. Exp is in the form of a mtx/1 matrix. Both selected (DEPrs) and rejected (NonDEPrs) are in the form of Gene-Ev, where Ev is the expression value for Gene in Exp.

We use ev as short for expression value, to avoid confusion with exp which short for experiment.

Opts

?- lib(mtx),
   absolute_file_name(pack('bio_analytics/data/silac/bt.csv'), CsvF),
   mtx(CsvF, Mtx, convert(true)),
   assert(mtx_data(Mtx)).

CsvF = '/usr/local/users/nicos/local/git/lib/swipl/pack/bio_analytics/data/silac/bt.csv',
Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', ...), row(..., ..., ..., ...)|...].

?- lib(debug_call),
   debug(testo),
   mtx_data(Mtx),
   debug_call(testo, dims, mtx/Mtx),
   exp_diffex(Mtx, DEPrs, NonDEPrs, []),
   debug_call(testo, length, de/DEPrs),
   debug_call(testo, length, nde/NonDEPrs).

% Dimensions for matrix,  (mtx) nR: 1245, nC: 4.
% Length for list, de: 426.
% Length for list, nde: 818.

DEPrs = ['CLEC1B'- -4.80614893171469, 'LGALS1'- -2.065877096524, ... - ...|...],
NonDEPrs = ['CNN2'- -0.69348026828097, 'CXCR4'-0.73039667221395, ... - ...|...].

?-
    mtx_data(Mtx),
    exp_diffex(Mtx, DEPrs, NonDEPrs, exp_pv_cut(0.01)),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 286.
% Length for list, nde: 958.

?-
    mtx_data(Mtx),
    Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf)],
    exp_diffex(Mtx, DEPrs, NonDEPrs, Opts),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 581.
% Length for list, nde: 663.
author
- nicos angelopoulos
version
- 0.1 2019/5/2
 exp_gene_family_string_graph(+Exp, +Family, -Graph, +Opts)
Generate and possibly plot the STRING graph of a known gene family as affected in a biological Exp_eriment. Each family gene is placed of the following states according to wether it was identified in Exp and how it was modified (in relation to background conditions):

Note that if de-reguation trumps identification. That is there currently no distiction between genes that seen to be both significantly de-regulated and identified versus just those that are simply significantly de-regulated.

Opts

exp_ev_log(EvLog=true)
are the expression values (ev) log values ? (also passed to exp_diffex/4)
faint_factor(Fctr=1.2)
faint factor for non-significant nodes
include_non_present(IncP=true)
false excludes non-indentified family genes
include_non_significant(IncS=true)
false excluded non-significant family genes
node_colours(Clrs=clrs(red, green, yellow, khaki4, grey))
colours for the different types of nodes
organism(Org=hs)
organism in which Family is looked in (and experiment was performed)
plot(Plot=true)
whether to plot the graph via wgraph_plot/2
wgraph_plot_opts(WgOpts=[])
options that override this predicates' options in the call to wgraph_plot/2
wgraph_plot_defs(WgDefs=[])
options that are overriden by this predicates' options in the call to wgraph_plot/2

Options are passed to a number of other predicates.

See [pack('bio_analytics/examples/bt.pl')].

?- lib(real), lib(mtx),
   absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   Ppts = [ vjust= -1, node_size(3), mode="fruchtermanreingold", format(svg), stem(bt)],
   Opts = [ exp_ev_cut_let(inf), exp_ev_cut_get(-inf),
            include_non_present(false), include_non_significant(false),
            minw(200), wgraph_plot_opts(Ppts) ],
   exp_gene_family_string_graph( CsvF, autophagy, G, Opts ).

Produces file: bt.svg

author
- nicos angelopoulos
version
- 0.1 2019/4/15
See also
- exp_diffex/4
- gene_family/3
- wgraph_plot/2
 exp_go_over_string_graphs(+Exp, ?GoOver, ?Dir, -Opts)
Create string graphs for all the over-represented terms in GoOver.

Opts

go_id_clm(GoIdClm=1)
column id for over represented GO terms
dir_postfix(Psfx=go_strings)
postfix for outputs directory (when Dir is a variable)

Options are passed to exp_gene_family_string_graph/4.

?- debug(real).
?- debug(exp_go_over_string_graphs).
?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), Exp ),
   exp_go_over_string_graphs( Exp, Gov, Dir, [] ).

% Sending to R: pltv <- ggnet2(lp_adj,vjust = -1,size = 3,label = pl_v_1,color = pl_v_2,edge.size = pl_v_3,edge.color = "#BEAED4")
author
- nicos angelopoulos
version
- 0.1 2019/5/5
See also
- exp_go_over/3
- exp_gene_family_string_graph/4
 bio_analytics_version(+Vers, +Date)
Version and release date.
?- bio_analytics_version(V,D).
V = 0:2:0,
D = date(2019,5,8).

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

 exp_go_over(Arg1, Arg2, Arg3)