================================================
Tutorial 2: Artificial 1D data
================================================

:Author: Kota Kasahara


The mDCC analysis method can be generalized for analyses of any multi-dimensional numerical data.
In this tutorial, we will re-purpose the mDCC analysis method for non-MD application, by using a simple, artificial data.

------------------------------------
Preparation
------------------------------------

Artificial data
------------------------------------

The two time series of 1D data, v0 and v1, will be generated, by using R.

.. image:: images/tuto02_data.png
    :scale: 50

The data v0 is composed of three sine curves: 1) the first half, 2) the last half, and 3) the middle of them. The phases of the sine functions 1 and 2 are opposite. The function 3 is needed for smoothly concatenating these two functions.  
The data v1 is a single sine function, which is same as the first half of v0.  
Thus, the first half of v0 is positively correlated with v1, and the last half is negatively correlated with v1.

To generate the data table, execute the following commands on R::

  gen.sin <- function(A, B, L, x){
    A * sin( B + 2*pi/L * x)
  }
  steps <- 1:10000
  v01 <- gen.sin(0.05, 10, 150, steps) + 1
  v02 <- gen.sin(-0.05, 10, 150, steps) + 1.5
  grad <- (gen.sin(1,  0, 400, 1:400) + 1)[101:300]/2
  w01 <- c(rep(1,4900), grad, rep(0,4900))
  w02 <- 1-w01
  v0 <- v01*w01 + v02 * w02
  v1 <- gen.sin(0.05, 10, 150, steps) + 1
  dat <- data.frame(v0, v1)
  write.table(dat, "dat.txt", quote=F)

*dat.txt* is tab separated table::

  v0 v1
  1 0.971065970962453 0.971065970962453
  2 0.969383757588155 0.969383757588155
  3 0.967755255506516 0.967755255506516
  4 0.966183321663557 0.966183321663557
  ..

Converting the tab separated table into the mDCC original format
----------------------------------------------------------------

The data table should be converted into the original binary file::

  python ${MDCCTOOLS}/bin/convert_trajectory.py \
    -i dat.txt \
    --i-format tsv \
    -o traj.trrmdcc \
    --ignore-col 1 \
    --ignore-row 1 \
    --dim 1

* *-i* option specifies the input data table.
* *--i-format* option indicates the data format of -i file. It should be "tsv" or "trr".
* *-o* option indicates the output file name.
* *--ignore-col, --ignore-row* options indicate how many columns and rows are skipped in the -i tsv table, respectively.
* *--dim* option specifies dimension of the data.

------------------------------------
Pattern recognition
------------------------------------

Execute the mdcc_learn program
------------------------------------

*mdcc_learn* program will be executed two times. One is for the entity v0, and the other is for v1.

The input files for v0 is::

  -feature 1 0 x
  -n-mixed-element 5
  -fn-data-table  traj.trrmdcc
  -format-data-table mdcc
  -fn-out-gaussian mdcclearn_out.txt.0

The other is::

  -feature 1 1 x
  -n-mixed-element 5
  -fn-data-table  traj.trrmdcc
  -format-data-table mdcc
  -fn-out-gaussian mdcclearn_out.txt.1

* *-fn-data-table* specifies the file name of the trajectory data.
* *-format-data-table* specifies the file format of --fn-data-table. It should be "mdcc" or "tsv"
* *-fn-out-gaussian* specifies the output file name. This file describes the inferred Gaussian mixture parameters.
* *-feature* requires three arguments.

  * The first is always 1. This field will be used for future developments.
  * The second specifies 0-origin ID of entity (atom) in the data file.
  * The third is the name of this feature. Any string is acceptable.

* *-n-mixed-element* is the number of Gaussian functions in the Gaussian mixture model.

Then, *mdcc_learn* is executed from the linux shell::

  ${MDCCTOOLS}/bin/mdcc_learn -fn-cfg mdcclearn_v0.cfg
  ${MDCCTOOLS}/bin/mdcc_learn -fn-cfg mdcclearn_v1.cfg
  

Integrating the results of all heavy entities
----------------------------------------------

After that, the results of these two entities are concatenated and global-ID for all Gaussian functions are assigned.

Execute the command from the shell::

   python ${MDCCTOOLS}/bin/mdcclearn_result_summary.py \
     --dir-mdcclearn ./ \
     --pref-mdcclearn mdcclearn_out.txt. \
     -o crd_mdcclearn_gauss.txt \
     --dim 1 --min-pi 0.01

* *--min-pi 0.01* means that the Gaussian funcitons probability of which is less than 0.01 will be eliminated 

The files named *mdcclearn_out.txt.\** in the directory are merged to a single file *crd_mdcclearn_gauss.txt*.

-------------------------------------------
Assigning the trajectory on the patterns
-------------------------------------------

Next, *mdcc_assign* is executed for the two entity, by using the following settings

mdccassign_v0.cfg::

  -mode assign-mdcctraj
  -target-column 0
  -skip-header-gaussian 1
  -fn-gaussians crd_mdcclearn_gauss.txt
  -fn-data-table  traj.trrmdcc
  -fn-result assign.dat.0
  -gmm-type 0
  -format-output binary

mdccassign_v1.cfg::

  -mode assign-mdcctraj
  -target-column 1
  -skip-header-gaussian 1
  -fn-gaussians crd_mdcclearn_gauss.txt
  -fn-data-table  traj.trrmdcc
  -fn-result assign.dat.1
  -gmm-type 1
  -format-output binary

* *-mode* specifies "assign-mdcctraj" or "assign-table". The latter is for .tsv input file.
* *-target-column* indicates the 0-origin ID of entity (atom) in the input data file.
* *-skip-header-gaussian* specifies the number of lines to be skipped in the Gaussian definition file (-fn-gaussians)
* *-fn-gaussians* specifies the file name of Gaussian definition file obtained from the previous step.
* *-fn-result* specifies the file name of output.
* *-gmm-type* specifies the ID of Gaussian mixture, corresponding to the second column in -fn-gaussians.

Execute the commands from the shell::

  ${MDCCTOOLS}/bin/mdcc_assign -fn-cfg mdccassign_v0.cfg
  ${MDCCTOOLS}/bin/mdcc_assign -fn-cfg mdccassign_v1.cfg


-------------------------------------------
Calculating the mDCC and DCC
-------------------------------------------

mDCC and DCC should be calculated as following commands.

For mDCC::

  python2.7 ${MDCCTOOLS}/bin/cal_mdcc.py \
    --gaussian crd_mdcclearn_gauss.txt \
    --pref-assign assign.dat. \
    --suff-assign "" \
    --o-mdcc mdcc.txt \
    --min-corr 0.0 \
    --select-id 0-1 \
    --fn-crd-bin traj.trrmdcc \
    --assign-binary

For DCC::

  python2.7 ${MDCCTOOLS}/bin/cal_mdcc.py \
    --o-dcc dcc.txt \
    --min-corr 0.0 \
    --select-id 0-1 \
    --fn-crd-bin traj.trrmdcc

* *--select-id* specifies the 0-origin entity ID, which are analyzed. The range of IDs should be specified by concatenating the first and last IDs with '\-'. Or, the IDs and the range of IDs can be enumerated by sepalation with ',', e.g., "1-10,12,14,16-18".
* *--pref-assign, --suff-assign* indicates prefix and suffix of  *mdcc_assign* output files. The file must be named with the prefix, IDs of elements, and suffix, e.g., "prefix.0.suffix", "prefix.1.suffix", ... 
* *--assign-binary* is required for the binary *mdcc_assign* output.
* *--o-mdcc, --o-dcc* are the output file name.
* *--fn-crd-bin* is the input binary file.
* *--min-corr* indicates the minimum correlation coefficient for output.

-------------------------------------------
Results
-------------------------------------------

crd_mdcclearn_gauss.txt::

  gc_id.int element_id.int pi.float mu1.float sigma11.float
  0       0       0.496002        1.49895 0.00199557
  1       0       0.49639 1.00054 0.00170489
  2       1       1       0.99981 0.00145084

The *mdcc_learn* program found two and one Gaussian functions for the data v0 and v1, respectively.


mdcc.txt::

  0       2       0       1       -0.90211646445  0.500272653972  0.499140297242
  1       2       0       1       0.875547919424  0.499727346028  0.000772943337186


* The 1st and 2nd columns indicate the pair Gaussian IDs.
* The 3rd and 4th columns indicate the element IDs.
* The 5th column indicates the mDCC value.
* The 6th column indicates the simultaneous probability for the Gaussians.
* The 7th column indicates the distance between the means of the Gaussian functions.


dcc.txt::

  0       1       0.00363216      0.24999

* The 1st and 2nd columns indicate the pair of element IDs.
* The 3rd column indicates the DCC value.
* The 4th column indicates the distance between the means.

As we expected, these results say that v1 positively correlated with the Gaussian 1 of v0, but it negatively correlated with the Gaussian 0 of v0. On the other hand, the conventional DCC shows no correlation between v0 and v1.

With this protocol, any kinds of multi-dimensional numerical data can be analyzed by using this tool kit.