Package 'ctsfeatures'

Title: Analyzing Categorical Time Series
Description: An implementation of several functions for feature extraction in categorical time series datasets. Specifically, some features related to marginal distributions and serial dependence patterns can be computed. These features can be used to feed clustering and classification algorithms for categorical time series, among others. The package also includes some interesting datasets containing biological sequences. Practitioners from a broad variety of fields could benefit from the general framework provided by 'ctsfeatures'.
Authors: Angel Lopez-Oriona [aut, cre], Jose A. Vilar [aut]
Maintainer: Angel Lopez-Oriona <[email protected]>
License: GPL-2
Version: 1.2.2
Built: 2025-02-23 03:01:36 UTC
Source: https://github.com/cran/ctsfeatures

Help Index


Constructs the binarized time series associated with a given categorical time series

Description

binarization constructs the binarized time series associated with a given categorical time series.

Usage

binarization(series)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, the function constructs the binarized time series, which is defined as Yt={Y1,,YT}\overline{\boldsymbol Y}_t=\{\overline{\boldsymbol Y}_1, \ldots, \overline{\boldsymbol Y}_T\}, with Yk=(Yk,1,,Yk,r)\overline{\boldsymbol Y}_k=(\overline{Y}_{k,1}, \ldots, \overline{Y}_{k,r})^\top such that Yk,i=1\overline{Y}_{k,i}=1 if Xk=i\overline{X}_k=i (k=1,,T,,i=1,,rk=1,\ldots,T, , i=1,\ldots,r). The binarized series is constructed in the form of a matrix whose rows represent time observations and whose columns represent the categories in the original series

Value

The binarized time series.

Author(s)

Ángel López-Oriona, José A. Vilar

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
binarized_series <- binarization(sequence_1) # Constructing the binarized
# time series for the first CTS in dataset GeneticSequences

Computes several features associated with a categorical time series

Description

calculate_features computes several features associated with a categorical time series or between a categorical and a real-valued time series

Usage

calculate_features(series, n_series = NULL, lag = 1, type = NULL)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

n_series

A real-valued time series.

lag

The considered lag (default is 1).

type

String indicating the feature one wishes to compute.

Details

Assume we have a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, with p^i\widehat{p}_i being the natural estimate of the marginal probability of the iith category, and p^ij(l)\widehat{p}_{ij}(l) being the natural estimate of the joint probability for categories ii and jj at lag l, i,j=1,,ri,j=1, \ldots, r. Assume also that we have a real-valued time series of length TT, Zt={Z1,,ZT}\overline{Z}_t=\{\overline{Z}_1,\ldots, \overline{Z}_T\}. The function computes the following quantities depending on the argument type:

  • If type=gini_index, the function computes the estimated gini index, g^=rr1(1i=1rp^i2)\widehat{g}=\frac{r}{r-1}(1-\sum_{i=1}^{r}\widehat{p}_i^2).

  • If type=entropy, the function computes the estimated entropy, e^=1ln(r)i=1rp^ilnp^i\widehat{e}=\frac{-1}{\ln(r)}\sum_{i=1}^{r}\widehat{p}_i\ln \widehat{p}_i.

  • If type=chebycheff_dispersion, the function computes the estimated chebycheff dispersion, c^=rr1(1maxip^i)\widehat{c}=\frac{r}{r-1}(1-\max_i\widehat{p}_i).

  • If type=gk_tau, the function computes the estimated Goodman and Kruskal's tau, τ^(l)=i,j=1rp^ij(l)2p^ji=1rp^i21i=1rp^i2\widehat{\tau}(l)=\frac{\sum_{i,j=1}^{r}\frac{\widehat{p}_{ij}(l)^2}{\widehat{p}_j}-\sum_{i=1}^r\widehat{p}_i^2}{1-\sum_{i=1}^r\widehat{p}_i^2}.

  • If type=gk_lambda, the function computes the estimated Goodman and Kruskal's lambda, λ^(l)=j=1rmaxip^ij(l)maxip^i1maxip^i\widehat{\lambda}(l)=\frac{\sum_{j=1}^{r}\max_i\widehat{p}_{ij}(l)-\max_i\widehat{p}_i}{1-\max_i\widehat{p}_i}.

  • If type=uncertainty_coefficient, the function computes the estimated uncertainty coefficient, u^(l)=i,j=1rp^ij(l)ln(p^ij(l)p^ip^j)i=1rp^ilnp^i\widehat{u}(l)=-\frac{\sum_{i, j=1}^{r}\widehat{p}_{ij}(l)\ln\big(\frac{\widehat{p}_{ij}(l)}{\widehat{p}_i\widehat{p}_j}\big)}{\sum_{i=1}^{r}\widehat{p}_i\ln \widehat{p}_i}.

  • If type=pearson_measure, the function computes the estimated Pearson measure, X^T2(l)=Ti,j=1r(p^ij(l)p^ip^j)2p^ip^j\widehat{X}_T^2(l)=T\sum_{i,j=1}^{r}\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}.

  • If type=phi2_measure, the function computes the estimated Phi2 measure, Φ^2(l)=X^T2(l)T\widehat{\Phi}^2(l)=\frac{\widehat{X}_T^2(l)}{T}.

  • If type=sakoda_measure, the function computes the estimated Sakoda measure, p^(l)=rΦ^2(l)(r1)(1+Φ^2(l))\widehat{p}^*(l)=\sqrt{\frac{r\widehat{\Phi}^2(l)}{(r-1)(1+\widehat{\Phi}^2(l))}}.

  • If type=cramers_vi, the function computes the estimated Cramer's vi, v^(l)=1r1i,j=1r(p^ij(l)p^ip^j)2p^ip^j\widehat{v}(l)=\sqrt{\frac{1}{r-1}\sum_{i,j=1}^r\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}}.

  • If type=cohens_kappa, the function computes the estimated Cohen's kappa, κ^(l)=j=1r(p^jj(l)p^j2)1i=1rp^i2\widehat{\kappa}(l)=\frac{\sum_{j=1}^{r}(\widehat{p}_{jj}(l)-\widehat{p}_j^2)}{1-\sum_{i=1}^r\widehat{p}_i^2}.

  • If type=total_correlation, the function computes the the estimated sum Ψ^(l)=1r2i,j=1rψ^ij(l)2\widehat{\Psi}(l)=\frac{1}{r^2}\sum_{i,j=1}^{r}\widehat{\psi}_{ij}(l)^2, where ψ^ij(l)\widehat{\psi}_{ij}(l) is the estimated correlation Corr^(Yt,i,Ytl,j)\widehat{Corr}(Y_{t, i}, Y_{t-l, j}), i,j=1,,ri,j=1,\ldots,r, being Yt={Y1,,YT}\overline{\boldsymbol Y}_t=\{\overline{\boldsymbol Y}_1, \ldots, \overline{\boldsymbol Y}_T\}, with Yk=(Yk,1,,Yk,r)\overline{\boldsymbol Y}_k=(\overline{Y}_{k,1}, \ldots, \overline{Y}_{k,r})^\top, the binarized time series of Xt\overline{X}_t.

  • If type=spectral_envelope, the function computes the estimated spectral envelope.

  • If type=total_mixed_correlation_1, the function computes the estimated total mixed l-correlation given by

    Ψ^1(l)=1ri=1rψ^i(l)2,\widehat{\Psi}_1(l)=\frac{1}{r}\sum_{i=1}^{r}\widehat{\psi}_{i}(l)^2,

    where ψ^i(l)=Corr^(Yt,i,Ztl)\widehat{\psi}_{i}(l)=\widehat{Corr}(Y_{t,i}, Z_{t-l}), being Yt={Y1,,YT}\overline{\boldsymbol Y}_t=\{\overline{\boldsymbol Y}_1, \ldots, \overline{\boldsymbol Y}_T\}, with Yk=(Yk,1,,Yk,r)\overline{\boldsymbol Y}_k=(\overline{Y}_{k,1}, \ldots, \overline{Y}_{k,r})^\top, the binarized time series of Xt\overline{X}_t.

  • If type=total_mixed_correlation_2, the function computes the estimated total mixed q-correlation given by

    Ψ^2(l)=1ri=1r01ψ^iρ(l)2dρ,\widehat{\Psi}_2(l)=\frac{1}{r}\sum_{i=1}^{r}\int_{0}^{1}\widehat{\psi}^\rho_{i}(l)^2d\rho,

    where ψ^iρ(l)=Corr^(Yt,i,I(ZtlqZt(ρ)))\widehat{\psi}_{i}^\rho(l)=\widehat{Corr}\big(Y_{t,i}, I(Z_{t-l}\leq q_{Z_t}(\rho)) \big), being Yt={Y1,,YT}\overline{\boldsymbol Y}_t=\{\overline{\boldsymbol Y}_1, \ldots, \overline{\boldsymbol Y}_T\}, with Yk=(Yk,1,,Yk,r)\overline{\boldsymbol Y}_k=(\overline{Y}_{k,1}, \ldots, \overline{Y}_{k,r})^\top, the binarized time series of Xt\overline{X}_t, ρ(0,1)\rho \in (0, 1) a probability level, I()I(\cdot) the indicator function and qZtq_{Z_t} the quantile function of the corresponding real-valued process.

Value

The corresponding feature.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH, Göb R (2008). “Measuring serial dependence in categorical time series.” AStA Advances in Statistical Analysis, 92, 71–89.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
uc <- calculate_features(series = sequence_1, type = 'uncertainty_coefficient' )
# Computing the uncertainty coefficient
# for the first series in dataset GeneticSequences
se <- calculate_features(series = sequence_1, type = 'spectral_envelope' )
# Computing the spectral envelope
# for the first series in dataset GeneticSequences

Computes the relative frequency of motifs in a categorical time series

Description

calculate_motifs computes the motifs of a categorical time series

Usage

calculate_motifs(series, motif_length)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

motif_length

The length of the motif.

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, and a motif length LL, the function returns an array of rLr^L elements, with the element in the position (i1,i2,,ir)(i_1, i_2, \ldots, i_r) being the relative frequency of the motif “i1i2iri_1i_2 \cdots i_r” in the corresponding time series.

Value

Returns an array with the relative frequency of motifs in a categorical time series.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Lonardi JLEKS, Patel P (2002). “Finding motifs in time series.” In Proc. of the 2nd Workshop on Temporal Data Mining, 53–68.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
calculate_motifs(sequence_1, motif_length = 3)
# Computing the relative frequencies of motifs of length 3 for the first
# series in dataset GeneticSequences

Computes several subfeatures associated with a categorical time series

Description

calculate_features computes several subfeatures associated with a categorical time series or between a categorical and a real-valued time series

Usage

calculate_subfeatures(series, n_series, lag = 1, type = NULL)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

n_series

A real-valued time series.

lag

The considered lag (default is 1).

type

String indicating the subfeature one wishes to compute.

Details

Assume we have a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, with p^i\widehat{p}_i being the natural estimate of the marginal probability of the iith category, and p^ij(l)\widehat{p}_{ij}(l) being the natural estimate of the joint probability for categories ii and jj at lag l, i,j=1,,ri,j=1, \ldots, r. Assume also that we have a real-valued time series of length TT, Zt={Z1,,ZT}\overline{Z}_t=\{\overline{Z}_1,\ldots, \overline{Z}_T\}. The function computes the following subfeatures depending on the argument type:

  • If type=entropy, the function computes the subfeatures associated with the estimated entropy, p^iln(p^i)\widehat{p}_i\ln(\widehat{p}_i), i=1,2,,ri=1,2, \ldots,r.

  • If type=gk_tau, the function computes the subfeatures associated with the estimated Goodman and Kruskal's tau, p^ij(l)2p^j\frac{\widehat{p}_{ij}(l)^2}{\widehat{p}_j}, i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=gk_lambda, the function computes the subfeatures associated with the estimated Goodman and Kruskal's lambda, maxip^ij(l)\max_i\widehat{p}_{ij}(l), i=1,2,,ri=1,2, \ldots,r.

  • If type=uncertainty_coefficient, the function computes the subfeatures associated with the estimated uncertainty coefficient, p^ij(l)ln(p^ij(l)p^ip^j)\widehat{p}_{ij}(l)\ln\Big(\frac{\widehat{p}_{ij}(l)}{\widehat{p}_i\widehat{p}_j}\Big), i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=pearson_measure, the function computes the subfeatures associated with the estimated Pearson measure, (p^ij(l)p^ip^j)2p^ip^j\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}, i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=phi2_measure, the function computes the subfeatures associated with the estimated Phi2 measure, (p^ij(l)p^ip^j)2p^ip^j\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}, i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=sakoda_measure, the function computes the subfeatures associated with the estimated Sakoda measure, (p^ij(l)p^ip^j)2p^ip^j\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}, i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=cramers_vi, the function computes the subfeatures associated with the estimated Cramer's vi, (p^ij(l)p^ip^j)2p^ip^j\frac{(\widehat{p}_{ij}(l)-\widehat{p}_i\widehat{p}_j)^2}{\widehat{p}_i\widehat{p}_j}, i,j=1,2,,ri,j=1,2, \ldots,r.

  • If type=cohens_kappa, the function computes the subfeatures associated with the estimated Cohen's kappa, p^ii(l)p^i2\widehat{p}_{ii}(l)-\widehat{p}_i^2, i=1,2,,ri=1,2, \ldots,r.

  • If type=total_correlation, the function computes the subfeatures associated with the total correlation, ψ^ij(l)\widehat{\psi}_{ij}(l), i,j=1,2,,ri,j=1,2, \ldots,r (see type='total_mixed_cor' in the function calculate_features).

  • If type=total_mixed_correlation_1, the function computes the subfeatures associated with the total mixed l-correlation, ψ^i(l)\widehat{\psi}_{i}(l), i=1,2,,ri=1,2, \ldots,r (see type='total_mixed_correlation_1' in the function calculate_features).

  • If type=total_mixed_correlation_2, the function computes the subfeatures associated with the total mixed q-correlation, 01ψ^iρ(l)2dρ\int_{0}^{1}\widehat{\psi}^\rho_{i}(l)^2d\rho, i=1,2,,ri=1,2, \ldots,r (see type='total_mixed_correlation_2' in the function calculate_features).

Value

The corresponding subfeature

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH, Göb R (2008). “Measuring serial dependence in categorical time series.” AStA Advances in Statistical Analysis, 92, 71–89.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
suc <- calculate_subfeatures(series = sequence_1, type = 'uncertainty_coefficient')
# Computing the subfeatures associated with the uncertainty coefficient
# for the first series in dataset GeneticSequences
scv <- calculate_subfeatures(series = sequence_1, type = 'cramers_vi' )
# Computing the subfeatures associated with the cramers vi
# for the first series in dataset GeneticSequences

Computes the conditional probabilities of a categorical time series

Description

conditional_probabilities returns a matrix with the conditional probabilities of a categorical time series

Usage

conditional_probabilities(series, lag = 1)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

lag

The considered lag (default is 1).

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, the function computes the matrix P^c(l)=(p^ijc(l))1i,jr\widehat{\boldsymbol P}^c(l) = \big(\widehat{p}^c_{ij}(l)\big)_{1 \le i, j \le r}, with p^ijc(l)=TNij(l)(Tl)Ni\widehat{p}^c_{ij}(l)=\frac{TN_{ij}(l)}{(T-l)N_i}, where NiN_i is the number of elements equal to ii in the realization Xt\overline{X}_t and Nij(l)N_{ij}(l) is the number of pairs (Xt,Xtl)=(i,j)(\overline{X}_t, \overline{X}_{t-l})=(i,j) in the realization Xt\overline{X}_t.

Value

A matrix with the conditional probabilities.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH, Göb R (2008). “Measuring serial dependence in categorical time series.” AStA Advances in Statistical Analysis, 92, 71–89.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
matrix_cp <- conditional_probabilities(series = sequence_1) # Computing the matrix of
# joint probabilities for the first series in dataset GeneticSequences

GeneticSequences

Description

Categorical time series (CTS) of DNA sequences from different viruses

Usage

data(GeneticSequences)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 32 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

Details

The column Value is the concatenation of 32 time series taking four categorical values (DNA bases). The column Class is formed by integers from 1 to 4, indicating that there are 4 different classes in the database. Each class is associated with a different family of viruses. For more information, see López-Oriona et al. (2023).

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.


Computes the joint probabilities of a categorical time series

Description

joint_probabilities returns a matrix with the joint probabilities of a categorical time series

Usage

joint_probabilities(series, lag = 1)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

lag

The considered lag (default is 1).

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, the function computes the matrix P^(l)=(p^ij(l))1i,jr\widehat{\boldsymbol P}(l) = \big(\widehat{p}_{ij}(l)\big)_{1 \le i, j \le r}, with p^ij(l)=Nij(l)Tl\widehat{p}_{ij}(l)=\frac{N_{ij}(l)}{T-l}, where Nij(l)N_{ij}(l) is the number of pairs (Xt,Xtl)=(i,j)(\overline{X}_t, \overline{X}_{t-l})=(i,j) in the realization Xt\overline{X}_t.

Value

A matrix with the joint probabilities.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH, Göb R (2008). “Measuring serial dependence in categorical time series.” AStA Advances in Statistical Analysis, 92, 71–89.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
matrix_jp <- joint_probabilities(series = sequence_1) # Computing the matrix of
# joint probabilities for the first series in dataset GeneticSequences

Computes the marginal probabilities of a categorical time series

Description

marginal_probabilities returns a vector with the marginal probabilities of a categorical time series

Usage

marginal_probabilities(series)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, the function computes the vector p^=(p^1,,p^r)\widehat{\boldsymbol p} =(\widehat{p}_1, \ldots, \widehat{p}_r), with p^i=NiT\widehat{p}_i=\frac{N_i}{T}, where NiN_i is the number of elements equal to ii in the realization Xt\overline{X}_t.

Value

A vector with the marginal probabilities.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH, Göb R (2008). “Measuring serial dependence in categorical time series.” AStA Advances in Statistical Analysis, 92, 71–89.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
vector_mp <- marginal_probabilities(series = sequence_1) # Computing the vector of
# marginal probabilities for the first series in dataset GeneticSequences

Constructs a control chart for the cycle lengths of a categorical series

Description

plot_ccc constructs a control chart for the cycle lengths of a categorical series

Usage

plot_ccc(
  series,
  mu_t,
  lcl_t,
  ucl_t,
  plot = TRUE,
  title = "Control chart (cycles)",
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

mu_t

The mean of the process measuring the cycle lengths.

lcl_t

The lower control limit.

ucl_t

The upper control limit.

plot

Logical. If plot = TRUE (default), returns the control chart. Otherwise, returns the standardized statistic.

title

The title of the graph.

...

Additional parameters for the function.

Details

Constructs a control chart of a CTS based on cycle lengths. The chart is based on the standardized statistic Tt=Tt(L)+Tt(U)T_t=T_t^{(L)}+T_t^{(U)}, with Tt(L)=min(0,CtμtLCLtμt)T_t^{(L)}=\min \left(0, \frac{C_t-\mu_t}{\left|L C L_t-\mu_t\right|}\right) and Tt(U)=max(0,CtμtUCLtμt)T_t^{(U)}=\max \left(0, \frac{C_t-\mu_t}{\left|U C L_t-\mu_t\right|}\right), where ZtZ_t expresses the length of a cycle ending with a specific category, μt\mu_t denotes the mean of ZtZ_t and LCLtLCL_t and UCLtUCL_t are lower and upper individual control limits, respectively. Note that an out-of-control alarm is signalled if Tt<1T_t<-1 or Tt>1T_t>1.

Value

If plot = TRUE (default), represents the control chart for the cycle lengths. Otherwise, the function returns a matrix with the values of the standardized statistic for each time t

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2008). “Visual analysis of categorical time series.” Statistical Methodology, 5(1), 56–71.

Examples

sequence_1 <- SyntheticData1[which(SyntheticData1$Series==1),]
cycle_cc <- plot_ccc(series = sequence_1, mu_t = c(1, 1.5, 1),
lcl_t = rep(10, 600), ucl_t = rep(10, 600)) # Representing
# a control chart for the cycle lengths
cycle_cc <- plot_ccc(series = sequence_1, mu_t = c(1, 1.5, 1),
lcl_t = rep(10, 600), ucl_t = rep(10, 600), plot = FALSE) # Computing the
# corresponding standardized statistic

Constructs a serial dependence plot based on Cohen's kappa

Description

plot_cohen constructs a serial dependence plot of a categorical time series based on Cohen's kappa

Usage

plot_cohen(
  series,
  max_lag = 10,
  alpha = 0.05,
  plot = TRUE,
  title = "Serial dependence plot",
  bar_width = 0.12,
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

max_lag

The maximum lag represented in the plot (default is 10).

alpha

The significance level for the corresponding hypothesis test (default is 0.05).

plot

Logical. If plot = TRUE (default), returns the serial dependence plot. Otherwise, returns a list with the values of Cohens's kappa, the critical value and the corresponding p-values.

title

The title of the graph.

bar_width

The width of the corresponding bars.

...

Additional parameters for the function.

Details

Constructs a serial dependence plot based on Cohens's kappa, κ^(l)\widehat{\kappa}(l), for several lags. A dashed lined is incorporated indicating the critical value of the test based on the following asymptotic approximation (under the i.i.d. assumption):

TV(p^)(κ^(l)+1T)N(0,1),\sqrt{\frac{T}{V(\widehat{\boldsymbol p})}}\bigg(\widehat{\kappa}(l)+\frac{1}{T}\bigg)\sim N\big(0, 1\big),

where TT is the series length, p^=(p^1,,p^r)\widehat{\boldsymbol p}=(\widehat{p}_1, \ldots, \widehat{p}_r) is the vector of estimated marginal probabilities for the rr categories of the series and V(p^)=11+2i=1rp^i33i=1rp^i2(1i=1rp^i2)2V(\boldsymbol {\widehat{p}})=1-\frac{1+2\sum_{i=1}^{r}\widehat{p}_i^3-3\sum_{i=1}^{r}\widehat{p}_i^2}{(1-\sum_{i=1}^{r}\widehat{p}_i^2)^2}.

Value

If plot = TRUE (default), returns the serial dependence plot based on Cohens's kappa. Otherwise, the function returns a list with the values of Cohens's kappa, the critical value and the corresponding p-values.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2011). “Empirical measures of signed serial dependence in categorical time series.” Journal of Statistical Computation and Simulation, 81(4), 411–429.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
plot_ck <- plot_cohen(series = sequence_1, max_lag = 3) # Representing
# the serial dependence plot
list_ck <- plot_cohen(series = sequence_1, max_lag = 3, plot = FALSE) # Obtaining
# the values of Cohens's kappa, the critical value and the p-values

Constructs a serial dependence plot based on Cramer's vi

Description

plot_cramer constructs a serial dependence plot of a categorical time series based on Cramer's vi

Usage

plot_cramer(
  series,
  max_lag = 10,
  alpha = 0.05,
  plot = TRUE,
  title = "Serial dependence plot",
  bar_width = 0.12,
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

max_lag

The maximum lag represented in the plot (default is 10).

alpha

The significance level for the corresponding hypothesis test (default is 0.05).

plot

Logical. If plot = TRUE (default), returns the serial dependence plot. Otherwise, returns a list with the values of Cramer's vi, the critical value and the corresponding p-values.

title

The title of the graph.

bar_width

The width of the corresponding bars.

...

Additional parameters for the function.

Details

Constructs a serial dependence plot based on Cramer's vi, v^(l)\widehat{v}(l), for several lags. A dashed lined is incorporated indicating the critical value of the test based on the following asymptotic approximation (under the i.i.d. assumption):

T(r1)v^(l)2χ(r1)22,T(r-1)\widehat{v}(l)^2 \sim\chi^2_{(r-1)^2},

where TT is the series length and rr is the number of categories in the time series.

Value

If plot = TRUE (default), returns the serial dependence plot based on Cramer's vi. Otherwise, the function returns a list with the values of Cramer's vi, the critical value and the corresponding p-values.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2013). “Serial dependence of NDARMA processes.” Computational Statistics and Data Analysis, 68, 213–238.

Examples

sequence_1 <- SyntheticData1[which(SyntheticData1$Series==1),]
plot_cv <- plot_cramer(series = sequence_1, max_lag = 3) # Representing
# the serial dependence plot
list_cv <- plot_cramer(series = sequence_1, max_lag = 3, plot = FALSE) # Obtaining
# the values of Cramer's vi, the critical value and the p-values

Constructs a categorical time series plot

Description

plot_cts constructs a categorical time series plot

Usage

plot_cts(series, title = "Time series plot")

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

title

The title of the graph.

Details

Constructs a categorial time series plot for a given CTS.

Value

The categorical time series plot.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2018). An introduction to discrete-valued time series. John Wiley and Sons.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
time_series_plot <- plot_cts(series = sequence_1) # Constructs a categorical
# time series plot for the first 50 observations of the first  time series in
# dataset GeneticSequences

Constructs the IFS circle transformation of a categorical time series

Description

plot_ifsct constructs the IFS circle transformation of a categorical time series.

Usage

plot_ifsct(series, alpha, beta, title = "IFS circle transformation", ...)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

alpha

Parameter alpha in the circle transformation.

beta

Parameter beta in the circle transformation.

title

The title of the graph.

...

Additional parameters for the function.

Details

Constructs the IFS circle transformation for a given CTS, which is useful to identify cycles of arbitrary length.

Value

The IFS circle transformation.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2008). “Visual analysis of categorical time series.” Statistical Methodology, 5(1), 56–71.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
ct <- plot_ifsct(sequence_1, alpha = 0.1, beta = 0.1) # Constructing the IFS circle transformation
# for the first CTS in dataset GeneticSequences

Constructs a control chart for the marginal distribution of a categorical series

Description

plot_mcc constructs a control chart for the marginal distribution of a categorical series

Usage

plot_mcc(
  series,
  c,
  sigma,
  lambda = 0.99,
  k = 3.3,
  min_max = FALSE,
  plot = TRUE,
  title = "Control chart (marginal)",
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

c

The hypothetical marginal distribution.

sigma

A matrix containing the variances for each category (columns) and each time t (rows).

lambda

The constant lambda to construct the EWMA estimator.

k

The constant k to construct the k sigma limits.

min_max

Logical. If min_max = FALSE (default), the standard control chart for the marginal distribution is plotted. Otherwise, the reduced control chart is plotted, i.e., only the minimum and maximum values of the standardized statistics (with respect to the set of categories) are considered.

plot

Logical. If plot = TRUE (default), returns the control chart. Otherwise, returns the standardized statistics or their maximum and minimum value for each time t.

title

The title of the graph.

...

Additional parameters for the function.

Details

Constructs a control chart of a CTS with range V={1,,r}\mathcal{V}=\{1, \ldots, r\} based on the marginal distribution. The chart relies on the standardized statistic Tt,i=π^t,i(λ)pikσt,iT_{t, i}=\frac{\hat{\pi}_{t, i}^{(\lambda)}-p_i}{k \cdot \sigma_{t, i}}, where the π^t,i(λ)\hat{\pi}_{t, i}^{(\lambda)}, i=1,,ri=1,\ldots,r, are the components of the EWMA estimator of the marginal distribution, pip_i is the marginal probability of category ii, σt,i\sigma_{t,i} is the variance of π^t,i(λ)\hat{\pi}_{t, i}^{(\lambda)} and kk is a constant set by the user. If min_max = FALSE, then only the statistics Ttmin=miniVTt,iT_t^{\min }=\min_{i \in \mathcal{V}} T_{t, i} and Ttmax=maxiVTt,iT_t^{\max }=\max_{i \in \mathcal{V}} T_{t, i} are plotted. An out-of-control alarm is signalled if the statistics are below -1 or above 1.

Value

If plot = TRUE (default), represents the control chart for the marginal distribution. Otherwise, the function returns a matrix with the values of the standardized statistics for each time t

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2008). “Visual analysis of categorical time series.” Statistical Methodology, 5(1), 56–71.

Examples

sequence_1 <- SyntheticData1[which(SyntheticData1$Series==1),]
cycle_cc <- plot_ccc(series = sequence_1, mu_t = c(1, 1.5, 1),
lcl_t = rep(10, 600), ucl_t = rep(10, 600))
cycle_md <- plot_mcc(series = sequence_1, c = c(0.3, 0.3, 0.4),
sigma = matrix(rep(c(1, 1, 1), 600), nrow = 600)) # Representing
# a control chart for the marginal distribution
cycle_md <- plot_mcc(series = sequence_1, c = c(0.3, 0.3, 0.4),
sigma = matrix(rep(c(1, 1, 1), 600), nrow = 600), plot = FALSE) # Computing the
# corresponding standardized statistic

Constructs the pattern histogram associated with a given category of a categorical time series

Description

plot_ph constructs the pattern histogram associated with a given category of a categorical time series.

Usage

plot_ph(
  series,
  category,
  plot = TRUE,
  title = paste0("Pattern histogram (", category, ")"),
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

category

The selected category.

plot

Logical. If plot = TRUE (default), returns the pattern histogram. Otherwise, returns the frequencies of cycle lengths associated with the corresponding category.

title

The title of the graph.

...

Additional parameters for the function.

Details

Constructs the pattern histogram for a specific category of a CTS. This graph represents the frequencies of the cycles for the corresponding category according to their length.

Value

The pattern histogram.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Weiß CH (2008). “Visual analysis of categorical time series.” Statistical Methodology, 5(1), 56–71.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
ph <- plot_ph(sequence_1,
category = 'a') # Constructing the pattern histogram
# for the first CTS in dataset GeneticSequences concerning the category 'a'
cycle_lengths <- plot_ph(sequence_1,
category = 'a', plot = FALSE) # Obtaining the frequencies of cycle lengths

Constructs the rate evolution graph for a categorical time series

Description

plot_reg constructs the rate evolution graph proposed by Ribler (1997).

Usage

plot_reg(
  series,
  title = "Rate evolution graph",
  linear_fit = FALSE,
  cat_res = NULL,
  ...
)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

title

The title of the graph.

linear_fit

Logical. I TRUE, the corresponding least squares lines are incorporated to the graph

cat_res

If this parameter is set to any of the categories of the series, then the function returns a graph of residuals for the linear model associated with the corresponding category

...

Additional parameters for the function.

Details

Given a CTS of length TT with range V={1,2,,r}\mathcal{V}=\{1, 2, \ldots, r\}, Xt={X1,,XT}\overline{X}_t=\{\overline{X}_1,\ldots, \overline{X}_T\}, and the corresponding binarized time series, Yt={Y1,,YT}\overline{\boldsymbol Y}_t=\{\overline{\boldsymbol Y}_1, \ldots, \overline{\boldsymbol Y}_T\}, the function constructs the rate evolution graph. Specifically, consider the series of cumulated sums given by Ct={C1,,CT}\overline{\boldsymbol C}_t=\{\overline{\boldsymbol C}_1, \ldots, \overline{\boldsymbol C}_T\}, with Ck=s=1kYs\overline{\boldsymbol C}_k=\sum_{s=1}^{k}\overline{\boldsymbol Y}_s, k=1,,Tk=1,\ldots,T. The rate evolution graph displays a standard time series plot for each one of the components of Ct\overline{\boldsymbol C}_t simultaneously in one graph.

Value

The rate evolution graph.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Ribler RL (1997). Visualizing categorical time series data with applications to computer and communications network traces. Ph.D. thesis, Virginia Polytechnic Institute and State University.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
reg <- plot_reg(sequence_1) # Constructing the rate
# evolution graph for the first time series in dataset GeneticSequences

Represents the spectral envelope of a categorical time series

Description

plot_se represents the spectral envelope of a categorical time series

Usage

plot_se(series)

Arguments

series

An object of type tsibble (see R package tsibble), whose column named Value contains the values of the corresponding CTS. This column must be of class factor and its levels must be determined by the range of the CTS.

Details

The function represents the spectral envelope of a categorical time series

Value

Returns returns a plot of the spectral envelope.

Author(s)

Ángel López-Oriona, José A. Vilar

References

Stoffer DS, Tyler DE, McDougall AJ (1993). “Spectral analysis for categorical time series: Scaling and the spectral envelope.” Biometrika, 80(3), 611–622.

Examples

sequence_1 <- GeneticSequences[which(GeneticSequences$Series==1),]
plot_se(sequence_1)
# Representing the spectral envelope for the first series in dataset
# GeneticSequences

ProteinSequences

Description

Categorical time series (CTS) of protein sequences from different species

Usage

data(ProteinSequences)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 40 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

Details

The column Value is the concatenation of 40 time series taking four categorical values (amino-acids). The column Class is formed by integers from 1 to 4, indicating that there are 4 different classes in the database. Each class is associated with a different family of viruses. For more information, see López-Oriona et al. (2023).

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.


SleepStages

Description

Categorical time series (CTS) of sleep stages from different subjects

Usage

data(SleepStages)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 62 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

Details

The column Value is the concatenation of 62 time series taking six categorical values (sleep stages). The column Class is formed by the integers 1 and 2 indicating that there are 2 different classes in the database. Each class is associated with a sleep disorder (class 1 refers to nocturnal frontal lobe epilepsy, while class refers 2 to REM behavior disorder). For more information, see López-Oriona et al. (2023).

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.


SyntheticData1

Description

Synthetic dataset containing 80 CTS generated from four different generating processes.

Usage

data(SyntheticData1)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 80 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

@details The column Value is the concatenation of 80 time series of length 600 taking three categorical values. Series 1-20, 21-40, 41-60 and 61-80 were generated from Markov Chains with different matrices of transition probabilities (see Scenario 1 in López-Oriona et al. (2023)). Therefore, there are 4 different classes in the dataset.

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.


SyntheticData2

Description

Synthetic dataset containing 80 CTS generated from four different generating processes.

Usage

data(SyntheticData2)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 80 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

@details The column Value is the concatenation of 80 time series of length 600 taking three categorical values. Series 1-20, 21-40, 41-60 and 61-80 were generated from Hidden Markov Models with different matrices of transition and emission probabilities (see Scenario 2 in López-Oriona et al. (2023)). Therefore, there are 4 different classes in the dataset.

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.


SyntheticData3

Description

Synthetic dataset containing 80 CTS generated from four different generating processes.

Usage

data(SyntheticData3)

Format

A tsibble with four columns, which are:

Value

The categorical values of the time series in the dataset.

Series

Integer values indicating the considered time series (there are 80 time series in the dataset).

Time

Integer values indicating the temporal indexes of the observations.

Class

Integer values indicating the class of each time series.

@details The column Value is the concatenation of 80 time series of length 600 taking three categorical values. Series 1-20, 21-40, 41-60 and 61-80 were generated from NDARMA processes with different orders and vectors of coefficients (see Scenario 3 in López-Oriona et al. (2023)). Therefore, there are 4 different classes in the dataset.

References

López-Oriona Á, Vilar JA, D’Urso P (2023). “Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences.” Information Sciences, 624, 467–492.