打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
clustomit {clusteval} | inside

clustomit {clusteval}

    ClustOmit - Cluster Stability Evaluation via Cluster Omission
    Package: 
    clusteval
    Version: 
    0.1

    Description

    We provide an implementation of the ClustOmit statistic, which is an approach to evaluating the stability of a clustering determined by a clustering algorithm. As discussed by Hennig (2007), arguably a stable clustering is one in which a perturbation of the original data should yield a similar clustering. However, if a perturbation of the data yields a large change in the clustering, the original clustering is considered unstable. The ClustOmit statistic provides an approach to detecting instability via a stratified, nonparametric resampling scheme. We determine the stability of the clustering via the similarity statistic specified (by default, the Jaccard coefficient).

    Usage

    clustomit(x, num_clusters, cluster_method,    similarity = c("jaccard", "rand"),    weighted_mean = TRUE, num_reps = 50,    num_cores = getOption("mc.cores", 2), ...)

    Arguments

    x
    data matrix with n observations (rows) and p features (columns)
    num_clusters
    the number of clusters to find with the clustering algorithm specified in cluster_method
    cluster_method
    a character string or a function specifying the clustering algorithm that will be used. The method specified is matched with the match.fun function. The function given should return only clustering labels for each observation in the matrix x.
    similarity
    the similarity statistic that is used to compare the original clustering (after a single cluster and its observations have been omitted) to its resampled counterpart. Currently, we have implemented the Jaccard and Rand similarity statistics and use the Jaccard statistic by default.
    weighted_mean
    logical value. Should the aggregate similarity score for each bootstrap replication be weighted by the number of observations in each of the observed clusters? By default, yes (i.e., TRUE).
    num_reps
    the number of bootstrap replicates to draw for each omitted cluster
    num_cores
    the number of coures to use. If 1 core is specified, then lapply is used without parallelization. See the mc.cores argument in mclapply for more details.
    ...
    additional arguments passed to the function specified in cluster_method

    Details

    To compute the ClustOmit statistic, we first cluster the data given in x into num_clusters clusters with the clustering algorithm specified in cluster_method. We then omit each cluster in turn and all of the observations in that cluster. For the omitted cluster, we resample from the remaining observations and cluster the resampled observations into num_clusters - 1 clusters again using the clustering algorithm specified in cluster_method. Next, we compute the similarity between the cluster labels of the original data set and the cluster labels of the bootstrapped sample. We approximate the sampling distribution of the ClustOmit statistic using a stratified, nonparametric bootstrapping scheme and use the apparent variability in the approximated sampling distribution as a diagnostic tool for further evaluation of the proposed clusters. By default, we utilize the Jaccard similarity coefficient in the calculation of the ClustOmit statistic to provide a clear interpretation of cluster assessment. The technical details of the ClustOmit statistic can be found in our forthcoming publication entitled "Cluster Stability Evaluation of Gene Expression Data."

    The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decision-theoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).

    We require a clustering algorithm function to be specified in the argument cluster_method. The function given should accept at least two arguments:

    x
    matrix of observations to cluster
    num_clusters
    the number of clusters to find
    ...
    additional arguments that can be passed on

    Also, the function given should return only clustering labels for each observation in the matrix x. The additional arguments specified in ... are useful if a wrapper function is used: see the example below for an illustration.

    Values

    object of class clustomit, which contains a named list with elements

    boot_aggregate:
    vector of the aggregated similarity statistics for each bootstrap replicate
    boot_similarity:
    list containing the bootstrapped similarity scores for each cluster omitted
    obs_clusters:
    the clustering labels determined for the observations in x
    num_clusters:
    the number of clusters found
    similarity:
    the similarity statistic used for comparison between the original clustering and the resampled clusterings

    References

    Fisher, L. and Van Ness, J. (1971), Admissible Clustering Procedures, _Biometrika_, 58, 1, 91-104.

    Hennic, C. (2007), Cluster-wise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258-271. http://www.jstor.org/stable/2334320

    Examples

    # First, we create a wrapper function for the K-means clustering algorithm# that returns only the clustering labels for each observation (row) in# \code{x}.kmeans_wrapper <- function(x, num_clusters, num_starts = 10, ...) {  kmeans(x = x, centers = num_clusters, nstart = num_starts, ...)$cluster} # For this example, we generate five multivariate normal populations with the# \code{sim_data} function.x <- sim_data("normal", delta = 1.5, seed = 42)$x clustomit_out <- clustomit(x = x, num_clusters = 4,                           cluster_method = "kmeans_wrapper", num_cores = 1)clustomit_out2 <- clustomit(x = x, num_clusters = 5,                            cluster_method = kmeans_wrapper, num_cores = 1)

    Documentation reproduced from package clusteval, version 0.1. License: MIT

    本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
    打开APP,阅读全文并永久保存 查看更多类似文章
    猜你喜欢
    类似文章
    隐马尔科夫模型HMM自学 (4-1)Forward Algorithm
    聚类知识(Clustering)
    一个用R语言进行Kmeans聚类分析的例子
    QA问答系统中的深度学习技术实现 | 我爱自然语言处理
    fsolve使用
    原生安卓开发app的框架frida自吐算法开发
    更多类似文章 >>
    生活服务
    热点新闻
    分享 收藏 导长图 关注 下载文章
    绑定账号成功
    后续可登录账号畅享VIP特权!
    如果VIP功能使用有故障,
    可点击这里联系客服!

    联系客服