﻿<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>EURASIP Journal on Bioinformatics and Systems Biology</title><link>http://www.hindawi.com</link><description>The latest articles from Hindawi Publishing Corporation</description><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright><item><title>Algorithms and Complexity Analyses for Control of Singleton Attractors in Boolean Networks</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/521407</link><description>A Boolean network (BN) is a mathematical model of genetic networks. We propose several algorithms for control of singleton attractors in BN. We theoretically estimate the average-case time complexities of the proposed algorithms, and confirm them by computer experiments. The results suggest the importance of gene ordering. Especially, setting internal nodes ahead yields shorter computational time than setting external nodes ahead in various types of algorithms. We also present a heuristic algorithm which does not look for the optimal solution but for the solution whose computational time is shorter than that of the exact algorithms.</description><Author>Morihiro Hayashida, Takeyuki Tamura, Tatsuya Akutsu, Shu-Qin Zhang, and Wai-Ki Ching</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Inference of Boolean Networks Using Sensitivity Regularization</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/780541</link><description>The inference of genetic regulatory networks from global measurements of gene expressions is an important problem in computational biology. Recent studies suggest that such dynamical molecular systems are poised at a critical phase transition between an ordered and a disordered phase, affording the ability to balance stability and adaptability while coordinating complex macroscopic behavior. We investigate whether incorporating this dynamical system-wide property as an assumption in the inference process is beneficial in terms of reducing the inference error of the designed network. Using Boolean networks, for which there are well-defined notions of ordered, critical, and chaotic dynamical regimes as well as well-studied inference procedures, we analyze the expected inference error relative to deviations in the networks' dynamical regimes from the assumption of criticality. We demonstrate that taking criticality into account via a penalty term in the inference procedure improves the accuracy of prediction both in terms of state transitions and network wiring, particularly for small sample sizes.</description><Author>Wenbin Liu, Harri L&amp;#228;hdesm&amp;#228;ki, Edward R. Dougherty, and Ilya Shmulevich</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Using Temporal Correlation in Factor Analysis for Reconstructing Transcription Factor Activities</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/172840</link><description>Two-level gene regulatory networks consist of the transcription factors (TFs) in the top level and their regulated genes in the second level. The expression profiles of the regulated genes are the observed high-throughput data given by experiments such as microarrays. The activity profiles of the TFs are treated as hidden variables as well as the connectivity matrix that indicates the regulatory relationships of TFs with their regulated genes. Factor analysis (FA) as well as other methods, such as the network component algorithm, has been suggested for reconstructing gene regulatory networks and also for predicting TF activities. They have been applied to E. coli and yeast data with the assumption that these datasets consist of identical and independently distributed samples. Thus, the main drawback of these algorithms is that they ignore any time correlation existing within the TF profiles. In this paper, we extend previously studied FA algorithms to include time correlation within the transcription factors. At the same time, we consider connectivity matrices that are sparse in order to capture the existing sparsity present in gene regulatory networks. The TFs activity profiles obtained by this approach are significantly smoother than profiles from previous FA algorithms. The periodicities in profiles from yeast expression data become prominent in our reconstruction. Moreover, the strength of the correlation between time points is estimated and can be used to assess the suitability of the experimental time interval.</description><Author>Iosifina Pournara and Lorenz Wernisch</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Gene Regulatory Network Reconstruction Using Conditional Mutual Information</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/253894</link><description>The inference of gene regulatory network from expression data is an important area of research that provides insight to the inner workings of a biological system. The relevance-network-based approaches provide a simple and easily-scalable solution to the understanding of interaction between genes. Up until now, most works based on relevance network focus on the discovery of direct regulation using correlation coefficient or mutual information. However, some of the more complicated interactions such as interactive regulation and coregulation are not easily detected. In this work, we propose a relevance network model for gene regulatory network inference which employs both mutual information and conditional mutual information to determine the interactions between genes. For this purpose, we propose a conditional mutual information estimator based on adaptive partitioning which allows us to condition on both discrete and continuous random variables. We provide experimental results that demonstrate that the
proposed regulatory network inference algorithm can provide better performance when the target network contains coregulated and interactively regulated genes.</description><Author>Kuo-Ching Liang and Xiaodong Wang</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Detecting Periodic Genes from Irregularly Sampled Gene Expressions: A Comparison Study</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/769293</link><description>Time series microarray measurements of gene expressions have been exploited to discover genes involved in cell cycles. Due to experimental constraints, most
microarray observations are obtained through irregular sampling. In this paper three
popular spectral analysis schemes, namely, Lomb-Scargle, Capon and missing-data
amplitude and phase estimation (MAPES), are compared in terms of their ability
and efficiency to recover periodically expressed genes. Based on in silico experiments for microarray measurements of Saccharomyces cerevisiae, Lomb-Scargle is found to be the most efficacious scheme. 149 genes are then identified to be periodically expressed in the Drosophila melanogaster data set.</description><Author>Wentao Zhao, Kwadwo Agyepong, Erchin Serpedin, and Edward R. Dougherty</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Recovering Genetic Regulatory Networks from Chromatin Immunoprecipitation and Steady-State Microarray Data</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/248747</link><description>Recent advances in high-throughput DNA microarrays and chromatin immunoprecipitation (ChIP) assays have enabled the learning of the structure and functionality of genetic regulatory networks. In light of these heterogeneous data sets, this paper proposes a novel approach for reconstruction of genetic regulatory networks
based on the posterior probabilities of gene regulations. Built within the
framework of Bayesian statistics and computational Monte Carlo techniques, the
proposed approach prevents the dichotomy of classifying gene interactions as either
being connected or disconnected, thereby it reduces significantly the inference errors. Simulation results corroborate the superior performance of the proposed approach
relative to the existing state-of-the-art algorithms. A genetic regulatory network
for Saccharomyces cerevisiae is inferred based on the published real data sets, and biological meaningful results are discussed.</description><Author>Wentao Zhao, Erchin Serpedin, and Edward R. Dougherty</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Optimal Constrained Stationary Intervention in Gene Regulatory Networks</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/620767</link><description>A key objective of gene network modeling
is to develop intervention strategies to alter regulatory
dynamics in such a way as to reduce the likelihood of
undesirable phenotypes. Optimal stationary intervention
policies have been developed for gene regulation in the
framework of probabilistic Boolean networks in a number
of settings. To mitigate the possibility of detrimental side
effects, for instance, in the treatment of cancer, it may
be desirable to limit the expected number of treatments
beneath some bound. This paper formulates a general constraint
approach for optimal therapeutic intervention by
suitably adapting the reward function and then applies this
formulation to bound the expected number of treatments.
A mutated mammalian cell cycle is considered as a case
study.</description><Author>Babak Faryabi, Golnaz Vahedi, Jean-Francois Chamberland, Aniruddha Datta, and Edward R. Dougherty</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/235451</link><description>This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods.</description><Author>Ravi Gupta, Ankush Mittal, and Kuldip Singh</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Which Is Better: Holdout or Full-Sample Classifier Design?</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/297945</link><description>Is it better to design a classifier and estimate its error on the full sample or to design a
classifier on a training subset and estimate its error on the holdout test subset? Full-sample
design provides the better classifier; nevertheless, one might choose holdout with the hope of better error estimation. A conservative criterion to decide the best course is to aim at a classifier whose error is less than a given bound. Then the choice between full-sample and holdout  designs  depends on which possesses the smaller expected bound. Using this criterion, we examine the choice between holdout and several full-sample error estimators using covariance models and a patient-data model. Full-sample design consistently outperforms holdout design. The relation between the two designs is revealed via a decomposition of the expected bound into the sum of the expected true error and the expected conditional standard deviation of the true error.</description><Author>Marcel Brun, Qian Xu, and Edward R. Dougherty</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Bayesian Hierarchical Model for Estimating Gene Expression Intensity Using Multiple Scanned Microarrays</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/231950</link><description>We propose a method for improving the quality of signal from DNA microarrays by using several scans at varying scanner sen-sitivities. A Bayesian latent intensity model is introduced for the analysis of such data. The method improves the accuracy at which expressions can be measured in all ranges and extends the dynamic range of measured gene expression at the high end. Our method is generic and can be applied to data from any organism, for imaging with any scanner that allows varying the laser power, and for extraction with any image analysis software. Results from a self-self hybridization data set illustrate an improved precision in the estimation of the expression of genes compared to what can be achieved by applying standard methods and using only a single scan.</description><Author>Rashi Gupta, Elja Arjas, Sangita Kulathinal, Andrew Thomas, and Petri Auvinen</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Combining Evidence, Specificity, and Proximity towards the Normalization of Gene Ontology Terms in Text</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/342746</link><description>Structured information provided by manual annotation of proteins with Gene Ontology concepts represents a high-quality reliable data source for the research community. However, a limited scope of proteins is annotated due to the amount of human resources required to fully annotate each individual gene product from the literature. We introduce a novel method for automatic identification of GO terms in natural language text. The method takes into consideration several features: (1) the evidence
for a GO term given by the words occurring in text, (2) the proximity between the
words, and (3) the specificity of the GO terms based on their information content.
The method has been evaluated on the BioCreAtIvE corpus and has been compared to
current state of the art methods. The precision reached 0.34 at a recall of 0.34 for the
identified terms at rank 1. In our analysis, we observe that the identification of GO
terms in the &amp;#8220;cellular component&amp;#8221; subbranch of GO is more accurate than for terms from the other two subbranches. This observation is explained by the average number of words forming the terminology over the different subbranches.</description><Author>S. Gaudan, A. Jimeno Yepes, V. Lee, and D. Rebholz-Schuhmann</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Inference of Gene Regulatory Networks Based on a Universal Minimum Description Length</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2008/482090</link><description>The Boolean network paradigm is a simple and effective way to interpret genomic systems, but discovering the structure of these networks remains a difficult task. The minimum description length (MDL) principle has already been used for inferring genetic regulatory networks from time-series expression data and has proven useful for recovering the directed connections in Boolean networks. However, the existing method uses an ad hoc measure of description length that necessitates a tuning parameter for artificially balancing the model and error costs and, as a result, directly conflicts with the MDL principle's implied universality. In order to surpass this difficulty, we propose a novel MDL-based method in which the description length is a theoretical measure derived from a universal normalized maximum likelihood model. The search space is reduced by applying an implementable analogue of Kolmogorov&amp;#39;s structure function. The performance of the proposed method is demonstrated on random synthetic networks, for which it is shown to improve upon previously published network inference algorithms with respect to both speed and accuracy. Finally, it is applied to time-series Drosophila gene expression measurements.</description><Author>John Dougherty, Ioan Tabus, and Jaakko Astola</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Information Theoretic Methods for Bioinformatics</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/79128</link><description /><Author>Jorma Rissanen, Peter Gr&amp;#252;nwald, Jukka Heikkonen, Petri Myllym&amp;#228;ki, Teemu Roos, and Juho Rousu</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>I&amp;#x03BA;B, NF-&amp;#x03BA;B Regulation Model: Simulation Analysis of Small Number of Molecules</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/25250</link><description>The regulation of I&amp;#x03BA;B, NF-&amp;#x03BA;B is of foremost interest in biology as the transcription
factor NF-&amp;#x03BA;B has multiple target genes. We have modeled a previously published model by Hoffmann et al. (2002) of I&amp;#x03BA;B, NF-&amp;#x03BA;B mathematically as discrete reaction systems. We
have used stochastic algorithm to compare the results when there are large and  small
numbers of molecules available in a finite volume  for each protein. Our results for small
number of molecules show that with continuous presence of stimulation, nuclear NF-&amp;#x03BA;B oscillates continuously in every individual cell rather than damping, which was observed in cell population results. This characteristic of the system is missed when averaged behavior is studied.</description><Author>Anamika Sarkar, Marina Meila, and Robert B. Franza</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/90947</link><description>Typical problems in bioinformatics involve large discrete datasets. Therefore, in order
to apply statistical methods in such domains, it is important to develop efficient algorithms
suitable for discrete data. The minimum description length (MDL) principle is a theoretically
well-founded, general framework for performing statistical inference. The mathematical
formalization of MDL is based on the normalized maximum likelihood (NML) distribution,
which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.</description><Author>Petri Kontkanen, Hannes Wettig, and Petri Myllym&amp;#228;ki</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Aligning Sequences by Minimum Description Length</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/72936</link><description>This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code
length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.</description><Author>John S. Conery</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/13853</link><description>Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine
its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads
to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies.</description><Author>Arvind Rao, Alfred O. Hero III, David J. States, and James Douglas Engel</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Extraction of Protein Interaction Data: A Comparative Analysis of Methods in Use</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/53096</link><description>Several natural language processing tools, both commercial and freely available, are used to extract protein interactions from publications. Methods used by these tools include pattern matching to dynamic programming with individual recall and precision rates. A methodical survey of these tools, keeping in mind the minimum interaction information a researcher would need, in comparison to manual analysis has not been carried out. We compared data generated using some of the selected NLP tools with manually curated protein interaction data  (PathArt and IMaps) to comparatively determine the recall and precision rate. The rates were found to be lower than the published scores when a normalized definition for interaction is considered. Each data point captured wrongly or not picked up by the tool was analyzed. Our evaluation brings forth critical failures of NLP tools and provides pointers for the development of an ideal NLP tool.</description><Author>Hena Jose, Thangavel Vadivukarasi, and Jyothi Devakumar</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/14741</link><description>Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5&amp;#x2032; untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI&amp;#39;s combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats&amp;#8212;an application of importance in genetic profiling.</description><Author>Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama, and Wojciech Szpankowski</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Question Processing and Clustering in INDOC: A Biomedical Question 
      Answering System</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/28576</link><description>The exponential growth in the volume of publications in the biomedical domain has made it impossible for an individual to keep pace with the advances. Even though evidence-based medicine has gained wide acceptance, the physicians are unable to access the relevant information in the required time, leaving most of the questions unanswered. This accentuates the need for fast and accurate biomedical question answering systems. In this paper we introduce INDOC&amp;#8212;a biomedical question answering system based on novel ideas of indexing and extracting the answer to the questions posed. INDOC displays the results in clusters to help the user arrive the most relevant set of documents quickly. Evaluation was done against the standard OHSUMED test collection. Our system achieves high accuracy and minimizes user effort.</description><Author>Parikshit Sondhi, Purushottam Raj, V. Vinod Kumar, and Ankush Mittal</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Compressing Proteomes: The Relevance of Medium Range Correlations</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/60723</link><description>We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained
in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.</description><Author>Dario Benedetto, Emanuele Caglioti, and Claudia Chica</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/43670</link><description>We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.</description><Author>Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin, and Andrew S. Torres</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/38473</link><description>The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous
studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional
settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples,  both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the
variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.</description><Author>Blaise Hanczar, Jianping Hua, and Edward R. Dougherty</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Genome-Wide Analysis of Intergenic Regions of Mycobacterium tuberculosis H37Rv Using Affymetrix GeneChips</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/23054</link><description>Sequencing the complete genome of Mycobacterium tuberculosis H37Rv is a major milestone in the genome project and it sheds new light in our fight with tuberculosis. The genome contains around 4000 genes (protein-coding sequences)
in the original genome annotation. A subsequent reannotation of the genome has added 80 more genes. However, we have found that the intergenic regions can exhibit expression signals, as evidenced by microarray hybridization. It is then reasonable to suspect that there are unidentified genes in these regions. We conducted a genome-wide analysis using the Affymetrix GeneChip to explore genes contained in the intergenic sequences of the M. tuberculosis H37Rv genome. A working criterion for potential protein-coding genes was based on bioinformatics, consisting of the gene structure, protein coding potential, and presence of ortholog evidence. The bioinformatics criteria in conjunction with transcriptional evidence revealed potential genes with a specific function, such as a DNA-binding protein in the CopG family and a nickle binding GTPase, as well as hypothetical proteins that had not been reported in the H37Rv genome. This study further demonstrated that microarray-based transcriptional evidence would facilitate genome-wide gene finding, and is also the first report concerning intergenic expression in M. tuberculosis genome.</description><Author>Li M. Fu and Thomas M. Shinnick</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/87356</link><description>We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally,
in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.</description><Author>Chris Hemmerich and Sun Kim</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Computational Methods for Estimation of Cell Cycle Phase Distributions of Yeast Cells</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/46150</link><description>Two computational methods for estimating the cell cycle phase distribution of
a budding yeast (Saccharomyces cerevisiae) cell population are presented. The first one is a nonparametric method that is based on the analysis of DNA content in the individual cells of the population. The DNA content is measured with a
fluorescence-activated cell sorter (FACS). The second method is based on budding
index analysis. An automated image analysis method is presented for the task
of detecting the cells and buds. The proposed methods can be used to obtain
quantitative information on the cell cycle phase distribution of a budding yeast
S. cerevisiae population. They therefore provide a solid basis for obtaining the complementary information needed in deconvolution of gene expression data. As a
case study, both methods are tested with data that were obtained in a time series
experiment with S. cerevisiae. The details of the time series experiment as well as the image and FACS data obtained in the experiment can be found in the online
additional material at  http://www.cs.tut.fi/sgn/csb/yeastdistrib/.</description><Author>Antti Niemist&amp;#246;, Matti Nykter, Tommi Aho, Henna Jalovaara, Kalle Marjanen, Miika Ahdesm&amp;#228;ki, Pekka Ruusuvuori, Mikko Tiainen, Marja-Leena Linne, and Olli Yli-Harja</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Genetic Regulatory Networks</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/17321</link><description /><Author>Edward R. Dougherty, Tatsuya Akutsu, Paul Dan Cristea, and Ahmed H. Tewfik</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Gene Systems Network Inferred from Expression Profiles in Hepatocellular Carcinogenesis by Graphical Gaussian Model</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/47214</link><description>Hepatocellular carcinoma (HCC) in a liver with advanced-stage chronic hepatitis C (CHC) is induced by hepatitis C virus, which chronically infects about 170 million people worldwide. To elucidate the associations between gene groups in hepatocellular carcinogenesis, we analyzed the profiles of the genes characteristically expressed in the CHC and HCC cell stages by a statistical method for inferring the network between gene systems based on the graphical Gaussian model. A systematic evaluation of the inferred network in terms of the biological knowledge revealed that the inferred network was strongly involved in the known gene-gene interactions with high significance (P&amp;#x003C;10&amp;#x2212;4), and that the clusters characterized by different cancer-related responses were associated with those of the gene groups related to metabolic pathways and morphological events. Although some relationships in the network remain to be interpreted, the analyses revealed a snapshot of the orchestrated expression of cancer-related groups and some pathways related with metabolisms and morphological events in hepatocellular carcinogenesis, and thus provide possible clues on the disease mechanism and insights that address the gap between molecular and clinical assessments.</description><Author>Sachiyo Aburatani, Fuyan Sun, Shigeru Saito, Masao Honda, Shu-ichi Kaneko, and Katsuhisa Horimoto</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among Bacteria</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/61374</link><description>G + C composition at the third codon position (GC3) is widely reported to be correlated with synonymous codon usage bias. However, no quantitative attempt has been made to compare the extent of this correlation among different genomes. Here, we applied Shannon entropy from information theory to measure the degree of GC3 bias and that of synonymous codon usage bias of each gene. The strength of the correlation of GC3 with synonymous codon usage bias, quantified by a correlation coefficient, varied widely among bacterial genomes, ranging from &amp;#x2212;0.07 to 0.95. Previous analyses suggesting that the relationship between GC3 and synonymous codon usage bias is independent of species are thus inconsistent with the more detailed analyses obtained here for individual species.</description><Author>Haruo Suzuki, Rintaro Saito, and Masaru Tomita</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item><item><title>Gene Selection for Multiclass Prediction by Weighted  Fisher Criterion</title><link>http://www.hindawi.com/GetArticle.aspx?doi=10.1155/2007/64628</link><description>Gene expression profiling has been widely used to study molecular signatures of many diseases and to develop molecular diagnostics for disease prediction. Gene selection, as an important step for improved diagnostics, screens tens of thousands of genes and identifies a small subset that discriminates between disease types. A two-step gene selection method is proposed to identify informative gene subsets for accurate classification of multiclass phenotypes. In the first step, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (wFC). In the second step, jointly discriminatory genes (JDGs) are selected by sequential search methods, based on their joint class separability measured by multidimensional weighted Fisher criterion (wFC). The performance of the selected gene subsets for multiclass prediction is evaluated by artificial neural networks (ANNs) and/or support vector machines (SVMs). By applying the proposed IDG/JDG approach to two microarray studies, that is, small round blue cell tumors (SRBCTs) and muscular dystrophies (MDs), we successfully identified a much smaller yet efficient set of JDGs for diagnosing SRBCTs and MDs with high prediction accuracies (96.9&amp;#37; for SRBCTs and 92.3&amp;#37; for MDs, resp.). These experimental results demonstrated that the two-step gene selection method is able to identify a subset of highly discriminative genes for improved multiclass prediction.</description><Author>Jianhua Xuan, Yue Wang, Yibin Dong, Yuanjian Feng, Bin Wang, Javed Khan, Maria Bakay, Zuyi Wang, Lauren Pachman, Sara Winokur, Yi-Wen Chen, Robert Clarke, and Eric Hoffman</Author><copyright>&amp;#169; 2008, Hindawi Publishing Corporation. All rights reserved.</copyright></item></channel></rss>