DATASET (Gene) selection: Gene expression levels in ASD



Autism Datasets:

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

experimental data used in this article was downloaded from the well-known
public source Gene
Expression Omnibus (GEO). Raw expression data generated by the providers for 3
human ASD studies (GEO accession numbers: GSE28475, GSE38322 and GSE28521)
were downloaded. The expression profiles from the ASD human cortex were only
used which are presented in the table. The Cortex tissue sample is more in
number than other samples such as cerebellum in GSE38322 and GSE28521. Thus the
focus will be on the cortex samples from all three datasets and the other
samples available will be not be considered.




Tissue Type

Number of samples



Chow et al.





Ginsberg et al. (2012)





Voineagu et
al. (2011)





Voineagu et
al. (2011)

temporal Cortex







Pre-processing of the data:

In order to clean the raw data, the
expression profiles must go through the data processing technique. For that
purpose, the “lumi” package in R language is to be used which applies for Illum?na
bead chip array. The expression data for GSE is already log2 transformed and
normalized, thus for consistency and effective comparability, other raw
datasets will be processed in the same manner. All data sets will go through
background correction in lumi package and will be normalized within R.


Feature (Gene) selection:

Gene expression levels in ASD show
considerable fluctuation among the
datasets and since the sequences of several of these genes are highly variable,
it is difficult to select and identify the genes which are most relevant to
autism. Feature subset selection works by removing features that are not
relevant or are redundant. ?n a large sample, if all the gene expression levels
in a dataset is considered, “the error in the variance of a measured variable”
also known as noise is developed. This results in noisy and complex data which
is difficult to classify. Also, genes which show similar expression in all the
datasets are considered not useful since they do not show differentiation. For
the purpose of reducing noise, similar genes are eliminated. This is because
they consist of high variance which will affect the mean and median values for expression
the expression of nearby genes, thus affecting the next steps involving feature
selection.   Using the ratio of mean method, the
eliminating process will result in a reduced set of genes.

The reduced
set of genes will go through selection methods mainly statistical filters and
wrapper-based algorithms. These methods are presently undecided but those enlisted
in table are foreseen to be used as Feature selection methods applied on the microarray

There are
more than 17,000 genes that are extracted from the datasets and the top 500
genes that consist of highest variance are predicted to be chosen for the next





feature selection


It finds
features with a maximal difference of mean value between groups and a minimal
variability within each group

feature selection (CFS)


It finds
features that are highly correlated with the class but are uncorrelated with
each other

algorithms (GA)


They find
the smaller set of features for which the optimization criterion
(classification accuracy) does not deteriorate



Clustering methods:

Clustering is performed to identify
natural or inherent genes in a dataset. The clustering divides the gens into
groups that represents subtypes of ASD using measures of similarity. The
cluster analysis consists of two strategies, hierarchical clustering and
K-means method.

Hierarchical clustering is one of the most
widely used methods. This methods consist of agglomerative (bottom-up) and
divisive (top-down) forms. In agglomerative method, each data point is in a cluster
of its own initially. At each step, two closest clusters are fount and combined
them into a single cluster. In the latter method, all the data points are in a
single cluster at the beginning which are recursively split until each cluster contains
a single data.

The type of
clustering algorithm, the distance metric, the type of linkage or inter-cluster
distances (when appropriate), and the number of clusters must be selected when
employing these methods; guidance is used from the work of Dr?ghici S


I'm Johnny!

Would you like to get a custom essay? How about receiving a customized one?

Check it out