DATASET (Gene) selection: Gene expression levels in ASD

DATASET
COLLECTION

 

Autism Datasets:

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

The
experimental data used in this article was downloaded from the well-known
public source Gene
Expression Omnibus (GEO). Raw expression data generated by the providers for 3
human ASD studies (GEO accession numbers: GSE28475, GSE38322 and GSE28521)
were downloaded. The expression profiles from the ASD human cortex were only
used which are presented in the table. The Cortex tissue sample is more in
number than other samples such as cerebellum in GSE38322 and GSE28521. Thus the
focus will be on the cortex samples from all three datasets and the other
samples available will be not be considered.

 

Datasets

Reference

Tissue Type

Number of samples
ASD;Control

 

GSE28475

Chow et al.
(2012)

Cortex

20;22

 

GSE38322

Ginsberg et al. (2012)

Occipital
Cortex

4;6

 

GSE28521

Voineagu et
al. (2011)

Frontal
Cortex

10;15

 

GSE28521

Voineagu et
al. (2011)

temporal Cortex

10;12

 

 

 

METHODS

 

Pre-processing of the data:

In order to clean the raw data, the
expression profiles must go through the data processing technique. For that
purpose, the “lumi” package in R language is to be used which applies for Illum?na
bead chip array. The expression data for GSE is already log2 transformed and
normalized, thus for consistency and effective comparability, other raw
datasets will be processed in the same manner. All data sets will go through
background correction in lumi package and will be normalized within R.

 

Feature (Gene) selection:

Gene expression levels in ASD show
considerable fluctuation among the
datasets and since the sequences of several of these genes are highly variable,
it is difficult to select and identify the genes which are most relevant to
autism. Feature subset selection works by removing features that are not
relevant or are redundant. ?n a large sample, if all the gene expression levels
in a dataset is considered, “the error in the variance of a measured variable”
also known as noise is developed. This results in noisy and complex data which
is difficult to classify. Also, genes which show similar expression in all the
datasets are considered not useful since they do not show differentiation. For
the purpose of reducing noise, similar genes are eliminated. This is because
they consist of high variance which will affect the mean and median values for expression
the expression of nearby genes, thus affecting the next steps involving feature
selection.   Using the ratio of mean method, the
eliminating process will result in a reduced set of genes.

The reduced
set of genes will go through selection methods mainly statistical filters and
wrapper-based algorithms. These methods are presently undecided but those enlisted
in table are foreseen to be used as Feature selection methods applied on the microarray
data.

There are
more than 17,000 genes that are extracted from the datasets and the top 500
genes that consist of highest variance are predicted to be chosen for the next
method.

 

Method

Type

Description

t-test
feature selection

Filter

It finds
features with a maximal difference of mean value between groups and a minimal
variability within each group

Correlation-based
feature selection (CFS)

Filter

It finds
features that are highly correlated with the class but are uncorrelated with
each other

Genetic
algorithms (GA)

Wrapper

They find
the smaller set of features for which the optimization criterion
(classification accuracy) does not deteriorate

 

 

Clustering methods:

Clustering is performed to identify
natural or inherent genes in a dataset. The clustering divides the gens into
groups that represents subtypes of ASD using measures of similarity. The
cluster analysis consists of two strategies, hierarchical clustering and
K-means method.

Hierarchical clustering is one of the most
widely used methods. This methods consist of agglomerative (bottom-up) and
divisive (top-down) forms. In agglomerative method, each data point is in a cluster
of its own initially. At each step, two closest clusters are fount and combined
them into a single cluster. In the latter method, all the data points are in a
single cluster at the beginning which are recursively split until each cluster contains
a single data.

The type of
clustering algorithm, the distance metric, the type of linkage or inter-cluster
distances (when appropriate), and the number of clusters must be selected when
employing these methods; guidance is used from the work of Dr?ghici S
(). 

x

Hi!
I'm Johnny!

Would you like to get a custom essay? How about receiving a customized one?

Check it out