A prediction approaches in the literature. Cross
A defect is found once the software does not fulfill one of
the specification’s properties 1. The complexity of software
verification and the pressure to reduce time-to-market
promotes utilizing automated defect prediction methods.
Deploying the project on time, on schedule, and on budget,
urges software project managers to optimize efforts based on
their limited human resources. Professional code testers and
inspectors should focus on code entities that are more error
prone. Here code entity may refer to different granularities of
codes such as procedures, classes, packages or other units of
Early detection of defects prior to product release is very
important. The costs of repairing a software flaw increases
exponentially in later phases of software evolution process.
After software deployment, not only the cost of fixing
problems will be much more expensive than fixing in the
primary steps, but also it will cause a bad effect on the market
and customer satisfaction. Hence, it is necessary to pay
attention to quality assurance throughout the software
evolution process by means of defect prediction.
Automated software defect prediction has attracted great
attention in recent years and many prediction models have been
proposed for determining defective parts of software based on
machine learning algorithms. Moreover, there exists various
defect prediction approaches in the literature. Cross project
defect prediction has received increasing attention in the past
few years, but many researchers neglected the availability of
huge amount of data in version control systems. Usually
software projects have long-term version history from which
we can build a repository of metrics, defects, fixes and causes.
Such a dataset is valuable for modeling the characteristics and
analysis of software development during its evolution.
Normally software metrics are extracted from source codes
available in version control systems 2. Furthermore, ITSs
(Issue Tracking Systems) such as Bugzilla contains bug reports
by users. By relating the bug reports in ITS to the versioning
system, corresponding defects for each code entity are
determined. These reported bugs are treated as labels and the
extracted metrics are treated as feature vector of each code
entity. Then a learner is trained using these samples as the
trainset. In the last stage, the trained model can be used to
predict defectiveness of each software entity whose defect
content is unknown.
Using code history of the project is more accurate than
using models made based on other source codes. Version-based
defect prediction aims to help software managers to optimize
efforts for next release of the code. Assignment of limited
human resources for code testing and focusing on defect prone
modules rather than defect free modules, is an arduous task.
Understanding the ways in which information about
previous versions of a software system enhances the ability to
predict defects of source code is an interesting issue in the field
of software quality and defect prediction. To evaluate versionbased
defect prediction models the least requirement is having
a benchmark dataset with appropriate variety of code metrics
along with the number of defects in each source code module.
We provide such a baseline by gathering an extensive dataset
composed of several open-source systems. Our dataset contains