|
Business Accent
Search for the right data
A new generation of data mining algorithms, more affordable
prices for memory processing and storage, and an explosion in electronic biomedical
data, have pushed many biostatisticians to take a harder look at data mining.
Once scorned as a way to twist and dredge data until it matches the desired
hypothesis, data mining has now proven to be an essential advantage in an increasingly
demanding and competitive market. George Varghese takes a look at data
mining in pharma
Biostatisticians have often undermined the potential of data mining algorithms.
However, recent advances have made every one sit up and take notice at data
mining. To put it simply, data mining is the process of selecting, exploring,
modifying, and modelling large amounts of data to uncover previously unknown
patterns in them for a business advantage. In general, this means recognising
and interpreting unanticipated but valid, potentially useful, and understandable
patterns in a database or data repository.
Data mining sifts through huge amounts of information and extracts previously
unrealised information about molecular compounds, product portfolio, clinical
studies, and customer information that can be applied to accelerate drug discovery
and development. For example, you could match desirable biochemical specifications
with disease and patient characteristics, and select compounds with the highest
probability of a favourable safety and efficacy profile. One can also generate
and explore scientifically interesting new hypotheses, based on multiple markers
and interdependencies.
Pharmaceutical companies are increasingly turning to data mining, as they look
for ways to cut research and development costs, and compress the drug development
cycle, while producing more effective drugs for a wider range of diseases. By
the most conservative estimates, investment in statistical analysis and data
mining tools would have reached nearly $2 billion by 2005 (IDC), representing
an annual compound growth rate of more than 30 percent per year, for five years.
Companies that engage and invest intelligently in data mining will be rewarded
with cost savings in research and development, reductions in drug development
cycle time, and improved returns on labour and capital investment. For example,
a major pharmaceutical research foundation sought a way to extract new intelligence
from a database, containing activity outcome-scores for more than 150,000 chemical
substances and results of more than 1,500 different assays performed on each
substance. With the amount of data they had to analyse, they needed a scalable
system to keep up with their growing database. This scalable and flexible solution
enabled the manager of medical chemistry, IT project leader and their teams,
to find correlations between assay outcomes, discern the most discriminating
fragments leading to activity (using decision trees), and identify and derive
lead candidates faster.
Another pharmaceutical company sought a way to analyse large, disparate, external
databases on drug outcomes, faster and more cost-effectively. They needed to
distil specific information discriminating characteristics in a target patient
group for a cancer medication from large volumes of detail about hospitalised
patients, treatments, outcomes, and more. Using data mining, researchers were
able to integrate all external databases containing data on drug research and
drug outcomes, create new knowledge on existing and new drugs from that data,
and redistribute the knowledge to relevant departments.
Data mining applications
Data mining adds value for all phases of the drug introduction
process:
|
Data mining sifts through huge
amounts of information and extracts previously unrealised information
about molecular compounds, product portfolio, clinical studies, and customer
information that can be applied to accelerate drug discovery and development
|
Discovery phase: Data mining can be applied to select
molecular compounds with a high propensity, resulting in a promising
new drug, identify patient and disease factors that may contribute
to knowledge of how to limit and control toxicity, and synthesise
large volumes of pharmacogenomic data.
Clinical trials: Data mining helps identify patients
most likely to benefit from an experimental drug and predicts the probability
of a treated patient experiencing an adverse reaction.
Investigational setting: Data mining can be applied
to produce the most favourable patient outcomes, evaluate drug economics across
multiple factors, and optimise manufacturing and marketing processes.
Companies that engage and invest intelligently in data mining will be rewarded
with cost savings in research and development, reductions in drug development
cycle time, and improved returns on labour and capital investment.
Data mining has proven particularly useful for researchers engaged in functional
genomics and analysis of gene expression data. This process provides a systematic
method to conduct genome sequence analysis for easy identification of different
genes or gene markers that may suggest why some individuals are at a higher
risk for developing certain diseases or for suffering from negative side effects.
Specialised data mining software can be very useful for exploratory data analysis,
dose-response exploration studies, drug and substance interactions, identifying
population segments for drug side effects, and for generating hypothesis that
can be explored in a subsequent study.
Best practices
A successful discovery data mining solution relies on an orderly process and
framework and the software or hardware architecture to support it. The overall
process should be transparent, standardised, repeatable, re-usable, preserve
statistical integrity, and yield consistent results that can be validated with
new databases. The credibility and usefulness of discovery data mining models
is enhanced, if best practices are adopted and can be standardised for a particular
therapeutic drug or device. As a result, data mining steps must be organised
into a logical sequence of activities:
Establish hypothesis: The process flow begins with
the research question. The process should be oriented toward solving discovery
or clinical research questions rather than a data analysis problem. The objective
is not only to search for systematic patterns in the data warehouse but also
to produce solutions that can generate useful predictions or reveal usable information
to support decision-making on a specific discovery issue.
Prepare data: The data mining process must have access
to a data warehouse, founded upon a common metadata modela standardised,
structured setting to logically organise and link together data from disparate
sources, and to provide data sets in analysis-ready form. Ideally, the data
warehouse creates the data set to be mined, pre-processing the data to obtain
metadata information, cleanse the data, and enrich the data if necessary.
Sample data to be mined: Although the original data
source could be mined, in many situations (especially when the data base is
large), it is more prudent to reduce the entire database to a representative
sample using an appropriate sampling algorithm, such as simple random, stratified,
Nth, or cluster sampling.
Create data partitions: Segregating the data mining
data set into training, validation, and test samples, provides one way to validate
and compare the results of different statistical models. The training data set
is used to train the model for learning and discovering the underlying patterns
in the data. The validation data set is used to validate the model built with
the training data set, and the test data set is used to test the trained and
validated models for external validation and generalisation to new data sets.
Data partitioning can be accomplished by simple random sampling, stratified
or purposeful sampling, and user-defined sampling, whereby, the user specifies
the partition variable and variable values.
Data mining process should be transparent, standardised, repeatable, re-usable,
preserve statistical integrity, and yield consistent results that can be validated
with new databases. One of the goals of data mining is to find a model which
has the greatest potential to generate accurate predictions with a minimal degree
of error. That means building, testing, comparing, and assessing multiple models
in order to select the one that will deliver the highest research value.
Define and deploy mining models: The appropriate modelling
strategy for data mining depends on the research question being addressed:
- "Supervised" learning methods are appropriate
where there exists a target efficacy variable or an endpoint with known values
and about which predictions will be made based on the values of other variables
as input. Statistical methods in this category include multiple and logistic
regression, neural networks, decision trees, and discriminant analysis.
- "Unsupervised" learning methods are appropriate
for data in situations when there is no target variable or endpoint with known
values, but input variables for modelling exist. Analytical methods of this
type include clustering (K-means) analysis, self-organising maps, principle
components analysis, association and sequential models, and factor analysis.
Refine and disseminate mining models: The credibility
of the data mining and its output is crucial in a regulated environment, such
as the pharmaceutical industry. Therefore, discovery data mining is an iterative
process of design, modelling, testing, revising, and documenting. Bootstrapping
methods, n-fold cross validation or sensitivity analysis may be applied to assess
the robustness of results. Different data mining techniques may be used to analyse
the same problem scenario or build similar models, and the results compared
across different methods for consistency and validity. The models that best
address the research questions at hand are disseminated to users.
The mining tool or software should support this iterative process and the requirements
for transparency, validation, and a process flow that is repeatable and auditable.
The author is Head, Marketing and Alliance, Pharma and ITES,
SAS India
|