Untitled Document
www.expresspharmaonline.com FORTNIGHTLY INSIGHT FOR PHARMA PROFESSIONALS
16-31 May 2006  
Untitled Document
Sections

Market
Management
Research
Pharma Life

Services
Open Forum
Subscribe/Renew
Archives
Contact Us
Events
Pharma Excellence Awards
Network Sites
Express Computer
Network Magazine India
Express Hospitality
Express TravelWorld
feBusiness Traveller
Exp. Healthcare Mgmt.
Express Textile
Group Sites
ExpressIndia
Indian Express
Financial Express

Partner Assns.
Home - Management - Article

Business Accent

Search for the right data

A new generation of data mining algorithms, more affordable prices for memory processing and storage, and an explosion in electronic biomedical data, have pushed many biostatisticians to take a harder look at data mining. Once scorned as a way to twist and dredge data until it matches the desired hypothesis, data mining has now proven to be an essential advantage in an increasingly demanding and competitive market. George Varghese takes a look at data mining in pharma

Biostatisticians have often undermined the potential of data mining algorithms. However, recent advances have made every one sit up and take notice at data mining. To put it simply, data mining is the process of selecting, exploring, modifying, and modelling large amounts of data to uncover previously unknown patterns in them for a business advantage. In general, this means recognising and interpreting unanticipated but valid, potentially useful, and understandable patterns in a database or data repository.

Data mining sifts through huge amounts of information and extracts previously unrealised information about molecular compounds, product portfolio, clinical studies, and customer information that can be applied to accelerate drug discovery and development. For example, you could match desirable biochemical specifications with disease and patient characteristics, and select compounds with the highest probability of a favourable safety and efficacy profile. One can also generate and explore scientifically interesting new hypotheses, based on multiple markers and interdependencies.

Pharmaceutical companies are increasingly turning to data mining, as they look for ways to cut research and development costs, and compress the drug development cycle, while producing more effective drugs for a wider range of diseases. By the most conservative estimates, investment in statistical analysis and data mining tools would have reached nearly $2 billion by 2005 (IDC), representing an annual compound growth rate of more than 30 percent per year, for five years.

Companies that engage and invest intelligently in data mining will be rewarded with cost savings in research and development, reductions in drug development cycle time, and improved returns on labour and capital investment. For example, a major pharmaceutical research foundation sought a way to extract new intelligence from a database, containing activity outcome-scores for more than 150,000 chemical substances and results of more than 1,500 different assays performed on each substance. With the amount of data they had to analyse, they needed a scalable system to keep up with their growing database. This scalable and flexible solution enabled the manager of medical chemistry, IT project leader and their teams, to find correlations between assay outcomes, discern the most discriminating fragments leading to activity (using decision trees), and identify and derive lead candidates faster.

Another pharmaceutical company sought a way to analyse large, disparate, external databases on drug outcomes, faster and more cost-effectively. They needed to distil specific information discriminating characteristics in a target patient group for a cancer medication from large volumes of detail about hospitalised patients, treatments, outcomes, and more. Using data mining, researchers were able to integrate all external databases containing data on drug research and drug outcomes, create new knowledge on existing and new drugs from that data, and redistribute the knowledge to relevant departments.

Data mining applications

Data mining adds value for all phases of the drug introduction process:

Data mining sifts through huge amounts of information and extracts previously unrealised information about molecular compounds, product portfolio, clinical studies, and customer information that can be applied to accelerate drug discovery and development

Discovery phase: Data mining can be applied to select molecular compounds with a high propensity, resulting in a promising new drug, identify patient and disease factors that may contribute to knowledge of how to limit and control toxicity, and synthesise large volumes of pharmacogenomic data.

Clinical trials: Data mining helps identify patients most likely to benefit from an experimental drug and predicts the probability of a treated patient experiencing an adverse reaction.

Investigational setting: Data mining can be applied to produce the most favourable patient outcomes, evaluate drug economics across multiple factors, and optimise manufacturing and marketing processes.

Companies that engage and invest intelligently in data mining will be rewarded with cost savings in research and development, reductions in drug development cycle time, and improved returns on labour and capital investment.

Data mining has proven particularly useful for researchers engaged in functional genomics and analysis of gene expression data. This process provides a systematic method to conduct genome sequence analysis for easy identification of different genes or gene markers that may suggest why some individuals are at a higher risk for developing certain diseases or for suffering from negative side effects.

Specialised data mining software can be very useful for exploratory data analysis, dose-response exploration studies, drug and substance interactions, identifying population segments for drug side effects, and for generating hypothesis that can be explored in a subsequent study.

Best practices

A successful discovery data mining solution relies on an orderly process and framework and the software or hardware architecture to support it. The overall process should be transparent, standardised, repeatable, re-usable, preserve statistical integrity, and yield consistent results that can be validated with new databases. The credibility and usefulness of discovery data mining models is enhanced, if best practices are adopted and can be standardised for a particular therapeutic drug or device. As a result, data mining steps must be organised into a logical sequence of activities:

Establish hypothesis: The process flow begins with the research question. The process should be oriented toward solving discovery or clinical research questions rather than a data analysis problem. The objective is not only to search for systematic patterns in the data warehouse but also to produce solutions that can generate useful predictions or reveal usable information to support decision-making on a specific discovery issue.

Prepare data: The data mining process must have access to a data warehouse, founded upon a common metadata model—a standardised, structured setting to logically organise and link together data from disparate sources, and to provide data sets in analysis-ready form. Ideally, the data warehouse creates the data set to be mined, pre-processing the data to obtain metadata information, cleanse the data, and enrich the data if necessary.

Sample data to be mined: Although the original data source could be mined, in many situations (especially when the data base is large), it is more prudent to reduce the entire database to a representative sample using an appropriate sampling algorithm, such as simple random, stratified, Nth, or cluster sampling.

Create data partitions: Segregating the data mining data set into training, validation, and test samples, provides one way to validate and compare the results of different statistical models. The training data set is used to train the model for learning and discovering the underlying patterns in the data. The validation data set is used to validate the model built with the training data set, and the test data set is used to test the trained and validated models for external validation and generalisation to new data sets. Data partitioning can be accomplished by simple random sampling, stratified or purposeful sampling, and user-defined sampling, whereby, the user specifies the partition variable and variable values.

Data mining process should be transparent, standardised, repeatable, re-usable, preserve statistical integrity, and yield consistent results that can be validated with new databases. One of the goals of data mining is to find a model which has the greatest potential to generate accurate predictions with a minimal degree of error. That means building, testing, comparing, and assessing multiple models in order to select the one that will deliver the highest research value.

Define and deploy mining models: The appropriate modelling strategy for data mining depends on the research question being addressed:

  • "Supervised" learning methods are appropriate where there exists a target efficacy variable or an endpoint with known values and about which predictions will be made based on the values of other variables as input. Statistical methods in this category include multiple and logistic regression, neural networks, decision trees, and discriminant analysis.
  • "Unsupervised" learning methods are appropriate for data in situations when there is no target variable or endpoint with known values, but input variables for modelling exist. Analytical methods of this type include clustering (K-means) analysis, self-organising maps, principle components analysis, association and sequential models, and factor analysis.

Refine and disseminate mining models: The credibility of the data mining and its output is crucial in a regulated environment, such as the pharmaceutical industry. Therefore, discovery data mining is an iterative process of design, modelling, testing, revising, and documenting. Bootstrapping methods, n-fold cross validation or sensitivity analysis may be applied to assess the robustness of results. Different data mining techniques may be used to analyse the same problem scenario or build similar models, and the results compared across different methods for consistency and validity. The models that best address the research questions at hand are disseminated to users.

The mining tool or software should support this iterative process and the requirements for transparency, validation, and a process flow that is repeatable and auditable.

The author is Head, Marketing and Alliance, Pharma and ITES, SAS India

 


Untitled Document
Untitled Document
© Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.