# Proven Machine Learning Generated Predictions on Account level INJ utilization

The following methodology leverages two Machine Learning algorithms to estimate the probability of “significant drop” and the Veeva Account Level. Furthermore, the same data structure enables a prediction over network growth, which conversely allows for a sales growth forecast.

Stylized fact:

Given that we do not observe HCP level sales, we must aggregate ChargeBack sales at the Account Level for which Veeva Affiliation (So-Called HCO Master) data provide insights. See the graph below for a blurred-visual of a typical Veeva Affiliation Data cluster:

Given that we do not observe the probability of a “significant drop,” we must estimate model parameters and splitting rules by fitting a parametric line of the natural logs of the odds ratios of the “significant drop” event.

Given the data structure provided by Affiliation Data, we forecast the Graph/Cluster growth to identify potential sales growth opportunities.

Output table:

The output table will look like the following:

The Logits:

As we wrap up the processing of the data, we need to start computing the Logits – our dependent variable in the Master Dataset. Before applying  of the error term as weights, please bear in mind that the functional form of the Logit model is the following:

The Logits, , which are the log of the odds ratio of the variable of interest -in our case a threshold of Unit sold for a minimum “significant”- need to be computed with respect to all levels of either Rebates Rates, Market Shares, or Price depending of our choice of independent variable is. Please keep in mind that in regression we estimate the conditional probability of something. In this case, it would be , where the ith value of X is either the level of Rebate, Market Share, or Price.

Therefore, we need:

1. To define “significant drop” as …. Just below average units sold, 25% of units… you name it – per cluster.
2. Once the significant threshold has been set for clusters, the Logits can be calculated as follow:
• Pi = Relative frequency of Unit Sold greater than “significant drop” -per cluster.
• We need to compute the odds ratio of Pi for each level of rebates, market share, or price -per cluster.
• Once we have the relative frequency of the event, we take the natural log on the odds ratio, which results is the Logit.

Logistic regression:

Via Weighted Least Squares (GLS), we will regress the Logits on rebates, market share, or price and any other the variables we believe are influential.

Once we have the regression parameters, the predicted Logits can be estimated at every level of Rebate Rate, Market Share, or Price. The last step will yield the prediction of which cluster would drop below the “significant drop” threshold. Then, by taking the anti-log of the Logit, we will retrieve the corresponding probability of the event “significant drop.” The R^2 is the selection criteria for the Logit model.

Random Forest:

The RF model dependent variable will also be the Logits calculated from the Logistic Model above. We will fit Random Forest Model by throwing the “Kitchen Sink” as independent variables as follows:

1. Fine-tune the model hyperparameters.
2. Decide the best fit by either Root Mean Squared Error or Mean Squared Error.
3. Predict the Logits by fixing the values to each level of Rebate Rate, Market Share, or Price.
4. Then, by taking the anti-log of the Logit, we will retrieve the corresponding probability of the event “significant drop.”