Next: Suggestions: How to write Up: RSCdE Previous: RSCdE

Introduction

Monte Carlo experiences are a powerful statistical technique used to provide approximate answers for questions about complex problems that may include a stochastic component, mostly when analitic and numeric techniques fail to supply, with an acceptable amount of effort, those answers in an exact and/or complete manner. These simulation techniques are essentially based on controlled statistical sampling, and they have a wide range of applications including, among others, statistical mechanics, biology, games, combinatorial optimization, engineering, etc.

There are several ways to summarize the behaviour of a sample of uni- or multi-dimensional data, either real or simulated data. Each technique is well suited for the enhancement of certain structures in the data: means, variances, quantiles, etc. and for the testing of different hypothesis about the underlying population.

It is commonplace, when simulation is used, to have several samples to be analyzed. Every sample could have been obtained, for example, by similar simulations of the same system and with different parameter values. What is to be known is, for example, which are the parameter values that render the most adequate model to represent a real system, or, maybe, how does the system behave under different parameter values.

A simulation study must be carefully planned, in order to obtain meaningful and useful results. It is to be always remembered that this kind of study (nothing but an experience with numbers) and experiences with animals, crops, etc. are alike. From this viewpoint, Monte Carlo experiences have the advantage of being wholly controlable, as is usually not the case in other laboratories of Applied Sciences.

Therefore, a Monte Carlo experience should be planned obeying the rules of Experiments Design. Many good ideas could be borrowed from this area of Statistics (see, for example, the book by Box et al. (1978) for a comprehensive introduction). We should identify the critical hypothesis in the considered model, and we should also obtain every needed output in order to isolate the effects of every factor, besides considering groups of relevant factors.

In principle, a simulation model has two components: one is given by the parameters and interaction structures among the random variables: the input; the other is the response or output. There are several elements to be considered when planning the experience, for example:

Which are the combinations of factors levels to be analyzed in each simulation run? Some techniques useful for the determination of these combinations could be seen in Mauro (1986).
How to organize the outputs?
How to construct the model under study in order to assess how the factors influence the output?
How to control the inputs to make the ouputs more accurate?

The first three questions are of general interest, and applicable to every simulation or data analysis problem. A detailed discussion about these issues, specifically applied to simulation, could be seen in the work by Kleijnen (1975); Gruber and Freimann (1986) also treat this problem, in the context of comparison of estimators.

The two most relevant factors in stochastic simulation (whenever used as a study methodology in Statistics) are: the sample size and the distributions from which the samples are taken. Most of the works that use stochastic simulation (in Statistics) aim at comparing different techniques under different data distributions. Among these works we consider those devoted to the determination (or approximation) of the exact distribution of certain statistics.

Some common situations, when the interest is in comparing performances of different statistics, are:

The t statistic against other options to obtain test procedures and confidence intervals, which are dependable even when the data are not normally distributed.
Estimating the center of symmetry of a symmetric distribution using several estimation procedures: sample mean, sample median, trimmed means, robust estimators, etc.
Robust estimation vs. least squares in regression.
Parametric vs. nonparametric estimation.

The following questions could be answered in order to assess performances:

How does the bias of an estimator varies with respect to variations of the sample size?
How does the variance of an estimator varies with respect to variations of the sample size?
How are the asymptotic distributions (if any) of the estimators under comparison?
Is it possible to know something about the exact distribution of estimators? For instance: estimate the mean, the variance, the quantiles; or, even better, are there analytical results about this?
If the estimators are asymptotically normal, which is the rate of convergence to normality?

A quite general setup for a Monte Carlo experiment, suited for the answering of these questions, could be:

Generate the model under study, say n times.
Calculate (and save) the values of the statistics under comparison for the sample of size n generated in s1).
Repeat (replication) s1) and s2) M times.
Study the empirical distributions of the statistics based upon the samples obtained in step s3).

In order to have the study complete, the setup above could be modified or repeated for:

Several values of n. These additional runs could give insight about the behaviour of small size samples, and about the convergence rate towards the asymptotic results.
Different distributions for the input data, including
- several families of distributions,
- several values of the parameters for every distribution,
- possible dependence upon observations.

A first step towards increasing the accuracy of a certain estimator, could be the searching of efficient techniques for the calculation of the used quantities; or increasing the number of replications (M) without overweighting the computational cost, using, for instance, faster generation techniques. For example, if the programs were developed in a high level programming language, such as FORTRAN, the experimenter should try to write them down without too much effort in a lower level language, such as C or even ASSEMBLER. This could yield faster generations and calculations... though there might appear formidable programming problems, increasing the risk of bugs, mistakes, etc. The only rule about it we know is: be sensible!

The tools devoted to improve the accuracy of simulation results are generically known as Variance Reduction Techniques. This name could be justified by considering what is usually done in a simulation experiment: let be a cummulative distribution function; let X be a random variable with distribution given by F above, and let be a measurable function. Problem: estimate , where are independent identically distributed random variables with common distribution F. The raw Monte Carlo estimator is:

Choose a big M, the number of replications, in a more or less arbitrary fashion (say M=500, 1000 or more).
For every generate a sample from X (with distribution F) and calculate
Define as estimator of the quantity
Define as estimator of the variance of the quantity

The variance reduction techniques consist of modifying this setup in order to obtain a variance reduction of the estimator of . For instance, through modifying the way in which the random variables are generated, or by incorporating analytical knowledge about the distribution F.

In most problems, could be a vector, and g and F could have quite complicated forms; in such cases, only the use of some variance reduction technique would ensure dependable results.

Next: Suggestions: How to write Up: RSCdE Previous: RSCdE

Alejandro C. Frery: frery@di.ufpe.br