5.3 Methods for Collecting Data (application)

3. Methods for Collecting Data (application)

Data constitute the foundation for statistical analysis. Data from the process which has to be analyzed can be collected by applying tools such as:

1. Check Sheets: These are the most common tool for collecting data. They permit the user to collect data from a process in an easy and systematic manner. (See the previous section)

2. Control Charts: These are graphs used to study how a process changes over time. Through control charts, current data can be compared to historical control limits and conclusions can be drawn on whether the process is in control, or displays variation(out of control) due to special causes of variation. (To read more on control charts, see: Chapter 8- Six Sigma, Control)

3. Design of Experiments: A collection of methods for carrying out controlled, planned experiments of a process or product, design of experimentsis undertaking a sequence of experiments that first look at the broader variables, known as independent variables, and then narrow down to a list of the vital few variables. The data is collected and studied to find out the effect a variable or a number of variables have on the process.

4. Survey: Sample Surveys are data collected from various groups of people to gather information about their knowledge or opinion about a product or process. (To read more on surveys, see: Chapter 2: Business Process Management)

5. Stratification: This is a way of separating data collected from various sources so that patterns in the data can be seen. (For more information on stratification, see the following section)

Coding Data

Thedata collected, no matter from where it is collected, need to be coded before entering for processing, analyzing and reporting. Each item of data that goes for processing needs a numeric response attached to it. For example, if you ask the customers whether they prefer a particular product, the answers could be ‘yes’ or ‘no’ or ‘can’t say’. In this case you cannot use just a check mark or such other symbols against the answer. A possible way to code it is ‘yes’=1, ‘no’=2, ‘can’t say’=3. While processing the data, you count the number of 1s, 2s, and 3s.

However, sometimes the data variables need not be coded. If you are using weight or age as a variable of interest, the age or weight itself can be used. Coding becomes necessary when the analysis does not take values as they are. For example, when you have to code the group of responses “< 18 years” , “18 to 30 years ”, “> 30 years” etc., you can use <18 years = 1, 18 to 30 years = 2, and so on. Therefore for each numeric variable to be analyzed, either actual values or coded values are used.

Binary Coding

When there is qualitative data, or as such observations are not available in the given data, attributes are used. To characterize this data, sometimes binary coding is used. If a certain character or event in the data that needs to be checked is present, it is denoted by 1. If it is absent, it is denoted by 0. This can be shown by:

Χ (event) = {1, if the event is present, 0, if the event is absent}

or example, if the efficiency of workers is measured as those who work for 8 hours are efficient, Χ (efficiency) will be 1 if a worker works for 8 hours everyday and 0 if he works for less than 8 hours a day.

4. Methods for Assuring Data Accuracy

Capturing incorrect or unreliable data turns out to be an expensive affair, and at the same time slows down the decision making process. The following is a list of important factors for assuring data accuracy:

Bias related to tolerances or targets while measuring or registering analog and digital images should be avoided.

When data occurs in time series, the order of the data captured should be noted.

In case an item feature changes over time, the measurement scale should be recorded immediately and also after the item stabilizes.

Rounding should be avoided because it dilutes the responsiveness of measurement. Averages should be calculated to at least one or more decimal places than individual readings.

Data entry errors should be removed by filtering the data.

Guesswork should be avoided while removing errors and objective statistical tests should be applied while spotting outliers (an observation that is different from the main trend in the data)

Every significant classification identification must be noted along with the data. This information might include time, gauge, auditor, operator, material, process change etc.

It is necessary to select the sampling method according to how the sample data is going to be used. There is no strict norm as to which sampling plan will be employed for data collection and analysis; a decision has to be made on the basis of experience and needs of the data collector. The following is a guideline on a few sampling techniques. Every sampling method has been developed for some specific purpose.

Simple Random Sampling

In simple random sampling, each element in the sample space has an equal chance of getting selected in the sample. Hence the probability of any event can be determined by listing all the possible units in the sample space.

Simple random sampling is considered as the simplest sampling technique and is also preferred because it has a time and economy advantage to it. But this requires homogeneous distribution of the samples. The sample must be a representative of the lot; hence the stress of sample selection is laid on sampling plan usage rather than selection of the sample itself. The sequence of sampling has to be done through a random plan.

Stratified Sampling

When the distribution of samples is not homogeneous or proportional, stratified sampling is used. For instance, parts may have been produced on different machines, or under varying conditions. In this case, the total sample population is divided into homogeneous subgroups. These subgroups are called strata and this is followed by applying simple random sampling technique in each stratum. These strata are based on predetermined criteria such as size, weight, location, assembly line etc. Each unit in the sample space must be assigned to one stratum only.

The person using the sample data must be conscious of the presence of stratified groups, and must document a report such that the interpretations are relevant only to the sample selected and may not represent the universal population.

Systematic Sampling

In this sampling technique, each n th element is selected from the sample space. The sampling interval, n, is calculated as:

n = Number in population / Number in sample

This technique is also referred to as interval sampling as every n th sample is selected from the list of sample space.

For example, if there are 2000 samples in the sample space, and the number of samples is 50, then 2000/50 = 40; hence every 40 th sample will be selected.

Clustered Sampling

In clustered sampling, all the units are grouped into clusters and a number of clusters are selected randomly to represent the total population. Then all units within selected clusters are included in the sample. The elements within the clusters can be homogeneous or heterogeneous but there should be heterogeneity between clusters.

The difference between cluster sampling and the stratified sampling is that in cluster sampling, each cluster is treated as the sampling unit and hence analysis is done on the number of clusters; whereas in stratified sampling, the analysis is done on elements within strata.

5. Descriptive Statistics

Descriptive statistics are used to explain the properties of distributions of data from samples. The following section describes the more frequently used descriptive statistical measures.

Measures of Central Tendency

The measures of central tendency are the various ways of describing a single central value of the entire mass of data. The central value is called average. The two main objectives of the study of averages are:

i. to get a single value that describes the characteristic of the entire group.
ii. to facilitate comparison.

Three averages: mean, median and mode are of interest to Six Sigma.

Mean: Arithmetic mean or simply mean is the value obtained by the sum of all data values divided by the total number of data observations. It is the most widely used measure of central tendency.

Population Mean

where, Χ is an observation , N is the population size

Sample Mean

Median: The median refers to the middle value in a distribution of a data set. One half of the items in the data set have a value the size of the median value or smaller, and one half has a value the size of the median value or larger. It splits the data into two parts. It is to be noted that the median is the average of the middle two values for an even set of data items.

Mode: The mode or modal value is that value in a series of data that occurs with the highest frequency. It is possible for data sets to have more than one mode.

While this statement is pretty helpful in interpreting the mode, it cannot be applied safely to any distribution because of the erratic nature of sampling. Rather, mode should be thought as the value around which the data items are most closely concentrated. It is the value which has the most frequency density in its immediate neighborhood.

Measures of Dispersion

The measures of central tendency give one single figure that represents the entire data set. But it becomes necessary to describe the variability or dispersion of observations because average alone cannot give an adequate description of the data set. Measures of dispersion help in describing the spread of dispersion. The dispersion, (also known as scatter, spread or variation) measures the extent to which the items vary from some central value.

The following are the main measures of dispersion:

Range: The range of a set of data values is the difference between the largest or smallest values.

R = Largest - Smallest

Variance, Standard Deviation: The variance is the sum of squared deviations from the mean divided by the sample size. The standard deviation is the square root of variance.

The Coefficient of Variation (COV): This is equal to the standard deviation divided by the mean and is expressed as a percentage.

Skewness: Skewness is a measure of symmetry of the distribution.

1. The normal distribution has a skewness zero, zero signifies perfect symmetry.
2. Positive skewness signifies that the tail of the distribution is more extended on the side above mean.
3. Negative skewness signifies that the tail of the distribution is more extended on the side below mean.

Probability Distributions: Probability distributionsare relative frequency distributions when the number of observations is made very large. These are the distributions for the probability of random variables. These random variables may be continuous or discrete in nature. For continuous random variables, probability density function (p.d.f) is used and for discrete random variables, probability mass function (p.m.f) is used.

Probability Density Function (p.d.f): The p.d.f explains the nature of random variables in continuous case. It forms a bell shaped distribution from normally grouped data.

When the random variables are normally distributed, there are symmetric p.d.fs with mean = mode = median, meaning they are at the same point. Mathematically, if f(x) is a continuous distribution function of the random variable x, and is always positive, i.e., then p.d.f will be,

Total probability of continuous distribution is 1.

Probability Mass Function (p.m.f): Similarly, if f(x) is a discrete distribution function of the random variable x for n values with f(x) ≥0, then, p.m.f will be,

For a given value of x, there is only a single value of f(x) (denoted by the points in the x-axis in the graph above)

Cumulative Distribution Function (c.d.f): The c.d.f represents the area under the probability distribution function to the left of X. The c.d.f is used for both continuous and discrete data. It is denoted by:

And,

where, x is a continuous random variable.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。