# Data Collection and Organization

## Introduction to data and information

Around us, we come across different types of information that help to take some decision or to reach some type of conclusion or to help in giving a better suggestion to someone.
If we take a glimpse of what information we see around and what the source of such information is? We can say there are many sources from our daily life those provide us information like television, newspaper, websites, interaction with people.
What has been collected from these sources is known as the information. It could be of any type, like train information collected from the train chart on the railway station, the most expecting winner of the match from scoreboard, the most expecting winner party of the polls from the polls results board, information about absent students in a school.
So is there anything still behind the information, from where it gets generated?
The answer is data. Data is the basic building block on which information is built up. So such information sources collect data and work on it to provide useful information that reaches us and is understandable by humans. Let’s take a deeper look next to know about data and information with examples.

Example

Example 1

A teacher in school who maintains daily attendance sheet of students in a class is an example of collecting data of students who are present or absent everyday. The data is collected for each and every student and written in attendance sheet as absent or present or sick against their names.

Table 1: Attendance sheet July, 2020

Roll numberAttendance
1Present
2Present
3Present
4Absent
5Absent
6Present
7Present
8Sick
9Present
10Present

So, in Table-1 the attendance sheet is filled up with student’s data. This data will help the teacher or school office to make further conclusions and analysis about the trends of students such as:

1. How many of them have attended the school for the month of july?
2. How many of them were absent for the month of july?
3. How many of them were sick for the month of july?

These are the questions which can be answered by observing and analysing the data of students in attendance sheet and the collected answers will be known as information.

Example

Example 2

Table 2: Student’s conveyance sheet July, 2020

Roll numberConveyance
1School bus
2By walk
3School bus
4School bus
5By walk
6Train
7Train
8Bus
9Bus
10By walk

The data in table 2 can help the teacher or school office to make further conclusions and analysis about the trends of students such as:

1. How many of them have attended the school for the year 2020?
2. Which conveyance do students use for school, a school bus or train or come by walk?

This collected information can help school office to take some decisions or to reach some type of conclusions to give a better suggestion to the head of school. An example of such conclusion can be how many of the students are using the school bus.

In other words, this conclusion, which will be known as the information, can help head of the school to make decision to ply more school buses, if total number of students who use bus are exceeding the total number of available seats available in the bus.

So, in the above example, we got to know about data, information and how information helps taking decisions and making conclusions.
Next, we see how statistics handles data mathematically and how information is collected from data.

## What is data in statistics?

Data is a plural form of datum which is a latin word. Data is a value which can further be used to prepare information about something. The value of data can be anything from facts or numerical figures.

Each value of data collected is called an observation. The data collected at initial stage is called raw data.
In statistics, we study data which can be measured numerically.

Example

Here is an example of data collected of marks scored by 10 students in a maths test.
50, 45, 46, 30, 21, 25, 32, 49, 47, 18
So, each of the marks from 50, 45, 46, 30, 21, 25, 32, 49, 47 and 18 is called an observation.
The whole list of marks scored by 10 students is data and it is a raw data until it is not placed in tables in a meaningful manner.

Therefore, from above example, we can say that a collection of observations is called as data.

Next, we see how the data is collected and organised in tables in statistics.

## Collection of data

Collection of data starts from collecting the raw data from sources. Sources can be any source like television, newspaper, websites, interaction with people etc. This raw data collected can be a bulk data and usually comes in unorganised forms.

The raw data is a mixture of unorganised data, which can not be used anywhere to extract information. When raw data is processed or cleaned using for example sorting or some refinements then it becomes worth using and is form of data.

We can classify this data in two types, depending upon how we collect it:

### 1. Primary data

When data is collected directly from some experiments or questionnaires or interviews is called as primary data.
Primary data is firsthand data in other words. It is not used by anyone else that firsthand person.

### 2. Secondary data

Secondary data is not firsthand data. It has been collected by someone and processed to use. This type of data is not collected by the individual himself but obtained from published or unpublished sources.

## Organization of data

To get the meaningful information from raw data, it gets processed which involves cleaning of data, which includes sorting it, transferring it into tables or charts or representing it graphically. Data is filled up in tables in columns. The table can have a sequence number against each row of data entered into the table.

Mostly, raw data that is collected to analyse can be seen in unorganised form initially. This unorganised data is always difficult to use to get some meaningful information.
So, to draw and reach a meaningful conclusion, the raw data is organised in some systematic manner. Some of the important and common ways that are used to organise raw data are:

1. Alphabetical order or in a sequence
2. Ascending order
3. Descending order

The above three ways to organize the data just rearrange the data alphabetically, in ascending or descending orders if it is numerical data.
The raw data when we put in ascending order or descending order is called an array or arrayed data.
Below is an example of organizing numerical data in ascending and descending order.

Example

Here is the list of marks obtained by 10 students in a maths test out of 50.
50, 45, 46, 30, 21, 25, 32, 49, 47, 18

The data is organised into two columns with column labels as Roll number and Marks obtained (out of 50).

Table 3: Marks obtained by students

Roll numberMarks obtained (out of 50)
150
245
346
430
521
625
732
849
947
1018

So, this data can be arranged in serial order such as:
18, 21, 25, 30, 32, 45, 46, 47, 49, 50
Data can be arranged in descending order as follows:
50, 49, 47, 46, 45, 32, 30, 25, 21, 18

## What is frequency in statistics?

In general, frequency means how many times an observation or an event occurs in a specific period or for a specific value. In statistics, it is more related to the number of observations that occur for a range of values.
By definition, the number of times a particular entry or an observation occurs, is called frequency.

Example

Consider an example of 20 students who like the different fruits from cherry, apple, grapes, banana, pineapple and orange.
The data of each of 20 students with their like of fruit is cherry, apple, grapes, banana, pineapple, apple, cherry, banana, orange, apple, cherry, apple, cherry, grapes, grapes, orange, grapes, apple, apple and cherry respectively.
When the number of observations are large then preferably data is represented in table form.
So, this data can be represented into table form as follows.

Table 4: Fruit liked by 20 students

Student NumberFruit
1Cherry
2Apple
3Grapes
4Banana
5Pineapple
6Apple
7Cherry
8Banana
9Orange
10Apple
11Cherry
12Apple
13Cherry
14Grapes
15Grapes
16Orange
17Grapes
18Apple
19Apple
20Cherry

So the table 4 shows one fruit is liked by more than one student or one fruit occurs in more than one observation.

This can be summarised in a table in another way i.e. how many students like cherry, apple, banana or oranges.

Table 5: Number of students who like a specific fruit

FruitLiked by total number of students
Cherry5
Apple6
Grapes4
Bananas2
Pineapple1
Orange2

The Table-5 provides information about the total number of students who like a specific fruit.
That means, cherry is liked by total of 5 students in a class and similarly, apple is liked by 6 students, grapes are liked by 4 students, banana is liked by 2 students, pineapple is liked by 1 student and orange is liked by 2 students.

Here fruit is called as variate and number of students who liked a particular fruit is called frequency of variate.

This example, which is about fruit liked by how many students, is an example of frequency. Here, an observation is to find how many students are there for a specific value which is a fruit.

The table 5 can be revised to show frequency in terms of tally marks and the table is known as frequency distribution.
Let’s see next how?

## Tally marks in frequency distribution

Tally marks are used to show the frequency distribution of an observation. Tally mark is written using the symbol “|” a straight vertical line. Tally mark is a counter which counts the frequency of an observation. Single tally mark “|” shows the count one and two tally marks show “||” the count two of an observation.
We keep on adding one tally mark as the count increases. When the count reaches five, we do not use fifth tally mark after four tally marks. Instead a crossed line is drawn on the first four tally marks.

Count = 1, tally marks = |
Count = 2, tally marks = ||
Count = 3, tally marks = |||
Count = 4, tally marks = ||||
Count = 5, tally marks =
Count = 6, tally marks = |
Count = 7, tally marks = ||
Count = 8, tally marks = |||
Count = 9, tally marks = ||||
Count = 10, tally marks =

### How are tally marks used?

In frequency distribution table, the first column contains name of observation like marks, height, age etc. The second column contains tally marks and the third column keeps the value of number of tally marks.

Let’s revise Table-5 with tally marks.

Table 6: Frequency distribution of number of students who like a specific fruit

FruitTally marksLiked by total number of students
Cherry5
Apple|6
Grapes||||4
Bananas||2
Pineapple|1
Orange||2

Interpretation of Table 6

In table 6, the first column of the table has the name of the observation, which is the fruit’s name.
Second column of the table has vertical lines by observing values from the first column.
For each student who eats a fruit a tally mark is written in the second column next to each other. This process is repeated until all observations are completed from the raw data.
If an observation repeats twice, two vertical lines are used i.e. || and so on up to four lines.
When an observation repeats five times, first four vertical lines are used and one line across the first four is drawn drawn diagonally on the first four lines which will be counted as fifth tally marks i.e. .
In the third column of the table we count the number of tally marks corresponding to each row.

Note

Tally marks are kept in a maximum bunch of five.

## Types of frequency distribution

In frequency distribution, raw data is represented into table form using tally marks so that information contained in raw data can be easily understandable.
Frequency distribution is of two types.

1. Discrete frequency distribution
2. Continuous frequency distribution or grouped frequency distribution

### 1. Discrete frequency distribution

Discrete frequency distribution is used when the value of observation is discrete or a single value. For example, the table 6 of frequency distribution of number of students who like a specific fruit is an example of discrete frequency distribution, because the value of observation is a name of fruit in first column, which is a single value.

Example

1, 2, 6, 4, 5, 6, 4, 2, 3, 1, 2, 1, 2, 4, 3 are the numbers that appear on a dice when it is thrown 15 times. This data can be filled up in the discrete frequency distribution table as shown below in Table 7.

Table 7: Discrete frequency distribution of number appeared on dice

NumberTally marksFrequency
1|||3
2||||4
3||2
4|||3
5|1
6||2

Interpretation of Table 7

Table 7 tells how many times a number on dice appears or frequency of a number when thrown. The second column of tally marks shows such information.

1. Number 2 appears four times, which means 2 has maximum frequency.
2. Number 5 appears only one time, which means 5 has minimum frequency
3. Number 3 and number 6 appear two times each and can say 3 and 6 have same frequencies.

Here, the number of observations in table 7 were only 6. But when the number of observations is very large then it is best to make a group of observations and find the frequency.
Continuous or grouped frequency distribution is a such method to group the observations.
Let’s see next, how it helps to find frequencies of large data.

### 2. Continuous or grouped frequency distribution

As said earlier also, grouped frequency distribution is suitable when the number of observations are large. In grouped frequency distribution, raw data is arranged into groups.

The groups are made in such a way that the difference between the greatest and the smallest observation in a group is large. These grouped observations are also called a class.

Example

The marks obtained by 40 students in the class test are as follows.
25, 35, 45, 20, 8, 10, 18, 28, 37, 48, 5, 49, 34, 37, 25, 31, 42, 47, 18, 19, 20, 12, 11, 9, 8, 14, 15, 17, 20, 24, 28, 30, 32, 34, 17, 23, 48, 40, 41, 17
As this raw data of marks is very large. So, it can be arranged in groups as following:

• 0 – 10
• 10 – 20
• 20 – 30
• 30 – 40
• 40 – 50

Here, groups are made of difference 10 where 0 is smallest and 10 is largest in a grouped observation. Each group 0 – 10, 10-21, 20-30, 30-40, 40-50 is a class.
Every class has a lower class limit and an upper class limit. For example, in 0 – 10 class 0 is lower class limit and 10 is upper class limit.
The difference between an upper class limit and lower class limit is called width or size of class interval. The mid value of the class is called class mark.

Table 8 below shows how a grouped frequency distribution table can be made from the above groups.

Table 8: Grouped frequency distribution of marks obtained by students in a class test

MarksTally marksNo. of students
0 – 10||||4
10 – 20 11
20 – 30 ||||9
30 – 40 ||8
40 – 50 |||8
Total40

Interpretation of table 8

10 occurs in both classes 0-10 and 10-20, but 10 has not been not included in both classes. 10 belongs to only one class i.e. 10-20.
Similarly, 20 exists in both classes 10-20 and 20-30, but 20 belongs to only 20-30 class.

So, the class 0-10 can include marks 0 but excludes 10.
Similarly, 10-20 means can include marks 10 but excludes 20.

In class 0-10, 0 is the lower class limit and 10 is upper class limit. Similarly, in class 10-20, 10 is lower class limit and 20 is upper class limit.

The width or size of the interval of class 0 – 10 is 10 – 0 = 0 interval.
The class mark of class 0 – 10 is $$\frac{0 + 10}{2} = 5$$.

Formula

Width of class = upper class limit – lower class limit

Class mark = $$\frac{(lower \;class \;limit \;+ \;upper\; class\; limit)}{2}$$

Last updated on: 23-06-2024