Data Collection and Organization

Introduction to data and information

Around us, we come across different types of information that help to take some decision or to reach some type of conclusion or to help in giving a better suggestion to someone.

If we take a glimpse of what information did we see around and what the source of such information was? We can say there are many sources from our daily life those provide us information like television, newspaper, websites, interaction with people.

What has been collected from these sources is known as the information. It could be of any type, like train information collected from the train chart on the railway station, the most expecting winner of the match from scoreboard, the most expecting winner party of the polls from the polls results board, information about absentees of students in a school.

So is there anything still behind the information, from where it gets generated?

The answer is data. Data is the basic building block on which information is built up. So such information sources collects data and work on it to provide useful information that reaches us and understandable by human. Let’s take a deeper look next to know about data and information with examples.


Example 1

A teacher in school who maintains daily attendance sheet of students in a class is an example of collecting data of students who are present or absent everyday. The data is collected for each and every student and written in attendance sheet as absent or present or sick against their names.

Table 1: Attendance sheet July, 2020

Roll numberAttendance

So, in table 1 attendance sheet is filled up with student’s data. This data will help the teacher or school office to make further conclusions and analysis about the trends of students such as:

  1. How many of them have attended the school for the month of july?
  2. How many of them were absent for the month of july?
  3. How many of them were sick for the month of july?

These are the questions which can be answered by observing and analysing the data of students in attendance sheet and the collected answers will be known as information.


Example 2

Table 2: Student’s conveyance sheet July, 2020

Roll numberConveyance
1School bus
2By walk
3School bus
4School bus
5By walk
10By walk

The data in table 2 can help the teacher or school office to make further conclusions and analysis about the trends of students such as:

  1. How many of them have attended the school for the year 2020?
  2. Which conveyance do students use for school, a school bus or train or come by walk?

This collected information can help school office to take some decisions or to reach some type of conclusions to give a better suggestion to the head of school. An example of such conclusion can be how many of the students are using school bus.

In other words, this conclusion, which will be known as the information, can help head of the school to make decision to ply more school buses, if total number of students who use bus are exceeding the total number of available seats available in the bus.

So, in the above example, we got to know about data, information and how information helps taking decisions and making conclusions.
Next, we see how statistics handles data mathematically and how information is collected from data.

What is data in statistics?

Data is a plural form of datum which is a latin word. Data is a value which can further be used to prepare information about something. The value of data can be anything from facts or numerical figures.

Each value of data collected is called an observation. The data collected at initial stage is called raw data.
In statistics, we study data which can be measured numerically.


Here is an example of data collected of marks scored by 10 students in maths test.
50, 45, 46, 30, 21, 25, 32, 49, 47, 18
So, the each of the marks from 50, 45, 46, 30, 21, 25, 32, 49, 47 and 18 is called an observation.
The whole list of marks scored by 10 students is data and it is a raw data until it is not placed in tables in a meaningful manner.

Therefore, from above example, we can say that a collection of observations is called as data.

Next, we see how the data is collected and organised in tables in statistics.

Collection of data

Collection of data starts from collecting the raw data from sources. Sources can be any source like television, newspaper, websites, interaction with people etc. This raw data collected can be a bulk data and usually comes in unorganised forms.

The raw data is a mixture of unorganised data, which can not be used anywhere to extract information. When raw data is processed or cleaned using for example sorting or some refinements then it becomes worth using and is form of data.

We can classify this data in two types, depending upon how we collect it:

1. Primary data

When data is collected directly from some experiments or questionnaires or interviews is called as primary data.
Primary data is the firsthand data in other words. It is not used by anyone else that firsthand person.

2. Secondary data

Secondary data is not a firsthand data. It has been collected by someone and processed to use. This type of data is not collected by individual himself but obtained from published or unpublished sources.

Organization of data

To get the meaningful information from raw data, it gets processed which involves cleaning of data, which includes sorting it, transferring it into tables or charts or representing it graphically. Data is filled up in tables in columns. The table can have a sequence number against each row of data entered into table.

Mostly, raw data that is collected to analyse can be seen in unorganised form initially. This unorganised data is always difficult to use to get some meaningful information.
So, to draw and reach to a meaningful conclusion, the raw data is organised in some systematic manner. Some of the important and common ways that are used to organise raw data are:

  1. Alphabetical order or in a sequence
  2. Ascending order
  3. Descending order

The above three ways to organize the data just rearrange the data alphabetically, in ascending or descending orders if it is numerical data.
The raw data when we put in ascending order or descending order is called an array or arrayed data.
Below is an example of organizing numerical data in ascending and descending order.


Here, is the list of marks obtained by 10 students in maths test out of 50.
50, 45, 46, 30, 21, 25, 32, 49, 47, 18

The data is organised into two columns with column labels as Roll number and Marks obtained (out of 50).

Table 3: Marks obtained by students

Roll numberMarks obtained (out of 50)

So, this data can be arranged in serial order such as:
18, 21, 25, 30, 32, 45, 46, 47, 49, 50
Data can be arranged in descending order as follows:
50, 49, 47, 46, 45, 32, 30, 25, 21, 18

What is frequency in statistics?

In general, frequency means how many times an observation or an event occurs in a specific period or for a specific value. In statistics, it is more related to number of observations that occur for a range of values.
By definition, the number of times of a particular entry or obervation occurs, is called frequency.

To understand frequency distribution, let’s start with the following example.


Consider an example of 20 students who like the different fruits from cherry, apple, grapes, banana, pineapple and orange.
The data of each of 20 students with their like of fruit is cherry, apple, grapes, banana, pineapple, apple, cherry, banana, orange, apple, cherry, apple, cherry, grapes, grapes, orange, grapes, apple, apple and cherry respectively.
When the number of observations are large then preferably data is represented in table form.
So, this data can be represented into table form as following.

Table 4: Fruit liked by 20 students

Student NumberFruit

So the table 4 shows one fruit is liked by more than one student or one fruit occurs in more than one observation.

This can be summarised in a table in another way i.e. how many students like cherry, apple, banana or oranges.

Table 5: Number of students who like a specific fruit

FruitLiked by total number of students

The table 5 provides the information about total number of students who like a specific fruit.
That means, cherry is liked by total of 5 students in a class and similarly, apple is liked by 6 students, grapes are liked by 4 students, banana is liked by 2 students, pineapple is liked by 1 student and orange is liked by 2 students.

Here fruit is called as variate and number of students who liked a particular fruit is called frequency of variate.

This example, which is about fruit liked by how many students is an example of frequency. Here, an observation is to find how many students are there for a specific value which is a fruit.

The table 5 can be revised to show frequency in terms of tally marks and the table is known as frequency distribution.
Let’s see next how?

Tally marks in frequency distribution

Tally marks are used to show the frequency distribution of an observation. Tally mark is written using the symbol “|” a straight vertical line. Tally mark is a counter which counts the frequency of an observation. Single tally mark “|” shows the count one and two tally marks show “||” the count two of an observation.
We keep on adding one tally mark as the count increases. When the count reaches five, we do not use fifth tally mark after four tally marks. Instead a crossed line is drawn on the first four tally marks.

Count = 1, tally marks = |
Count = 2, tally marks = ||
Count = 3, tally marks = |||
Count = 4, tally marks = ||||
Count = 5, tally marks = \(\bcancel{||||}\)
Count = 6, tally marks = \(\bcancel{||||}\)|
Count = 7, tally marks = \(\bcancel{||||}\)||
Count = 8, tally marks = \(\bcancel{||||}\)|||
Count = 9, tally marks = \(\bcancel{||||}\)||||
Count = 10, tally marks = \(\bcancel{||||}\)\(\bcancel{||||}\)

How are tally marks used?

In frequency distribution table, the first column contains name of observation like marks, height, age etc. In second column contains tally marks and third column keeps the value of number of tally marks.

Let’s revise the table 5 with tally marks.

Table 6: Frequency distribution of number of students who like a specific fruit

FruitTally marksLiked by total number of students

Interpretation of Table 6

In table 6, the first column of table has the name of the observation, which is fruit’s name.
Second column of the table has vertical lines by observing values from the first column.
For each student who eats a fruit a tally mark is written in second column next to each other. This process is repeated until all observations are completed from the raw data.
If an observation repeats twice, two vertical lines are used i.e. || and so on upto four lines.
When an observation repeats five times, first four vertical lines are used and one line across the first four is drawn drawn diagonally on the first four lines which will be counted as fifth tally marks i.e. |||| \(\bcancel{|}\).
In third column of table we count number of tally marks corresponding to each row.


Tally marks are kept in a maximum bunch of five.

Types of frequency distribution

In frequency distribution, raw data is represented into table form using tallly marks so that information contained in raw data can be easily understandable.
Frequency distribution is of two types.

  1. Discrete frequency distribution
  2. Continuous frequency distribution or grouped frequency distribution

1. Discrete frequency distribution

Discrete frequency distribution is used when value of observation is discrete or a single value. For example, the table 6 of frequency distribution of number of students who like a specific fruit is an example of discrete frequency distribution, becasue the value of observation is a name of fruit in first column, which is a single value.


1, 2, 6, 4, 5, 6, 4, 2, 3, 1, 2, 1, 2, 4, 3 are the numbers appeared on a dice when it is thrown 15 times. This data can be filled up in the discrete frequency distribution table as shown below in Table 7.

Table 7: Discrete frequency distribution of number appeared on dice

NumberTally marksFrequency

Interpretation of Table 7

Table 7 tells how many times a number on dice appears or frequency of a number when thrown. The second column of tally marks shows such information.

  1. Number 2 appears four times, which means 2 has maximum frequency.
  2. Number 5 appears only one time, which means 5 has minimum frequency
  3. Number 3 and number 6 appear two times each and can say 3 and 6 have same frequencies.

Here, the number of observations in table 7 were only 6. But when the number of observations is very large then it is best to make group of observations and find the frequency.
Continuous or grouped frequency distribution is a such method to group the observations.
Let’s see next, how it helps to find frequencies of large data.

2. Continuous or grouped frequency distribution

As said earlier also, grouped frequency distribution is suitable when the number of observations are large. In grouped frequency distribution, raw data is arranged into groups.

The groups are made in such a way that the difference between the greatest and the smallest observation in a group is large. These grouped observation is also called as class.


The marks obtained by 40 students in class test is as follows.
25, 35, 45, 20, 8, 10, 18, 28, 37, 48, 5, 49, 34, 37, 25, 31, 42, 47, 18, 19, 20, 12, 11, 9, 8, 14, 15, 17, 20, 24, 28, 30, 32, 34, 17, 23, 48, 40, 41, 17
As this raw data of marks is very large. So, it can be arranged in groups as following:

  • 0 – 10
  • 10 – 20
  • 20 – 30
  • 30 – 40
  • 40 – 50

Here, groups are made of difference 10 where 0 is smallest and 10 is largest in a grouped observation. Each group 0 – 10, 10-21, 20-30, 30-40, 40-50 is a class.
Every class has lower class limit and an upper class limit. For example, in 0 – 10 class 0 is lower class limit and 10 is upper class limit.
The difference between an upper class limit and lower class limit is called width or size of class interval. The mid value of class is called class mark.

Table 8 below shows how a grouped frequency distribution table can be made from the above groups made.

Table 8: Grouped frequency distribution of marks obtained by students in a class test

MarksTally marksNo. of students
0 – 10||||4
10 – 20\(\bcancel{||||}\) \(\bcancel{||||}\)11
20 – 30\(\bcancel{||||}\) ||||9
30 – 40\(\bcancel{||||}\) ||8
40 – 50\(\bcancel{||||}\) |||8

Interpretation of table 8

10 occurs in both classes 0-10 and 10-20, but 10 has not been not included in both classes. 10 belongs to only one class i.e. 10-20.
Similarly, 20 exists in both classes 10-20 and 20-30, but 20 belongs to only 20-30 class.

So, the class 0-10 can include marks 0 but excludes 10.
Similarly, 10-20 means can include marks 10 but excludes 20.

In class 0-10, 0 is lower class limit and 10 is uppper class limit. Similarly, in class 10-20, 10 is lower class limit and 20 is upper class limit.

The width or size of interval of class 0 – 10 is 10 – 0 = 0 interval.
The class mark of class 0 – 10 is \(\frac{0 + 10}{2} = 5\).


\(Width \;of \;class = upper \; class \; limit \;- \;lower \;class \;limit\)

\(Class \;mark = \frac{(lower \;class \;limit \;+ \;upper\; class\; limit)}{2}\)