A Simple Taxonomy of Data

I decided to document some of the projects and team work that I was involved in during my Masters Programme especially in relation to data mining here on my blog. I already wrote two articles here and here. However today, I will be sharing an interesting article (well, maybe interesting to me alone 🀷‍♀️) on the simple taxonomy of data. This article is an extract from the text that was recommended to us during the course. The title of the text is: Business Intelligence and Analytics: Systems for decision support, Tenth Edition, Ramesh Sharda, Dursun Delen, Efraim Turban). I found this classification interesting and also very insightful. So, I decided to share it because I know that a lot of people will also find this useful . Let's dive right into itπŸ‘‡πŸ‘‡πŸ‘‡

Data refers to a collection of facts usually obtained as the result of experiences, observations, or experiments. Data may consist of numbers, letters, words, images, voice recordings, and so on as measurements of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge is derived. At the highest level of abstraction, one can classify data as structured and unstructured (or semistructured). Unstructured/semistructured data is composed of any combination of textual, imagery, voice, and web content. Structured data is what data mining algorithms use, and can be classified as categorical or numeric. The categorical data can be subdivided into nominal or ordinal data, whereas numeric data can be subdivided into interval or ratio. The diagram below shows a simple taxonomy of data in data mining.


πŸ‘‰Categorical variable represents types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level. While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups.

πŸ‘‰Nominal data contain measurements of simple codes assigned to objects as labels, which are not measurements. For example, the variable marital status can be generally categorized as (1) single, (2) married, and (3) divorced. Nominal data can be represented with binomial values having two possible values (e.g., yes/no, true/false, good/bad), or multinomial values having three or more possible values (e.g., brown/green/blue, white/ black/Latino/Asian, single/married/divorced). 

πŸ‘‰Ordinal data contain codes assigned to objects or events as labels that also represent the rank order among them. For example, the variable credit score can be generally categorized as (1) low, (2) medium, or (3) high. Similar ordered relationships can be seen in variables such as age group (i.e., child, young, middle-aged, elderly) and educational level (i.e., high school, college, graduate school). Some data mining algorithms, such as ordinal multiple logistic regression, take into account this additional rank-order information to build a better classification model. 

πŸ‘‰Numerical data represent the numeric values of specific variables. Examples of numerically valued variables include age, number of children, total household income (in U.S. dollars), travel distance (in miles), and temperature (in Fahrenheit degrees). Numeric values representing a variable can be integer (taking only whole numbers) or real (taking also the fractional number). The numeric data may also be called continuous data, implying that the variable contains continuous measures on a specific scale that allows insertion of interim values. Unlike a discrete variable, which represents finite, countable data, a continuous variable represents scalable measurements, and it is possible for the data to contain an infinite number of fractional values. 

πŸ‘‰Interval data are variables that can be measured on interval scales. A common example of interval scale measurement is temperature on the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water in atmospheric pressure; that is, there is not an absolute zero value. 

πŸ‘‰Ratio data include measurement variables commonly found in the physical sciences and engineering. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. The scale type takes its name from the fact that measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind. Informally, the distinguishing feature of a ratio scale is the possession of a non-arbitrary zero value. For example, the Kelvin temperature scale has a non-arbitrary zero point of absolute zero, which is equal to –273.15 degrees Celsius. This zero point is non-arbitrary, because the particles that comprise matter at this temperature have zero kinetic energy. 

Other data types, including textual, spatial, imagery, and voice, need to be converted into some form of categorical or numeric representation before they can be processed by data mining algorithms. 

Note: This article is an extract from the text: Business Intelligence and Analytics: Systems for decision support, Tenth Edition, Ramesh Sharda, Dursun Delen, Efraim Turban. Hence, I do not take any credit for this piece. I just decided to share it here because it has been useful to me and I know it will be useful to other people as well. However, if you want to read more and expand your knowledge on this topic, you can get the textbook from Amazon or any bookstore around you. 

I hope you love this article as much as I do πŸ˜‰. Until next time...πŸ’‹



No comments