Chapter 7.2 - Big Data

Time Estimate: 45 minutes

7.2.1. Introduction and Goals

We live in the information age with an exponential growth of data. In 2010 Eric Schmidt, the CEO of Google, said, "There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." In 2019, the World Economic Forumarrow-up-right estimated that "the entire digital universe is expected to reach 44 zettabytes by 2020."

How much is an Exabyte or Zettabyte? Here is a visualization and a table from the same articlearrow-up-right at the World Economic Forum. Click on each to view full-size versions.

Learning Objectives: I will learn to

  • describe what information can be extracted from data

  • identify what qualifies as big data

  • describe challenges associated with processing big data sets

  • recognize both benefits and harms of using big data

Language Objectives: I will be able to

  • discuss privacy and security concerns related to a data set

  • use target vocabulary, such as megabyte, gigabyte, and terabyte while describing the effects of big data, with the support of concept definitions from this lesson

7.2.2. Learning Activities

Big Data

We live in the era of Big Data which refers to data sets that are too large to fit on a normal computer or be processed by a standard spreadsheet or database program. Large data sets are difficult to process using a single computer and may require parallel systems (multiple computers working together to run an algorithm). Scalability of systems is an important consideration when working with large data sets, as the computational capacity of a system affects how data sets can be processed and stored.

We will explore Big Data through a number of videos from the PBS documentary, The Human Face of Big Data. We will start with a short (2:31) video, Everything Is Quantifiable.arrow-up-right

Everything is Quantifiable

Q-1: True or False: A Terabyte is equivalent to 1000 bytes.

A. True

B. False

Q-2: True or False: Big data only contains numeric data, it does not include text, images or videos.

A. False

B. True

Q-3: The term Big Data refers to _________________.

A. data sets that are stored in the cloud

B. data sets that contain very large numbers

C. data sets that are owned by a big corporation

D. data sets that are too large and complex to download and process on a single computer

Data Science

The field of Data Science deals with extracting information from and visualizing the results of manipulating large data sets. The size of a data set affects the amount and quality of information that can be extracted from it. From this information, further analysis may yield knowledge or even wisdom. Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data. We often think of data, information, knowledge and wisdom forming a pyramid.

DIKW Pyramid

Data provide opportunities for identifying trends, making connections, and addressing problems. Computing enables new methods of deriving information from data, driving monumental change across many disciplines — from art to business to science. Keep the DIKW pyramid in mind as you watch the short 3 minute video, Learning Revealed: Acquiring Languagearrow-up-right.

Acquiring Language

Q-4: Which of the following best matches statements from the video to the Data-Information-Knowledge-Wisdom pyramid?

A.

  • Information: The child said "water" most frequently in the kitchen and the bathroom

  • Knowledge: The child is likely to learn words heard in multiple locations

  • Data: The child said "Truck" for the first time at 11:45 on January 15, 2017

B.

  • Information: The child said "water" most frequently in the kitchen and the bathroom

  • Data: The child is likely to learn words heard in multiple locations

  • Knowledge: The child said "Truck" for the first time at 11:45 on January 15, 2017

C.

  • Data: The child said "water" most frequently in the kitchen and the bathroom

  • Knowledge: The child is likely to learn words heard in multiple locations

  • Information: The child said "Truck" for the first time at 11:45 on January 15, 2017

Q-5: What does “data science” refer to?

A. Data science refers to manipulating large data sets to gain information from them.

B. Data science refers to data published along with peer-reviewed scientific research

C. Data science refers to scientific information that is gained from scientific experiments.

Impacts of Big Data

Careful analysis of data can help us solve many problems. Watch the following 4-minute video to see how tracking data on The Smallest Heartbeatarrow-up-right can help save a child's life.

Acquiring Language

Bias in Data

The path from data to information to knowledge is not always straightforward. Bias can be introduced into the collection and analysis of data with dangerous results. Care must be taken when collecting and analyzing data. Problems of bias are often caused by the type or source of data that is being collected. Bias is not eliminated by simply collecting more data.

Joy Buolamwini from the MIT Media labs studies the impact of bias in face recognition systems. Watch the following video about her research.

Activity: 7.2.2.6 YouTube (TWWsW1w-BVo)arrow-up-right

The following spoken word piece by Joy Buolamwini highlights how computer systems based on incomplete data misinterpret the images of iconic black women.

Activity: 7.2.2.7 YouTube (QxuyfWoVV98)arrow-up-right

Q-8: True or False: When Joy Buolamwini says that current face recognition systems are "pale and male" she means that since the data used to train these systems consisted largely of white, male faces, these systems perform poorly for other faces.

A. True

B. False

Q-9: Based on the Joy Buolamwini’s research, IBM retrained its system using a more diverse set of faces. How would you interpret the new results?

Retrained IBM's System

A. The bias in the system was nearly entirely removed by retraining.

B. Retraining the system made the bias worse.

C. Retraining did not improve the system.

Big Data Activity: Exploring Data Sets

Explore some of examples of big data and find at least two data sets that interest you. Some ideas of where to find data sets are below. Then, answer the following reflection questions in your portfolio.

  1. What specifically were the types of data (text, sounds, transactions, etc.) included in the data set you chose?

  2. What new facts did you learn when exploring the data set? List at least 3 facts.

  3. Write a question you have about the data set you chose. Now, convert that question into a hypothesis (a statement) with your prediction about the data.

  4. Identify at least one security and/or privacy concern that is associated with the data in the data set you chose.

  5. If your data set included a visualization, explain the purpose of the visualization. How would you change or improve the visualization? If it did not include a visualization, describe one that you think would be useful in understanding the data.

Here are some websites where you can explore big data sets.

7.2.3. Summary

In this lesson, you learned how to:

Learning Objective DAT-2.A: Describe what information can be extracted from data.

  • Information is the collection of facts and patterns extracted from data.

  • Data provide opportunities for identifying trends, making connections, and addressing problems.

Learning Objective DAT-2.C: Identify the challenges associated with processing data.

  • Problems of bias are often created by the type or source of data being collected. Bias is not eliminated by simply collecting more data.

  • The size of a data set affects the amount of information that can be extracted from it.

  • Large data sets are difficult to process using a single computer and may require parallel systems.

  • Scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.

Learning Objective DAT-2.D: Extract information from data using a program.

  • Programs can be used to process data to acquire information.

  • Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data.

  • Programs such as spreadsheets help efficiently organize and find trends in information.

Learning Objective IOC-1.A: Explain how an effect of a computing innovation can be both beneficial and harmful.

  • Advances in computing have generated and increased creativity in other fields, such as medicine, engineering, communications, and the arts.

7.2.4. Self-Check

Sample AP CSP Exam Question

Q-10:

A. Deleting entries from data

B. Backing up data

C. Searching through data

D. Sorting data

7.2.5. Reflection: For Your Portfolio

Answer the following portfolio reflection questions as directed by your instructor. Questions are also available in this Google Docarrow-up-right where you may use File > Make a Copy to make your own editable copy.

  1. Choose one of the data sets listed above in the Activity section or one that you find on your own and give a brief description of it. What specifically were the types of data (text, sounds, transactions, etc.) included in the data set you chose?

  2. What new facts did you learn when exploring the data set? List at least 3 facts.

  3. Write a question you have about the data set you chose. Now, convert that question into a hypothesis (a statement) with your prediction about the data.

  4. Identify at least one security and/or privacy concern that is associated with the data in the data set you chose.

  5. If your data set included a visualization, explain the purpose of the visualization. How would you change or improve the visualization? If it did not include a visualization, describe one that you think would be useful in understanding the data.

Portfolio Reflection Questions

Make a copy of this document in your Portfolio Assignments folder and answer these questions in the spaces below. Once complete, turn in this assignment according to the steps given by your teacher.

7.2 Big Data Curriculum Pagearrow-up-right

Answer the following questions:

1. Choose one of the data sets listed above in the Activity section or one that you find on your own and give a brief description of it. What specifically were the types of data (text, sounds, transactions, etc.) included in the data set you chose?

Answer

2. What new facts did you learn when exploring the data set? List at least 3 facts.

Answer

3. Write a question you have about the data set you chose. Now, convert that question into a hypothesis (a statement) with your prediction about the data.

(Hypotheses take the form of "If __________, then _________." For example, a hypothesis about the student debt data could be, "If the tuition costs are higher at an institution, the student debt will be higher."

Answer

4. Identify at least one security and/or privacy concern that is associated with the data in the data set you chose.

Answer

5. If your data set included a visualization, explain the purpose of the visualization. How would you change or improve the visualization? If it did not include a visualization, describe one that you think would be useful in understanding the data.

Answer

Last updated