Background / Purpose

We have built the class around a set of data driven projects. During each of those projects we will cover readings in visualzation, data handling, and statistics. Excluding the first two days and the last two days of class, we will spend four days of class on each case study. You can find data for each project on the data page

Case Study 1: Introduction

Background

Data intuition and insight is not rocket science. However, there is a science to data handling, visualization, and decision making with data. The material we cover this semester will introduce to topics around this science.

Your Challenge

Make sure you understand the course objectives and that you are introduced to the material and tools of the course.

Deliverables

  1. Complete the quizzes for the first two days of class in Canvas.

Case Study 2: Lego my data

Background

Lego has become one of the most valuable toys on the planet. You can even download software and create your own sets from bulk toys (https://www.leocad.org/index.html).

We will be selling bulk LEGO in 1-lb bags. According to this LEGO selling guide, the more organized the bricks are the better they will sell. While most Ebayers take a couple of photos and list the number of pounds to sell their project, we will be selling our bags of LEGO with data visualizations. If you would like more details about each brick you can read here.

Your Challenge

You will need to work with your team to enter your LEGO data and then create a set of visualizations that sell your product. We will need to prepare our sales material to make sure we can communicate our bag of LEGO without them handling or seeing the product (no pictures of your bag of LEGO).

With your team, record information for each LEGO brick in your bag (such as color). Then create data visualizations to answer questions you think the buyers might have. Some example questions are listed below.

  • How many slanted bricks do you have?
  • What is the distribution of colors available in your set?
  • What is the total area by color?

Deliverables

All of your deliverables will be provided in a 6 slide Google Slides. Use the provided template. Data collection happens as a team. Each student from the team should have their own charts.

  1. Slide 1: Include the team that collected the data and a project title.
  2. Slide 2: Include a data snippet and a description of why you chose your varaibles.
  3. Slide 3: A polished bar chart that you feel provides the best description of your data.
  4. Slides 4-6: Three charts ordered by quality. Include a description for each chart.
  5. Slides 7-9: Three pictures of your LEGO spread out (on-campus only).
  6. A complete .csv file of your data set.

CS2 Template

Case Study 3: What is normal about marathons?

Background

In a nod to Greek history, the first marathon in 1896 commemorated the run of the soldier Pheidippides from a battlefield near the town of Marathon, Greece, to Athens in 490 B.C. For the 1908 London Olympics, the course was laid out from Windsor Castle to White City stadium, about 26 miles. An extra 385 yards was added inside the stadium to locate the finish line in front of the royal family’s viewing box.

Despite the success of that first race, it took 13 more years of arguing before the International Amateur Athletic Federation (IAAF) adopted the 1908 distance as the official marathon. Today, there are more than 500 organized marathons in 64 countries around the world each year, with more than 425,000 marathon finishers in the United States alone (reference).

Eric J. Allen published an article on marathon runners and they have shared much of the data that we will use. We have formatted their data and its size for use in our class. Their data has close to 10 million observations.

Your Challenge

You will need to work with your team to describe the finishing times and how they relate to the associated spatial and temporal information about the marathon. You will look at the individual marathon runners and the marathon’s themselves. You have been asked to create a short story that describes marathons that could be published in a local newspaper the week before a marathon is to be held in their community.

You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.

Deliverables

  1. A short article with less than 500 words and 4-6 visualizations.
  2. Your article should introduce the reader to what a normal distribution is with the use of a visualization of the marathon data.
  3. Your article should contain at least one spatial, temporal, and variable graphic.
  4. Your article should contain one table that provides data summaries.
  5. The end of your article should contain one quote from a reader (like a comment on the article). The reader should be a spouse, parent, or friend.

CS3 Template

Case Study 4: What is a healthy child?

Background

Do you remember you or your sibling going to the doctor for a check-up to hear a conversation about what percentile the measured height and weight were in? The World Health Organization gets some bad press at times of pandemics, but they do much more than managing pandemics. With the United States Centers for Disease Control (CDC), they define the standards for those percentiles that you hear - https://www.cdc.gov/growthcharts/who_charts.htm.

Your Challenge

We have repeated measurements of weights and heights for children from varied multiple countries (United States, Netherlands, Bangladesh, Brazil, India, Nepal, Pakistan, Peru, South Africa, and Tanzania). We will use these measures to explain the health of the children in each study location and to explore children that have poor health over most of their measurements (note: subjects were part of a study of specific groups and are not representative of the country).

You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.

Deliverables

A short presentation (4-8 slides) that visualizes child health using the height measurements. Each presentation should include the following. Remember to find a story and weave your charts together into one cohesive story.

  1. One slide should explain what a z-score is and how it is calculated for our graphics.
  2. One slide should show height adjusted z-scores (HAZ) for a few healthy and a few unhealthy children from each gender over all the times using the MAL-ED data.
  3. 1-2 Slides about the health of the children at 365 days (1-year) for multiple countries.
    1. One chart should show the distribution of heights for children from at least 4 countries at ~365 days.
    2. One chart should have visualizations of the health of the children at ~365 days for each country (height adjusted z-scores).
    3. Take the time to explain your concerns about the health of the children of the study based on their z-scores.
  4. One slide should show a plot of the heights of the dutch children over time. Take the time to describe the key takeaways about their growth.
  5. Be creative with the remaining slides.

Case Study 5: Does your birthday make you better at sports?

Background

Matthew 25:29 explains that _‘whoever has will be given more, and they will have an abundance. Whoever does not have, even what they have will be taken from them’ _which has lead many researchers to describe how some get more and others get less as the‘Matthew Effect’ in our society.

One way to evaluate this effect is to look at professional athletes. Malcolm Gladwell studies the birth dates of successful hockey players in a chapter of his 2008 non-fiction book Outliers to provide an example of the Matthew Effect. Please read the Matthew Effect chapter to get more background.

Hockey’s cutoff date is December 31st, but baseball’s cutoff date has historically been July 31st. Football’s cutoff is July 31st as well but they also have weight categories as well for older ages. Basketball’s AAU cutoff date is August 31st.

Your Challenge

Malcolm Gladwell’s chapter on the Matthew Effect is persuasive as a narrative. He even has a couple of tables. After reading his Matthew Effect chapter, your job is to create a few persuasive visualizations about the birthday distributions within each sport. In addition, you will need to compare the sport birthday distributions to the distributions of birthdays in the US population to verify that the population of birthdays is in fact different.

You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.

Deliverables

  1. Build a 6-10 slide presentation that provides data visualizations to pair up with the text from Malcolm Gladwell’s chapter on the Matthew Effect.
  2. At least one slide should have a visualization that compares the US population of births to a sport of your preference with persuasive annotations added to the graphics.
  3. At least three slides should describe and show the statistical comparison you performed to provide justification for your inference.
  4. Your final slide should have your conclusions beyond the observed data.

Case Study 6: Catch me, if you can?

Background

You have recently been hired by the U.S. internal revenue service (IRS) to catch corporate cheaters. You have been given three companies to investigate. You will need to decide if the IRS should build a legal case to investigate the institution for fraud.

  • Sino Forest Corporation: You have the values from the financial statement numbers of Sino Forest Corporation’s 2010 Report.
  • Government Entity: A dataset containing the card transactions for a government entity - 2010.
  • General Motors: The amounts paid to vendors for the 90 days preceding General Motor’s 2009 liquidation.

Your Challenge

You will be responsible to report as much evidence as you can with the data provided for each institution above. The government entity has more available data than the other two which will require you to dig deeper to find additional clues.

You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.

Deliverables

  1. A 8-12 slide presentation to your IRS managers on the case against each entity.
  2. At least one slide that shows the statistical test results from the analysis you performed.
  3. At least one slide per institution that visualizes their first digit distribution compared to Benford’s law.
  4. At least one slide for one of the institutions that compares the last digit distribution to what would be expected.
  5. Multiple visualizations of the Government Entity data to find other interesting insights. For each chart, please provide a follow on question to ask the Government Entity about the data discrepancy you displayed.

Case Study 7: Can you help me with my data problem?

Background

Up to this point in our course, we have been protecting you against the difficulty of gathering and formatting data for use in visualization and analysis. Tools like Tableau and PowerBI can be used to manipulate data. Some business analysts will stay in Excel and use VBA or use DAX in PowerBI. We are still going to protect you, but you will have to guide us on your data needs.

Most of the time data scientists move to the programming languages of Python, R, or SQL to wrangle their data. Both PowerBI and Tableau allow all three languages to be used internally.

The Bill and Melinda Gates Foundation wants to eradicate Tuberculosis (TB). They have asked your team to use the World Health Organization’s report on TB to guide them on their next steps in fighting this disease.

Your Challenge

Adress the following questions;

  • Which countries require our attention?
  • What age groups are of the most concern?
  • Are there differences between males and females?
  • What data science programming language should we use moving forward?

Deliverables

  1. A 5-8 slide presentation that addresses the questions above.
  2. Each question should have at least one graphic to support your answer.
  3. At least one slide highlighting the language choice for future work on the project.
  4. You should have an appendix slide that describes the wrangling that had to be done to the data.

Case Study 8: Do you want to be a data scientist?

Background

Data scientists need to have resumes that show strong experience in solving real-world problems. While employers know that their job candidates need experience, they often don’t know what experience they need and often list too many requirements.

Your Challenge

To finish the semester you will need to create your hypothetical resume that you could use upon graduation from BYU-I to a data science or analytics position. Your resume will need to combine the three elements of coursework available, the real-work experience you can accumulate over your next few years before graduation, and the job requirements of the currently available positions for undergraduate data scientists.

Deliverables

  1. A one-page dream data science resume.
  2. A second page of sources that you used to justify the stuff in your resume.
  3. A cover letter that describes what you have learned from this course