We have built the class around a set of data driven projects. During each of those projects we will cover readings in visualzation, data handling, and statistics. Excluding the first two days and the last two days of class, we will spend four days of class on each case study. You can find data for each project on the data page
Data intuition and insight is not rocket science. However, there is a science to data handling, visualization, and decision making with data. The material we cover this semester will introduce to topics around this science.
Make sure you understand the course objectives and that you are introduced to the material and tools of the course.
Lego has become one of the most valuable toys on the planet. You can even download software and create your own sets from bulk toys (https://www.leocad.org/index.html).
We will be selling bulk LEGO in 1-lb bags. According to this LEGO selling guide, the more organized the bricks are the better they will sell. While most Ebayers take a couple of photos and list the number of pounds to sell their project, we will be selling our bags of LEGO with data visualizations. If you would like more details about each brick you can read here.
You will need to work with your team to enter your LEGO data and then create a set of visualizations that sell your product. We will need to prepare our sales material to make sure we can communicate our bag of LEGO without them handling or seeing the product (no pictures of your bag of LEGO).
With your team, record information for each LEGO brick in your bag (such as color). Then create data visualizations to answer questions you think the buyers might have. Some example questions are listed below.
All of your deliverables will be provided in a 6 slide Google Slides. Use the provided template. Data collection happens as a team. Each student from the team should have their own charts.
In a nod to Greek history, the first marathon in 1896 commemorated the run of the soldier Pheidippides from a battlefield near the town of Marathon, Greece, to Athens in 490 B.C. For the 1908 London Olympics, the course was laid out from Windsor Castle to White City stadium, about 26 miles. An extra 385 yards was added inside the stadium to locate the finish line in front of the royal family’s viewing box.
Despite the success of that first race, it took 13 more years of arguing before the International Amateur Athletic Federation (IAAF) adopted the 1908 distance as the official marathon. Today, there are more than 500 organized marathons in 64 countries around the world each year, with more than 425,000 marathon finishers in the United States alone (reference).
Eric J. Allen published an article on marathon runners and they have shared much of the data that we will use. We have formatted their data and its size for use in our class. Their data has close to 10 million observations.
You will need to work with your team to describe the finishing times and how they relate to the associated spatial and temporal information about the marathon. You will look at the individual marathon runners and the marathon’s themselves. You have been asked to create a short story that describes marathons that could be published in a local newspaper the week before a marathon is to be held in their community.
You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.
Do you remember you or your sibling going to the doctor for a check-up to hear a conversation about what percentile the measured height and weight were in? The World Health Organization gets some bad press at times of pandemics, but they do much more than managing pandemics. With the United States Centers for Disease Control (CDC), they define the standards for those percentiles that you hear - https://www.cdc.gov/growthcharts/who_charts.htm.
We have repeated measurements of weights and heights for children from varied multiple countries (United States, Netherlands, Bangladesh, Brazil, India, Nepal, Pakistan, Peru, South Africa, and Tanzania). We will use these measures to explain the health of the children in each study location and to explore children that have poor health over most of their measurements (note: subjects were part of a study of specific groups and are not representative of the country).
You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.
A short presentation (4-8 slides) that visualizes child health using the height measurements. Each presentation should include the following. Remember to find a story and weave your charts together into one cohesive story.
Matthew 25:29 explains that _‘whoever has will be given more, and they will have an abundance. Whoever does not have, even what they have will be taken from them’ _which has lead many researchers to describe how some get more and others get less as the‘Matthew Effect’ in our society.
One way to evaluate this effect is to look at professional athletes. Malcolm Gladwell studies the birth dates of successful hockey players in a chapter of his 2008 non-fiction book Outliers to provide an example of the Matthew Effect. Please read the Matthew Effect chapter to get more background.
Hockey’s cutoff date is December 31st, but baseball’s cutoff date has historically been July 31st. Football’s cutoff is July 31st as well but they also have weight categories as well for older ages. Basketball’s AAU cutoff date is August 31st.
Malcolm Gladwell’s chapter on the Matthew Effect is persuasive as a narrative. He even has a couple of tables. After reading his Matthew Effect chapter, your job is to create a few persuasive visualizations about the birthday distributions within each sport. In addition, you will need to compare the sport birthday distributions to the distributions of birthdays in the US population to verify that the population of birthdays is in fact different.
You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.
You have recently been hired by the U.S. internal revenue service (IRS) to catch corporate cheaters. You have been given three companies to investigate. You will need to decide if the IRS should build a legal case to investigate the institution for fraud.
You will be responsible to report as much evidence as you can with the data provided for each institution above. The government entity has more available data than the other two which will require you to dig deeper to find additional clues.
You can find varied data sources available for your use on the data page in Canvas. You will need to use more than one of the data sets provided but you are not expected to use them all.
Up to this point in our course, we have been protecting you against the difficulty of gathering and formatting data for use in visualization and analysis. Tools like Tableau and PowerBI can be used to manipulate data. Some business analysts will stay in Excel and use VBA or use DAX in PowerBI. We are still going to protect you, but you will have to guide us on your data needs.
Most of the time data scientists move to the programming languages of Python, R, or SQL to wrangle their data. Both PowerBI and Tableau allow all three languages to be used internally.
The Bill and Melinda Gates Foundation wants to eradicate Tuberculosis (TB). They have asked your team to use the World Health Organization’s report on TB to guide them on their next steps in fighting this disease.
Adress the following questions;
Data scientists need to have resumes that show strong experience in solving real-world problems. While employers know that their job candidates need experience, they often don’t know what experience they need and often list too many requirements.
To finish the semester you will need to create your hypothetical resume that you could use upon graduation from BYU-I to a data science or analytics position. Your resume will need to combine the three elements of coursework available, the real-work experience you can accumulate over your next few years before graduation, and the job requirements of the currently available positions for undergraduate data scientists.