Unit 2 Task 3: Validation

Published

April 2, 2025

Background

How do you know if you have correctly programmed a query or an analysis? The process of confirming that your analysis or dataset is “correct” is called validation.

Imagine you are new on the job and received the assignment to run queries on a baseball dataset. Confirming you did things correctly is a good idea before presenting your data to others and letting them discover your mistake. Or maybe, it isn’t a matter of you doing something wrong as just a realization that the data source is flawed. Making that discovery is very important and makes you look really good; as opposed to someone else discovering it later - which can be a big hit to your credibility.

Consider all the various ways you might want to validate the query results. One way to validate, though certainly not the only way and not necessarily the best way, is to try to accomplish the same task in a different system or language. That’s the purpose of this assignment.

Client Request

For this project, you will validate the assignments you have done previously, but this time using Python pandas code instead of complicated SQL queries. You will still need to connect to the database to pull the data in, but all the filtering, calculating, joining, etc. should be done with Python pandas (or polars).

Data

Note

This project will use the Lahman Baseball Database. In order to complete this project, you will need to download the database and save it inside the DS250 folder (hopefully you have a folder dedicated to this class) next to the .qmd file you will be using for this assignment. Note: Right click the ‘Download’ link and select “Save Link As” to download the data to your computer.

Download: lahmansbaseballdb
Information: Lahman Data Dictionary
Setup Instructions: See SQL Setup

Readings

Complete these readings before we cover the material in class. This will help you retain the material and make the class period more engaging - not less.

In other words, if an assignment is due on Wednesday, we will cover the material needed to complete the assignment on Tuesday. Therefore, you should complete the readings on Monday (or anytime before class on Tuesday).

  • [P4DS: CH4 Data Transformation](https://aeturrell.github.io/python4DS/data-transform.html#)
  • [P4DS: CH22 Joins](https://aeturrell.github.io/python4DS/joins.html)

Optional

Questions and Tasks

Download the template for this task.

  1. Write an SQL query that pulls in the batting table. Then, with Python Pandas code create a dataframe that contains playerID, yearID, and batting average for players with at least 10 at bat that year. Sort the dataframe from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

  2. Now use Python Pandas to calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats over their entire career. Sort the dataframe from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

  3. Write an SQL query that pulls in the the salaries table and the collegeplaying table (and any other tables you might need) and store them in pandas dataframes. Combine the dataframes to create a list of baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID and teamID associated with each salary.

  4. Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Be creative! Write an SQL query to read in the table(s) you need, then use pandas to manipulate the data. Finally, make a graph using Lets-Plot to visualize the comparison. What do you learn?

Submission:

When you have completed the report, you will need to follow this process to submit your work:

  1. Have this assignment’s template/quarto file open in VS Code and nothing else
  2. Click Preview Button in VS Code in the top right of the screen
    1. This will render the project but also entire course work portfolio into HTML files for review
    2. Confirm everything displas as you would like it to
    3. How you see it will be how it is viewed for grading
    4. If there is an error in any cell of the quarto files, the rendering will stop and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
  3. Once the report is confirmed close the preview and open a VS Code Terminal
    1. click Terminal in the top menu bar and then New Terminal
  4. Type the following in the terminal quarto render
    1. Then drag and drop this assignment into the terminal. This will add the file path to the terminal command
    2. Press Enter
  5. This will render the project into a HTML file in the same location as the .qmd file
  6. To locate the file in VS Code
    1. Right click on the file in the file explorer and select Reveal in File Explorer(Win) or Reveal in Finder(Mac)
  7. Upload HTML file into Canvas

Deliverables:

Use this task’s template to submit your Client Report.

Answers to the questions | tasks. Each should include a written description of your results, code cells with comments, charts and/or tables.

Note

Your report should be written in quarto markdown files and rendered to an HTML File. Upload the HTML file in Canvas. (Do not submit the .qmd file)

Back to top