Unit 2 Task 3: Flights - Missing Data & JSON
Background
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
Client Request
The JSON file for this project contains information on delays at 7 airports over 10 years. Your task is to clean the data, search for insights about flight delays, and communicate your results to the Client. The Client is a CEO of a flight booking app who is interested in the causes of flight delays and wants to know which airports have the worst delays. They also want to know the best month to fly if you want to avoid delays of any length.
Data
Every data science project should start with data, and our class projects are no different. Each project will have ‘URL’ and ‘Information’ links like the ones below. Right click the ‘URL’ link and select “Copy Link” to use it to import the data into your project. This is the preferred method to get data into your report as you will be publishing your report to GitHub. If you choose to download the data file to your computer you will need to save it in the same folder as your .qmd file for it to work correclty in GitHub.
URL: JSON File
Information: Data Description
Subject Matter: Types of Delay
Readings
- Python Polars: The Definitive Guide, Chapter 6, Reading JSON and NDJSON
- Polars Cookbook, Chapter 5, Handling Missing Data
- Polars Users Guide, Using pl.when
Optional Alternative References
- Polars Cookbook, Chapter 2, Reading and writing JSON files
- Python Polars: The Definitive Guide, Chapter 4, Data Types -> Missing Values section brief summary
- Polars Users Guide, Missing Data
- Lambda Function
Questions
Convert values to missing Start with a fresh copy of the flights dataset by reading it in from the link provided above (as opposed to starting with the dataset you ended with in the last assignment). Identify data values that are problematic, non-sense or impossible and change them to be missing. In your report include one record example (one row) from your clean data, in the raw JSON format. Your example should display at least one missing value so that we can verify it was done correctly. (Note: JSON will convert NaN’s to null). Describe your process for finding values that needed to be changed, and how you changed them.
Filling in missing data For all the “minutes_delayed” columns, fill in any missing values with the median of that column. For the “num_of” columns, fill in missing values with that column’s mean.
Display a histogram for num_of_delays_carrier and for minutes_delayed_nas (2 separate histograms in total). Comment on the shape of each one. Also report the mean of each of these columns now that missing values have been filled in.
Submission / Deliverables:
Use this unit2_task3_template to create your Client Report. Answer the questions. Each answer should include a written description of your results, code cells with comments, charts and/or tables.
Your instructor will advise you, or it will be evident in Canvas, whether to submit an rendered .html file, or a link to the rendered file on GitHub on gh-pages. (Do not submit the URL to the GitHub .qmd file)
Here are some reminder instructions if you are using GitHub:
When you have completed the report and are ready to submit it, you will need to render the project into HTML files and publish it to GitHub pages. Follow these steps:
- Have this assignment’s template/quarto file open in VS Code and nothing else
- Click
Preview Buttonin VS Code in the top right of the screen- This will render the project but also entire course work portfolio into
HTMLfiles for review - Confirm everything displas as you would like it to
- How you see it will be how it is viewed for grading
- If there is an error in any cell of the quarto files, the rendering will stop and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
- This will render the project but also entire course work portfolio into
- Once the report is confirmed close the preview and open the
GitHub Desktopapplication - Confirm you are in the correct repository in the top left corner of the screen
- Confirm you are on the correct branch
Mainin the top left corner of the screen (Never change off theMainbranch) - Type a summary of the changes in the
Summarybox - Click
Commit to mainblue button in the bottom left corner - Click
Push originblue button in the middle right of the screen- This will push all your changes in the project .qmd file to GitHub
- The publish.yml file will kick off an automated process to render the project into HTML files
- The HTML files will be published to GitHub pages in the gh-pages branch
- The URL to the published project will be in the deployment section in GitHub
- In
GitHub DesktopclickOpen in GitHubto navigete to the repository - Click on the
Actionstab and make sure there were no errors in the rendering process - Click on the
deploymentsection of the main page of the repository to find the URL - Navigate to the URL and confirm it displays as you intended
- Copy the URL and submit it in Canvas
- In