Unit 2 Core Task 1: Flights - Column Creation Advanced
Canvas: U2: Core Task 1 — Flights: Column Creation
Type: Core Task (1 pt, complete/incomplete)
Copilot: Allowed for syntax lookup; disallow for answer generation.
Background
Delayed flights are not something most people look forward to. In the best case scenario you may only wait a few extra minutes for the plane to be cleaned. However, those few minutes can stretch into hours if a mechanical issue is discovered or a storm develops. Arriving hours late may result in you missing a connecting flight, job interview, or your best friend’s wedding.
In 2003 the Bureau of Transportation Statistics (BTS) began collecting data on the causes of delayed flights. The categories they use are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft, and Security. You can visit the BTS website to read definitions of these categories.
Client Request
The JSON file for this project contains information on delays at 7 airports over 10 years. Your task is to clean the data, search for insights about flight delays, and communicate your results to the Client. The Client is a CEO of a flight booking app who is interested in the causes of flight delays and wants to know which airports have the worst delays. They also want to know the best month to fly if you want to avoid delays of any length.
Data
Every data science project should start with data, and our class projects are no different. Each project will have ‘URL’ and ‘Information’ links like the ones below. Right click the ‘URL’ link and select “Copy Link” to use it to import the data into your project. This is the preferred method to get data into your report as you will be publishing your report to GitHub. If you choose to download the data file to your computer you will need to save it in the same folder as your .qmd file for it to work correctly in GitHub.
URL: JSON File
Information: Data Description
Subject Matter: Types of Delay
Readings
- Python Polars: The Definitive Guide, Chapter 6, Reading JSON and NDJSON
- Polars Users Guide, Using pl.when
- Polars Cookbook
Chapter 1 Getting Started with Python Polars
Optional Alternative References
Polars Cookbook, Chapter 2, Reading and writing JSON files
Python Polars: The Definitive Guide, Chapter 10, Selecting and Creating Columns especially the Creating Columns section to the end
Core Questions
Skills: with_columns, expressions on real columns
Create a new ‘season’ column. Define ‘season’ as:
- Winter for January, Febuary (note the mispelling), March
- Spring for April, May, June
- Summer for July, August, September
- Fall for October, November, December
- All other rows can be assigned a value of ‘Unknown’
Divide all the columns by 1000 whose name starts with num_of. Be sure to first filter out rows that contain problematic values for the columns like -999 or “1500+”.(No discussion needed for this question)
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Using the columns you modified in the bullet above, your job is to create a new column that calculates the total number of flights (in 1,000’s) delayed by weather (both severe and mild). Show your work by printing the first 5 rows of the dataset. Use these three rules for your calculations:
- 100% of delayed flights in the Weather category are due to weather
- 30% of all delayed flights in the Late-Arriving category are due to weather
- From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
- 100% of delayed flights in the Weather category are due to weather
Using the new weather variable calculated above, create a boxplot showing the distribution of flights delayed by weather for each season. Describe what you learn from this graph.
Submission / Deliverables:
Use this unit2_task1_template to create your Client Report. Answer the questions. Each answer should include a written description of your results, code cells with comments, charts and/or tables.
Your instructor will advise you — or it will be evident in Canvas — whether to submit a rendered .html file, or a link to the rendered file on GitHub Pages (gh-pages). Do not submit the URL to the GitHub .qmd file.
When you have completed the report and are ready to submit, render the project into HTML and publish it to GitHub Pages. Follow these steps:
- Have this assignment’s template/quarto file open in VS Code and nothing else
- Click the
Previewbutton in VS Code (top right of the screen)- This renders the project so you can review it
- Confirm everything displays as you would like it to
- How you see it is how it is viewed for grading
- If there is an error in any cell, the rendering stops and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
- Once the report is confirmed, close the preview and open the
GitHub Desktopapplication - Confirm you are in the correct repository (top left corner)
- Confirm you are on the
Mainbranch (top left corner — never change offMain) - Type a summary of the changes in the
Summarybox - Click
Commit to main(blue button, bottom left) - Click
Push origin(blue button, middle right)- This pushes your changes to GitHub
- The
publish.ymlworkflow renders the project into HTML files - The HTML files are published to the
gh-pagesbranch - The URL of the published project is in the deployment section on GitHub
- In
GitHub Desktop, clickOpen in GitHubto navigate to the repository - Click the
Actionstab and confirm there were no errors in rendering - Open the
deploymentsection on the main repo page to find the URL - Navigate to the URL and confirm it displays as you intended
- Copy the URL and submit it in Canvas
- In