Table of Information about Marathons

An interesting data set to see the effects of goals on what should be a unimodal distrubtion of finish times. The NYT had a good article - https://www.nytimes.com/2014/04/23/upshot/what-good-marathons-and-bad-investments-have-in-common.html?rref=upshot&_r=1
marathon
Author

DS 150

Published

January 25, 2024

Data details

There are 6,888 rows and 5 columns. The data source1 is used to create our data that is stored in our pins table. You can access this pin from a connection to posit.byui.edu using hathawayj/race_info.

This data is available to all.

Variable description

  • year: The year of the marathon
  • marathon: The name of the marathon
  • country: The country where the marathon occurred
  • finishers: The number of finishers at the marathon
  • mean_time: The average finish time in minutes

Variable summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2007.53 4.73 1970.00 2005.00 2008 2011.0 2013.0 ▁▁▁▂▇
finishers 0 1 1421.21 4166.05 1.00 87.75 224 842.5 50062.0 ▇▁▁▁▁
mean_time 0 1 270.47 35.52 139.07 250.85 263 280.0 614.5 ▁▇▁▁▁

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
marathon 0 1 6 64 0 1176 0
country 0 1 2 14 0 43 0
Explore generating code using R
pacman::p_load(pins, tidyverse, downloader, fs, glue, rvest, pins, connectapi)

# Data origin. This works.
race_info <- read_html("https://faculty.chicagobooth.edu/george.wu/research/marathon/marathon_names.htm") %>%
  html_nodes("table") %>%
  html_table() %>%
  .[[1]]

# Wrangle
colnames(race_info) <- race_info[1,]

race_info <- race_info[-1,] %>%
  as_tibble() %>%
  rename_all("str_to_lower") %>%
  mutate(year = as.integer(year), finishers = as.integer(finishers),
         `mean time` = as.numeric(`mean time`)) %>%
  rename(mean_time = `mean time`)

board <- board_connect()
pin_write(board, race_info, type = "parquet")

pin_name <- "race_info"
meta <- pin_meta(board, paste0("hathawayj/", pin_name))
client <- connect()
my_app <- content_item(client, meta$local$content_id)
set_vanity_url(my_app, paste0("data/", pin_name))

Access data

This data is available to all.

Direct Download: race_info.parquet

R and Python Download:

URL Connections:

For public data, any user can connect and read the data using pins::board_connect_url() in R.

library(pins)
url_data <- "https://posit.byui.edu/data/race_info/"
board_url <- board_connect_url(c("dat" = url_data))
dat <- pin_read(board_url, "dat")

Use this custom function in Python to have the data in a Pandas DataFrame.

import pandas as pd
import requests
from io import BytesIO

def read_url_pin(name):
  url = "https://posit.byui.edu/data/" + name + "/" + name + ".parquet"
  response = requests.get(url)
  if response.status_code == 200:
    parquet_content = BytesIO(response.content)
    pandas_dataframe = pd.read_parquet(parquet_content)
    return pandas_dataframe
  else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    return None

# Example usage:
pandas_df = read_url_pin("race_info")

Authenticated Connection:

Our connect server is https://posit.byui.edu which you assign to your CONNECT_SERVER environment variable. You must create an API key and store it in your environment under CONNECT_API_KEY.

Read more about environment variables and the pins package to understand how these environment variables are stored and accessed in R and Python with pins.

library(pins)
board <- board_connect(auth = "auto")
dat <- pin_read(board, "hathawayj/race_info")
import os
from pins import board_rsconnect
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv('CONNECT_API_KEY')
SERVER = os.getenv('CONNECT_SERVER')

board = board_rsconnect(server_url=SERVER, api_key=API_KEY)
dat = board.pin_read("hathawayj/race_info")