WINGJOCKEY

2024 SUMMER INTERN

2024/06/03 ~ 2024/08/09

Internship Goal

The primary goal of the Kinertia summer internship is to provide a holistic experience in using open source data, organizing it, and analyzing it to create usable formats for web app development. Our experience spans across various aspects of the development process, ensuring a well-rounded skill set for us.

Key Focus Areas

1. Data Pipeline

Learning Objectives: Understanding and implement end-to-end process of data collection, transformation, and loading (ETL) using Apache Airflow.

2. Open Source Data Validation

Learning Objectives: Ensuring the accuracy, reliability, and quality of open source data by using data science technique to validate usability.

3. Backend and Frontend Development

Learning Objectives: Grasping server-side development, including database management and server logic, as well as designing and implementing UI components, ensuring responsiveness and accessibility, and integrating frontend with backend services.

Internship Timeline

Bi-Weekly Summary

Week 1~2

1. Data Migration
Successfully set up Alembic for data migration and created initial revisions.

2. US-Tiles Server
Created a sample US-tiles server for visualizing map tiles.

3. ADS-B Data Filtering
Filtered ADS-B data for US flights and processed it for storage in Google Cloud and BigQuery.

4. PostGIS Setup
Integrated airspace and airport location data into PostGIS and PostgreSQL.

5. WingJockey Familiarization
Ran the Flask app and Leaflet HTML to understand the app’s workflow and began preparing the tiles server.

Week 3~4

1. Airflow Setup
Configured local Airflow environment and started automated ADS-B data processing.

2. Trace Data Split
Split flight trace data by altitude and timestamp differences for better visualization.

3. React and JS Learning
Learned JavaScript and React basics, implementing airport icons on maps.

4. Backend Connection
Established SSH configuration for VM and connected it to the backend.

5. Flight Path Features
Developed features for displaying flight paths more accurately in the app.

Week 5~6

1. Finalize ADS-B Pipeline
Completed and documented the ADS-B data pipeline and began backfilling.

2. NASR Data Pipeline
Finalized the NASR data pipeline with 28-day updates and backfill.

3. UI Enhancements
Organized the Figma UI design, adding flight features and sign-in flows.

4. Backend Cleanup
Renamed and cleaned up backend DAGs and updated query logic.

5. IFR Charts
Created and tested tile layers for IFR charts, setting up a repo for future automation.

Week 7~8

1. FAA Registration Pipeline
Refined the FAA registration pipeline, adding checksums and updating data frequency.

2. UI Development
Improved the Figma UI design and added new flight features.

3. Backend DAGs
Renamed and cleaned up backend DAGs, adding a Total Distance column.

4. Tile Layer Creation
Developed tile layers for IFR charts and tested their accuracy.

5. Looker Dashboard Prototype
Started prototyping a Looker Dashboard for visualizing trace data.

Week 9~10

1. IFR/VFR Pipeline
Finalized the IFR and VFR pipeline, creating DAGs and testing tile updates.

2. NASR Pipeline
Retested the NASR pipeline, adding a cleanup task and preparing for production deployment.

3. FAA Registration
Finalized the FAA registration pipeline, ensuring correct DAG functionality and generalizing functions.

4. WingJockey UI
Modified the Figma UI, adding new components and organizing the layout.

5. WingJockey Backend
Ran the backend code locally, debugging issues and making necessary modifications.

ADS-B data

DATA SOURCE INTRO

ADS-B stands for Automatic Dependent Surveillance - Broadcast. ADS-B is an aviation surveillance system in which an aircraft determines its position through satellite navigation or other sensors to periodically broadcast its position and store data for tracking purposes. ADSB.lol is a flight tracking network that aggregates data from volunteer feeders around the world. In this project, we use ADSB.lol open source public data to get the traces of each flight.

The GitHub repository “globe_history_2023” by user ADSB.lol hosts a database of aircraft tracked by ADS-B. It includes daily historical records of aircraft data, provided under the Open Database License. The repository is archived and now read-only. Each release corresponds to data from a specific day, collected by the ADS-B software, and is made possible by contributions from their feeders. Users can contribute by adding feeders to increase coverage. The data contains a file per aircraft, a JSON GZIP file containing the data for that aircraft for the day.

OUR TASK & GOAL

1. Validate Data Usability
Validate whether we can rely on this free open-source data for daily flight trace information instead of paying for the FlightAware API.

2. Set filtering condition to get the flight we want
We only want to keep US flights that flew within US territory and include only private aircraft, which is targeted in our app.

3. Split the daily traces per takeoff and landing
The trace data downloaded from the adsb.lol repository is grouped by tail number per day. However, if an aircraft flew several times in a day, it’s hard to identify which flight trace belongs to each flight. We need to determine a logic to split the flights.

WHAT WE'VE DONE

To Ensure that the data is reliable and relevant for our analysis, we check for completeness, accuracy, and consistency within the dataset. We cross compare the flight trace data on adsb.lol with FlightAware. Besides, we perform data cleaning to remove any duplicates or irrelevant entries to maintain data integrity

FlightAware also records the flight traces of each flight throughout the day. They have more feeders supplying their data, making it more accurate than what we observed on adsb.lol. Cross-comparing the data from both sources is one way to validate the reliability of adsb.lol. It turns out that adsb.lol captures most flight trace data with considerable confidence. However, its feeder coverage is lower, which might result in gaps in the trace data.

FlightAware also records the flight traces of each flight throughout the day. They have more feeders supplying their data, making it more accurate than what we observed on adsb.lol. Cross-comparing the data from both sources is one way to validate the reliability of adsb.lol. It turns out that adsb.lol captures most flight trace data with considerable confidence. However, its feeder coverage is lower, which might result in gaps in the trace data.

In addition, adsb.lol covers all aircraft worldwide, but we want to minimize the data stored in our cloud to avoid excessive data accumulation. Therefore, we aim to filter out irrelevant aircraft and flights. Our filtering conditions include: only keeping US aircraft, retaining traces only for flights within US boundaries, including only private flights (excluding commercial flights), and excluding certain aircraft based on their engine and aircraft type according to the FAA aircraft registration data.

The most challenging part might be the trace splitting. Without columns or indicators in the original data showing which records represent a flight’s landing or takeoff, splitting each flight becomes quite difficult. We began by examining several flights in our trace data and visualizing the altitude and timestamp using Google Sheets. This gave us a clearer picture of the flight traces. We also compared our data with FlightAware, which has a more precise flight-splitting logic.

During this process, we explored various solutions. Using altitude alone was impractical, as takeoff and landing altitudes can vary greatly between different airports. Similarly, relying on the duration of time without ADS-B signals was unreliable. In areas with sparse coverage, it might take over 10 minutes to receive data even when the aircraft is in flight. Flight speed was also not a viable criterion because the calculation, based on latitude and longitude differences divided by the time gap between data points, is affected by the density of ADS-B feeder coverage. In areas with dense coverage, signals might be received every second, causing significant variability in speed calculations.

Finally, we decided to determine whether each data point indicated the aircraft was ascending, descending, or stable. If an aircraft showed continuous ascent or stable altitude and the time gap from the previous data point exceeded 10 minutes, it likely indicated takeoff. Conversely, if an aircraft showed continuous descent or stable altitude with a similar time gap to the next data point, it likely indicated landing.

This criterion allowed us to accurately split the trace data into distinct flights. Cross-referencing with FlightAware confirmed our method, showing over 90% accuracy. Although this method isn’t perfect, it brings us very close to our goal of accurately splitting flight traces.

RESULTS

Displaying the querying flight and show the split flights for each day

Displaying the querying flight and show the split flights for each day

FAA Aircraft Registration & NASR data

DATA SOURCE INTRO

FAA Aircraft Registration Data (Daily Updated)
The FAA Aircraft Registration database includes detailed information about all registered aircraft in the United States. This data encompasses ownership details, registration statuses, and tail number assignments, which are unique identifiers for each aircraft.

1. Filtering out Commercial Flight

2. Engine Type and Plane Models

3. Trace Data Quality

NASR Subscription Data (28 Days Updated)
The NASR (National Airspace System Resources) subscription provides comprehensive aeronautical data, crucial for flight planning and navigation. This data includes updates on airspace structure, navigation aids, airports, and other critical information.

1. Airports & Airports Runway

2. Airspaces Data

OUR TASK & GOAL

1. Validate data usability
Ensure the usability of data and verify the quality of data before put into production

2. Restructure and import the data
Perform basic data cleaning, import into Google Cloud Storage and BigQuery, and structure it for seamless downstream usage. 

WHAT WE'VE DONE

The FAA data is provided in a complex format with detailed information in a zip file from their site. To extract the necessary data for our application, we refer to their data dictionary and guidelines. We also validate the data and check its update frequency through the website. Our goal is to store this data in a simple, easy-to-query format in cloud storage and BigQuery, avoiding the need to sift through massive files for future queries.

The FAA Aircraft Registration Database provides detailed information on registered aircraft, including ownership details, registration statuses, and unique identifiers. The data dictionary explains the structure and content of the dataset, offering definitions and descriptions of each field to help users understand and utilize the data effectively. 

The FAA Aircraft Registration Database provides detailed information on registered aircraft, including ownership details, registration statuses, and unique identifiers. The data dictionary explains the structure and content of the dataset, offering definitions and descriptions of each field to help users understand and utilize the data effectively. 

Airspaces Data are multi-polygon that cause large network cost for our server, and large network cost leads to bad user experience - very slow rendering of airspaces on the map. Hence, for efficient querying, we utilize hashed index for each unique airspace to load the data dynamically and migrate PostgreSQL to GCS to use cache mechanism of browser. This allows our web application server to quickly retrieve the relevant airspaces without querying the entire database whenever the app’s bounding box changes. This approach significantly reduces querying time and improves app performance. Similarly, we apply this logic to querying airports, loading them dynamically only when the bounding box changes.

OUR RESULTS

Showcasing the dynamic loading of airport

Showcasing the dynamic loading of airport

Showcasing the dynamic loading of airspace

Showcasing the dynamic loading of airspace

VFR & IFR Tile Layer

Wingjockey target users are pilots working toward VFR and IFR  and they’ve been using physical chart during training. We need a way to integrate them onto our flight map just like a tile layer.

However, both VFR and IFR are divided into sections and it is not easy to combine them.

Tile Preprocessing:

We used a powerful geospatial library called GDAL to first crop the individual section of charts and combine them together to generate tiles

It is interesting how we have different projection system with each representing a unique coordinate system. So we need to choose specific EPSG for best web view which is EPSG: 3857. Otherwise you will get something like this

Result is that now we have our own tile server hosted on GCS for both IFR and VFR

Single Section VFR Chart
Without Cropping
Wrong Projection System
OUR TASK & GOAL

1. Automate ETL Process
We aim to automate the ETL (Extract, Transform, Load) process for the backend data required by the WingJockey app.

2. Optimize Data Storage and Querying
Our pipeline will include basic data preprocessing and store data in a format that minimizes storage capacity while maximizing querying speed and efficiency.

3. Maintain Clear Workflow
We strive to maintain a clear workflow for each data pipeline, making it easier to track, adjust, or modify processes in the future. This way, if adjustments or changes are needed later, they can be tracked or reverted more easily.

WHAT WE'VE DONE

1. ADS-B data import pipeline

schedule_interval = @daily

This pipeline automates the ETL process for processing ADS-B data for the WingJockey app. It downloads raw data, extracts flight traces, filters relevant information, and uploads the cleaned data to Google Cloud Storage (GCS) and BigQuery for efficient querying. The pipeline will be run in a daily basis since the adsb.lol github repo is updated daily to include all the flight and traces data from yesterday. This data pipeline also maintains clear workflows and removes temporary files to ensure smooth operation and easy future adjustments.

2. Split Traces to Flight pipeline

schedule_interval = @daily

This Airflow DAG automates the transformation and loading (TL) process for ADS-B trace data. It includes tasks to create views and partitioned tables in BigQuery, focusing on splitting flight traces by identifying takeoff and landing events. The SQL scripts split daily flight traces by calculating altitude and timestamp, labeling each trace with takeoff and landing events using altitude changes, and grouping them into individual flights. The final dataset is optimized before storing back into our Google Cloud Storage and BigQuery, allowing for fast retrieval, efficient querying, and analysis of flight data.

Additionally, the DAG computes the closest landing and takeoff airports for each flight. This is crucial in cases where there is insufficient data to pinpoint the exact airport, possibly due to the last trace being recorded near multiple airports. Including this computation in the pipeline helps users quickly see airport information in the system, allowing for automatic input or correction, enhancing data accuracy.

3. FAA Aircraft Registration data import pipeline

schedule_interval = @daily

This Airflow DAG automates the dynamic import of FAA registration data, focusing on preventing duplicate entries in Google Cloud Storage (GCS) and BigQuery (BQ). The process begins with downloading and unzipping the data, followed by calculating a checksum to compare with the previously stored value. This ensures updates are processed only when there is new data, as the FAA registration data may not update daily or show significant differences with each update. If new data is detected, the files are converted to CSV, uploaded to GCS, and imported into BQ. Additionally, views are created in BQ to ensure only unique rows are retained, optimizing data retrieval and storage. Finally, the DAG cleans up temporary files to maintain efficiency.

4. FAA NASR subscription data import pipeline

schedule_interval = @daily

This Airflow DAG automates the process of dynamically importing NASR data upon subscription updates. It starts by checking for new data releases, downloading and unzipping the files if available. The DAG processes both CSV and shape files, uploading them to Google Cloud Storage (GCS) and importing them into BigQuery (BQ). For CSV files, it ensures they are placed in the correct directories and tables in BQ. For shape files, it converts them to newline GeoJSON format before uploading them to GCS and BQ. This process ensures efficient and updated data handling for NASR datasets.

CHALLENGES & RESULTS

Before this internship, we had almost no experience with data pipelines. While creating DAGs and running them on Airflow isn’t hard, the challenge lies in importing data efficiently and building maintainable pipelines that can be easily fixed when issues arise.

We had to carefully manage tasks and dependencies, and tailor our approach for different data types. For instance, ADS-B data updates regularly and changes daily, so we can download and store files with a date suffix. However, applying this method to aircraft registration and NASR data could result in duplicating data, increasing storage costs, and affecting query efficiency and app performance.

We also faced challenges with backfills, needing different logic for updates and backfills. Duplicating code felt inefficient, so we focused on writing clean, reusable functions. Through hands-on experience with writing DAGs, testing backfills, and monitoring pipelines, we gained valuable insights.

Now, we have these pipelines running daily for our app!

APP DEV

Backend & Frontend

OUR TASK & GOAL

1. Learn and Assist in Connecting Frontend and Backend Data for Visualization
Understand the processes involved in integrating frontend and backend systems to visualize data effectively.

2. Implement Airflow Data Pipeline Outputs into the Frontend
Utilize the data obtained from the Airflow data pipeline and integrate it into the frontend applications for improved functionality.

3. Learn Flask, React, and Ruby on Rails
Gain proficiency in Flask, React, and Ruby on Rails to enhance our capabilities in web development and backend integration.

WHAT WE'VE DONE

Split Flight Trace Search and Display

We went through the entire process of handling ADS-B data, starting from downloading and cleaning, to splitting traces, and finally storing it. This allows us to easily query the last 10 flights of any aircraft and successfully plot them on a map.

Detailed Flight Info Display

The integration of FAA registration data and NASR subscription data allows us to retrieve detailed information about each flight and aircraft, such as aircraft details or takeoff and landing airports. This enables us to query and display comprehensive flight details quickly, thanks to our structured data storage format.

Enable the Display of VFR/IFR Chart Layer

We processed the extensive VFR and IFR tile files, and successfully integrated them into the app, enabling the display of these chart layers.

CHALLENGES AND RESULTS

This was the first time both of us used React, Flask, or Ruby, making it quite challenging to understand web app development in a short period. By carefully following Cotter’s code, we replicated our own version and integrated the processed data, adding new features to the existing framework. Through hands-on practice, we learned a lot, and it was rewarding to see our work—from data processing to frontend display—come to life. The experience of reading code, running it locally, setting up environments, and finally seeing our work on the interface was incredibly satisfying.

Figma UI Design

TO BE CONTINUED....