Accessing GREGoR Data
A How-To for other small companies
One of the inherent difficulties of rare disease research is that it can be tricky to get access to relevant data. Because of this limitation, one of the resources we've been particularly excited about is the GREGoR Consortium, who have worked hard to make their large collection of unique rare disease cases available to researchers. As we went through the process of gaining access to this data, our team realized that the process can be significantly different for a small industry organization compared to a large academic institution.
We are one of the first (if not the first) startup to access this data, so we wanted to write up the process for others who want to make use of this world-class dataset.
The timeline below is roughly split into two parts:
Setting up all the necessary upstream accounts to get dbGaP approval. Most of the provided documentation assumes that these accounts have already been created by your institution. Any small group would have to go through this same process.
Debugging recent issues with Terra. These exact bugs are hopefully an experience that was unique to us, but if others going through this process run into Terra issues, they shouldn't hesitate to contact Terra Support who were great to work with.
10/06/2023: It’s suggested we try to access GREGoR and we’re sent the link to AnVIL. The information we need for our specific use case is spread across a few different sites, so it takes a bit of time to understand what it takes to download the data.
10/11/23: We determine that we need to apply for access to the data via dbGaP, which for a new startup requires several pre-steps. The first is getting a Unique Entity ID (UEI) through SAM.gov. We fill out the application.
10/19/23: We receive a UEI. This allows us to register Lodestar as “Active” with SAM.gov.
10/23/23: Filling out these applications takes a bit of time if you're unfamiliar. We submit our registration to SAM.gov.
10/31/23: Our registration is activated. This allows us to create an eRA Commons account.
11/1/2023: We request creation of an eRA commons account for Lodestar
11/7/2023: Lodestar’s eRA Commons account is approved. After making individual accounts for our and associating them with Lodestar, we’re able to apply for access to the GREGoR data in dbGaP. The application process is moderately involved, including what amounts to be a two-page proposal for what you want to do with the data.
11/14/2023: We submit our proposal to dbGaP.
11/16/2023: We receive a notice that our proposal has been rejected for three reasons. First, we had an unaffiliated email address for someone in the proposal. This was fixed by paying for a new gmail account. Second, we had “PI = IT Director”. Since we’re a small org, individuals are wearing multiple hats. This was fixed by swapping some hats around. Lastly, someone from NHGRI reaches out to us requesting more information about Lodestar as an organization, mostly to verify the legitimacy of the company. We’re able to answer the questions the same day.
11/17/2023: We resubmit our proposal.
11/29/2023: dbGaP approves our proposal and we gain access to the data in dbGaP. Unfortunately, this only gives us permission to view the data, no real ability to access it, since the data itself is accessed through Terra.
12/1/2023: Ben and I spend a lot of the day trying to be able to view the data in the Terra workspace. We work through a number of steps including associating external IDs and setting up billing for the project. We still aren’t able to see any data.
12/2/2023: Ben gets access to the data for the first time, but I don’t. We realize that only Ben’s role in the dbGaP application would have correct permissions. Ben’s ability to access the data lasts for about an hour before it disappears.
12/4/2023: We get reapproved by dbGaP with a modified proposal that allows me to see the workspace for the first time. I also only have access to data for a short period of time before it disappears.
12/7/2023: We begin working with the Terra Support Team on what we call “Intermittent access to workspaces”. At this point, we have access to the metadata in GREGoR for long enough to download it, but no access to the genotypic data (e.g. crams/vcfs).
12/21/23: We get an update that the Terra engineers were able to track down the root cause and determined that a bug was introduced in a recent release, causing intermittent access. Because of the holidays, the next release is scheduled for the new year.
01/16/2024: We determine that access to GREGoR is stable, but the vcfs and crams continue to be inaccessible.
01/19/2024: We open a support ticket with the Terra Support Team called “Cannot access linked files in GREGoR dataset”. This email chain for me has 60 emails, so there are plenty of small back-and-forths I’ll skip below.
02/13/2024: We receive an update that the first bug fix has been released. Some data snapshots had been misplaced during a recent update and are now restored. While the update allows the internal Terra team to access the data, Ben and I still don’t have any ability to download data.
02/20/2024: We get another update that the permissions of the snapshots are likely the issue and it’s being investigated.
03/05/2024: Snapshot permissions are updated and released. Ben is now able to download the data, but I’m still not for an unknown reason.
03/12/2024: I have a meeting for “AnVIL GREGoR access troubleshooting”. We see that requesting to download a file causes a combination of 400 and 401 errors, but the underlying issue is unclear.
03/15/2024: I have a second meeting with several engineers from the Broad institute. In about 15 minutes, we determine that the 400 Response is coming from a “Bad Request Header”. The “Bad Request Header” is caused by the fact that my Google profile picture is too large. Removing the profile picture allows me to access the GREGoR data.
Yep, the final issue blocking my access to GREGoR genomes was that my profile picture resolution was too high! A pretty fun end to the saga.
If you're interested in accessing this data yourself, particularly as a small startup, feel free to contact us! We'd be more than happy to hear what you're working on and see how we can help!