Here we describe the step-by-step process of creating a dataset of 8,431 open-source Android apps.
The queries below are run against the bigquery-public-data:github_repos dataset in Google’s BigQuery.
All other steps can be followed with our open-source data collection tool.

    SELECT
      repo_name,
      path,
      id
    FROM
      [bigquery-public-data:github_repos.files]
    WHERE
      path LIKE
        '%AndroidManifest.xml'
github_repos.filesAndroidManifest.xml filesall_manifest_files    SELECT
      M.id as id,
      M.repo_name as repo_name,
      M.path as path,
      C.content as content
    FROM
      [all_manifest_files] AS M
    JOIN
      [bigquery-public-data:github_repos.contents] AS C
    ON
      M.id = C.id
    WHERE
      NOT C.binary
github_repos.contentsall_manifest_contents    SELECT
      id,
      repo_name,
      path,
      REGEXP_EXTRACT(
        content,
        r'(?is)<manifest[^>]*package=[\'"]([\w\.]*)[\'"]'
      ) AS package
    FROM
      [all_manifest_contents]
    HAVING
      package IS NOT null
all_manifest_contentspackage attributeall_package_names    SELECT
      package
    FROM
      [all_package_names]
    GROUP BY
      package
all_package_namesdistinct_package_nameshttps://play.google.com/store/apps/details?id=<package_name>200 is counted as verification for existence on
Google Play.403 Unauthorized is returned.
    With a mapping from package names to repositories from BigQuery and data from Google Play, we try to match a unique GitHub project to each package name.
To that end a list of package names with GitHub repositories the package occurs in is exported:
    SELECT
      package,
      COUNT(repo_name) AS repo_count,
      GROUP_CONCAT(repo_name) AS all_repos
    FROM
      [:manifest_files_in_github.all_package_names]
    WHERE
      package IN (
      SELECT
        package_name
      FROM
        [:manifest_files_in_github.package_names_200] )
    GROUP BY
      package
That list is used as input for the mapping heuristics.
This resulted in the final set of 8,431 open-source Android apps in 8,216 GitHub repositories.
The results of the data collection process are a list of 8,431 open-source Android apps with metadata from their Google Play pages and 8,216 GitHub repositories with the source code of those apps.
All this information is made available in two ways:

GooglePlayPage| Property | Type | Description | 
|---|---|---|
| docId | String | Identifier of an app, com.example.app. This property is always present. | 
    
| uri | String | The URI of the Google Play page. | 
| snapshotTimestamp | Long | POSIX timestamp when metadata from the Google Play entry was stored. | 
| title | String | Title of the app listing. | 
| appCategory | List of Strings | A list of categories such as “Tools”. | 
| promotionalDescription | String | Short description of the app. | 
| descriptionHtml | String | Description of the app in original language. | 
| translatedDescriptionHtml | String | Translation of descriptionHtml if available. | 
    
| versionCode | Int | Numeric value of the version of the app. | 
| versionString | String | Human readable version string. | 
| uploadDate | Long | POSIX timestamp of latest update of app. | 
| formattedAmount | String | Price of app (“Free” or “$1.66”) | 
| currencyCode | String | Three character currency code of price (“USD”) | 
| in-app purchases | String | Description of in-app purchases (“$3.19 per item”) | 
| installNotes | String | Either “Contains ads” or no value. | 
| starRating | Float | Average review between 0 and 5. May not be available if too few users have rated yet. | 
| numDownloads | String | Estimated number of downloads as displayed on Google Play (e.g “10,000+ downloads”). | 
| developerName | String | Name of developer. | 
| developerEmail | String | Email address of developer. | 
| developerWebsite | String | URI of website. | 
| targetSdkVersion | Int | Android SDK version the app targets. | 
| permissions | List of Strings | List of permission identifiers. | 
GitHubRepository| Property | Type | Description | 
|---|---|---|
| id | Long | Numerical identifier of this repository on GitHub. | 
| owner | String | Owner name at snapshot time. | 
| name | String | Repository name at snapshot time. | 
| snapshot | String | URI to clone of the repository. | 
| snapshotTimestamp | Long | POSIX timestamp when snapshot was taken. | 
| description | String | Short description of the repository. | 
| createdAt | Long | POSIX timestamp when repository has been created. | 
| forksCount | Int | Number of forks from this repository created with GitHub’s fork functionality. Other ways of forking, cloning locally and pushing to a new repostitory are not counted. | 
| stargazersCount | Int | Number of GitHub users having starred this repository. | 
| subscribersCount | Int | Number of GitHub subscribers. | 
| watchersCount | Int | Number of users watching this repository. | 
| networkCount | Int | Number of repositories forked from same source. | 
| ownerType | String | Account type of the owner, either “User” or “Organization”. | 
| parentId | Long | Id of parent repository if this is a fork, otherwise -1. | 
| sourceId | Long | Id of ancestor repository if this is a fork, otherwise -1. |