Here we describe the step-by-step process of creating a dataset of 8,431 open-source Android apps.
The queries below are run against the bigquery-public-data:github_repos
dataset in Google’s BigQuery.
All other steps can be followed with our open-source data collection tool.
SELECT
repo_name,
path,
id
FROM
[bigquery-public-data:github_repos.files]
WHERE
path LIKE
'%AndroidManifest.xml'
github_repos.files
AndroidManifest.xml
filesall_manifest_files
SELECT
M.id as id,
M.repo_name as repo_name,
M.path as path,
C.content as content
FROM
[all_manifest_files] AS M
JOIN
[bigquery-public-data:github_repos.contents] AS C
ON
M.id = C.id
WHERE
NOT C.binary
github_repos.contents
all_manifest_contents
SELECT
id,
repo_name,
path,
REGEXP_EXTRACT(
content,
r'(?is)<manifest[^>]*package=[\'"]([\w\.]*)[\'"]'
) AS package
FROM
[all_manifest_contents]
HAVING
package IS NOT null
all_manifest_contents
package
attributeall_package_names
SELECT
package
FROM
[all_package_names]
GROUP BY
package
all_package_names
distinct_package_names
https://play.google.com/store/apps/details?id=<package_name>
200
is counted as verification for existence on
Google Play.403 Unauthorized
is returned.
With a mapping from package names to repositories from BigQuery and data from Google Play, we try to match a unique GitHub project to each package name.
To that end a list of package names with GitHub repositories the package occurs in is exported:
SELECT
package,
COUNT(repo_name) AS repo_count,
GROUP_CONCAT(repo_name) AS all_repos
FROM
[:manifest_files_in_github.all_package_names]
WHERE
package IN (
SELECT
package_name
FROM
[:manifest_files_in_github.package_names_200] )
GROUP BY
package
That list is used as input for the mapping heuristics.
This resulted in the final set of 8,431 open-source Android apps in 8,216 GitHub repositories.
The results of the data collection process are a list of 8,431 open-source Android apps with metadata from their Google Play pages and 8,216 GitHub repositories with the source code of those apps.
All this information is made available in two ways:
GooglePlayPage
Property | Type | Description |
---|---|---|
docId | String | Identifier of an app, com.example.app . This property is always present. |
uri | String | The URI of the Google Play page. |
snapshotTimestamp | Long | POSIX timestamp when metadata from the Google Play entry was stored. |
title | String | Title of the app listing. |
appCategory | List of Strings | A list of categories such as “Tools”. |
promotionalDescription | String | Short description of the app. |
descriptionHtml | String | Description of the app in original language. |
translatedDescriptionHtml | String | Translation of descriptionHtml if available. |
versionCode | Int | Numeric value of the version of the app. |
versionString | String | Human readable version string. |
uploadDate | Long | POSIX timestamp of latest update of app. |
formattedAmount | String | Price of app (“Free” or “$1.66”) |
currencyCode | String | Three character currency code of price (“USD”) |
in-app purchases | String | Description of in-app purchases (“$3.19 per item”) |
installNotes | String | Either “Contains ads” or no value. |
starRating | Float | Average review between 0 and 5. May not be available if too few users have rated yet. |
numDownloads | String | Estimated number of downloads as displayed on Google Play (e.g “10,000+ downloads”). |
developerName | String | Name of developer. |
developerEmail | String | Email address of developer. |
developerWebsite | String | URI of website. |
targetSdkVersion | Int | Android SDK version the app targets. |
permissions | List of Strings | List of permission identifiers. |
GitHubRepository
Property | Type | Description |
---|---|---|
id | Long | Numerical identifier of this repository on GitHub. |
owner | String | Owner name at snapshot time. |
name | String | Repository name at snapshot time. |
snapshot | String | URI to clone of the repository. |
snapshotTimestamp | Long | POSIX timestamp when snapshot was taken. |
description | String | Short description of the repository. |
createdAt | Long | POSIX timestamp when repository has been created. |
forksCount | Int | Number of forks from this repository created with GitHub’s fork functionality. Other ways of forking, cloning locally and pushing to a new repostitory are not counted. |
stargazersCount | Int | Number of GitHub users having starred this repository. |
subscribersCount | Int | Number of GitHub subscribers. |
watchersCount | Int | Number of users watching this repository. |
networkCount | Int | Number of repositories forked from same source. |
ownerType | String | Account type of the owner, either “User” or “Organization”. |
parentId | Long | Id of parent repository if this is a fork, otherwise -1. |
sourceId | Long | Id of ancestor repository if this is a fork, otherwise -1. |