Here we describe the step-by-step process of creating a dataset of 8,431 open-source Android apps.
The queries below are run against the
bigquery-public-data:github_repos dataset in Google’s BigQuery.
All other steps can be followed with our open-source data collection tool.
SELECT repo_name, path, id FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%AndroidManifest.xml'
SELECT M.id as id, M.repo_name as repo_name, M.path as path, C.content as content FROM [all_manifest_files] AS M JOIN [bigquery-public-data:github_repos.contents] AS C ON M.id = C.id WHERE NOT C.binary
SELECT id, repo_name, path, REGEXP_EXTRACT( content, r'(?is)<manifest[^>]*package=[\'"]([\w\.]*)[\'"]' ) AS package FROM [all_manifest_contents] HAVING package IS NOT null
SELECT package FROM [all_package_names] GROUP BY package
200is counted as verification for existence on Google Play.
403 Unauthorizedis returned.
With a mapping from package names to repositories from BigQuery and data from Google Play, we try to match a unique GitHub project to each package name.
To that end a list of package names with GitHub repositories the package occurs in is exported:
SELECT package, COUNT(repo_name) AS repo_count, GROUP_CONCAT(repo_name) AS all_repos FROM [:manifest_files_in_github.all_package_names] WHERE package IN ( SELECT package_name FROM [:manifest_files_in_github.package_names_200] ) GROUP BY package
That list is used as input for the mapping heuristics.
This resulted in the final set of 8,431 open-source Android apps in 8,216 GitHub repositories.
The results of the data collection process are a list of 8,431 open-source Android apps with metadata from their Google Play pages and 8,216 GitHub repositories with the source code of those apps.
All this information is made available in two ways:
|docId||String||Identifier of an app,
|uri||String||The URI of the Google Play page.|
|snapshotTimestamp||Long||POSIX timestamp when metadata from the Google Play entry was stored.|
|title||String||Title of the app listing.|
|appCategory||List of Strings||A list of categories such as “Tools”.|
|promotionalDescription||String||Short description of the app.|
|descriptionHtml||String||Description of the app in original language.|
|versionCode||Int||Numeric value of the version of the app.|
|versionString||String||Human readable version string.|
|uploadDate||Long||POSIX timestamp of latest update of app.|
|formattedAmount||String||Price of app (“Free” or “$1.66”)|
|currencyCode||String||Three character currency code of price (“USD”)|
|in-app purchases||String||Description of in-app purchases (“$3.19 per item”)|
|installNotes||String||Either “Contains ads” or no value.|
|starRating||Float||Average review between 0 and 5. May not be available if too few users have rated yet.|
|numDownloads||String||Estimated number of downloads as displayed on Google Play (e.g “10,000+ downloads”).|
|developerName||String||Name of developer.|
|developerEmail||String||Email address of developer.|
|developerWebsite||String||URI of website.|
|targetSdkVersion||Int||Android SDK version the app targets.|
|permissions||List of Strings||List of permission identifiers.|
|id||Long||Numerical identifier of this repository on GitHub.|
|owner||String||Owner name at snapshot time.|
|name||String||Repository name at snapshot time.|
|snapshot||String||URI to clone of the repository.|
|snapshotTimestamp||Long||POSIX timestamp when snapshot was taken.|
|description||String||Short description of the repository.|
|createdAt||Long||POSIX timestamp when repository has been created.|
|forksCount||Int||Number of forks from this repository created with GitHub’s fork functionality. Other ways of forking, cloning locally and pushing to a new repostitory are not counted.|
|stargazersCount||Int||Number of GitHub users having starred this repository.|
|subscribersCount||Int||Number of GitHub subscribers.|
|watchersCount||Int||Number of users watching this repository.|
|networkCount||Int||Number of repositories forked from same source.|
|ownerType||String||Account type of the owner, either “User” or “Organization”.|
|parentId||Long||Id of parent repository if this is a fork, otherwise -1.|
|sourceId||Long||Id of ancestor repository if this is a fork, otherwise -1.|