The Docker image containing the graph database is based on the official Neo4j image. The only difference is, that this Docker image contains the dataset and an EXTENSION_SCRIPT
(load_db.sh
) which preloads the data when starting the container.
docker pull androidtimemachine/neo4j_open_source_android_apps
For example:
docker run --rm --detach=true \
--publish=7474:7474 --publish=7687:7687 \
androidtimemachine/neo4j_open_source_android_apps
This command starts the Docker image and exposes ports used by Neo4j. The --rm
options tells Docker to remove any newly created data inside the container after it has stopped running.
Map volumes into the container in order to persist state between executions:
docker run --rm --detach=true \
--publish=7474:7474 --publish=7687:7687 \
--volume=$HOME/neo4j/data:/data \
--volume=$HOME/neo4j/logs:/logs \
androidtimemachine/neo4j_open_source_android_apps
When running the container for the first time, data gets imported into the graph database. This can take several seconds. Subsequent starts with an existing database in a mapped volume skip the importing step.
You can access the Neo4j web-interface at http://localhost:7474
and connect Gopher clients to bolt://localhost:7687
.
Alternatively, a Cypher shell can be run with the bin/cypher-shell
command once a container is running. Use the container ID returned by the docker run
command or find it out with docker ps
.
$ docker run <...> # As above
6455917a2532b0c9bc335f93568022bd66c6dd4208f16b29b7f8b14b9418238b
$ docker exec --interactive --tty 6455917a2532b0c9bc335f93568022bd66c6dd4208f16b29b7f8b14b9418238b bin/cypher-shell
When logging in for the first time, a new password needs to be set. Log-in with username neo4j
and password neo4j
to set a new password. This step can be skipped by setting a default password or disabling authentication..
Below we list a series of example queries that highlight how to explore data in the graph.
For some of these queries, the Neo4j plugin APOC is necessary. Install it by mapping it into the container as follows:
$ mkdir plugins
$ cd plugins
$ wget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/3.3.0.1/apoc-3.3.0.1-all.jar
$ docker run --rm --detach=true \
--publish=7474:7474 --publish=7687:7687 \
--volume=$PWD:/plugins \
androidtimemachine/neo4j_open_source_android_apps
Select apps belonging to the Finance category with more than 10 commits in a given week.
WITH apoc.date.parse('2017-01-01', 's', 'yyyy-MM-dd')
as start,
apoc.date.parse('2017-01-08', 's', 'yyyy-MM-dd')
as end
MATCH (p:GooglePlayPage)<-[:PUBLISHED_AT]-
(a:App)-[:IMPLEMENTED_BY]->
(:GitHubRepository)<-[:BELONGS_TO]-
(:Commit)<-[c:COMMITS]-(:Contributor)
WHERE 'Finance' in p.appCategory
AND start <= c.timestamp < end
WITH a.id as package, SIZE(COLLECT(DISTINCT c)) as commitCount
WHERE commitCount > 10
RETURN package, commitCount
Select contributors who worked on more than one app in a given month.
WITH apoc.date.parse('2017-01-01', 's', 'yyyy-MM-dd')
as start,
apoc.date.parse('2017-08-01', 's', 'yyyy-MM-dd')
as end
MATCH (app1:App)-[:IMPLEMENTED_BY]->
(:GitHubRepository)<-[:BELONGS_TO]-
(:Commit)<-[c1:COMMITS|AUTHORS]-
(c:Contributor)-[c2:COMMITS|AUTHORS]->
(:Commit)-[:BELONGS_TO]->
(:GitHubRepository)<-[:IMPLEMENTED_BY]-
(app2:App)
WHERE c.email <> 'noreply@github.com'
AND app1.id <> app2.id
AND start <= c1.timestamp < end
AND start <= c2.timestamp < end
RETURN DISTINCT c
LIMIT 20
Providing our dataset in containerized form allows future research to easily augment the data and combine it for new insights. The following is a very simple example showcasing this possibility. Assuming all commits have been tagged with self-reported activity of developers, select all commits in which the developer is fixing a performance-related bug. For demonstration purposes, a very simple tagger is applied. Optimally, tagging is done with a more sophisticated model.
MATCH (c:Commit)
WHERE c.message CONTAINS 'performance'
SET c :PerformanceFix
Also, given these additional labels, performance related fixes can then be easily used in any kind of query via the following snippet.
MATCH (c:Commit:PerformanceFix) RETURN c LIMIT 20
Metadata from GitHub and Google Play can be combined and compared. Both platforms have popularity measures such as star ratings. The following query returns these metrics for further analysis.
MATCH (r:GitHubRepository)<-[:IMPLEMENTED_BY]-
(a:App)-[:PUBLISHED_AT]->(p:GooglePlayPage)
RETURN a.id, p.starRating, r.forksCount,
r.stargazersCount, r.subscribersCount,
r.watchersCount, r.networkCount
LIMIT 20
Does a higher number of contributors relates to more successful apps? The following query returns the average review rating on Google Play and the number of contributors to the source code.
MATCH (c:Contributor)-[:AUTHORS|COMMITS]->
(:Commit)-[:BELONGS_TO]->
(:GitHubRepository)<-[:IMPLEMENTED_BY]-
(a:App)-[:PUBLISHED_AT]->(p:GooglePlayPage)
WITH p.starRating as rating, a.id as package,
SIZE(COLLECT(DISTINCT c)) as contribCount
RETURN package, rating, contribCount
LIMIT 20