• No results found

A drawback and its solution

A drawback of DBLP Communities is the way it builds the collaboration net-works. The DBLP server limits the number of requests that can be made to it during a certain amount of time. A consequence of this is that the collabora-tion network of authors with large coauthor lists can not be built. The current version of DBLP Communities notifies the user about this drawback.

A solution to this problem is already created and can be put into production but needs some funding since you can’t have your database on a server for free.

The solution is described in the following and sketched in Figure 9.5.

A stand-alone Spring Boot application downloads the XML file containing all the publications in the DBLP database. The application goes through every publication in the XML file and creates a node for every author in memory.

For every pair of coauthors an edge is made, which is simply a reference from one author object to another author object. As soon as 500 or more authors have been stored in memory a transaction is made to the database with these authors. Authors and coauthor relationships that are not already present in the database are added and the rest is discarded.

The application utilizes Spring Data Neo4j to create a Neo4j database con-taining one node per author in the DBLP database and with an edge labeledis coauthor ofbetween two nodes if their corresponding authors have published together. The database model is shown in Figure 9.6. The resulting database takes almost 430 MB of storage. The application can be put on a server and included in a task scheduler to be run once a month since the XML file of the DBLP database is updated once a month.

The Neo4j database will run in server mode on e.g. GrapheneDB (a service that hosts Neo4j databases for 50 dollars a month if you need 1 GB of stor-age) and be available through the standard Neo4j server mode REST API. The Neo4j server includes an unmanaged server extension that extends the standard REST API of Neo4j with an additional GET request on the form addressOf-Neo4jDatabase/unmanaged/neighborhood/John Doe. The server responds with the collaboration network of John Doe in JSON format. DBLP Communities, on the other hand, consumes this resource with the help of Spring Data Rest, which automatically transforms the JSON content to a predesigned POJO.

If John Doe has k coauthors, then k + 1 GET requests for XML files to the DBLP server are replaced by one GET request to the Neo4j server for the neighborhood of John Doe. The Neo4j database already stores the entire col-laboration network of the DBLP database and simply has to return the induced subgraph containing the neighborhood of John Doe, so a drastic speedup can be achieved for authors with many coauthors.

It should, however, be noted that a Neo4j database supporting weighted networks has not been created due to lack of time.

(a) The top of the results page.

(b) The middle of the results page.

(c) The bottom of the results page.

Figure 9.4

Glassfish 4 Application server

Neo4j server Embedded web

server

REST API DBLP

Communities HTTP Neo4j

data store

Unmanaged server extension

Figure 9.5: The alternative application setup. DBLP Communities is backed by a Neo4j database running on a separate server. The REST API is extended with an unmanaged server extension to provide JSON export of collaboration networks.

John Doe

Jane Roe IS_COAUTHOR_OF

count = x

Figure 9.6: The Neo4j database model. John Doe and Jane Roe are two fictional authors in the DBLP database and they have published x articles together. The count attribute is not implemented in the database due to lack of time.

Chapter 10

Conclusion and further research

10.1 Summary

In this thesis, implementations of the CNM and RECC algorithms have been presented together with a web service offering community detection in the DBLP database using the the RECC algorithm and giving the user the opportunity to set various parameters in this algorithm. Testing community detection rithms is not altogether easy, as was shown in the chapter on the CNM algo-rithm, as social networks often are hierarchically organized with smaller com-munities living inside larger comcom-munities, and people having different opinions on how large the outputted communities should be. This was some of the moti-vation for making a web service offering to reveal the different partitions in the dendrogram found by the algorithm and offering to give as output communities of a certain size.

Both the modularity function used by the CNM algorithm and the edge clustering coefficient used by the RECC algorithm are simple functions and are therefore not able to help find the very best partitions into communities.

However, they are able to find relatively good partitions and for many purposes they may be good enough. In the end it will be up to the knowledgeable user to evaluate the results on each network, and our web service may help achieve this.

10.2 Further research and application