Sparsity-technologies: high-performance graph database, data deduplication and bibliographic exploration

email this page
en es cat

* Graph Database Use Case: SNA (Social Network Analysis)

May 7th, 2012

In this second release of the series of the use cases, we are looking through one of the most interesting scenarios for graph databases: Social Network Analysis (SNA).

DEX highest-performance with huge volumes of processed data, its flexibility and the nature of the graph, makes it the perfect solution for Social Network Analysis.

More info? We have created a new section in DEX site called Scenarios that contains a detailed explanation about the fields where Graph Databases are key, and which we have plenty of experience. We will be adding more, stay tuned!

Do not forget to check the list of features we believe SNA must cover. We welcome your feedback! Please tell us which do you think are SNA requirements and achievements and why graph databases could be a good solution.

If you think SNA is your area, we encourage you to evaluate DEX here, and do not hesitate to contact us at info@sparsity-technologies.com for additional support. Use our knowledge in the SNA field!.

Read also the first release of the Use Case series: Bibliographic exploration

No Comments

* How to use DEX algorithm package

March 26th, 2012

The latest version of DEX includes the helpful algorithm package that give more high-level operations to the API.

Here you can find the list of algorithms explained including examples of use for the Java and .NET APIs:

Traversals algorithms
To traverse a graph is to visit the nodes included in the graph. You can choose between DFS or BFS techniques.

DFS (depth first search) is a technique were the nodes are visited starting at the root and selecting one of the neighbors’ nodes which are explored as far as possible along each branch before backtracking.

BFS (breadth first search) is a technique were the nodes are visited starting at the root which all its neighbors are explored and so on.

For both techniques you can restrict the visit by a certain type of nodes or only navigating through a certain type of edges.

Java example:
System.out.println("Traversal BFS"); // Create a new BFS traversal from the node "startingNode" TraversalBFS bfs = new TraversalBFS(sess, startingNode); // Allow the use of all the node types bfs.addAllNodeTypes(); // Allow the use of all the edge types but only in outgoing direction bfs.addAllEdgeTypes(EdgesDirection.Outgoing); // Limit the depth to 3 hops from the starting node bfs.setMaximumHops(3); // Get the nodes while (bfs.hasNext()) { long nodeid = bfs.next(); int depth = bfs.getCurrentDepth(); System.out.println("Node "+nodeid+" at depth "+depth+"."); } // Close the traversal bfs.close();

The same with the TraversalDFS method.

.Net example:
System.Console.WriteLine("Traversal BFS"); // Create a new BFS traversal from the node "startingNode" TraversalBFS bfs = new TraversalBFS(sess, startingNode); // Allow the use of all the node types bfs.AddAllNodeTypes(); // Allow the use of all the edge types but only in outgoing direction bfs.AddAllEdgeTypes(EdgesDirection.Outgoing); // Limit the depth to 3 hops from the starting node bfs.SetMaximumHops(3); // Get the nodes while (bfs.HasNext()) { long nodeid = bfs.Next(); int depth = bfs.GetCurrentDepth(); System.Console.WriteLine("Node "+nodeid+" at depth "+depth+"."); } // Close the traversal bfs.Close();

The same with the TraversalDFS method.

Find shortest path algorithms
Find the shortest way to travel from one node to another. The APIs offer two techniques BFS or Dijkstra. Dijkstra is the one to use if you have weights, that matter in the path retrieval, in the edges whileas BFS is the one to use otherwise.

Java example:
System.out.println("SinglePairShortestPath BFS"); // Create a new unweighted shortest path from "startingNode" to "endingNode" SinglePairShortestPathBFS spBFS = new SinglePairShortestPathBFS(sess, startingNode, endingNode); // Allow the use of all the edge types in Any direction spBFS.addAllEdgeTypes(EdgesDirection.Any); // Allow the use of all the node types spBFS.addAllNodeTypes(); // Calculate the shortest path spBFS.run(); // Check the path if it exists if (spBFS.exists()) { // Get the total path cost System.out.println("A shortest path exists with cost: "+spBFS.getCost()+"."); // Get the path OIDList pathAsNodes = spBFS.getPathAsNodes(); OIDListIterator pathIt = pathAsNodes.iterator(); while (pathIt.hasNext()) { long nodeid = pathIt.next(); System.out.println("Node: "+nodeid); } } else { System.out.println("No path found"); } // Close the shortest path spBFS.close();

Analogously the Dijkstra method.

.Net example:
System.Console.WriteLine("SinglePairShortestPath BFS"); // Create a new unweighted shortest path from "startingNode" to "endingNode" SinglePairShortestPathBFS spBFS = new SinglePairShortestPathBFS(sess, startingNode, endingNode); // Allow the use of all the edge types in Any direction spBFS.AddAllEdgeTypes(EdgesDirection.Any); // Allow the use of all the node types spBFS.AddAllNodeTypes(); // Calculate the shortest path spBFS.Run(); // Check the path if it exists if (spBFS.Exists()) { // Get the total path cost System.Console.WriteLine("A shortest path exists with cost: "+spBFS.GetCost()+"."); // Get the path OIDList pathAsNodes = spBFS.GetPathAsNodes(); OIDListIterator pathIt = pathAsNodes.Iterator(); while (pathIt.HasNext()) { long nodeid = pathIt.Next(); System.Console.WriteLine("Node: "+nodeid); } } else { System.Console.WriteLine("No path found"); } // Close the shortest path spBFS.Close();

Analogously the Dijkstra method.

Connected components algorithms
Connectivity shows in which degree a group of nodes are connected to each other. With DEX you can find strongy connected components using Gabow technique or weakly connected components using DFS technique.

Java example:
System.out.println("Weak Connectivity DFS"); // Create a new WeakConnectivityDFS WeakConnectivityDFS weakConnDFS = new WeakConnectivityDFS(sess); // Allow the user of all the edge types weakConnDFS.addAllEdgeTypes(); // Allow the use of all the node types weakConnDFS.addAllNodeTypes(); // Don't set a materialized attribute // Calculate the weakly connected components weakConnDFS.run(); // Get the connected components ConnectedComponents weakCC = weakConnDFS.getConnectedComponents(); long numWeakComponents = weakCC.getCount(); System.out.println("Weakly connnected componennts: "+numWeakComponents); for (long ii = 0; ii < weakCC.getCount(); ii++) { Objects ccNodes = weakCC.getNodes(ii); long numNodes = ccNodes.count(); System.out.println("Connected component "+ii+" has "+numNodes+" nodes."); ccNodes.close(); } // Close the connected components weakCC.close(); // Close the WeakConnectivityDFS weakConnDFS.close();

Analogously the StrongConnectivityGabow method.

.Net example:
System.Console.WriteLine("Weak Connectivity DFS"); // Create a new WeakConnectivityDFS WeakConnectivityDFS weakConnDFS = new WeakConnectivityDFS(sess); // Allow the user of all the edge types weakConnDFS.AddAllEdgeTypes(); // Allow the use of all the node types weakConnDFS.AddAllNodeTypes(); // Don't set a materialized attribute // Calculate the weakly connected components weakConnDFS.Run(); // Get the connected components ConnectedComponents weakCC = weakConnDFS.GetConnectedComponents(); long numWeakComponents = weakCC.GetCount(); System.Console.WriteLine("Weakly connnected componennts: "+numWeakComponents); for (long ii = 0; ii < weakCC.GetCount(); ii++) { Objects ccNodes = weakCC.GetNodes(ii); long numNodes = ccNodes.Count(); System.Console.WriteLine("Connected component "+ii+" has "+numNodes+" nodes."); ccNodes.Close(); } // Close the connected components weakCC.Close(); // Close the WeakConnectivityDFS weakConnDFS.Close();

Analogously the StrongConnectivityGabow method.

Finally do not forget to include the package when using the former methods! by adding:

Java:
import com.sparsity.dex.algorithms.*;

.Net:
using com.sparsity.dex.algorithms;

No Comments

* DEX Analytical Use Case Benchmark: Wikipedia

January 17th, 2011

As we have already announced in the previous post we would like to share an analytical use case that shows DEX high performance. This time we are taking a look at how DEX responds to some queries performed on a single dataset taking as a reference the results of another well-known open source graph database.

With this benchmark we would like to join the celebration of Wikipedia for its 10^th anniversary. Wikipedia was launched in 2001 by Jimmy Wales and Larry Sanger and has become the largest and most popular general reference work on the Internet having 365 million readers. Our congratulations to everyone making Wikipedia possible!

For the benchmark we used all the Wikipedia articles written before January 2010. In particular the database loaded contained 55M articles, 2.1M images and 321M references between articles.

With this benchmark we want to obtain the following information:

Loading times, including the generation of full index structures for the graph.
Graph database size
Response times for typical queries made to the loaded data, which include:
1. Query 1(Q1): Finds the node with the maximum outdegree, the one with most relationships with other nodes, and then runs a BFS traversal of the graph starting from that node. More info about traversals algorithm in the graph algorithms post.
2. Query 2(Q2): Finds the node with the maximum indegree, select nodes referencing that node, and with this new set, finds again those referencing every node in the set. In other words, it performs the 2-hops operation. Finally the query ranks the nodes by number of references and returns de top 5.
3. Query 3(Q3): Finds a pattern in the graph. The pattern tries to find articles written in Catalan (CA) which are translated into English (EN) without some images from the original article.
4. Query 4(Q4): Finds the number of articles and images for every language available.
5. Query 5(Q5): Materializes the number of images for all the articles.
6. Query 6(Q6): Deletes all the articles from the database with no images.

See the results in the following table:

It is remarkable to notice that loading all the articles from the Wikipedia to DEX only took 2.25 hours with a resulting database size of 16.98 GB. This shows the huge amounts of information that can be loaded to DEX with no disk restrictions in reasonable times. The results for all six queries are always positive for DEX with results of more than two orders of magnitude for all the queries except for Q3, which still is one order of magnitude faster.

DEX gives the greatest performance results in both loading time and responding time to queries, making DEX an attractive option for those solutions with big volumes of data that are cumbersome to analyze. Try DEX now and you’ll see how this performance is in action.

4 Comments

* Sparsity Technologies new headquarters at Parc UPC K2M

April 16th, 2012

Sparsity Technologies announces its new offices opening this April at the K2M (Knowledge to Market) building.

UPC Park was conceived with the mission to become a socioeconomic dynamic between UPC, administration and companies in order to promote research, innovation and transfer of technological progress and results.

You can now find us at floor 0 (hall level) offices 001a.

APTE also shares the news here: http://www.apte.org/es/noticia-parque529.cfm

No Comments

* Graph Database Use Case: Bibliographic exploration

September 19th, 2011

Bibliographic exploration is an interesting use case for Graph Databases. Bibliographic exploration rises after the need to query huge bibliographic resources to obtain relevant information for researchers.

There are many questions that researchers try to ask to Bibliographic resources, but the vast amount of heterogeneous information stored in them makes it difficult to obtain good and fast answers.

Articles, its authors and the keywords that best describe those articles are stored in Bibliographic resources. This type of information is naturally linked, for instance authors are linked with other authors by the articles they have collaboratively written and at the same time articles may be connected with the keywords that are most relevant in them.

Graph Databases are a good solution to store huge amount of strongly connected information. Graph Databases store information the same way it is connected naturally; therefore answers can be retrieved directly without having to join all the data as it happens in SQL traditional databases.

The following figure is an example of how a Bibliographic resource may be stored in a Graph Database.

Capture from Bibex online demo

We can see in the illustration that authors are nodes in the graph, and they are connected by their collaboration in papers (edge). With a click on the edge of the graph you can obtain all the articles written together by both authors.

Capture from Bibex online demo

This type of query takes seconds to have a result in a Graph Database and could be relevant to new researchers, like PhD students, or researchers in a new area in order to investigate authors, the papers they have written, who they have collaborated with and about what topic areas.

If you want to play with this type of query in a graph, visit Bibex social network free demo. Bibex is able to process large quantities of data because it is powered by DEX Graph Database. The online demo stores the information from DBLP a well-known bibliographic resource, but it could use any other bibliographic source, even combining them.

Another interesting aspect about storing Bibliographic information with Graph Databases is the use of the citation metric. An article or an author can be considered to be of quality depending on both the number and the acknowledgment (quality) of the citations. Again Graph Databases are the most suitable technology to work with this metrics, since it would represent only retrieving the neighbors for a certain node “article”, that have the edge type “cite”:

//Once the DB is open


article = graph.findType("article");

title = graph.findAttribute(article, “title”);

www = graph.findObject(title, new Value(“The World-Wide Web.”));

cite = graph.findType("cite");

citations = graph.neighbors(www, cite, EdgesDirection.Ingoing)

articleQualityValue = citations.count();

//You should close here the DB
Example of code using DEX Graph Database Java API

Using citations we could answer questions like “Who is an authority in a certain topic?” or “Who is the most suitable reviewer for a certain paper?” The possibility to answer those new complex queries is what makes graph databases an excellent use case for bibliographic exploration.

Let’s conclude with the big pros of using Graph Databases for Bibliographic exploration:

Data sources with bibliographic information are huge and strongly connected. Graph Databases can store billions of objects and are specially created to store linked information.
Bibliographic exploration is more interesting if it merges as many sources as possible. Graph Databases can store data with heterogeneous schemas, like bibliographic repositories, publishers, patents, or any other source of information.
Researchers need to have answers as quick as possible, in order to have his/her efforts focused in its main topic of research. Graph Databases can query connected data in a few seconds, even for complex queries.
New complex questions can be easily answered using graph database ease to navigate through linked information.

1 Comment

* NEW Bibex demo

September 13th, 2011

We would like to announce the release of a complete free demo for the Bibex social network query.

Click here to launch Bibex demo

Bibex resolves multiples queries that help retrieving very relevant information for researchers in very short responding times. Bibex uses DEX graph database to resolve questions like “who is the most important authority in a certain subject?” in a few seconds. Moreover as Bibex results are shown in a network it is easy to jump between articles, keywords and authors while navigating the answer. Read more information about Bibex here .

Bibex demo is able to resolve the “Social network” query available in Bibex.

You will be able to search for the social network of any author*, retrieve their curriculum, relationships & statistics. In addition you can jump to the publication source of each article with just one click. Check how Bibex resolves this query in our Bibex demo here.

Bibex social network has an intuitive web interface. Once you click on this link, a search box will be shown. There you can search for any author name you think of*. For instance, in the following example we search for Tim Berners-Lee. Search box has an autocomplete facility to help you discern faster the exact author you are trying to search.

Once the search is performed, if there’s only one possible answer to your query the author curriculum and its social network are shown right away. If there are multiple answers a list of possible authors appear on the left in alphabetical order with the first author social network already loaded in the right side of the screen.

In that second case you must click on the name you were actually searching for to load the social network of the author.

Authors’ curriculum on the left contains a list of all their publications. They can be sorted by date, alphabetically or you can search inside the list as well. Selecting a publication and clicking in the “Go to URL” icon shown in the following picture jumps you to the original source of the publication.

Authors’ collaborations on the left contain a list of all the authors that have co-written some publication. It is interesting to sort this list by number of collaborations allowing discerning which have most strong relationships and therefore may be also of your interest.

Finally on the bottom of this left side there’s a statistic of the productivity of the author through time.

Another important part of the results are shown in the right side of the screen. The social network of the author can be navigated; discovering all the relationships.

Double clicking on another author’s name jumps you to that author information and social network. In addition, clicking two authors’ edge reveals a list of the publications written collaboratively by both authors.

Hope you enjoy Bibex, feel free to play and experience the smoothness and quickness of answer.

*Bibex demo searches in DBLP bibliographic database. The DBLP used for Bibex has an amount of 999,053 authors and 2,740,244 articles.

No Comments

* DEX Graph Database version 4.2 goes .NET

July 15th, 2011

The possibilities of native .NET programming now for the highest performance graph database.

Now you can have all the scalability and performance of DEX graph database with a dedicated Microsoft .NET API in your secure and robust professional MS environment.

DEX new release comes with a completely renovated Java API and the brand new .NET API, for .NET languages programmers.

DEX does not forget compatibility with applications that use previous versions of its API, that’s why we not only give an easy migration guide but also offer an API for DEX 4.2 with compatibility with previous versions. Nevertheless, we recommend to all DEX programmers to start migrating to the new Java API, since we assure the process to be quite painless and quick and will guarantee the continuity of all their applications.

No Comments

* Daurum 5.0, services and online demo

June 23rd, 2011

We are proud to announce new Daurum 5.0 version of our software.

Daurum 5.0, the deduplication and integration tool is able to integrate data from multiple sources linking millions of records and identifying potential duplicated records. A remarkable characteristic of this new version is its friendly web interface that allows concurrent access by multiple users. With Daurum 5.0 users may have different roles sharing their data and, most important, dividing tasks between them. Also, Daurum 5.0 allows users to create their own filters for data cleansing, using an integrated editor with preview options, or choose one from the list of the tested and most valuable default filters.

Daurum 5.0 is coming with new exciting possibilities. Now you can test our deduplication technology with an online demo, request our services, or buy a license of the software.

Daurum online service demo is a free service we offer through our website to take a first glimpse of the results that can be achieved with Daurum 5.0 services or software. Test one typical scenario for the client database and see how duplicates are detected. Please read through our description of the demo, and the comparative table between what you can actually find in Daurum that is not available in the demo service.

We hope that the online service demo becomes a useful evaluation tool in order to understand how deduplication works.

We offer Daurum 5.0 in a 2 brand new modalities:

If you need to clean your database of duplicated records, or if you have several databases to be merged assuring the best quality for the resulting database, consider hiring Daurum services.

Daurum services prices are tailored to your database size and cleansing needs, it is always the best option if you need one shot deduplication.

If you want your own copy of Daurum 5.0 for perpetual use, we would recommend one of our licenses.

With Daurum license, you will be able to deduplicate or merge databases as many times as you need. Create your own specific filters for the detection of duplicates, add your dictionaries and become the best expert in your database.

Also the license includes a professional installation support.

To see a little more about the potential of Daurum 5.0 services & software, with its complete functionalities, take a look at the following images:

No Comments

* DAMA-UPC project RECOMANA using DEX winner of BDigital congress prizes

June 9th, 2011

We would like to congratulate DAMA-UPC that has just been awarded with the Bdgital congress prize in the category of Research institutes and Universities.

RECOMANA* uses the potential of graph databases with the high-performance and billions of objects storage offered by DEX to analyze different type of information from media content to school programs, in order to give a tool to the professionals of education to obtain better and specific information for teaching purposes.

The quantity of audiovisual content generated by media such as TV broadcasters and Radio stations is huge and forms a significant body of knowledge that spans many areas. However, this content is not available to the professionals of education in an efficient way. In other words, teachers have to devote a significant amount of their times to obtain videos or fragments of videos useful for their teaching purposes, without knowing whether they are accurate for their objective.

The final objective is to allow society have an asset and benefit from RECOMANA project, allowing teachers and students to access digital information to improve the contents of the topics in primary and secondary education.

BDigital congress new: http://www.bdigitalglobalcongress.com/lang-en/2011/zyncro-invoxcontact-recomana-and-webdom-enlight-winners-of-the-prizes-awarded-at-the-bdigital-global-congress/

RECOMANA project video (in catalan): http://www.youtube.com/user/BarcelonaDigital#p/c/20/jar83ywfEf0

*RECOMANA in Catalan means recommend, suggest

No Comments

* NEW Dex technical support area

June 2nd, 2011

DEX has a new collaborative support area to share doubts, suggestions or any request about DEX. We encourage your participation in the new online forum callled DEX technical support area. We would like the DEX technical support area to be the meeting point for all DEX users. Share your doubts with our technical stuff and use your expertise in DEX to help other users.

To enter to the area, please click on the main page and enter in the technical area using the button shown in the following images:

Also you can use DEX subsection Request info to access to the technical area.

Once you are in the Technical Area a list of the available forums will be shown. Click on the DEX Technical Support Forum to see the list of current topics.

If you are new to DEX we highly recommend to check the always updated complete list of the most frequently asked questions (FAQ), which you will find in the top of the forum topics list. New questions from your requests in the forum will be included in the FAQ . Do not forget to check the list, you may find your answers there!

No Comments

* Sparsity Technologies website now in Spanish & Catalan

May 26th, 2011

Now you can choose to read our website in Catalan or Spanish. All contents are accessible in the new multi-language platform. To change your language just search it in the top of the page:

Yet we are keeping the blog only in English. Please feel free to share with us your impressions or any suggestion!

No Comments

Back to top ↑