New version page

# Penn CIS 400 - Applications of Small World Graphs in Online Social Networks

Documents in this Course

14 pages

13 pages

13 pages

13 pages

22 pages

11 pages

10 pages

4 pages

14 pages

14 pages

20 pages

15 pages

13 pages

15 pages

13 pages

Unformatted text preview:

Applications of Small-World Graphs in Online Social NetworksDaniel [email protected] KhannaThis project explored the applicability of the small-world graph characteristic to an online social network. The theory of small degrees of separation between people is one that has been thoroughly studied in regular social networks. As online social networks become more and more massive, it is interesting to see whether or not this quality translates. My project focused on gathering information from an existing social network, LiveJournal, and running statistical analysis on the structure and connectivityof the graph. Using this data, I created a decentralized searching algorithm to test whether or not targets in the graph could be found using only local information quickly and with high probability.Most of the related work for small-world graph analysis is relatively conjectural. The definition of a small-world graph is one that has a small diameter, as well as the ability to find short path to a target from a source in the graph with high probability while only using local information. Most research has revolved around finding an underlying structure for this type of graph, and then demonstrating that the small-world quality is inherent in it. The model that can be considered most well known and is of most use to my project is the Kleinberg lattice structure. The lattice structure is fairly simple, and starts off with a d-dimensional grid of vertexes. These vertexes, beyond their rigidly created short-range contacts, have a random distribution of long range contacts. This distribution has the property that the probability a node has a long-range contact isinversely proportional to the distance between the two vertexes. The routing in this model is generally optimized when the expected number of long range contacts is one for each distance from the node. The classic algorithm for this model is the greedy approach, which simply looks at the distances from the target for every contact, and extends the path with the next closest vertex.On the more social network side, most of the research revolves around geographical routing. As is easily apparent with offline social networks, online social networks have been known to exhibit a strong correlation between friendship and geographic proximity. Work such as those in “Geographic Routing in Social Networks” (see Bibliography) experimented on geographical routing. Although their findings showed a strong link between geography and friendship, their rank based friendship more accurately described the social network, in this case LiveJournal. The rank-based friendship estimated the number of people living closer to the target than the current node, and used this to determine a probable value for the path length between them. Although an interesting find about social network connectivity, this result does not enable any more efficient routing.The work that has been done to this point does not utilize all the information available given by a user; much of which may determine better paths to explore when searching for a target. My work did more than simply conjecture and test structural qualities of online social networks by taking a sample of the graph and performing data analysis through graph traversal algorithms to give definitive statistical conclusions about the network.The first portion of my project was a regular expression matcher which would becrucial to page scraping. Although not as extensive as a standard regular expression identifier (such as Perl) it is just as powerful and provided the functionality necessary to easily and efficiently recognize patterns in HTML source code. Since I was using LiveJournal as my data source, I needed to study the patterns in the HTML, recognizing how I would extract the information necessary. Once this was completed, the data extraction was next.I did a breadth-first search of the LiveJournal graph, starting from a user with relatively high initial contacts. The search continued until it had seen 200,000 user pages. This constituted about 1.5% of the entire user base of LiveJournal. This data was all put into one file, as the file input/output is significantly faster than opening an Input Stream online to retrieve information. With all the information on hand, I was able to create a graph which had users as vertexes and listing another user as a friend as an edge. Most of the analysis done on the graph consisted of breadth-first searches from various vertexes while recording information on the way. I ran graph traversal programs which analyzed different aspects of the graph in relation to common interests shared, communities listed, location, and schools listed. I also performed tests on the graph to discover hubs, vertexes which have a high frequency of use among shortest paths between vertexes. The information that resulted from the graph enabled conclusions to be drawn about the overall connectivity of the graph. The final portion of the project was creating a decentralized search algorithm to find a target from a source in the graph. The path computed from the algorithm was completely dependent on the information gleamed from previous analysis of the graph. From here it was essentially modifying the algorithm to get to the target quicker based onconclusions about expanding from a relatively small sample size of the LiveJournal site to being within a graph that contains all this information.The mechanisms by which I implemented these programs were fairly straightforward. The regular expression matcher was implemented by breaking down each regular expression and recursively finding matches on its sub-expressions. Eventually the regular expression would be broken down into individual characters, constituting a match, and these matches were relayed back up in the tree-like data structure to find overall matches. Most of the regular expressions I was looking for contained Kleene star closures. For example, to find an interest, the regular expression looked as such: 'http://www.livejournal.com/interests.bml?int=#', where the '#' symbol stands for any character. As it is impractical to search at every point for a match consisting of the rest of the HTML file, I limited the size of any regular expression match in my program.The LiveJournal HTML patterns were crucial to the correctness of the data extraction. Although I was only extracting from

View Full Document