# The Wikipedia game

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 03-23-2009
ಠ_ಠ
The Wikipedia game

Post failures as well as successes

example:
http://en.wikipedia.org/wiki/Chertsey_Lock
http://en.wikipedia.org/wiki/City_of_London
http://en.wikipedia.org/wiki/World_War_II
• 03-23-2009
abachler
Its called the Matt Damon effect, all things link to Matt Damon (or Adolf Hitler) within 6 links.

Hitler is too easy, as anything you can connect to europe is automatically connected to hitler within 2 links (ww2, hitler) and by extension you can connect europe via continent, which connects to any other country which connects to anythign that hapened or exists in that country, so 6 links to hitler max.

Rice Pudding
Rice
China
WW2
Hitler

Matt Damon
Movies
Television
Hitler
• 03-23-2009
laserlight
Quote:

Originally Posted by abachler
Hitler is too easy, as anything you can connect to europe is automatically connected to hitler within 2 links

Doesn't that mean that "Hitler" is difficult, since the aim of the game is to "find an article on Wikipedia that doesn't lead to Adolf Hitler within 8 links"?
• 03-23-2009
Desolation
Quote:

Originally Posted by laserlight
Doesn't that mean that "Hitler" is difficult, since the aim of the game is to "find an article on Wikipedia that doesn't lead to Adolf Hitler within 8 links"?

Perhaps he meant that too easy to find = too hard not to find.
• 03-24-2009
brewbuck
This problem isn't interesting enough. What I want is for somebody to describe an algorithm which tries to find the shortest path between any pair of articles. A solution to that automatically solves the original problem, and might tell us something informative about Wikipedia.
• 03-24-2009
DavidP
Crap! I hate these NP-Complete kind of problems ;)
• 03-24-2009
Snafuist
Quote:

What I want is for somebody to describe an algorithm which tries to find the shortest path between any pair of articles.
Wikipedia is a graph G where each article is a node and each internal link constitutes a directed edge to another node. Dijkstra's algorithm finds the shortest paths from a given node to every other node in the graph. Thus, finding the shortest path for any pair of nodes is fairly simple: for each node n in G, call Dijkstra(G, n).

As Wikipedia isn't exactly sparse, Disjkstra has a running time of O(|V|^2 + |E|), where V is the set of nodes/vertices/articles and E is the set of edges. Note that |E| is smaller than |V|^2 in this case (|E|==|V|^2 if every article contains a link to every other article), so we can safely omit the |E| in our analysis. As we're calling Dijkstra as many times as there are nodes in G (i.e. |V| times), the overall complexity is O(|V|^3).

This is far from being NP-complete, although I wouldn't want to compute it for the approximately 2.8 million English Wikipedia articles.

Greets,
Philip
• 03-24-2009
CornedBee
Shortest path problems aren't NP-complete. Wikipedia has 2.8 million articles, and as a rough guess, 50 links per article on average. That's quite doable. For finding the shortest path between any single given pair of articles, you use Dijkstra's Shortest Path algorithm, which for a non-sparse graph (and Wikipedia definitely isn't sparse) has a worst case complexity of O(|V|² + |E|) == O(|V|²).

Edit: Meh. Snafuist beat me to it. Anyway, 2.8 million² == 7,84x10^12 (7.84 trillion) operations. Heh. Will take a while, I suppose.
• 03-24-2009
brewbuck
Quote:

Originally Posted by CornedBee
Shortest path problems aren't NP-complete. Wikipedia has 2.8 million articles, and as a rough guess, 50 links per article on average. That's quite doable. For finding the shortest path between any single given pair of articles, you use Dijkstra's Shortest Path algorithm, which for a non-sparse graph (and Wikipedia definitely isn't sparse) has a worst case complexity of O(|V|² + |E|) == O(|V|²).

Edit: Meh. Snafuist beat me to it. Anyway, 2.8 million² == 7,84x10^12 (7.84 trillion) operations. Heh. Will take a while, I suppose.

Well, we all know algorithms to find shortest paths. The problem is that we don't have access to the complete graph structure of Wikipedia. We can only go to a page, see the links there, and click on them.

So, working on the assumption that you do NOT know the edges from a node until you actually visit that node, what kind of heuristic could be used to navigate the graph to any arbitrary node? What method would a human use?
• 03-24-2009
laserlight
Quote:

Originally Posted by brewbuck
Well, we all know algorithms to find shortest paths. The problem is that we don't have access to the complete graph structure of Wikipedia. We can only go to a page, see the links there, and click on them.

So, working on the assumption that you do NOT know the edges from a node until you actually visit that node, what kind of heuristic could be used to navigate the graph to any arbitrary node? What method would a human use?

Is it a requirement to know the entire graph in order to use Dijkstra's algorithm? I had the impression that it can work even if the information about the next node's next nodes is not available until the next node is visited.
• 03-24-2009
Snafuist
Quote:

Originally Posted by brewbuck
So, working on the assumption that you do NOT know the edges from a node until you actually visit that node

...which you don't have to know in order to use Dijkstra (you probably meant what you said plus "do not know that a node exists until you've been there")...

Quote:

, what kind of heuristic could be used to navigate the graph to any arbitrary node?
Brute force, i.e. BFS or DFS.

What do you want to achieve? Start at a random node and minimize the hops to "Hitler" without full knowledge about the graph?

There are two possibilities:

1)
either the distance between any two nodes is < infinity. Then applying Dijkstra to one node reveals all other nodes (and you can call Dijkstra recursively on those to get all shortest paths between any two nodes)

2)
or the distance between two nodes equals infinity, i.e. the graph consists of subgraphs which are not connected with each other. Then the problem of finding all shortest paths can't be solved without knowing all nodes in advance, but at least any node outside of the "Hitler" subgraph satisfies the conditions from the OP

Greets,
Philip
• 03-24-2009
CornedBee
• 03-24-2009
Snafuist
Another approach to solve the original problem: Wikipedia features backlinks (i.e. for every article, it's possible to find out the articles that link to it). So going to the "Hitler" page, viewing its backlinks, visiting all these pages, viewing their back links and so on 8 times will reveal all articles that refer to "Hitler" in 8 hops. Every other article doesn't.

Greets,
Philip
• 03-24-2009
Cactus_Hugger
Quote:

As Wikipedia isn't exactly sparse
I'd argue that Wikipedia is very sparse! According to one site, there are "2,301,486 articles with 55,550,003 links between them". There are a maximum of 5,296,835,506,710 links with that many articles, or you have .001048% of the total possible links. (Ask yourself: Does any given article link to most other articles? No.) Given that number though, it's pretty amazing how quickly you can prove Godwin right.

Additionally, why not use BFS? BFS should find the shortest path, and in better running time too! Although you'll probably need quite a bit of mem memory for the queue & the visited list. (although you'll still need memory with Dijkstra...)
• 03-24-2009
DavidP
I don't think BFS or DFS would be good choices for such a problem. It's better to use a smart search when possible.

(PS My NP-Complete joke was just that...a joke).

Anyways, the best choice for the job would be the "Learning Real-time A* Algorithm" (not regular A* which is similar to Dijkstra's).

Learning Real-time A* works under the assumption that the entire graph isn't really "known"...and it finds the same solution as A* or Dijkstra's would in a few more iterations of the algorithm.
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last