# Thread: The Wikipedia game

1. ## The Wikipedia game

find an article on Wikipedia that doesn't lead to Adolf Hitler within 8 links

Post failures as well as successes

example:
http://en.wikipedia.org/wiki/Chertsey_Lock
http://en.wikipedia.org/wiki/City_of_London
http://en.wikipedia.org/wiki/World_War_II

2. Its called the Matt Damon effect, all things link to Matt Damon (or Adolf Hitler) within 6 links.

Hitler is too easy, as anything you can connect to europe is automatically connected to hitler within 2 links (ww2, hitler) and by extension you can connect europe via continent, which connects to any other country which connects to anythign that hapened or exists in that country, so 6 links to hitler max.

Rice Pudding
Rice
China
WW2
Hitler

Matt Damon
Movies
Television
Hitler

3. Originally Posted by abachler
Hitler is too easy, as anything you can connect to europe is automatically connected to hitler within 2 links
Doesn't that mean that "Hitler" is difficult, since the aim of the game is to "find an article on Wikipedia that doesn't lead to Adolf Hitler within 8 links"?

4. Originally Posted by laserlight
Doesn't that mean that "Hitler" is difficult, since the aim of the game is to "find an article on Wikipedia that doesn't lead to Adolf Hitler within 8 links"?
Perhaps he meant that too easy to find = too hard not to find.

5. This problem isn't interesting enough. What I want is for somebody to describe an algorithm which tries to find the shortest path between any pair of articles. A solution to that automatically solves the original problem, and might tell us something informative about Wikipedia.

6. Crap! I hate these NP-Complete kind of problems

7. What I want is for somebody to describe an algorithm which tries to find the shortest path between any pair of articles.
Wikipedia is a graph G where each article is a node and each internal link constitutes a directed edge to another node. Dijkstra's algorithm finds the shortest paths from a given node to every other node in the graph. Thus, finding the shortest path for any pair of nodes is fairly simple: for each node n in G, call Dijkstra(G, n).

As Wikipedia isn't exactly sparse, Disjkstra has a running time of O(|V|^2 + |E|), where V is the set of nodes/vertices/articles and E is the set of edges. Note that |E| is smaller than |V|^2 in this case (|E|==|V|^2 if every article contains a link to every other article), so we can safely omit the |E| in our analysis. As we're calling Dijkstra as many times as there are nodes in G (i.e. |V| times), the overall complexity is O(|V|^3).

This is far from being NP-complete, although I wouldn't want to compute it for the approximately 2.8 million English Wikipedia articles.

Greets,
Philip

8. Shortest path problems aren't NP-complete. Wikipedia has 2.8 million articles, and as a rough guess, 50 links per article on average. That's quite doable. For finding the shortest path between any single given pair of articles, you use Dijkstra's Shortest Path algorithm, which for a non-sparse graph (and Wikipedia definitely isn't sparse) has a worst case complexity of O(|V|² + |E|) == O(|V|²).

Edit: Meh. Snafuist beat me to it. Anyway, 2.8 million² == 7,84x10^12 (7.84 trillion) operations. Heh. Will take a while, I suppose.

9. Originally Posted by CornedBee
Shortest path problems aren't NP-complete. Wikipedia has 2.8 million articles, and as a rough guess, 50 links per article on average. That's quite doable. For finding the shortest path between any single given pair of articles, you use Dijkstra's Shortest Path algorithm, which for a non-sparse graph (and Wikipedia definitely isn't sparse) has a worst case complexity of O(|V|² + |E|) == O(|V|²).

Edit: Meh. Snafuist beat me to it. Anyway, 2.8 million² == 7,84x10^12 (7.84 trillion) operations. Heh. Will take a while, I suppose.
Well, we all know algorithms to find shortest paths. The problem is that we don't have access to the complete graph structure of Wikipedia. We can only go to a page, see the links there, and click on them.

So, working on the assumption that you do NOT know the edges from a node until you actually visit that node, what kind of heuristic could be used to navigate the graph to any arbitrary node? What method would a human use?

10. Originally Posted by brewbuck
Well, we all know algorithms to find shortest paths. The problem is that we don't have access to the complete graph structure of Wikipedia. We can only go to a page, see the links there, and click on them.

So, working on the assumption that you do NOT know the edges from a node until you actually visit that node, what kind of heuristic could be used to navigate the graph to any arbitrary node? What method would a human use?
Is it a requirement to know the entire graph in order to use Dijkstra's algorithm? I had the impression that it can work even if the information about the next node's next nodes is not available until the next node is visited.

11. Originally Posted by brewbuck
So, working on the assumption that you do NOT know the edges from a node until you actually visit that node
...which you don't have to know in order to use Dijkstra (you probably meant what you said plus "do not know that a node exists until you've been there")...

, what kind of heuristic could be used to navigate the graph to any arbitrary node?
Brute force, i.e. BFS or DFS.

What do you want to achieve? Start at a random node and minimize the hops to "Hitler" without full knowledge about the graph?

There are two possibilities:

1)
either the distance between any two nodes is < infinity. Then applying Dijkstra to one node reveals all other nodes (and you can call Dijkstra recursively on those to get all shortest paths between any two nodes)

2)
or the distance between two nodes equals infinity, i.e. the graph consists of subgraphs which are not connected with each other. Then the problem of finding all shortest paths can't be solved without knowing all nodes in advance, but at least any node outside of the "Hitler" subgraph satisfies the conditions from the OP

Greets,
Philip

12. Downloading and parsing the entire Wikipedia is linear, though. Sloooooooooow linear, probably.

13. Another approach to solve the original problem: Wikipedia features backlinks (i.e. for every article, it's possible to find out the articles that link to it). So going to the "Hitler" page, viewing its backlinks, visiting all these pages, viewing their back links and so on 8 times will reveal all articles that refer to "Hitler" in 8 hops. Every other article doesn't.

Greets,
Philip

14. As Wikipedia isn't exactly sparse
I'd argue that Wikipedia is very sparse! According to one site, there are "2,301,486 articles with 55,550,003 links between them". There are a maximum of 5,296,835,506,710 links with that many articles, or you have .001048% of the total possible links. (Ask yourself: Does any given article link to most other articles? No.) Given that number though, it's pretty amazing how quickly you can prove Godwin right.

Additionally, why not use BFS? BFS should find the shortest path, and in better running time too! Although you'll probably need quite a bit of mem memory for the queue & the visited list. (although you'll still need memory with Dijkstra...)

15. I don't think BFS or DFS would be good choices for such a problem. It's better to use a smart search when possible.

(PS My NP-Complete joke was just that...a joke).

Anyways, the best choice for the job would be the "Learning Real-time A* Algorithm" (not regular A* which is similar to Dijkstra's).

Learning Real-time A* works under the assumption that the entire graph isn't really "known"...and it finds the same solution as A* or Dijkstra's would in a few more iterations of the algorithm.

Popular pages Recent additions