Open data release to accompany
Conflict and Computation on Wikipedia: A Finite-State Machine Analysis of Editor Interactions
Simon DeDeo, Future Internet 2016, 8(x) [in press]

A dataset for the analysis of conflict and cooperation on Wikipedia, as well as a resource for the theory of symbolic time series analysis.

including:
• the time series of revert (R) and non-revert (C) actions on the edit histories of the 62 most-edited pages of Wikipedia
• best fit hidden Markov models for each page, as found by SFIHMM, in SFIHMM-readable format.
• data on each edit, including user name, date of edit, and page hash.

dedeo_wikipedia_HMM.zip

== README ==
Release 0.1, 6 July 2016
http://bit.ly/wiki_hmm

Data Release to accompany
Conflict and Computation on Wikipedia: A Finite-State Machine Analysis of Editor Interactions
Simon DeDeo, Future Internet 2016, 8(x) [in press]
http://arxiv.org/abs/1512.04177

This data release contains (1) this README; (2) a metadata file, metadata.tsv; and (3) 62x3 files, three for each of the 62 pages in the data: coarse-grained edit history, more fine-grained data on each edit, and the best-fit HMMs found by SFIHMM

example:

From metadata.tsv, we see that the Wikipedia page “God”, accessible at http://en.wikipedia.edu/wiki/God, was created on 28 October 2001 at 6 o’clock in the morning UTC. Our data records a total of 10731 edits for this page.

God_timeseries.dat is an (SFIHMM-readable) file of the page’s C/R symbolic time series.
God_hmm_8states is the best-fit HMM for this page; using the AIC criterion, we prefer a model with 8 states.
Both God_timeseries.dat and God_hmm_8states can be read by SFIHMM for further analysis (Viterbi reconstruction, etc.); see
http://bit.ly/sfihmm

God_edit.tsv contains more detailed data on each edit in the God_timeseries.dat file. For example, edit 34 was made on 14 December 2001 by user
Larry Sanger; it has hash e4f88db3d6dd1dcd2774295fcf83c06ade3f37eb, and was a revert.

I hope this data is of help, either as a new way to explore Wikipedia, or as a source of real-world data for time series analysis. If you use this data in some way, I’d be grateful if you cited the original Future Internet paper (in press; available at http://arxiv.org/abs/1512.04177). For LaTeX users, see BiBTeX below.

=== BiBTeX ===

@Article{dedeo16,
AUTHOR = {DeDeo, Simon},
TITLE = {Conflict and Computation on {W}ikipedia: A Finite-State Machine Analysis of Editor Interactions},
JOURNAL = {Future Internet},
VOLUME = {8},
YEAR = {2016},
NUMBER = {8},
PAGES = {31},
URL = {http://www.mdpi.com/1999-5903/8/3/31},
ISSN = {1999-5903},
DOI = {10.3390/fi8030031}
}

Pages: George_W._Bush, United_States, Wikipedia, Michael_Jackson, Catholic_Church, Barack_Obama, World_War_II, Global_warming, 2006_Lebanon_War, Islam, Canada, Eminem, September_11_attacks, Paul_McCartney, Israel, Hurricane_Katrina, Xbox_360, Pink_Floyd, Iraq_War, Blackout_(Britney_Spears_album), Turkey, Super_Smash_Bros._Brawl, World_War_I, Gaza_War, Lost_(TV_series), Blink-182, Scientology, John_Kerry, Heroes_(TV_series), Australia, China, Bob_Dylan, Neighbours, The_Holocaust, Atheism, Hilary_Duff, Mexico, The_Dark_Knight_(film), France, John_F._Kennedy, Lindsay_Lohan, Girls'_Generation, Argentina, Virginia_Tech_massacre, RMS_Titanic, Russo-Georgian_War, Homosexuality, Circumcision, Hillary_Rodham_Clinton, Star_Trek, Shakira, Sweden, New_Zealand, Paris_Hilton, Wizards_of_Waverly_Place, Genghis_Khan, Cuba, Linux, Che_Guevara, Golf, IPhone, God.