# PART I: Creating the Study's Datasets

# 0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to running the code via IPython Notebook.
* sudo pip install --upgrade turicreate
* sudo pip install --upgrade repoze.lru
* sudo pip install --upgrade networkx
* sudo pip install --upgrade pymongo



Please download the KDD Cup 2016 data, and please also download the project files from our GitHub repository. Through this research, we use the various constants that appear in consts.py. Please change the DATASETS_AMINER_DIR, DATASETS_BASE_DIR, and SFRAMES_BASE_DIR to your local directories, where you can download the datasets and save the project's SFrames.

**Note: Creating the following SFrame requires considerable computation power for long periods.** 

# 1. Creating the SFrames

In this study, we used the following datasets:
* [The Microsoft Academic KDD Cup 2016 dataset](https://kddcup2016.azurewebsites.net/Data) - The Microsoft Academic KDD Cup Graph dataset (referred to as the MAG 2016 dataset) contains data on over 126 million papers. The main advantage of this dataset is that it has undergone several preprocessing iterations of author entity matching (any author is identified by ID) and paper deduplication. Additionally, the dataset match between papers and their fields of study includes the hierarchical structure and connections between various fields of study.

* [AMiner dataset](https://aminer.org/open-academic-graph) - The AMiner dataset contains information on over 154 million papers collected by the AMiner team. The dataset contains papers' abstracts, ISSNs, ISBNs, and details on each paper.

* [SJR dataset](http://www.scimagojr.com/journalrank.php) -  The SCImago Journal Rank open dataset (referred to as the SJR dataset) contains journals and country specific metric data starting from 1999. In this study, we used the SJR dataset to better understand how various journal metrics have changed over time.


## 1.1 The Microsoft Academic KDD Cup Dataset

The first step is to convert the dataset text files into SFrame objects using the code located under the SFrames creator directory, using the following code.

In [1]:
from create_mag_sframes import *
from configs import *
create_all_sframes() # running this can take considerable time

The above two lines of code will create a set of SFrames with all the dataset data. The SFrames will include data on authors’ papers, keywords, fields of study, and more. Moreover, the code will construct the Extended Papers SFrame, which contains various meta data on each paper in the dataset.

In [2]:
mag_sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
mag_sf

Paper ID,Original paper title,Normalized paper title,Paper publish year,Paper publish date
01B27BE8,Evaluating Polarity for Verbal Phraseological ...,evaluating polarity for verbal phraseological ...,2014,2014/11/16
027D0030,Automatic Monitoring the Content of Audio ...,automatic monitoring the content of audio ...,2012,2012/10/27
7CFE299E,Towards a set of Measures for Evaluating Software ...,towards a set of measures for evaluating software ...,2009,2009/11
59BEBE1C,Learning Probability Densities of Optimiza ...,learning probability densities of optimiza ...,2008,2008/10/27
5873C011,Towards a Model for an Immune System ...,towards a model for an immune system ...,2002,2002/04/22
7A1109E4,Approach Towards a Natural Language Anal ...,approach towards a natural language anal ...,2013,2013/11
0B00AFD8,Towards the creation of semantic models based on ...,towards the creation of semantic models based on ...,2012,2012/10/27
5C66D743,Comparison of Neural Networks and Support ...,comparison of neural networks and support ...,2009,2009/11/01
040121AE,Multiple Kernel Support Vector Machine Proble ...,multiple kernel support vector machine proble ...,2014,2014/11/16
7DEADC9A,A Set of Test Cases for Performance Measures in ...,a set of test cases for performance measures in ...,2008,

Paper Document Object Identifier (DOI) ...,Original venue name,Normalized venue name,Journal ID mapped to venue name ...,Conference ID mapped to venue name ...
10.1007/978-3-319-13647-9 _19 ...,mexican international conference on artificial ...,micai,,42D7146F
10.1007/978-3-642-37807-2 _11 ...,mexican international conference on artificial ...,micai,,42D7146F
10.1109/MICAI.2009.15,mexican international conference on artificial ...,micai,,42D7146F
10.1007/978-3-540-88636-5 _25 ...,mexican international conference on artificial ...,micai,,42D7146F
10.1007/3-540-46016-0_42,mexican international conference on artificial ...,micai,,42D7146F
,mexican international conference on artificial ...,micai,,42D7146F
10.1007/978-3-642-37807-2 _26 ...,mexican international conference on artificial ...,micai,,42D7146F
10.1007/978-3-642-05258-3 _42 ...,mexican international conference on artificial ...,micai,,42D7146F
10.1007/978-3-319-13650-9 _14 ...,mexican international conference on artificial ...,micai,,42D7146F
,mexican international conference on artificial ...,micai,,42D7146F

Paper rank,Ref Number,Total Citations by Year,Total Citations by Year without Self Citations ...,Authors List Sorted
19517,21,,,"[834A11E2, 7E8BA14F, 852B2668] ..."
19444,7,,,"[7DB8825E, 6936139F]"
18870,14,"{'2015': 10.0, '2014': 9.0, '2011': 3.0, '20 ...","{'2015': 8.0, '2014': 7.0, '2011': 1.0, '20 ...","[81867464, 8106CCE6, 7D20CE86, 7C6C6BB9] ..."
19444,7,,,"[807DCA23, 811B0352, 2779E3F4] ..."
19177,8,"{'2003': 1.0, '2006': 3.0, '2007': 3.0, '20 ...","{'2003': 1.0, '2006': 2.0, '2007': 2.0, '20 ...","[7F553272, 7F830ACE, 7E7F1E07] ..."
19555,0,,,[7EE331AF]
19476,10,,,"[80CF45DD, 814339C1, 7ED13F21, 7F45D6E4] ..."
19428,9,"{'2015': 1.0, '2014': 1.0} ...","{'2015': 1.0, '2014': 1.0} ...","[7E2F72E3, 45F06265]"
19468,9,,,"[7677E6C4, 7EBBEA7F, 7F312412, 776D4ECC, ..."
19394,9,"{'2015': 3.0, '2014': 2.0, '2013': 2.0, '20 ...","{'2015': 3.0, '2014': 2.0, '2013': 2.0, '20 ...","[7E792787, 7EC1BF2D, 7897839F] ..."

Keywords List,Field of study list,Field of study list names,Fields of study parent list (L0) ...
,,,
,,,
"[measures, software measurement, autonomy, ...","[0A9CB5A9, 0556B228, 03E623B0, 0ABCEA76, ...","[Measure, Software measurement, Autonomy, ...","[0271BC14, 0895A350, 0205A1DB, 07982D63] ..."
"[optimization problem, probability density] ...","[083736DA, 0BBED543]","[None, Probability density function] ...",[0205A1DB]
"[process algebra, process calculi, multi agent ...","[09A47029, 09A47029, 027A0232, 027A0232, ...","[Process calculus, Process calculus, Multi- ...",[0271BC14]
"[cognition, computational linguistics, grammars, ...","[0A2079AC, 093E8748, 03365AB6, 044294F0, ...","[Cognition, Computational linguistics, Rule-based ...","[0271BC14, 00F03FC7]"
"[computer aided design, cad, ontologies] ...","[07245C42, 0B9C400C, 09F001E0] ...","[Computer Aided Design, None, Ontology] ...",[0271BC14]
"[dynamic system, dynamic systems, neural network ...","[0AA68668, 0304C748, 0304C748, 0AA68668, ...","[Dynamical system, Artificial neural ...","[0271BC14, 0B0FEB68]"
,,,
[multiobjective optimization] ...,[04198571],[Multi-objective optimization] ...,[]

Fields of study parent list names (L0) ...,Fields of study parent list (L1) ...,Fields of study parent list names (L1) ...,Fields of study parent list (L2) ...
,,,
,,,
"[Computer Science, Sociology, Mathematics, ...","[0BE4BA29, 0765A2E4, 093C4716, 06E88D7C] ...","[Law, Data mining, Artificial intelligence, ...","[00F36ADC, 05A3DFDE]"
[Mathematics],[064E5072],[Statistics],[007E3B49]
[Computer Science],"[0C19BFCD, 0BE20181, 093C4716] ...","[Immunology, Programming language, Artificial ...",[027A0232]
"[Computer Science, Psychology] ...","[0BE20181, 0C2DB2A7]","[Programming language, Natural language ...","[00E4DDF6, 0C199D1F, 093E8748, 044294F0] ..."
[Computer Science],[093C4716],[Artificial intelligence],[07245C42]
"[Computer Science, Chemistry] ...",[0724DFBA],[Machine learning],"[0304C748, 097464D7]"
,,,
[],[07868074],[Mathematical optimization] ...,[02724C38]

Fields of study parent list names (L2) ...,Authors Number,Urls,Fields of study parent list (L3) ...
,3,[http://link.springer.com /content/pdf/10.1007% ...,
,2,"[http://dl.acm.org/citati on.cfm?id=2481834, ht ...",
"[Project management, Politics] ...",4,[http://ieeexplore.ieee.o rg/xpl/abstractAuthor ...,"[0059F32E, 0556B228, 03E623B0, 0ABCEA76, ..."
[Stochastic process],3,[http://dx.doi.org/10.100 7/978-3-540-88636-5_2 ...,[0BBED543]
[Multi-agent system],3,"[http://dl.acm.org/citati on.cfm?id=691909, htt ...","[09A47029, 0087AC0D]"
"[Speech synthesis, Machine translation, ...",1,[http://ieeexplore.ieee.o rg/lpdocs/epic03/wrap ...,"[0322F49A, 0A2079AC, 03365AB6, 041AB807] ..."
[Computer Aided Design],4,"[http://dl.acm.org/citati on.cfm?id=2481852, ht ...",[09F001E0]
"[Artificial neural network, Nonlinear ...",2,[http://adsabs.harvard.ed u/abs/2009LNCS.5845.. ...,"[00BB2E8D, 0AA68668, 2078A8D7] ..."
,5,[http://link.springer.com /content/pdf/10.1007% ...,
[Linear programming],3,[http://dx.doi.org/10.100 7/978-3-540-88636-5_4 ...,[04198571]

Fields of study parent list names (L3) ...
""
""
"[Software agent, Software measurement, Autonomy, ..."
[Probability density function] ...
"[Process calculus, Immune system] ..."
"[Feature extraction, Cognition, Rule-based ..."
[Ontology]
"[Support vector machine, Dynamical system, Cop ..."
""
[Multi-objective optimization] ...


In our study, we also analyzed how various authors' attributes, such as the number of published papers, number of coauthors, etc., has changed over time. To achieve this, we created an authors features SFrame using the following code:

In [3]:
from create_mag_authors_sframe import *
a = AuthorsFeaturesExtractor()

#This need to run on a strong server and can take considerable time to run
a_sf = a.get_authors_all_features_sframe()
a_sf #the SFrame can be later loaded using tc.load_sframe(AUTHROS_FEATURES_SFRAME)

Author ID,Papers by Years Dict,Coauthors by Years Dict,Affilation by Year Dict
00001F05,"{2010: ['5DA0F250'], 2013: ['7AF8ABFE']} ...","{2013: ['77CE16EC', '17B20BAE']} ...","{2010: [''], 2013: ['']}"
00002AD3,"{2009: ['7A0B348F'], 2010: ['795F56C6'], 2 ...","{2009: ['7FD1B86A', '7921EA7D', '05390F01', ...","{2009: [''], 2010: [''], 2012: [''], 2006: [''], ..."
00006A31,{2009: ['7CFAEB15']},"{2009: ['7E24B147', '7C3ED158', '79A6FF42', ...",{2009: ['']}
0000B5FA,"{2008: ['7714EB4E'], 2009: ['78B7C257'], 2 ...","{2008: ['54648B9B'], 2009: ['7ADCCDB0', ...","{2008: [''], 2009: [''], 2010: ['', '', ''], 2 ..."
0001CF9B,{2013: ['7C9BBC3A']},"{2013: ['852809D8', '77FE1F64', '80B5223D', ...",{2013: ['']}
00040294,"{2009: ['7892886A', '81263516', '81424AA7'], ...","{2009: ['75D367F6', '82C1C4DE', '80824D21', ...","{2009: ['', '', ''], 2011: ['', ''], 2004: ..."
00045553,"{1987: ['77F10A3D'], 1988: ['7836E5B8'], 1 ...","{1987: ['85CAEB12'], 1988: ['819D7046', ...","{1987: ['new york medical college'], 1988: [''], ..."
0004B8AF,{2010: ['77AFDEB4']},"{2010: ['77227437', '77CB65A7', '5EBF97A1', ...",{2010: ['']}
000510E2,"{2011: ['80790612'], 2012: ['76E5D7F2', ...","{2011: ['82D84635', '11F1B283'], 2012: ...","{2011: ['university of queensland'], 2012: ['', ..."
00063841,{2014: ['7790AFD4']},"{2014: ['853EBBF2', '7901305E', '0F71473E', ...",{2014: ['']}

Sequence Number by Year Dict ...,Author name,First name,Last name,Conference ID by Year Dict ...
"{2010.0: array('d', [1.0]), 2013.0: ...",nancy praill,nancy,praill,"{2010: [''], 2013: ['']}"
"{2009.0: array('d', [1.0]), 2010.0: ...",david s rebergen,david,rebergen,"{2009: [''], 2010: [''], 2012: [''], 2006: [''], ..."
"{2009.0: array('d', [6.0])} ...",b zelazowska,b,zelazowska,{2009: ['']}
"{2008.0: array('d', [1.0]), 2009.0: ...",lars goerigk,lars,goerigk,"{2008: [''], 2009: [''], 2010: ['', '', ''], 2 ..."
"{2013.0: array('d', [5.0])} ...",orlando lastres danguillecourt ...,orlando,danguillecourt,{2013: ['']}
"{2009.0: array('d', [4.0, 7.0, 6.0]), 2011.0: ...",ivani bisordi,ivani,bisordi,"{2009: ['', '', ''], 2011: ['', ''], 2004: ..."
"{1987.0: array('d', [1.0]), 1988.0: ...",miguel a pappolla,miguel,pappolla,"{1987: [''], 1988: [''], 1989: [''], 1990: ['', ..."
"{2010.0: array('d', [7.0])} ...",dong zaijie,dong,zaijie,{2010: ['']}
"{2011.0: array('d', [1.0]), 2012.0: ...",fairlie mcilwraith,fairlie,mcilwraith,"{2011: [''], 2012: ['', '', ''], 2013: [''], ..."
"{2014.0: array('d', [8.0])} ...",tiziano ponsetti,tiziano,ponsetti,{2014: ['']}

Journal ID by Year Dict,Venue by Year Dict,Gender Dict
"{2010: [''], 2013: ['']}","{2010: [''], 2013: ['']}","{'Gender': 'Female', 'Total Males': 2999, ..."
"{2009: ['0959867B'], 2010: ['0069C535'], 2 ...",{2009: ['Journal of Occupational and ...,"{'Gender': 'Male', 'Total Males': 3700247, 'Total ..."
{2009: ['036625C9']},{2009: ['Advances in Medical Sciences']} ...,"{'Gender': 'Unisex', 'Total Males': 536, ..."
"{2008: ['0A1986D0'], 2009: ['0ACEE946'], 2 ...","{2008: ['ChemPhysChem'], 2009: ['Physical ...","{'Gender': 'Male', 'Total Males': 12459, 'Total ..."
{2013: ['01F41F83']},{2013: ['International Journal of Energy ...,"{'Gender': 'Male', 'Total Males': 47535, 'Total ..."
"{2009: ['08826C6E', '0B483532', '05F694A1'], ...","{2009: ['Infection, Genetics and Evolution', ...","{'Gender': 'Female', 'Total Males': 0, 'Total ..."
"{1987: ['096E1E70'], 1988: ['03C89659'], 1 ...","{1987: ['Synapse'], 1988: ['Human Pathology'], ...","{'Gender': 'Male', 'Total Males': 173865, 'Total ..."
{2010: ['0B0C2E2F']},{2010: ['Aquaculture Research']} ...,"{'Gender': 'Male', 'Total Males': 317, 'Total ..."
"{2011: ['080AF648'], 2012: ['068E6FF5', '', ...","{2011: ['Drug and Alcohol Review'], 2012: ['Drug ...","{'Gender': 'Unisex', 'Total Males': 7, 'Total ..."
{2014: ['06FD8B4A']},{2014: ['Catalysis Today']} ...,"{'Gender': 'Male', 'Total Males': 37, 'Total ..."


The above SFrame contains various features of each author that were constructed based on analyzing the author’s papers that have at least 5 references. If you notice, the author’s SFrame contains each author’s gender prediction. This column was created by obtaining first-name gender statistics from the[SSA Baby Names](http://www.ssa.gov/oact/babynames/names.zip]) and [WikiTree](https://www.wikitree.com/wiki/Help:Database_Dumps) datasets which include over 115 thousands unique first names (see details in geneder_classifier.py). 

## 1.2 The AMiner Dataset

After downloading the [AMiner website](https://aminer.org/open-academic-graph), simply load to an SFrame using the following code:

In [4]:
aminer_sf = tc.SFrame.read_json('%s/*.txt' % AMINER_DATA_DIR,  orient='lines')
aminer_sf # the SFrame can be accessed also by using tc.load_sframe(AMINER_PAPERS_SFRAME)

abstract,authors,doi,id
,"[{'name': 'G. Adam'}, {'name': 'K. Schreibe ...",10.1002/ange.19650770204,53e99784b7602d9701f3e130
,"[{'name': 'R. Farahbod'}, {'name': 'V. Gervasi'}, ...",,53e99784b7602d9701f3e131
The method to making technology roadmap is ...,"[{'name': 'MO Chou'}, {'name': 'CHEN Jiqing'}, ...",,53e99784b7602d9701f3e132
Drought is the first place in all the natural ...,"[{'name': 'Peijuan Wang'}, {'name': 'Jiahua ...",10.1109/IGARSS.2011.60495 03 ...,53e99784b7602d9701f3e133
Determination of total sugar can serve to ...,[{'org': 'Yantai Institute of Coastal ...,,53e99784b7602d9701f3e135
Resumen: Uno de los problemas que debemos ...,[{'name': 'CELSO VARGAS'}] ...,,53e99784b7602d9701f3e136
,"[{'name': 'D J Lum'}, {'name': 'V Upadhyay'}, ...",10.1111/j.1365-2559.2007. 02817.x ...,53e99784b7602d9701f3e137
This paper discussed the planning and design ...,[{'org': 'School of Resource and ...,,53e99784b7602d9701f3e139
Rough set is a mathematical tool to ...,,,53e99784b7602d9701f3e13a
,"[{'name': 'F THOUVENYPAISANT'}, ...",10.1016/S0221-0363(05)762 74-0 ...,53e99784b7602d9701f3e13b

isbn,issn,issue,keywords,lang,n_citation,page_end,page_start,pdf
,,2.0,,en,,95,94.0,
,,,,en,,,,
,,19.0,"[science and technology production, technology ...",zh,,95,90.0,
,,,"[canopy parameters, canopy spectrum, ...",en,,1933,1930.0,
,,7.0,"[metabolites, Jerusalem artichoke, total sugar, ...",zh,1.0,93+97,90.0,
,,,,en,,,,
,,5.0,,en,,707,704.0,
,,28.0,"[Planning and design method, Mountainous ...",zh,1.0,364,362.0,
,,11.0,"[Data Mining, Rough Set, Algorithm, Rules ...",zh,3.0,106,104.0,
,,10.0,,en,,1555,1555.0,

references,title,url,venue
"[53e9a6e6b7602d970301a47d , ...",1.4-N→N′-Acylwanderun g bei einem ...,[http://dx.doi.org/10.100 2/ange.19650770204] ...,Angewandte Chemie
"[53e9a1d0b7602d9702ac8f1b , ...",Design and Specification of the CoreASM Execution ...,,
,Practice Research on Technology Roadmap for ...,,Science and Technology Management Research ...
"[53e999c3b7602d970220b9b7 , ...",The relationship between canopy parameters and ...,[http://dx.doi.org/10.110 9/IGARSS.2011.6049503] ...,IGARSS
,The effect of metabolites on the determination of ...,,Food Science and Technology ...
,El Humanista y la Energía Nuclear ...,,
"[53e9b395b7602d9703e78794 , ...",Botryoid fibroepithelial polyp of the urinary ...,[http://dx.doi.org/10.111 1/j.1365-2559.2007.02 ...,Histopathology
,Planning and Design Method of Land ...,,Journal of Anhui Agricultural Sciences ...
,A Data Mining Based on Rough Set Theory ...,,Software Guide
,RI1 Embolisation des varices stomiales par ...,[http://dx.doi.org/10.101 6/S0221-0363(05)76274 ...,Journal De Radiologie

volume,year
77.0,1965.0
,
,2013.0
,2011.0
,2012.0
,2013.0
51.0,2007.0
,2012.0
,2012.0
86.0,2005.0


## 1.3 The SJR Dataset

First, we download all the journal ranking files from [the SJR website](http://www.scimagojr.com/journalrank.php).
Next, we use the following code to create a single SFrame with all the journal data:

In [5]:
from create_sjr_sframe import *
sjr_sf = create_sjr_sframe(SJR_FILES_DIR)
sjr_sf # the SFrame can also be accessed using tc.load_sframe(SJR_SFRAME)

Rank,Title,Type,SJR,SJR Best Quartile,H index,Total Docs.,Total Docs. (3years)
1,Astrophysical Journal Letters ...,journal,61.473,Q1,82,5,7
1,Astrophysical Journal Letters ...,journal,61.473,Q1,82,5,7
2,Annual Review of Biochemistry ...,journal,49.476,Q1,248,30,81
2,Annual Review of Biochemistry ...,journal,49.476,Q1,248,30,81
3,Cell,journal,41.978,Q1,616,354,1359
3,Cell,journal,41.978,Q1,616,354,1359
4,Annual Review of Immunology ...,journal,40.906,Q1,254,29,81
4,Annual Review of Immunology ...,journal,40.906,Q1,254,29,81
5,Annual Review of Cell and Developmental Biology ...,book serie,33.882,Q1,182,25,61
5,Annual Review of Cell and Developmental Biology ...,book serie,33.882,Q1,182,25,61

Total Refs.,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country
350,493,7,72.75,70.0,United Kingdom
350,493,7,72.75,70.0,United Kingdom
5913,3445,80,35.38,197.1,United States
5913,3445,80,35.38,197.1,United States
15870,47390,1328,34.36,44.83,United States
15870,47390,1328,34.36,44.83,United States
5236,4030,81,46.69,180.55,United States
5236,4030,81,46.69,180.55,United States
4134,1770,60,26.55,165.36,United States
4134,1770,60,26.55,165.36,United States

Year,Categories,ISSN
1999,,20418213
1999,,20418205
1999,,15454509
1999,,664154
1999,,928674
1999,,10974172
1999,,7320582
1999,,15453278
1999,,15308995
1999,,10810706


## 1.4 Joint Datasets

The MAG and AMiner datasets have a slightly different set of features. While the MAG dataset contains data on each author with a unique author ID, the AMiner contains additional data on each paper, including the paper's abstract and the paper's ISSN or ISBN. Additionally, the SJR dataset contains data about each journal's ranking.

To combine the data from the author publication record and the journals' rankings, we join the datasets. First, we joined the MAG and AMiner datasets by matching DOI values, using the following code (see also create_mag_aminer_sframe.py):

In [6]:
sf = tc.load_sframe(EXTENDED_PAPERS_SFRAME)
g1 = sf.groupby('Paper Document Object Identifier (DOI)', {'Count': agg.COUNT()})
s1 = set(g1[g1['Count'] > 1]['Paper Document Object Identifier (DOI)'])
sf = sf[sf['Paper Document Object Identifier (DOI)'].apply(lambda doi: doi not in s1 )]
sf.materialize()

sf2 = tc.load_sframe(AMINER_PAPERS_SFRAME)
g2 = sf2.groupby('doi', {'Count': agg.COUNT()})
s2 = set(g2[g2['Count'] > 1]['doi'])
sf2 = sf2[sf2['doi'].apply(lambda doi: doi not in s2 )]
sf2.materialize()

aminer_mag_sf = sf.join(sf2, {'Paper Document Object Identifier (DOI)': 'doi'})
aminer_mag_sf['title_len'] = aminer_mag_sf['title'].apply(lambda t: len(t))
aminer_mag_sf = aminer_mag_sf[aminer_mag_sf['title_len'] > 0]
aminer_mag_sf = aminer_mag_sf.rename({"Paper ID": "MAG Paper ID", "id": "Aminer Paper ID"})
aminer_mag_sf.remove_column('title_len')
aminer_mag_sf # this SFrame can be accessed using tc.load_Sframe(AMINER_MAG_JOIN_SFRAME)

MAG Paper ID,Original paper title,Normalized paper title,Paper publish year,Paper publish date
7C15F682,"Ptychoptera deleta Novak, 1877 from the Early ...",ptychoptera deleta novak 1877 from the early ...,2011,2011
84A37D36,Sherborn’s foraminiferal studies ...,sherborn s foraminiferal studies and their ...,2016,2016/07/01
773B216E,A new species of hydrobiid snails ...,a new species of hydrobiid snails moll ...,2011,2011/10/19
77C44F83,Revision of the planthopper genus ...,revision of the planthopper genus ...,2014,2014/10/12
75233F3E,Female genitalia of Seasogonia Young from ...,female genitalia of seasogonia young from ...,2012,2012/11/01
7B5321C5,A taxonomic study on the genus Ettchellsia ...,a taxonomic study on the genus ettchellsia cam ...,2012,2012
3C77A5B8,"An Asiatic Chironomid in Brazil: morphology, DNA ...",an asiatic chironomid in brazil morphology dna ...,2015,2015/07/27
8051111A,A new species of Smicromorpha ...,a new species of smicromorpha hymenoptera ...,2009,2009/09/14
79BD8F37,Open exchange of scientific knowledge and ...,open exchange of scientific knowledge and ...,2014,2014/06/06
240B7EFF,"Four new species of Epicephala Meyrick, 1880 ...",four new species of epicephala meyrick 1880 ...,2015,2015/06/15

Paper Document Object Identifier (DOI) ...,Original venue name,Normalized venue name,Journal ID mapped to venue name ...,Conference ID mapped to venue name ...
10.3897/zookeys.130.1401,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.550.9863,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.138.1927,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.462.6657,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.164.2132,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.254.4182,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.514.9925,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.20.195,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.414.7717,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.508.9479,ZooKeys,zookeys,0BDFC074,

Paper rank,Ref. Number,Total Citations by Year,Total Citations by Year without Self Citations ...,Authors List Sorted
19382,7.0,{'2015': 1.0},{'2015': 1.0},"[855C02FD, 7B2C4199]"
19555,,,,[84CB5028]
19402,12.0,"{'2015': 3.0, '2014': 3.0, '2011': 1.0, '20 ...","{'2015': 1.0, '2014': 1.0, '2011': 1.0, '20 ...",[8439D30B]
19555,,,,"[84C55AD3, 7E095DC6]"
19427,6.0,,,"[805137B8, 80CA5307]"
19370,4.0,,,"[80F9A983, 7FFB2555]"
19555,,,,"[6118B891, 7FE3B9C3, 79CC73E7, 7C17044D] ..."
19157,3.0,"{'2015': 4.0, '2014': 3.0, '2013': 2.0, '20 ...","{'2015': 4.0, '2014': 3.0, '2013': 2.0, '20 ...","[85B81F06, 7E5ABA3D]"
19321,5.0,"{'2015': 3.0, '2014': 1.0} ...",{'2015': 2.0},"[7237B1F9, 7DAD3B1C, 78F96D88, 78A6ED0B] ..."
19555,,,,"[7CFFBD65, 862455E0, 7D640C0F] ..."

Keywords List,Field of study list,Field of study list names,Fields of study parent list (L0) ...
"[tertiary, biomedical research, neogene, ...","[009377C6, 0660586C, 01A380F9, 039D5C06] ...","[Tertiary, None, Neogene, Bioinformatics] ...",[]
,,,
"[biomedical research, bioinformatics] ...","[0660586C, 039D5C06]","[None, Bioinformatics]",[]
,,,
"[morphology, taxonomy]","[06A2C3F5, 037ECF39]","[Morphology, Taxonomy]",[052C8328]
"[taxonomy, bioinformatics, ...","[039D5C06, 037ECF39, 0660586C] ...","[Bioinformatics, Taxonomy, None] ...",[052C8328]
,,,
"[morphology, taxonomy]","[06A2C3F5, 037ECF39]","[Morphology, Taxonomy]",[052C8328]
"[taxonomy, intellectual property rights, ...","[037ECF39, 0215A9CE, 039D5C06, 0660586C] ...","[Taxonomy, Intellectual property, Bioinformat ...","[0895A350, 052C8328]"
,,,

Fields of study parent list names (L0) ...,Fields of study parent list (L1) ...,Fields of study parent list names (L1) ...,Fields of study parent list (L2) ...
[],"[090B39EA, 039D5C06]","[Paleontology, Bioinformatics] ...",[0683829C]
,,,
[],[039D5C06],[Bioinformatics],[]
,,,
[Biology],[027F4522],[Linguistics],"[037ECF39, 06A2C3F5]"
[Biology],[039D5C06],[Bioinformatics],[037ECF39]
,,,
[Biology],[027F4522],[Linguistics],"[037ECF39, 06A2C3F5]"
"[Sociology, Biology]","[0BE4BA29, 039D5C06]","[Law, Bioinformatics]",[037ECF39]
,,,

Fields of study parent list names (L2) ...,Authors Number,Urls,abstract
[Stratigraphy],2,[/pmc/articles/instance/3 260767/?report=abstra ...,The first fossil that was described in ...
,1,[http://zookeys.pensoft.n et/lib/ajax_srv/artic ...,
[],1,[http://bionames.org/refe rences/b38c8055b3453e ...,Anew minute valvatiform species belonging to the ...
,2,[http://www.researchgate. net/publication/27090 ...,"Chinese species in the genus Nycheuma Fennah, ..."
"[Taxonomy, Morphology]",2,[/pmc/articles/instance/3 272620/?report=abstra ...,"Seasogonia Young, 1986 is a sharpshooter genus ..."
[Taxonomy],2,[http://bionames.org/refe rences/4078cc33be92d7 ...,"Three new species of Ettchellsia Cameron, ..."
,4,[http://www.researchgate. net/publication/28071 ...,
"[Taxonomy, Morphology]",2,[http://www.cabdirect.org /abstracts/2009331573 ...,
[Taxonomy],4,[http://advocomplex.ch/fi les/Open%20exchange%2 ...,Background. The 7(th) Framework Programme for ...
,3,"[http://www.ncbi.nlm.nih. gov/pubmed/26167120, ...",

authors,Aminer Paper ID,isbn,issn,issue,keywords,lang
"[{'org': 'Institute of Biology, Pedagogical ...",55a4881d65ce31bc877df20d,,1313-2970,130.0,"[cheb basin, cypris formation, czech ...",en
[{'name': 'giles miller'}] ...,56d81fffdabfae2eeeb5906d,,,,,en
"[{'org': 'Department of Ecology and Systematics, ...",55a47a6865ce31bc877c5f09,,1313-2970,138.0,"[caenogastropoda, daphniola eptalophos sp. ...",en
[{'org': 'The Special Key Laboratory for ...,55a6b03265ce054aad712ec1,,1313-2989,462.0,"[delphacini, fulgoroidea, hemiptera, nycheuma, new ...",en
"[{'org': 'Institute of Entomology, Guizhou ...",55a4953165ceb7cb02d2d096,,1313-2970,164.0,"[auchenorrhyncha, cicadellinae, ...",en
"[{'org': 'Laboratory of Entomology, Faculty of ...",55a51ace65ceb7cb02e11911,,1313-2989,254.0,"[south east asia, taxonomy, parasitic ...",en
"[{'name': 'gizelle amora'}, {'name': 'neusa ...",56d81fffdabfae2eeeb59087,,,,,en
"[{'name': 'd c darling'}, {'name': 'norman f ...",53e9ac5cb7602d970362f86d,,,0.0,"[morphology, taxonomy]",en
"[{'org': 'Plazi, Zinggstrasse 16, 3007 ...",55a676ab65ce054aad68df46,,1313-2989,414.0,"[biodiversity knowledge, european copyright, ...",en
"[{'name': 'houhun li'}, {'name': 'zhibo wang'}, ...",56d82000dabfae2eeeb59093,,,,,en

n_citation,page_end,page_start,pdf,references,title,...
3.0,305.0,299.0,,"[53e99c60b7602d9702511119 , ...","Ptychoptera deleta Novák, 1877 from the ...",...
,,,,,Sherborn’s foraminiferal studies ...,...
,64.0,53.0,,"[56d8e4efdabfae2eee2affe9 , ...",A new species of hydrobiid snails ...,...
,57.0,47.0,,,Revision of the planthopper genus ...,...
,40.0,24.0,,"[56d92143dabfae2eee9f1671 , ...",Female genitalia of Seasogonia Young from ...,...
,108.0,99.0,,"[53e99ddab7602d970269b655 , ...",A taxonomic study on the genus Ettchellsia ...,...
,,,,,"An Asiatic Chironomid in Brazil: morphology, DNA ...",...
,,,,"[56d85c7cdabfae2eee5a5fe6 , ...",A new species of Smicromorpha ...,...
9.0,135.0,109.0,,"[55a4825f65ce31bc877d426d , ...",Open exchange of scientific knowledge and ...,...
,,,,,"Four new species of Epicephala Meyrick, 1880 ...",...


Using the joined dataset, we obtained an SFrame with the joint meta data of 28.9 million papers. We can take this SFrame and join it with the SJR dataset.

In [8]:
import re
def create_aminer_mag_sjr_sframe(year):
    """
    Creates a unified SFrame of AMiner, MAG, and the SJR datasets
    :param year: year to use for SJR data
    :return: SFrame with AMiner, MAG, and SJR data
    :rtype: tc.SFrame
    """
    sf = tc.load_sframe(AMINER_MAG_JOIN_SFRAME)
    sf = sf[sf['issn'] != None]
    sf = sf[sf['issn'] != 'null']
    sf.materialize()
    r = re.compile("(\d+)-(\d+)")
    sf['issn_str'] = sf['issn'].apply(lambda i: "".join(r.findall(i)[0]) if len(r.findall(i))> 0 else None)
    sf = sf[sf['issn_str'] != None]
    sjr_sf = tc.load_sframe(SJR_SFRAME)
    sjr_sf = sjr_sf[sjr_sf['Year'] == year]
    return sf.join(sjr_sf, on={'issn_str': "ISSN"})
create_aminer_mag_sjr_sframe(2015)

MAG Paper ID,Original paper title,Normalized paper title,Paper publish year,Paper publish date
7C15F682,"Ptychoptera deleta Novak, 1877 from the Early ...",ptychoptera deleta novak 1877 from the early ...,2011,2011
773B216E,A new species of hydrobiid snails ...,a new species of hydrobiid snails moll ...,2011,2011/10/19
77C44F83,Revision of the planthopper genus ...,revision of the planthopper genus ...,2014,2014/10/12
75233F3E,Female genitalia of Seasogonia Young from ...,female genitalia of seasogonia young from ...,2012,2012/11/01
7B5321C5,A taxonomic study on the genus Ettchellsia ...,a taxonomic study on the genus ettchellsia cam ...,2012,2012
79BD8F37,Open exchange of scientific knowledge and ...,open exchange of scientific knowledge and ...,2014,2014/06/06
778DE072,"First report on C-banding, fluorochrome ...",first report on c banding fluorochrome staining ...,2013,2013/07/30
76034B62,"Checklist of the families Scathophagidae, Fanni ...",checklist of the families scathophagidae fanniidae ...,2014,2014/09/19
781099DE,Checklist of the family Culicidae (Diptera) in ...,checklist of the family culicidae diptera in ...,2014,2014
770A539F,"Two new species of harvestmen (Opiliones, ...",two new species of harvestmen opiliones ...,2014,2014/08/14

Paper Document Object Identifier (DOI) ...,Original venue name,Normalized venue name,Journal ID mapped to venue name ...,Conference ID mapped to venue name ...
10.3897/zookeys.130.1401,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.138.1927,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.462.6657,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.164.2132,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.254.4182,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.414.7717,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.319.4265,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.441.7142,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.441.7743,ZooKeys,zookeys,0BDFC074,
10.3897/zookeys.434.7486,ZooKeys,zookeys,0BDFC074,

Paper rank,Ref. Number,Total Citations by Year,Total Citations by Year without Self Citations ...,Authors List Sorted
19382,7.0,{'2015': 1.0},{'2015': 1.0},"[855C02FD, 7B2C4199]"
19402,12.0,"{'2015': 3.0, '2014': 3.0, '2011': 1.0, '20 ...","{'2015': 1.0, '2014': 1.0, '2011': 1.0, '20 ...",[8439D30B]
19555,,,,"[84C55AD3, 7E095DC6]"
19427,6.0,,,"[805137B8, 80CA5307]"
19370,4.0,,,"[80F9A983, 7FFB2555]"
19321,5.0,"{'2015': 3.0, '2014': 1.0} ...",{'2015': 2.0},"[7237B1F9, 7DAD3B1C, 78F96D88, 78A6ED0B] ..."
19485,15.0,"{'2015': 1.0, '2014': 1.0} ...","{'2015': 1.0, '2014': 1.0} ...","[7FDB7566, 80418DAE]"
19404,5.0,,,"[78FFCF1E, 7CD385BA]"
19424,6.0,{'2015': 1.0},{'2015': 1.0},[7D2454E0]
19404,5.0,,,"[850AEC5F, 8306860B]"

Keywords List,Field of study list,Field of study list names,Fields of study parent list (L0) ...
"[tertiary, biomedical research, neogene, ...","[009377C6, 0660586C, 01A380F9, 039D5C06] ...","[Tertiary, None, Neogene, Bioinformatics] ...",[]
"[biomedical research, bioinformatics] ...","[0660586C, 039D5C06]","[None, Bioinformatics]",[]
,,,
"[morphology, taxonomy]","[06A2C3F5, 037ECF39]","[Morphology, Taxonomy]",[052C8328]
"[taxonomy, bioinformatics, ...","[039D5C06, 037ECF39, 0660586C] ...","[Bioinformatics, Taxonomy, None] ...",[052C8328]
"[taxonomy, intellectual property rights, ...","[037ECF39, 0215A9CE, 039D5C06, 0660586C] ...","[Taxonomy, Intellectual property, Bioinformat ...","[0895A350, 052C8328]"
"[biomedical research, bioinformatics] ...","[0660586C, 039D5C06]","[None, Bioinformatics]",[]
"[bioinformatics, biomedical research] ...","[039D5C06, 0660586C]","[Bioinformatics, None]",[]
"[bioinformatics, biomedical research] ...","[039D5C06, 0660586C]","[Bioinformatics, None]",[]
[taxonomy],[037ECF39],[Taxonomy],[052C8328]

Fields of study parent list names (L0) ...,Fields of study parent list (L1) ...,Fields of study parent list names (L1) ...,Fields of study parent list (L2) ...
[],"[090B39EA, 039D5C06]","[Paleontology, Bioinformatics] ...",[0683829C]
[],[039D5C06],[Bioinformatics],[]
,,,
[Biology],[027F4522],[Linguistics],"[037ECF39, 06A2C3F5]"
[Biology],[039D5C06],[Bioinformatics],[037ECF39]
"[Sociology, Biology]","[0BE4BA29, 039D5C06]","[Law, Bioinformatics]",[037ECF39]
[],[039D5C06],[Bioinformatics],[]
[],[039D5C06],[Bioinformatics],[]
[],[039D5C06],[Bioinformatics],[]
[Biology],[],[],[037ECF39]

Fields of study parent list names (L2) ...,Authors Number,Urls,abstract
[Stratigraphy],2,[/pmc/articles/instance/3 260767/?report=abstra ...,The first fossil that was described in ...
[],1,[http://bionames.org/refe rences/b38c8055b3453e ...,Anew minute valvatiform species belonging to the ...
,2,[http://www.researchgate. net/publication/27090 ...,"Chinese species in the genus Nycheuma Fennah, ..."
"[Taxonomy, Morphology]",2,[/pmc/articles/instance/3 272620/?report=abstra ...,"Seasogonia Young, 1986 is a sharpshooter genus ..."
[Taxonomy],2,[http://bionames.org/refe rences/4078cc33be92d7 ...,"Three new species of Ettchellsia Cameron, ..."
[Taxonomy],4,[http://advocomplex.ch/fi les/Open%20exchange%2 ...,Background. The 7(th) Framework Programme for ...
[],2,"[/pmc/articles/PMC3764527 /?report=abstract, ht ...",In spite of various cytogenetic works on ...
[],2,"[http://europepmc.org/abs tract/MED/25337032, h ...","A revised checklist of the Scathophagidae, ..."
[],1,[http://europepmc.org/art icles/PMC4200447?pdf= ...,A checklist of the Culicidae (Diptera) ...
[Taxonomy],2,[http://espace.library.cu rtin.edu.au/R?func=dbin- ...,Neopilionidae: Enantiobuninae) are ...

authors,Aminer Paper ID,isbn,issn,issue,keywords,lang
"[{'org': 'Institute of Biology, Pedagogical ...",55a4881d65ce31bc877df20d,,1313-2970,130,"[cheb basin, cypris formation, czech ...",en
"[{'org': 'Department of Ecology and Systematics, ...",55a47a6865ce31bc877c5f09,,1313-2970,138,"[caenogastropoda, daphniola eptalophos sp. ...",en
[{'org': 'The Special Key Laboratory for ...,55a6b03265ce054aad712ec1,,1313-2989,462,"[delphacini, fulgoroidea, hemiptera, nycheuma, new ...",en
"[{'org': 'Institute of Entomology, Guizhou ...",55a4953165ceb7cb02d2d096,,1313-2970,164,"[auchenorrhyncha, cicadellinae, ...",en
"[{'org': 'Laboratory of Entomology, Faculty of ...",55a51ace65ceb7cb02e11911,,1313-2989,254,"[south east asia, taxonomy, parasitic ...",en
"[{'org': 'Plazi, Zinggstrasse 16, 3007 ...",55a676ab65ce054aad68df46,,1313-2989,414,"[biodiversity knowledge, european copyright, ...",en
"[{'org': 'Department of Entomology, Y.S. Parmar ...",55a5ce0d65ce60f99bf5c02d,,1313-2989,319,"[c-banding, cma3, dapi, nor location, ...",en
"[{'org': 'Finnish Museum of Natural History, ...",55a69aa665ce054aad6d871f,,1313-2989,441,"[diptera, finland, species list, ...",en
"[{'org': 'Finnish Museum of Natural History, ...",55a69aa665ce054aad6d8706,,1313-2989,441,"[checklist, culicidae, diptera, finland, ...",en
[{'org': 'Dept of Environment and ...,55a688db65ce054aad6ae23a,,1313-2989,434,"[taxonomy, arachnids, cave biota] ...",en

n_citation,page_end,page_start,pdf,references,title,...
3.0,305,299,,"[53e99c60b7602d9702511119 , ...","Ptychoptera deleta Novák, 1877 from the ...",...
,64,53,,"[56d8e4efdabfae2eee2affe9 , ...",A new species of hydrobiid snails ...,...
,57,47,,,Revision of the planthopper genus ...,...
,40,24,,"[56d92143dabfae2eee9f1671 , ...",Female genitalia of Seasogonia Young from ...,...
,108,99,,"[53e99ddab7602d970269b655 , ...",A taxonomic study on the genus Ettchellsia ...,...
9.0,135,109,,"[55a4825f65ce31bc877d426d , ...",Open exchange of scientific knowledge and ...,...
,291,283,,"[56d8b9c0dabfae2eee2562bf , ...","First report on C-banding, fluorochrome ...",...
,367,347,,"[56d90738dabfae2eeefe3d12 , ...","Checklist of the families Scathophagidae, Fanni ...",...
,51,47,,"[56d8d4a0dabfae2eeec120ac , ...",Checklist of the family Culicidae (Diptera) in ...,...
,45,37,,"[56d88f6bdabfae2eeedadea4 , ...","Two new species of harvestmen (Opiliones, ...",...


# 2. Loading the Dataset to MongoDB

Using Turicreate and SFrame objects can help us get general data on how academic publication dynamics have changed over time, but it would be challenging to use this data to create more complicated insights, such as the trends of a specific journal. To reveal more complicated insights using the data, we would need to load the dataset to a different framework. In this study, we chose to use MongoDB as our framework for more complicated queries.
We installed MongoDB on Ubuntu 17.10 using the instructions in the following [link](https://medium.com/gatemill/how-to-install-mongodb-3-6-on-ubuntu-17-10-ac0bc225e648). After MongoDB is installed and running, please remember to set the user and password, and update MONGO_HOST & MONGO_PORT vars in consts.py (one can also adjust the connection to include user password auth).
Now, the next step is to load the above created SFrames to collections in MongoDB using mongo_connecter.py:

In [9]:
from mongo_connector import *
load_sframes() #this will load the SFrame to a local

In the end of the loading process, six collections will be loaded to the journal database.

In [10]:
MD.client.journals.collection_names()

[u'authoros_features',
 u'sjr_journals',
 u'aminer_mag_papers',
 u'fields_of_study_papers',
 u'papers_features',
 u'authors_features']

In the second part of the tutorial, we will demonstrate how the above created MongoDB collections can be utilized to calculate various statistics on paper collections, authors, journals, and research domains.