Author Identification through Stylometric Analysis


  • Machine Learning


Mentors :

  • Mani Praneeth Chilukuri

Mentees :

  • 5


Every author (and for that matter, any person who writes) has his own unique writing style, and this unique style can majorly be observed in the way they use small function words, such as articles, prepositions and conjunctions (I guess my writing style majorly has long sentences containing comma-separated smaller sentences :P ). In a survey of historical and current stylometric methods, it has been pointed out that function words are “used in a largely unconscious manner by the authors, and they are topic-independent.” In simple words, the writing style is to some extent irrelevant with the topics and can be recognised by human readers. We intend to make a program recognise it too. The proposed solution is to look at few works of different authors, analyse them for some metric, and then test our program on other works to determine their author.
A little more information (which you can very well ignore if I grabbed your attention from the above paragraph alone): We will start the project using three well-known approaches viz., Mendenhall's Characteristic Curves of Composition, Kilgariff's Chi-Squared Method and John Burrow's Delta Method. Then, we shall move on to build a machine learning model, experimenting and evaluating accuracies of different models, to come to the best model that does our job.
Pre-requsites:
Any experience in python will be appreciated but not necessary. Enthusiasm to explore machine learning (doesn't matter if it is your first time).


Tentative Timeline :

Week Work
Week 1 Get used to python, and nltk module.
Week 2 Collecting data: Collecting written texts to analyze. Pre-processing this data Study about different tests of stylometric analysis
Week 3-4 Implementing these stylometric tests for the data we collected and pre-processed.
Week 5-6 Learn "Logistic Regression", "Feature Extraction", "Bag-of-Words", "RNN"s, and "Transformers". Building a model to classify the authorship of the texts.
Week 7 Evaluating our model using metrics such as F-scores. Analyse the results of our model, interpret them and draw conclusions similar to what we did in week 4.