A SEMINAR REPORT ONSHORT TEXT CLASSIFICATION OF BBC NEWSASEMINAR REPORTSUBMITTED TOSAVITRIBAI PHULE PUNE UNIVERSITY, PUNEIn partial fulfillment for the Degree OfMaster of Technology(COMPUTER ENGINEERING)BYDIVYA KISHOR PUNWANTWARExam No: 218M0065VISHWAKARMA INSTITUTE OF INFORMATION TECHNOLOGY, PUNE(An Autonomous Institute affiliated to Savitribai Phule Pune University)COMPUTER ENGINEERING DEPARTMENTApril-May 2019DEPARTMENT OF COMPUTER ENGINEERINGVishwakarma Institute of Information Technology(An Autonomous Institute affiliated to Savitribai Phule Pune University)CERTIFICATEThis is to certify that the seminar-I report entitledSHORT TEXT CLASSIFICATION OF BBC NEWSSubmitted byDivya Kishor PunwantwarExam No: 218M0065Is a bonafide work carried out by them under the supervision of Prof.
Kirti H. Wanjale and it is submitted towards the partial fulfillment of the requirement of Savitribai Phule Pune University, Pune for the award of the degree of Master of Technology (Computer Engineering).Prof. Kirti H. Wanjale Dr. S. R. SakhareInternal Guide Head Of DepartmentDepartment of Computer Engineering Department of Computer EngineeringVIIT, Pune VIIT, PuneSeal/Stamp of the College Dr. B. S. KarkarePlace: Pune Director, VIIT, PuneDate:iiAbstractA short text is substantially different from traditional long text documents which are due to its shortness and conciseness which is somehow obstruct the applications of conventional machine learning and data mining algorithms in short text classification.
According to the traditional artificial intelligence methods, we can divide a short text classification into three steps and they are pre-processing, feature selection and classifier comparison.Specifically, in feature selection, we compared the performance and robustness of the method of TF-IDF weighting and we deliberately chose Naive Bayes as classifier technique. After that, we compared and analysed the classifiers horizontally with each other and vertically with feature selections.With the expeditious growth of the number of short text and how to effectively realize the automatic classification of a short text in the information domain is needed to be solved. According to the characteristics of short text, proposed Naive Bayes, which is classification algorithms based on the improvement of currently integrated classifiers. Traditional classifier Naive Bayes is used as the basis classifiers to train the classification models. Compared with several individual classifiers, our method Naive Bayes have excellent results in a variety of classification evaluation indexes. Based on that BBC news dataset is used to classify using a Naive Bayes algorithm. Most of the peoples used to read BBC news but everyone has a different interest as like technology, sports, business, politics, and entertainment.iiiAcknowledgementIt is matter of great pleasure for me to submit this seminar report on “SHORT TEXT CLASSIFICATION OF BBC NEWS”, as a part of curriculum for Master of technology (Computer Engineering) of Savitribai Phule University of Pune. I am thankful to my guide Prof. Guide name, Assistant Professor/Associate professor/ Professor in Computer Engineering Department for his/her constant encouragement and able guidance. I am also thankful to Dr. B. S. Karkare, Principal of VIIT Pune, Dr. S.R. Sakhare, Head of Computer Department for their valuable support.I take this opportunity to express my deep sense of gratitude towards those, who have helped us in various ways, for preparing my seminar. At the last but not least, I am thankful to my parent, who had encouraged and inspired me with their blessings.Divya Kishor PunwantwarivContentsCertificateiiAbstractiiiAcknowledgementiv1Introduction11.1Background11.2Motivation and Social Impact21.3Objectives and Outcomes21.4Mathematical model of problem solved32Literature Survey52.1Existing Techniques52.1.1 Technique 1 : An Improved Information Retrieval Approach to Short Text Classification52.1.2 Technique 2 : Short Text Classification Improved by Learning Multi-Granularity Topics53Implementation63.1Flow of Work63.2Data collection and Data sets63.3Software requirement73.4Results obtained94Results and Discussion114.1Discussion on Result Obtained114.2Comparison of Results (with other researchers)115Conclusion and Future Work12A Annexure13A.1 Source Code and Screenshots13A.2 Plagiarism report15Bibliography16List of FiguresFigure 1.1.1: Short Text ClassificationFigure 1.4.1: Confusion MatrixFigure 1.4.2: AccuracyFigure 1.4.3: Precision and RecallFigure 1.4.4: F1-ScoreChapter 1Introduction1.1 BackgroundOnline social media and news have emerged recently as a medium of information sharing and communication. Blogging, status updates, social networking, watching the news and video sharing are some of the ways in which people try to achieve this. Popular online social media like Facebook, Orkut or Twitter, and news sites like BBC news, CNN, FOX News allows users to post or watch a short message to their homepage. These are often introduced to as micro-blogging sites and the message which is called a status update. News updates from BBC channel are more commonly called as news on a different category of data.News is often related to some event information rapidly.based on the topic of interest like a business, technology, entertainment, personal thoughts, and opinions. News can contain text, emotion, link or their combination. News has recently gained a lot of importance due to their ability to disseminate.Figure 1.1.1: Short Text Classification11.2 Motivation and Social ImpactFor easier understanding of users classify the dataset of BBC news. Classifying manually the data into the different category is easy only when the dataset is very short but many times it is not easy to classify or categorize the data which has a large number of data. It is very clumsy or tricky to classify a large number of data set for that use algorithms for classification of short text. This is proposed to classify the BBC news which is having multiclass and multi labels.1.3 Objectives and OutcomesThe Short text classification task consists of learning models for a given set of classes and applying these models to new imaginary documents for a class assignment. It is mainly a supervised classification task, where a training set subsists of documents with already assigned classes is provided, and a testing set is used for the evaluation of the models. Short text classification is shown in Figure 1, including the pre-processing steps which consist of document representation and space reduction/feature extraction; and the learning/evaluation procedure like Naive Bayes. Great relevance has been deservedly given to learning procedures in short text classification. However, there must be a preprocessing stage before the learning process. Pre-processing alters the input space which is used to represent the documents that are conclusively included in the training and testing sets, used by machine learning algorithms to learn classifiers, which they are evaluated after.21.4 Mathematical model of problem solvedConfusion Matrix:Figure 1.4.1: Confusion Matrix‚· True Positive (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.‚· True Negative (TN): We predicted no, and they don’t have the disease.‚· False positive (FP): We predicted yes, but they don’t actually have the disease. And it is also known as a “Type I error.”‚· False negative (FN): We predicted no, but they actually do have the disease. And it is also known as a “Type II error.”Accuracy:Figure 1.4.2: AccuracyThe accuracy is a measure of the degree of closeness of a measured or calculated value to its actual value.3Precision and Recall:Figure 1.4.3: Precision and RecallPrecision is the ratio of correctly predicted positive observations to the total predicted positive observations. And Recall is the ratio of correctly predicted positive observations to the all observations in actual class – yes.F1-Score:Figure 1.4.4: F1-ScoreF1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.4Chapter 2Literature Survey2.1 Existing Techniques2.1.1 Technique 1: An Improved Information Retrieval Approach to Short Text ClassificationTwitter act as the most important medium of information sharing and communication. As tweets of Twitter do not provide sufficient word occurrences that are of 140 characters limits and classification methods that use traditional approaches like Bag-Of-Words have some of the limitations. The proposed system used an instinctive approach to determine the class labels with a set of features. The system can able to classify all incoming tweets mainly into three generic categories as News, Movies, and Sports. Since all these categories are diverse and cover most of the topics that people usually tweet about. Experimental results using the proposed technique outperform the existing models in terms of accuracy, precision, recall, support.2.1.2 Technique 2: Short Text Classification Improved by Learning Multi-Granularity TopicsUnderstanding the fastly growing short text is very essential. A short text is different from traditional documents in its sparsity and shortness, which hinders the application of conventional text mining algorithms and machine learning. The major two approaches have been exploited to enrich the representation of short text. One approach is to fetch contextual information of a short text to directly add more text and the other one is to derive latent topics from an existing large corpus, which are used as features to enrich the representation of short text.The latter approach is elegant and efficient in most cases. To set up effective feature spaces, the topics of certain granularity are usually not sufficient. In this, we move forward along this direction by proposing a method to leverage topics at multiple granularities, which can model the short text more precisely.5Chapter 3Implementation3.1 Flow of WorkSTEP 1: The features extracted for the classes that are stored in files.STEP 2: The BBC news which has to be correctly classified and the feature sets are fed into the system.STEP 3: The BBC news is then disambiguated. Disambiguation involves tokenizing the news, making the tokens Case-less, removing stop words, lemmatizing the tokens using Word Net, stemming the tokens and finally, the stemmed tokens are Part of Speech tagged.STEP 4: A loop executes on each word in the BBC news. A POS tagged word is selected and all senses of that word are learned.STEP 5: If the learned sense is not a noun or verb then it is ignored and skip to the next sense.STEP 6: Loop on all other words in the same news and find their senses.STEP7: Then the definition of all the senses are derived from Word Net.STEP 8: The senses of a precise word are then compared with the senses of the remaining words. An overall score is evaluated and the maximum score is then considered for further.STEP 9: The senses which give these maximum scores are then returned.STEP 10: The steps from 4 to 9 are also executed on the feature sets.STEP 11: The senses of the feature sets and the words of the news are then evaluated.STEP 12: The feature set which gives the maximum similarity with the news of BBC is considered the correct feature set. The class of the feature set is then extracted and the news is classified to that class.3.2 Data collection and Data setsThere is one class of name as BBC and that class contains some files as Entertainment, Technology, Business, Politics, and Sports. Each file contains related category wise news files which is in the form of text of news in BBC.63.3 Software Requirements1) Language Used : PythonPython is a high-level, interpreted, general-purpose programming language. And it was created by Guido van Rossum, and in 1991 python was released.It is used for: Web development (server-side), Software development, Mathematics, System scripting, etc.What can Python do? To create web applications python is used on a server. To create workflows, it can be used alongside the software. Python can connect to database systems so it can also read and modify files. Python can also be used to handle big data and perform complex mathematics. Python can be used for production-ready software development or rapid prototyping.Why Python? Because python works on different platforms or supports different platforms like Windows, Mac, Linux, Raspberry Pi, etc. It has a simple syntax which is similar to the English language. As well as it has a syntax that allows developers to write programs with fewer lines than some other programming languages. It runs on an interpreter system which means that code can be executed as soon as it is written. This means that prototyping can be very quick. Python can be treated in a procedural way, a functional way or an object-orientated way.72) Platform : Jupyter NotebookIt is an open-source web application which is allowed to create and share documents which contain equations, visualizations, live code, and narrative text. It has some uses which include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, etc.The notebook extends the console-based approach only to interactive computing in a qualitatively new direction and for providing a web-based application suitable for capturing the whole computation process including developing, documenting, and executing code as well as communicating the results.The Jupyter notebook has two components: A web application: It is one of the component of Jupyter Notebook where a browser-based tool for interactive authoring of documents which combine all explanatory text, mathematics, computations, and their rich media output. Notebook documents: And the second component of Jupyter Notebook is the representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, images, mathematics, and rich media representations of objects.83.4 Results obtained‚· Label’s and their counts:‚· Testing Data:‚· Tfidf Vectorizer:9‚· Result as accuracy using Nave Bayes algorithm and their precision, recall, f1-score, support10Chapter 4Results and Discussion4.1 Discussion on Result Obtained‚· News are harder to classify than a larger corpus of text. Here we classify news efficiently based on some attributes.‚· Because of this, it is easy to find news related to some topic.‚· This is primarily because there are few word occurrences and hence it is difficult to capture the semantics of such messages.‚· Hence, traditional approaches when applied to classify news do not perform as well as expected. Here, the method used to classify news is a supervised method as it does require a source of data or labelling the news.‚· In these, by using the Naive Bayes algorithm, it gets an accuracy of near about 96%.4.2 Comparison of Results (with other researchers)Existing short text classification is on twitter tweets. There has a class with different attributes or categories like movie, news, and sports. They use the Naive Bayes algorithm for short text classification and get an accuracy of 60%. While comparing with our short text classification of BBC news get almost near to 96% accuracy using the same Naive Bayes algorithm. Which is having some attributes like business, sport, politics, entertainment, technology that contains some text files.11Chapter 5Conclusion and Future WorkFor short classification use, a class BBC news which contains some attributes and each attribute contains some text files related to their attribute. Classify using Naive Bayes algorithm the BBC news with different attributes and calculate the accuracy, support, precision, recall, f1-score, etc. While comparing with existing short text classification we get the highest accuracy almost near to 96%.The future scope is classifying this BBC news as short text classification by using some other classification algorithms. And calculate precision, recall, support, f1-score and try to get highest accuracy.12Appendix ACodeA.1 Source Code and Screenshots1314A.2 Plagiarism report15Bibliography python-and-nltk-c52b92a7c73a naive-bayes-with-python-421db3a72d34 and-implement-text-classification-in-python/16