Introduction to NLP
Natural Language Processing (NLP) is the field of computer science that is concerned with developing algorithms for analysis of human languages. Artificial Intelligence approaches( eg. Machine Learning) have been used for solving many tasks of NLP such as parsing, POS tagging, Named Entity Recognition, word sense disambiguation, document classification, machine translation, textual entailment, question answering, summarization, etc. Natural languages are notoriously difficult to understand and model by machines mostly because of ambiguity (eg. humor, sarcasm, puns), lack of clear structure, diversity (eg. models for English are not directly applicable to Chinese). Even so, in recent years we’re witnessing rapid progress in the field of NLP, due to deep learning models, which are becoming more and more complex and able to capture subtleties of human languages.
Introduction to NLP
In order to do the following we have to undergo the process of text cleaning, understanding the text. We had to find a way in order to split the data and form a data frame which consists of the following columns.News_TextNews_NumberNews_TypeThe data has lots of fillers which had to be removed and some rows where news_numbers and type were missing. In order to clean the data we had to remove the fillers using the NLTK stop words filtration. Later on we tokenized the data using the word_tokenizer from the nltk package.The next important step was to lemmatize/stem the data to remove the tense from the words and normalize the words. Even though it was a time consumption process the results were promising.XGBoost has capability to handle the imbalanced dataset by using the parameter Scale_Pos_weight. we can do threshold probability throttling to increase the sensitivity by sacrificing the reasonable specificity.Evaluation:- This process is kind of tricky for the train data set provided, as the data was highly imbalanced, the dependent feature/variable had imbalanced classes
We are back to participate in another Datathon hosted by Data Science Society. This time the theme is Text Analytics.
We will not be able to completely devote ourselves to the cause this time because of the exams which start in next week. But We’ll try to keep the article as simple and well detailed as possible so that it will be helpful for any new Data Science Enthusiast seeking for little helps. So Lets roll.
Propaganda is a form of communication that is aimed at influencing the attitude of a community toward some cause or position. It often presents facts selectively to encourage a particular synthesis. The disinformation damages the reputation of respectable news outlets, organisations and very bad for business indeed. The objective of the Hackathon is to be able to detect the Propaganda and Non-propaganda news as well as to develop a model that can help with the venture. The other objectives of this work includes detecting phrases which are propagandist and also finding out the type of propaganda it is. The algorithms that we will be taking help from are Passive Aggressive, Multiple Layer Perceptron Network, Logistic Regression, AdaBoost, Decision Tree, Random Forest, KNN, SVM and Naive Bayes to detect the potentially propagandistic and non-propagandistic sentences in a news article. For the evaluation, we are calculating F1 Score to measure the class imbalance in the testing dataset. We have used the best model for detecting propagandist and non-propagandist articles, phrases and also type of propaganda.
News is the lifeline of the human society , it underlines all the important events and influences public opinion like no other tool , but with the recent advent of electronic media and the sheer amount of new being churned out and the current political climate it’s hard to figure out what’s genuine news and what’s propaganda , this is where intelligent systems which can classify news articles , text fragments as propagandistic or non-propagandistic comes into play , this Datathon is focussed on developing such a system using various algorithms and methods to predict such a scenario the levels of challenges are:
A System that is able to classify a news article whether it is propaganda or not.
A System that is able to classify whether a sentence in a article is propaganda or not.
A System that is intelligently able to classify the propaganda technique used in the new piece.
In recent years, deceptive content such as fake news and fake reviews, also known as opinion spams, have increasingly become a dangerous prospect for online users. Fake reviews have affected consumers and stores alike. Furthermore, the problem of fake news has gained attention in 2016, especially in the aftermath of the last U.S. presidential elections. Fake reviews and fake news are a closely related phenomenon as both consist of writing and spreading false information or beliefs. The opinion spam problem was formulated for the first time a few years ago, but it has quickly become a growing research area due to the abundance of user-generated content. It is now easy for anyone to either write fake reviews or write fake news on the web. The biggest challenge is the lack of an efficient way to tell the difference between a real review and a fake one; even humans are often unable to tell the difference. We are implementing 7 machine learning classification techniques here.
Due to the extreme divergence of social discussions happening in the political space, rumours and fake news becoming inferno which is difficult for anyone who reads to differentiate it from the truth.
What we are going to achieve?
To detect the propagandas at article level, sentence level and recognizing its type.
Using supervised machine learning technique, model shall be created to identify and flag the false news propaganda.
Cell phones have become a necessity for many people throughout the world. The ability to keep in touch with family, business associates, and access to email are only a few of the reasons for the increasing importance of cell phones. Today’s technically advanced cell phones are capable of not only receiving and placing phone calls, but storing data, taking pictures, and can even be used as walkie talkies, to name just a few of the available options.
Dataset, The Telenor Case – What do Game of Thrones and Telecoms Have in Common? contains the data of delays in networks (RAVENS). The delays of RAVENS are ranging from 26/07/2018 – 05/08/2018. Each RAVEN_NAME represents the Tower. There are 7847 unique RAVEN_NAMES for different networks like 2G/3G/4G. There are 5 unique families.
To provide optimum solution to business problems we are solving the problem in two steps (i) Data Analysis and coding in PYTHON and (ii) Time Series model building in R Studio.
In data analysis we have found the solutions for the problems and found the number of delays (failures) of RAVENS. We also found the Top_10 RAVENS with and without fails. We also detected the Family names and Member names with most and least fails in networks (failures).
The methods of prediction & forecasting of the problem is done by using Time Series model building. As the name suggests that it involves working on time (years, days, hours, minutes) based on data, to derive the hidden insights to make informed decision making. Time series models are very useful models when it is serially correlated data. Based on mobile data, to predict the four days we have divided the data into train and test .We have done Time series analysis by using Arima, Simple exponential analysis and Recurrent Neural networks (RNN).
Finally we conclude that by considering the Root mean square error for these algorithms, we got RNN (Recurrent Neural Networks) as the best algorithm to predict the future for days. Based on the RNN algorithm the prediction of delays for the next four days were analyzed. We have plotted the graphs based on the Time series model for all the algorithms.
The objective of this analysis is to find out the ravens that are not reaching the destination on time. This kind of analysis would help us to scrutinize and understand the towers(ravens) who would require our utmost attention, in order to improve the reasons which are playing a major role in the delays.
The data-set talks about the networks between the towers (ravens). The land based communication happens with the help of signals.
A cellular network or mobile network is a communication network where the last link is wireless. This wireless transmission is done by a tower which comprises of a transmitter and a receiver (for the wireless transmission). The channel provides transmission for both the data as well as Voice transmission.
Every cellular network has different set of frequencies, to avoid any kind of overlapping and interference. Despite of many precautions for maintaining the setup, there are few parameters that are still impacting the transmission. Few parameters can be classified as:
Interference between the frequencies
External Factors (Predators etc.)
For this our first approach is to create a “Decision Model” which can help us to give value to our business and help in improving the communication.
****** The tools that we using in order to predict is ******
1. Visual Analysis using different plots
2. Usage of ARMA (Auto-regressive- Moving- Average- Model)
The usage of this Decision Model will help us in forecasting the failure rate for next 4-7 days in regards to the Ravens.
->This datasets is regarding the time series analysis on the failure rate of RAVENS sending the messages from kings landing to the north.
It depicts the analogy of Telenor communication and Game of Thrones.
-> Sending ravens is one of the most fundamental parameters in mobile communications engineering.
For land-based mobile communications, the received raven variation is primarily the result of multipath fading caused by obstacles such as buildings (or clutter) or terrain irregularities; the distance between link end points; predatory animals, and interference among multiple transmissions, for example wars.
This inevitable raven variation is the cause of communication dropping, one of the most significant quality of service measure in operative communication. For this reason, various techniques and schemes are employed in the planning, design and optimization of raven networks to combat these propagation effects.
This normally covers the network physical configuration which include all aspects of network infrastructure deployment such as locations of base nests; additional food; sometimes guards, etc.
A typical example of these schemes and techniques is the use of models for flight prediction based on measured data.
Based on one month data with flight fails, the participants have to make time-series analysis and predict the future amount of fails.
In this article the mentors give some preliminary guidelines, advice and suggestions to the participants for the case. Every mentor should write their name and chat name in the beginning of their texts, so that there are no mix-ups with the other menthors. By rules it is essential to follow CRISP-DM methodology (http://www.sv-europe.com/crisp-dm-methodology/). The DSS […]