Text Mining, Analytics & More: My Biased Opinions

Showing posts with label My Biased Opinions. Show all posts

Mar 3, 2016

Untold Truths About Text Mining and Web Mining in Practice

Truth #1: There is no such thing as "THE TOOL" for text mining

People often ask me, "So what tool do you use for your text analysis projects"...This is how it typically goes..I stare at them not sure whether I should laugh or cry, they stare at me thinking I am crazy and there is a few seconds of uncomfortable silence. I usually end up with a single tear in one eye and a partial half-hearted smile...then the confused other person clarifies him/her self by saying..."actually, what I meant was do you use NLTK or GATE". Now the tears start to pour. Well, the truth is, it all depends on the problem!! There is no magic tool or tools. If I am scraping the Web, I might use something like Nutch or write a custom crawler (no, not in python!) that does crawling and scraping at the same time. If I am building a text classifier I may just use Mallet with simple stemming where needed. If I am doing entity linking I might just use some similarity measures like Jaccard or Cosine. It also depends on the programming language I am working on. If I am using Python, I may choose to use scikit learn for text classification instead of Mallet. So, there really is no such thing as THE TOOL for text analysis - it always depends on what you are trying to solve and in which development environment. The idea of "THE TOOL" probably comes from some irresponsible advertising where you are promised that you can solve the world's text mining problems with this one tool.

Truth #2: Text Mining is 20% engineering, 40% algorithms and 40% science/statistics

If you think text mining or web mining is just pure statistics or science where you can for example, apply black box Machine Learning and solve the entire problem, you are in for a big surprise. Many text mining projects start with intelligent algorithms (from focused crawling, to compact text representation to graph search) with much of the code written from scratch coupled with some statistical modeling to solve very specific parts of the problem (e.g. CRF based web page segmentation). Developing the custom algorithms and tying the pieces together requires some level of engineering. This is why text mining and analytics is so appealing to some, as there is always going to be new challenges in every new problem you work on - whether its engineering problems or types of algorithms to use - its always a fun and challenging journey. Even a change in domain, can significantly change the scope of the problem.

Truth # 3: Not all data scientists can work on text mining / web mining / NLP projects

Data science is a very broad term and it encapsulates BI analysts, NLP Experts and ML Experts. The way data scientists are being trained today (outside academic programs), is that most of them get hands on experience working on structured data using R or Python and are mostly trained in descriptive statistics and supervised machine learning. Text and language is a whole other ball game. The language expert type of data scientist must have significant NLP knowledge as well as some knowledge in information retrieval, data mining, breadth of AI, as well as creative algorithm development capabilities.

Now you might be thinking so what is the whole point of GATE and NLTK and that Alchemy API for gods sake! GATE and NLTK and Alchemy type NLP APIs are just some tools to help you get started with text mining type projects. These tools provide some core functionality such as stemming, sentencing, chunking, etc. You are not required to use these tools. If you find other off the shelf tools that are easy to integrate into your code, go ahead and use it! If you are already working with python, then NLTK is always an option for you to use. The bottom line is, pick and choose and use *only* what you REALLY need, not something that looks or sounds cool as this just adds unneeded overhead and makes debugging a living nightmare.

Feb 13, 2015

What is the Notion of NLP in Computer Science vs. Biomedical Informatics vs. Information Systems (MIS)?

Computer Science: In computer science research, when we talk about NLP methods we are referring to specific methods that use some linguistics properties or structural properties of text often coupled with statistics to solve targetted language and text mining problems. The NLP parts are those "customized methods" that we use to solve specific problems. Thus, the notion of NLP here is rather broad as any novel approach that does not just re-use existing tools could be considered NLP or Computational Linguistics.

Biomedical Informatics: In the pure biomedical informatics world however, I often notice that the idea of what NLP is, is quite narrow. People tend to refer to usage of tools such as NegEx Detection, POS Taggers, Parsers and Stemming as NLP. While technically, the underlying technology in these tools can involve lightweight to heavy NLP, the applications of these tools in specific problems are not necessarily "NLP". It is more of applications of NLP tools or text-processing using NLP tools. To be technically correct, only when these NLP tools are used in combination with additional learning methods or customized text mining algorithms, this would probably qualify as actual NLP, solving a specific problem.

Information Systems (MIS): The concept of what qualifies as NLP (and text mining) in the information systems (MIS) world tends to be a lot more shallow and ill-defined than biomedical informatics or computer science where even counting word frequencies seems to be considered a form of NLP. In fact, even getting a binary classifier to work on text is considered fairly involved NLP. This may seem slightly exaggerated, but when you start reviewing papers from some of the MIS departments through TOIS or TKDE you will start to understand how ill-defined NLP can really be.

The point here is that there are different flavors of NLP both in research and in practice and the notion of NLP varies from department to department. So when you are publishing obvious methods to conferences where the audience primarily consists of core computer scientists, you will get dinged really hard for the lack of originality in your proposed methods. At the same time, if you submit a really effective algorithm that does not use existing NLP tools to a biomedical informatics journal, again you may get dinged but this time for not using existing tools or as a friend of mine once told me "the reviewer had philosophical issues with my submission". Also, be warned that if a physician comes and tells you hey, I am also doing NLP (heard quite a few of those), don't be surprised that what he or she means is that he is running some NLP tools on some patient data. And if someone from the IS department of a business school talks to you about offering a text mining course, it usually means they are teaching you how to use R or SaS to do some text processing and querying :D, not to mention they will also refer to this as "Big Data".

Sep 26, 2014

Super WoW, I loved KDD 2014!

I used to think KDD was a super hyped-up conference, but boy was I wrong until I actually attended it. This was my first time attending KDD (2014) and I did enjoy quite a few of the sessions - some of the research as well as industry sessions. The funny part is that people tend to think that industry sessions are much more interesting than the research sessions, but that was not the case at all. The research sessions were equally interesting. What really intrigued me was the fact that they adopted the theme "Data Science for Social Good" and had really good keynote and invited speakers to actually support this theme. Highlight of some of the invited talks and keynote sessions from KDD 2014:

Dr. Nigam Shah
-Talks about using data mining in medicine for finding actionable insights
-So you have made a discovery, what do you do with that information? How do you make this actionable?

Dr. Oren Etzioni
-Talks about the future - that its way beyond deep learning and classification in general
-We need a more complete human-like knowledge base (common sense knowledge and reasoning)

Dr. Eric Horvitz
-Talks about predictions for potentially preventable outcomes and other projects in the healthcare domain

Dr. Eric Schadt
- Talks about some really cutting edge work in personalized medicine and he thinks the future in all of this is HCI and visual analytics
- You have done all the analysis, so how do you present this information so that it can actually be interpreted and used by different groups of people

I also attended a few of the tutorials, but unfortunately I found the tutorials not interesting at all. I guess the effectiveness of tutorials is one part material and one part presentation. I have limited patience for presentations that are a drag....but if you can sit through some of those there is quite a large selection of tutorials to choose from.

I think people should stop attending for profit 'summits' like "Text Analytics Summit" and "Sentiment Analysis Summit"" and start attending conferences like KDD and CHI that actually have a lot of substance and learning opportunities.

Jul 13, 2013

USC vs. UIUC ?

Well, this seems like a controversial topic or perhaps its like comparing apples and oranges, but here is my view of these two schools based on personal experiences. Point to note: I completed my M.Sc. in Computer Science at USC and then went on to a PhD. in Computer Science at UIUC.

Location: First off all, let's talk location. USC is located in absolute downtown Los Angeles....and this area just as in most downtowns of big cities is not quite known for being safe. You would often hear about occurrences of mugging, robbery etc through campus announcements. But, if you pay heed to safety precautions you should be just fine. Just that the USC area is not an area you would typically hang out after class or with friends and don't even think about taking a stroll by yourself. I got by easily without being affected just by taking some common sense safety steps. Now on the positive note, if you have a car while you are in school, then you have access to a lot of good food (Thai, Korean, Chinese, you name it) and entertainment! I truly enjoyed my time in LA, experiencing the big city traffic, the 'dangerousness' of the city and working at USC's ISI and ICT. Would I do it again? Most likely, no at least not near the downtown area. The USC campus itself is fairly small in comparison to other well known universities. The student facilities and health care options are just so-so not as good as other big schools. Hey, its a private school, they can do better! The weather in LA is nice and sunny all throughout and slightly cool in the winter but very pleasant.

UIUC on the other hand is located in a small University town, slightly south of Chicago. It's about a two and half hour drive from Chicago. The campus is huge and beautiful I would say. In general the UIUC campus seems to be safe and some parts of the campus seem to be safer than others. But of course you still have to be careful at nights especially. The winters are extremely cold, you cannot get by with your nice summer outfit. You do need to bundle up pretty well and plan ahead if you need to go out to classes or run errands (I stayed home for the most part - too weak for winters). Because UIUC is in a University town, you can only imagine how much entertainment the city would have to offer - yes, limited. You would probably find more corn fields than movie theaters or restaurants. You could however drive up to Chicago or other nearby cities for more options. On the flip side, if you are actually working on your PhD, entertainment is probably the last thing on your priority list - so this should not be much of a concern at all. Its pretty easy to get medical care within the UIUC campus. They have a pretty big student health facility where you can get your immunizations, flu shots, etc.

CS Department: Moving on to the CS department. USC's CS department seems to leave students in the dark. If you have a problem, or concerns they tend not to work with you to resolve it. Instead they throw out standard 'canned' type of responses and insist in repeating the same message in different ways without really trying to get into details. You are probably better off talking to an automated question answering system, if they even had one. In terms of funding, most of the funding is reserved for PhD students either in the form of RA or TA. They have occasional merit based funding or RA type of jobs for M.S. students but that is totally dependent on Professors and the availability of such jobs in other departments. Doing a CPT at USC was not very easy either. At the time I was there we had to go through a supposed advisor (Margery something) who approves the CPT whom every student was afraid of. She was rude and shouted at students if the offer letter had mistakes or she was irritated or not happy about something. Unprofessional alright, but it is what it is! I don't quite know if things have changed since I left. In summary, I would say that USC's CS/Engineering department is not very student oriented.

When I enrolled at UIUC I was pleasantly surprised by how much importance and effort the department puts into student happiness. They do everything in their capacity to see that their students succeed and get what they need. It almost feels like a family where they take you on as a member. I would have to especially commend Kathy Runck who is very efficient with helping students and answering all sorts of administrative questions as well as Mary Beth and Rhonda Kay who always seem to get back to you almost immediately should you need their help. All in all I am very happy that I chose to join UIUC for my PhD as this is when you can get into all sorts of administrative complications and messes. What worked for me was also the fact that I had an excellent advisor who was not just a good mentor, but also a very understanding person who works with you to see that you meet your future career plans. Another thing that I found out about the department is that once you graduate, it does not mean you are completely out of the picture. Once you have established connections and trust, these connections will always be there (so take advantage of it!). Funding is available for both MS and PhD students. For M.S. its mostly in terms of TA. But for PhDs it can be scholarships, TAship or RAship.

Opportunities for research in NLP / Text Analytics: USC is very very well known for its NLP research. They work on a variety of problems ranging from Machine Translation, to Question Answering to Common Sense Reasoning. You would find a lot of well known research folks in NLP at USC's ISI. Jerry Hobbs, Eduard Hovy, Kevin Knight, Daniel Marcu are just amongst a few. Here is a more complete list: http://nlg.isi.edu/nlpeople/. You will find some more NLP folks at USC's ICT as well. UIUC is also well known for NLP and research in all types of text analysis work, however there are fewer faculty members in these areas. We have Dan Roth, Julia Hockenmeir, ChengXiang Zhai (was my advisor), Roxana Girju, and Margaret M. Fleck. Apart from Prof. ChengXiang Zhai, the NLP work at UIUC is more foundational and theoretical. Prof. Zhai works mostly on applied text analysis and management, covering topics in information retrieval, opinion mining and summarization, text summarization, bioinformatics, medical informatics and many more.

Ranking: In terms of ranking, both schools are top tier schools where UIUC's engineering program ranks 5th and USC's program ranks 9th. In terms of the CS program in particular UIUC ranks 5th still but USC ranks 20th. These rankings are based on U.S news.