Computer Scientists Developing Technology To Improve Data Mining For Homeland Security
From online news articles to blogs, a massive amount of information is voluntarily being put before the public every day.
Some of this information may be valuable to protecting homeland security. However, to sift through this readily available content and summarize it for agencies like the Department of Homeland Security, analysts need to do more than sit at a computer, entering words like "al-Quaida" into Internet search engines.
That's why Kansas State University's William Hsu and other computer scientists who research data mining are part of a project to develop technology that makes automated Internet searches more useful and productive.
"We're helping to develop the next generation of Web search and crawling," Hsu said. "Our goal is to develop a research program that will help with homeland security. The Department of Homeland Security wants to pull information that's available to anyone in the public domain, like millions of articles from sources like CNN and Al-Jazeera, and monitor them for security."
Hsu is an associate professor of computer and information sciences, head of K-State's Laboratory for Knowledge Discovery in Databases, and co-principal investigator of a Department of Homeland Security-funded summer institute aimed at training future researchers in data sciences. The $2.4 million Data Sciences Summer Institute, headed by the University of Illinois along with K-State and the University of Texas San Antonio, is titled "Multimodal Information Access and Synthesis." The Illinois-led cooperative is one of four such University Affiliate Centers nationwide.
Data mining is a way of processing vast amounts of information and putting it in multiple, useful formats. Hsu's data mining research at K-State includes applications in fields like genome analysis, nanoscale materials modeling and diagnostic medicine. The work at K-State that will benefit homeland security strives to resolve ambiguity in Internet searches. For instance, this would allow a search engine to differentiate between homeland security as a concept and Homeland Security as a government agency. Hsu said that one of the institute's projects aims to improve name recognition, a heavily studied problem in information extraction.
"The goal is to develop an automated system that can pick out al-Quaida as an organization, Kandahar as a place and Osama bin Laden as a person, based upon rules developed from previously-seen documents," Hsu said. "Subcategories are a problem," he said. "'People' is a big tag. Is this a head of state? A celebrity? Someone who was interviewed?"
Data mining research at K-State and collaborating institutions is helping solve another problem with getting information off the Internet -- inefficient crawling. Hsu said search engines provide up-to-date results by first looking through vast numbers of Web pages and archiving them in a process called crawling. Hsu said the project leader, Kevin Chang at the University of Illinois, describes the problem with this process as "crawling in the dark -- you start somewhere and grab everything." Hsu said research in this area will lead to better searches whereby search engines can anticipate keywords, for instance. Search engines also could create virtual neighborhoods of information in which connections are made among bits of information based on the results of similar searches.
Although text-based searches have their complications, Hsu said searching for images is even harder because searches rely on the words people use to describe the images, such as a photo caption. Data mining research at K-State and its partner institutions is leading to technology that will allow search engines to "look" through images from the Web. Hsu said search engines would sift through images that are automatically annotated, or marked up, to describe their contents. This would be done using tools that analyze the shape, border, color and orientation of objects, among many other features, to pick out, for instance, an image of George W. Bush in a press conference photo.
"Computers will figure out an image identity by 'seeing' a feature that all such images have in common," Hsu said.
The next generation of data mining research, Hsu said, will involve computer scientists working with social scientists. By scouring news articles and other public data, researchers can work on something called sentiment analysis.
"Sometimes Homeland Security just needs to know, for instance, what the local reaction is to a particular event such as a bomb threat or recent explosion," Hsu said.