-
Ester, M.; Sander, J.: Knowledge discovery in databases : Techniken und Anwendungen (2000)
0.02
0.019904286 = product of:
0.07961714 = sum of:
0.07961714 = weight(_text_:und in 2374) [ClassicSimilarity], result of:
0.07961714 = score(doc=2374,freq=14.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.51866364 = fieldWeight in 2374, product of:
3.7416575 = tf(freq=14.0), with freq of:
14.0 = termFreq=14.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.0625 = fieldNorm(doc=2374)
0.25 = coord(1/4)
- Abstract
- Knowledge Discovery in Databases (KDD) ist ein aktuelles Forschungs- und Anwendungsgebiet der Informatik. Ziel des KDD ist es, selbständig entscheidungsrelevante, aber bisher unbekannte Zusammenhänge und Verknüpfungen in den Daten großer Datenmengen zu entdecken und dem Analysten oder dem Anwender in übersichtlicher Form zu präsentieren. Die Autoren stellen die Techniken und Anwendungen dieses interdisziplinären Gebiets anschaulich dar.
- Content
- Einleitung.- Statistik- und Datenbank-Grundlagen.Klassifikation.- Assoziationsregeln.- Generalisierung und Data Cubes.- Spatial-, Text-, Web-, Temporal-Data Mining. Ausblick.
-
Analytische Informationssysteme : Data Warehouse, On-Line Analytical Processing, Data Mining (1998)
0.02
0.019904286 = product of:
0.07961714 = sum of:
0.07961714 = weight(_text_:und in 2380) [ClassicSimilarity], result of:
0.07961714 = score(doc=2380,freq=14.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.51866364 = fieldWeight in 2380, product of:
3.7416575 = tf(freq=14.0), with freq of:
14.0 = termFreq=14.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.0625 = fieldNorm(doc=2380)
0.25 = coord(1/4)
- Abstract
- Neben den operativen Informationssystemen treten heute verstärkt Informationssysteme für die analytischen Aufgaben der Fach- und Führungskräfte in den Vordergrund. In fast allen Unternehmen werden derzeit Begriffe und Konzepte wie Data Warehouse, On-Line Analytical Processing und Data Mining diskutiert und die zugehörigen Produkte evaluiert. Vor diesem Hintergrund zielt der vorliegende Sammelband darauf, einen aktuellen Überblick über Technologien, Produkte und Trends zu bieten. Als Entscheidungsgrundlage für den Praktiker beim Aufbau und Einsatz derartiger analytischer Informationssysteme können die unterschiedlichen Beiträge aus Wirtschaft und Wissenschaft wertvolle Hilfestellung leisten
-
Fenstermacher, K.D.; Ginsburg, M.: Client-side monitoring for Web mining (2003)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 2611) [ClassicSimilarity], result of:
0.079129905 = score(doc=2611,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 2611, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=2611)
0.25 = coord(1/4)
- Abstract
- "Garbage in, garbage out" is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of the user's actions ever reaches the Web server, analysts must rely an incomplete data. In this paper, we propose a client-side monitoring system that is unobtrusive and supports flexible data collection. Moreover, the proposed framework encompasses client-side applications beyond the Web browser. Expanding monitoring beyond the browser to incorporate standard office productivity tools enables analysts to derive a much richer and more accurate picture of user behavior an the Web.
-
Benoit, G.: Data mining (2002)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 5296) [ClassicSimilarity], result of:
0.079129905 = score(doc=5296,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 5296, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=5296)
0.25 = coord(1/4)
- Abstract
- Data mining (DM) is a multistaged process of extracting previously unanticipated knowledge from large databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer associations and rules from them. The extracted information may then be applied to prediction or classification models by identifying relations within the data records or between databases. Those patterns and rules can then guide decision making and forecast the effects of those decisions. However, this definition may be applied equally to "knowledge discovery in databases" (KDD). Indeed, in the recent literature of DM and KDD, a source of confusion has emerged, making it difficult to determine the exact parameters of both. KDD is sometimes viewed as the broader discipline, of which data mining is merely a component-specifically pattern extraction, evaluation, and cleansing methods (Raghavan, Deogun, & Sever, 1998, p. 397). Thurasingham (1999, p. 2) remarked that "knowledge discovery," "pattern discovery," "data dredging," "information extraction," and "knowledge mining" are all employed as synonyms for DM. Trybula, in his ARIST chapter an text mining, observed that the "existing work [in KDD] is confusing because the terminology is inconsistent and poorly defined.
-
Dang, X.H.; Ong. K.-L.: Knowledge discovery in data streams (2009)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 816) [ClassicSimilarity], result of:
0.079129905 = score(doc=816,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 816, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=816)
0.25 = coord(1/4)
- Abstract
- Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resourcesare abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms may well be the solution to the above problem. In this entry, we discuss this issue and present an overview of recent works.
-
Berendt, B.; Krause, B.; Kolbe-Nusser, S.: Intelligent scientific authoring tools : interactive data mining for constructive uses of citation networks (2010)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 226) [ClassicSimilarity], result of:
0.079129905 = score(doc=226,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 226, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=226)
0.25 = coord(1/4)
- Abstract
- Many powerful methods and tools exist for extracting meaning from scientific publications, their texts, and their citation links. However, existing proposals often neglect a fundamental aspect of learning: that understanding and learning require an active and constructive exploration of a domain. In this paper, we describe a new method and a tool that use data mining and interactivity to turn the typical search and retrieve dialogue, in which the user asks questions and a system gives answers, into a dialogue that also involves sense-making, in which the user has to become active by constructing a bibliography and a domain model of the search term(s). This model starts from an automatically generated and annotated clustering solution that is iteratively modified by users. The tool is part of an integrated authoring system covering all phases from search through reading and sense-making to writing. Two evaluation studies demonstrate the usability of this interactive and constructive approach, and they show that clusters and groups represent identifiable sub-topics.
-
Berry, M.W.; Esau, R.; Kiefer, B.: ¬The use of text mining techniques in electronic discovery for legal matters (2012)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 1091) [ClassicSimilarity], result of:
0.079129905 = score(doc=1091,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 1091, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=1091)
0.25 = coord(1/4)
- Abstract
- Electronic discovery (eDiscovery) is the process of collecting and analyzing electronic documents to determine their relevance to a legal matter. Office technology has advanced and eased the requirements necessary to create a document. As such, the volume of data has outgrown the manual processes previously used to make relevance judgments. Methods of text mining and information retrieval have been put to use in eDiscovery to help tame the volume of data; however, the results have been uneven. This chapter looks at the historical bias of the collection process. The authors examine how tools like classifiers, latent semantic analysis, and non-negative matrix factorization deal with nuances of the collection process.
-
Biskri, I.; Rompré, L.: Using association rules for query reformulation (2012)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 1092) [ClassicSimilarity], result of:
0.079129905 = score(doc=1092,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 1092, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=1092)
0.25 = coord(1/4)
- Abstract
- In this paper the authors will present research on the combination of two methods of data mining: text classification and maximal association rules. Text classification has been the focus of interest of many researchers for a long time. However, the results take the form of lists of words (classes) that people often do not know what to do with. The use of maximal association rules induced a number of advantages: (1) the detection of dependencies and correlations between the relevant units of information (words) of different classes, (2) the extraction of hidden knowledge, often relevant, from a large volume of data. The authors will show how this combination can improve the process of information retrieval.
-
Organisciak, P.; Schmidt, B.M.; Downie, J.S.: Giving shape to large digital libraries through exploratory data analysis (2022)
0.02
0.019782476 = product of:
0.079129905 = sum of:
0.079129905 = weight(_text_:however in 1474) [ClassicSimilarity], result of:
0.079129905 = score(doc=1474,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.27530175 = fieldWeight in 1474, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.046875 = fieldNorm(doc=1474)
0.25 = coord(1/4)
- Abstract
- The emergence of large multi-institutional digital libraries has opened the door to aggregate-level examinations of the published word. Such large-scale analysis offers a new way to pursue traditional problems in the humanities and social sciences, using digital methods to ask routine questions of large corpora. However, inquiry into multiple centuries of books is constrained by the burdens of scale, where statistical inference is technically complex and limited by hurdles to access and flexibility. This work examines the role that exploratory data analysis and visualization tools may play in understanding large bibliographic datasets. We present one such tool, HathiTrust+Bookworm, which allows multifaceted exploration of the multimillion work HathiTrust Digital Library, and center it in the broader space of scholarly tools for exploratory data analysis.
-
Datentracking in der Wissenschaft : Aggregation und Verwendung bzw. Verkauf von Nutzungsdaten durch Wissenschaftsverlage. Ein Informationspapier des Ausschusses für Wissenschaftliche Bibliotheken und Informationssysteme der Deutschen Forschungsgemeinschaft (2021)
0.02
0.019545622 = product of:
0.07818249 = sum of:
0.07818249 = weight(_text_:und in 1249) [ClassicSimilarity], result of:
0.07818249 = score(doc=1249,freq=24.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.50931764 = fieldWeight in 1249, product of:
4.8989797 = tf(freq=24.0), with freq of:
24.0 = termFreq=24.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.046875 = fieldNorm(doc=1249)
0.25 = coord(1/4)
- Abstract
- Das Informationspapier beschreibt die digitale Nachverfolgung von wissenschaftlichen Aktivitäten. Wissenschaftlerinnen und Wissenschaftler nutzen täglich eine Vielzahl von digitalen Informationsressourcen wie zum Beispiel Literatur- und Volltextdatenbanken. Häufig fallen dabei Nutzungsspuren an, die Aufschluss geben über gesuchte und genutzte Inhalte, Verweildauern und andere Arten der wissenschaftlichen Aktivität. Diese Nutzungsspuren können von den Anbietenden der Informationsressourcen festgehalten, aggregiert und weiterverwendet oder verkauft werden. Das Informationspapier legt die Transformation von Wissenschaftsverlagen hin zu Data Analytics Businesses dar, verweist auf die Konsequenzen daraus für die Wissenschaft und deren Einrichtungen und benennt die zum Einsatz kommenden Typen der Datengewinnung. Damit dient es vor allem der Darstellung gegenwärtiger Praktiken und soll zu Diskussionen über deren Konsequenzen für die Wissenschaft anregen. Es richtet sich an alle Wissenschaftlerinnen und Wissenschaftler sowie alle Akteure in der Wissenschaftslandschaft.
- Editor
- Deutsche Forschungsgemeinschaft / Ausschuss für Wissenschaftliche Bibliotheken und Informationssysteme
-
Lackes, R.; Tillmanns, C.: Data Mining für die Unternehmenspraxis : Entscheidungshilfen und Fallstudien mit führenden Softwarelösungen (2006)
0.02
0.018713508 = product of:
0.07485403 = sum of:
0.07485403 = weight(_text_:und in 2383) [ClassicSimilarity], result of:
0.07485403 = score(doc=2383,freq=22.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.48763448 = fieldWeight in 2383, product of:
4.690416 = tf(freq=22.0), with freq of:
22.0 = termFreq=22.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.046875 = fieldNorm(doc=2383)
0.25 = coord(1/4)
- Abstract
- Das Buch richtet sich an Praktiker in Unternehmen, die sich mit der Analyse von großen Datenbeständen beschäftigen. Nach einem kurzen Theorieteil werden vier Fallstudien aus dem Customer Relationship Management eines Versandhändlers bearbeitet. Dabei wurden acht führende Softwarelösungen verwendet: der Intelligent Miner von IBM, der Enterprise Miner von SAS, Clementine von SPSS, Knowledge Studio von Angoss, der Delta Miner von Bissantz, der Business Miner von Business Object und die Data Engine von MIT. Im Rahmen der Fallstudien werden die Stärken und Schwächen der einzelnen Lösungen deutlich, und die methodisch-korrekte Vorgehensweise beim Data Mining wird aufgezeigt. Beides liefert wertvolle Entscheidungshilfen für die Auswahl von Standardsoftware zum Data Mining und für die praktische Datenanalyse.
- Content
- Modelle, Methoden und Werkzeuge: Ziele und Aufbau der Untersuchung.- Grundlagen.- Planung und Entscheidung mit Data-Mining-Unterstützung.- Methoden.- Funktionalität und Handling der Softwarelösungen. Fallstudien: Ausgangssituation und Datenbestand im Versandhandel.- Kundensegmentierung.- Erklärung regionaler Marketingerfolge zur Neukundengewinnung.Prognose des Customer Lifetime Values.- Selektion von Kunden für eine Direktmarketingaktion.- Welche Softwarelösung für welche Entscheidung?- Fazit und Marktentwicklungen.
-
Analytische Informationssysteme : Data Warehouse, On-Line Analytical Processing, Data Mining (1999)
0.02
0.018618755 = product of:
0.07447502 = sum of:
0.07447502 = weight(_text_:und in 2381) [ClassicSimilarity], result of:
0.07447502 = score(doc=2381,freq=16.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.48516542 = fieldWeight in 2381, product of:
4.0 = tf(freq=16.0), with freq of:
16.0 = termFreq=16.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.0546875 = fieldNorm(doc=2381)
0.25 = coord(1/4)
- Abstract
- Neben den operativen Informationssystemen, welche die Abwicklung des betrieblichen Tagesgeschäftes unterstützen, treten heute verstärkt Informationssysteme für analytische Aufgaben der Fach- und Führungskräfte in den Vordergrund. In fast allen Unternehmen werden derzeit Begriffe und Konzepte wie Data Warehouse, On-Line Analytical Processing und Data Mining diskutiert und die zugehörigen Produkte evaluiert. Vor diesem Hintergrund zielt der vorliegende Sammelband darauf ab, einen aktuellen Überblick über Technologien, Produkte und Trends zu bieten. Als Entscheidungsgrundlage für den Praktiker beim Aufbau und Einsatz derartiger analytischer Informationssysteme können die unterschiedlichen Beiträge aus Wirtschaft und Wissenschaft wertvolle Hilfestellung leisten.
- Content
- Grundlagen.- Data Warehouse.- On-line Analytical Processing.- Data Mining.- Betriebswirtschaftliche und strategische Aspekte.
-
Drees, B.: Text und data mining : Herausforderungen und Möglichkeiten für Bibliotheken (2016)
0.02
0.017842628 = product of:
0.07137051 = sum of:
0.07137051 = weight(_text_:und in 4952) [ClassicSimilarity], result of:
0.07137051 = score(doc=4952,freq=20.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.4649412 = fieldWeight in 4952, product of:
4.472136 = tf(freq=20.0), with freq of:
20.0 = termFreq=20.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.046875 = fieldNorm(doc=4952)
0.25 = coord(1/4)
- Abstract
- Text und Data Mining (TDM) gewinnt als wissenschaftliche Methode zunehmend an Bedeutung und stellt wissenschaftliche Bibliotheken damit vor neue Herausforderungen, bietet gleichzeitig aber auch neue Möglichkeiten. Der vorliegende Beitrag gibt einen Überblick über das Thema TDM aus bibliothekarischer Sicht. Hierzu wird der Begriff Text und Data Mining im Kontext verwandter Begriffe diskutiert sowie Ziele, Aufgaben und Methoden von TDM erläutert. Diese werden anhand beispielhafter TDM-Anwendungen in Wissenschaft und Forschung illustriert. Ferner werden technische und rechtliche Probleme und Hindernisse im TDM-Kontext dargelegt. Abschließend wird die Relevanz von TDM für Bibliotheken, sowohl in ihrer Rolle als Informationsvermittler und -anbieter als auch als Anwender von TDM-Methoden, aufgezeigt. Zudem wurde im Rahmen dieser Arbeit eine Befragung der Betreiber von Dokumentenservern an Bibliotheken in Deutschland zum aktuellen Umgang mit TDM durchgeführt, die zeigt, dass hier noch viel Ausbaupotential besteht. Die dem Artikel zugrunde liegenden Forschungsdaten sind unter dem DOI 10.11588/data/10090 publiziert.
-
Baumgartner, R.: Methoden und Werkzeuge zur Webdatenextraktion (2006)
0.02
0.01741625 = product of:
0.069665 = sum of:
0.069665 = weight(_text_:und in 808) [ClassicSimilarity], result of:
0.069665 = score(doc=808,freq=14.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.4538307 = fieldWeight in 808, product of:
3.7416575 = tf(freq=14.0), with freq of:
14.0 = termFreq=14.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.0546875 = fieldNorm(doc=808)
0.25 = coord(1/4)
- Abstract
- Das World Wide Web kann als die größte uns bekannte "Datenbank" angesehen werden. Leider ist das heutige Web großteils auf die Präsentation für menschliche Benutzerinnen ausgelegt und besteht aus sehr heterogenen Datenbeständen. Überdies fehlen im Web die Möglichkeiten Informationen strukturiert und aus verschiedenen Quellen aggregiert abzufragen. Das heutige Web ist daher für die automatische maschinelle Verarbeitung nicht geeignet. Um Webdaten dennoch effektiv zu nutzen, wurden Sprachen, Methoden und Werkzeuge zur Extraktion und Aggregation dieser Daten entwickelt. Dieser Artikel gibt einen Überblick und eine Kategorisierung von verschiedenen Ansätzen zur Datenextraktion aus dem Web. Einige Beispielszenarien im B2B Datenaustausch, im Business Intelligence Bereich und insbesondere die Generierung von Daten für Semantic Web Ontologien illustrieren die effektive Nutzung dieser Technologien.
-
Heyer, G.; Quasthoff, U.; Wittig, T.: Text Mining : Wissensrohstoff Text. Konzepte, Algorithmen, Ergebnisse (2006)
0.02
0.017237617 = product of:
0.06895047 = sum of:
0.06895047 = weight(_text_:und in 218) [ClassicSimilarity], result of:
0.06895047 = score(doc=218,freq=42.0), product of:
0.15350439 = queryWeight, product of:
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.06921162 = queryNorm
0.4491759 = fieldWeight in 218, product of:
6.4807405 = tf(freq=42.0), with freq of:
42.0 = termFreq=42.0
2.217899 = idf(docFreq=13141, maxDocs=44421)
0.03125 = fieldNorm(doc=218)
0.25 = coord(1/4)
- Abstract
- Ein großer Teil des Weltwissens befindet sich in Form digitaler Texte im Internet oder in Intranets. Heutige Suchmaschinen nutzen diesen Wissensrohstoff nur rudimentär: Sie können semantische Zusammen-hänge nur bedingt erkennen. Alle warten auf das semantische Web, in dem die Ersteller von Text selbst die Semantik einfügen. Das wird aber noch lange dauern. Es gibt jedoch eine Technologie, die es bereits heute ermöglicht semantische Zusammenhänge in Rohtexten zu analysieren und aufzubereiten. Das Forschungsgebiet "Text Mining" ermöglicht es mit Hilfe statistischer und musterbasierter Verfahren, Wissen aus Texten zu extrahieren, zu verarbeiten und zu nutzen. Hier wird die Basis für die Suchmaschinen der Zukunft gelegt. Das erste deutsche Lehrbuch zu einer bahnbrechenden Technologie: Text Mining: Wissensrohstoff Text Konzepte, Algorithmen, Ergebnisse Ein großer Teil des Weltwissens befindet sich in Form digitaler Texte im Internet oder in Intranets. Heutige Suchmaschinen nutzen diesen Wissensrohstoff nur rudimentär: Sie können semantische Zusammen-hänge nur bedingt erkennen. Alle warten auf das semantische Web, in dem die Ersteller von Text selbst die Semantik einfügen. Das wird aber noch lange dauern. Es gibt jedoch eine Technologie, die es bereits heute ermöglicht semantische Zusammenhänge in Rohtexten zu analysieren und aufzubereiten. Das For-schungsgebiet "Text Mining" ermöglicht es mit Hilfe statistischer und musterbasierter Verfahren, Wissen aus Texten zu extrahieren, zu verarbeiten und zu nutzen. Hier wird die Basis für die Suchmaschinen der Zukunft gelegt. Was fällt Ihnen bei dem Wort "Stich" ein? Die einen denken an Tennis, die anderen an Skat. Die verschiedenen Zusammenhänge können durch Text Mining automatisch ermittelt und in Form von Wortnetzen dargestellt werden. Welche Begriffe stehen am häufigsten links und rechts vom Wort "Festplatte"? Welche Wortformen und Eigennamen treten seit 2001 neu in der deutschen Sprache auf? Text Mining beantwortet diese und viele weitere Fragen. Tauchen Sie mit diesem Lehrbuch ein in eine neue, faszinierende Wissenschaftsdisziplin und entdecken Sie neue, bisher unbekannte Zusammenhänge und Sichtweisen. Sehen Sie, wie aus dem Wissensrohstoff Text Wissen wird! Dieses Lehrbuch richtet sich sowohl an Studierende als auch an Praktiker mit einem fachlichen Schwerpunkt in der Informatik, Wirtschaftsinformatik und/oder Linguistik, die sich über die Grundlagen, Verfahren und Anwendungen des Text Mining informieren möchten und Anregungen für die Implementierung eigener Anwendungen suchen. Es basiert auf Arbeiten, die während der letzten Jahre an der Abteilung Automatische Sprachverarbeitung am Institut für Informatik der Universität Leipzig unter Leitung von Prof. Dr. Heyer entstanden sind. Eine Fülle praktischer Beispiele von Text Mining-Konzepten und -Algorithmen verhelfen dem Leser zu einem umfassenden, aber auch detaillierten Verständnis der Grundlagen und Anwendungen des Text Mining. Folgende Themen werden behandelt: Wissen und Text Grundlagen der Bedeutungsanalyse Textdatenbanken Sprachstatistik Clustering Musteranalyse Hybride Verfahren Beispielanwendungen Anhänge: Statistik und linguistische Grundlagen 360 Seiten, 54 Abb., 58 Tabellen und 95 Glossarbegriffe Mit kostenlosen e-learning-Kurs "Schnelleinstieg: Sprachstatistik" Zusätzlich zum Buch gibt es in Kürze einen Online-Zertifikats-Kurs mit Mentor- und Tutorunterstützung.
-
Shi, X.; Yang, C.C.: Mining related queries from Web search engine query logs using an improved association rule mining model (2007)
0.02
0.016485397 = product of:
0.06594159 = sum of:
0.06594159 = weight(_text_:however in 1597) [ClassicSimilarity], result of:
0.06594159 = score(doc=1597,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.22941813 = fieldWeight in 1597, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.0390625 = fieldNorm(doc=1597)
0.25 = coord(1/4)
- Abstract
- With the overwhelming volume of information, the task of finding relevant information on a given topic on the Web is becoming increasingly difficult. Web search engines hence become one of the most popular solutions available on the Web. However, it has never been easy for novice users to organize and represent their information needs using simple queries. Users have to keep modifying their input queries until they get expected results. Therefore, it is often desirable for search engines to give suggestions on related queries to users. Besides, by identifying those related queries, search engines can potentially perform optimizations on their systems, such as query expansion and file indexing. In this work we propose a method that suggests a list of related queries given an initial input query. The related queries are based in the query log of previously submitted queries by human users, which can be identified using an enhanced model of association rules. Users can utilize the suggested related queries to tune or redirect the search process. Our method not only discovers the related queries, but also ranks them according to the degree of their relatedness. Unlike many other rival techniques, it also performs reasonably well on less frequent input queries.
-
Liu, Y.; Zhang, M.; Cen, R.; Ru, L.; Ma, S.: Data cleansing for Web information retrieval using query independent features (2007)
0.02
0.016485397 = product of:
0.06594159 = sum of:
0.06594159 = weight(_text_:however in 1607) [ClassicSimilarity], result of:
0.06594159 = score(doc=1607,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.22941813 = fieldWeight in 1607, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.0390625 = fieldNorm(doc=1607)
0.25 = coord(1/4)
- Abstract
- Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance.
-
Wei, C.-P.; Lee, Y.-H.; Chiang, Y.-S.; Chen, C.-T.; Yang, C.C.C.: Exploiting temporal characteristics of features for effectively discovering event episodes from news corpora (2014)
0.02
0.016485397 = product of:
0.06594159 = sum of:
0.06594159 = weight(_text_:however in 2225) [ClassicSimilarity], result of:
0.06594159 = score(doc=2225,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.22941813 = fieldWeight in 2225, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.0390625 = fieldNorm(doc=2225)
0.25 = coord(1/4)
- Abstract
- An organization performing environmental scanning generally monitors or tracks various events concerning its external environment. One of the major resources for environmental scanning is online news documents, which are readily accessible on news websites or infomediaries. However, the proliferation of the World Wide Web, which increases information sources and improves information circulation, has vastly expanded the amount of information to be scanned. Thus, it is essential to develop an effective event episode discovery mechanism to organize news documents pertaining to an event of interest. In this study, we propose two new metrics, Term Frequency × Inverse Document FrequencyTempo (TF×IDFTempo) and TF×Enhanced-IDFTempo, and develop a temporal-based event episode discovery (TEED) technique that uses the proposed metrics for feature selection and document representation. Using a traditional TF×IDF-based hierarchical agglomerative clustering technique as a performance benchmark, our empirical evaluation reveals that the proposed TEED technique outperforms its benchmark, as measured by cluster recall and cluster precision. In addition, the use of TF×Enhanced-IDFTempo significantly improves the effectiveness of event episode discovery when compared with the use of TF×IDFTempo.
-
Vaughan, L.; Chen, Y.: Data mining from web search queries : a comparison of Google trends and Baidu index (2015)
0.02
0.016485397 = product of:
0.06594159 = sum of:
0.06594159 = weight(_text_:however in 2605) [ClassicSimilarity], result of:
0.06594159 = score(doc=2605,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.22941813 = fieldWeight in 2605, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.0390625 = fieldNorm(doc=2605)
0.25 = coord(1/4)
- Abstract
- Numerous studies have explored the possibility of uncovering information from web search queries but few have examined the factors that affect web query data sources. We conducted a study that investigated this issue by comparing Google Trends and Baidu Index. Data from these two services are based on queries entered by users into Google and Baidu, two of the largest search engines in the world. We first compared the features and functions of the two services based on documents and extensive testing. We then carried out an empirical study that collected query volume data from the two sources. We found that data from both sources could be used to predict the quality of Chinese universities and companies. Despite the differences between the two services in terms of technology, such as differing methods of language processing, the search volume data from the two were highly correlated and combining the two data sources did not improve the predictive power of the data. However, there was a major difference between the two in terms of data availability. Baidu Index was able to provide more search volume data than Google Trends did. Our analysis showed that the disadvantage of Google Trends in this regard was due to Google's smaller user base in China. The implication of this finding goes beyond China. Google's user bases in many countries are smaller than that in China, so the search volume data related to those countries could result in the same issue as that related to China.
-
Gill, A.J.; Hinrichs-Krapels, S.; Blanke, T.; Grant, J.; Hedges, M.; Tanner, S.: Insight workflow : systematically combining human and computational methods to explore textual data (2017)
0.02
0.016485397 = product of:
0.06594159 = sum of:
0.06594159 = weight(_text_:however in 4682) [ClassicSimilarity], result of:
0.06594159 = score(doc=4682,freq=2.0), product of:
0.28742972 = queryWeight, product of:
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.06921162 = queryNorm
0.22941813 = fieldWeight in 4682, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
4.1529117 = idf(docFreq=1897, maxDocs=44421)
0.0390625 = fieldNorm(doc=4682)
0.25 = coord(1/4)
- Abstract
- Analyzing large quantities of real-world textual data has the potential to provide new insights for researchers. However, such data present challenges for both human and computational methods, requiring a diverse range of specialist skills, often shared across a number of individuals. In this paper we use the analysis of a real-world data set as our case study, and use this exploration as a demonstration of our "insight workflow," which we present for use and adaptation by other researchers. The data we use are impact case study documents collected as part of the UK Research Excellence Framework (REF), consisting of 6,679 documents and 6.25 million words; the analysis was commissioned by the Higher Education Funding Council for England (published as report HEFCE 2015). In our exploration and analysis we used a variety of techniques, ranging from keyword in context and frequency information to more sophisticated methods (topic modeling), with these automated techniques providing an empirical point of entry for in-depth and intensive human analysis. We present the 60 topics to demonstrate the output of our methods, and illustrate how the variety of analysis techniques can be combined to provide insights. We note potential limitations and propose future work.