About K12 Education: Knowledgebase Data

The Education Research Knowledgebase is the almost-physical embodiment research-based education system. It is one of the main benefits a school gets from “the system” in a semi-anarchic education system. The knowledgebase has two main attributes: Data organized and accumulated in a database, discussed here. And methods to access the data, discussed elsewhere.

The more detailed information exists in a database, the more information can be analyzed by the public, and the more informed the public can be. The Internet is a fine example of that. Or going to be some day. On the other hand, giving information to irresponsible people, is harmful to the system, to the schools and teachers, and ultimately to the very people who get that information. A trivial example is that of giving a violent parent the name and address of the teacher who failed his kid at a test. This problem is wider than the violent fringes: Giving raw information to the ignorant, gives the unscrupulous a good opportunity to shape that information in any way useful to them. See any hatemonger for examples. Worse yet, “the ignorant” is a pretty large group. It may even include people we meet in the mirror. For example, in the past few months we got a lot of raw information about Swine Flu. Bottom line: Should I worry about it? Is it dangerous for me and my loved ones? It is not all that clear to me. I am too ignorant.

The Education Research Knowledgebase should be widely accessible, in order to maximize its usefulness. The more people work with the knowledgebase, the more diverse ideas and thinking can go into it and come out of it. How wide? All qualified researchers and educators? All researchers and educators? All? Let’s go with All, as a working assumption, and we’ll see if we need to scale it back a bit. If we manage to stay with “All”, it will be a good step towards taking education seriously as everybody’s business.

In the current thinking about privacy, we must minimize the public’s access to the identity of individual students. On the other hand, the less details are available, the less information can be analyzed. A database open for everybody needs to be carefully anonymized.
Let’s consider a few types of details:

Individual student names must and can be kept out of reach.

Individual teacher names: In the current climate, they will have to be unreachable, though a culture can be created where the teachers don’t mind, and parents considering a school demand the information.

School-names: This item suffers a lot from the information-to-the-ignorant problem. One needs to have very keen critical thinking, coupled with considerable familiarity with statistics to avoid pitfalls in comparing between schools. For example, suppose School A has a 90% success rate in SAT, and school B has a 70% success rate. Most people will conclude (and nearly all people will feel) that School A must be better than school B. But maybe School A just pre-tests students, and prevents the less successful from even trying the SAT? Maybe School B takes those students dumped by School A, and manages to get an amazing success rate of 70%? Simplistic analysis is a danger, but instead of giving up and keeping information away from “All”, we can try to meet this danger head on, and remove it - See discussion of access methods. The school name is also a problematic piece of information because it is partially identifying: It can be used to identify individual students, especially in small schools. But this would take an effort of a data-savvy person, and such people are known to find all kinds of information, such as medical information, credit information, etc. even without a publicly accessible database. Let’s go ahead with “All” a few more steps and see how far we get.

Researcher details: Name, current activities, past activities, etc. Freely accessible. If a researcher wants to remain anonymous, it may be possible, but then the research has to be marked as suspicious…

Individual parameters: Date of Birth; Ethnicity; Address of residence; Medical history; Family circumstances, etc. These suffer from being partially identifying, and should be partially anonymized one way or another. For example, the date of birth can be rounded to the Month of Birth, and the Address can be replaced with a few measurements along lines such as affluence, altitude, air quality, noise level, etc. Alternatively, the types of analysis can be limited so that such partially identifying information will not be visible for groups smaller than 100 people.

Test scores: Individual (anonymized); Association with school; Association with teacher (anonymized for now); Association with individual parameters etc. All can be freely accessible.

Individual choices, such as courses taken, associated with dates and correlated with any other information in the database. These are also partially identifying and can be partially anonymized, and then be made accessible.

Research dependability quotient: Calculated from details such as the number of times the research was repeated by different teams, the number of pupils participating in the research, scores of formal reviews, dependability of researchers, etc. Freely accessible.

Formal research details: Ah, there's the rub. Formalized, quantifiable details are essential. But the more formalized the data is, the more it loses the “juice” - those details that were not expected by, the designers of the research or the designers of the database. And this is where most of the meaning of the research may be. I have been dancing around this issue for a while now, and there are no answers yet. Generally, one can guess that there will be semantic markings on such details, creating a semantic web. Some people are working on such projects.

Raw research details: Until we got the Formal Research Details engine right, and I don’t expect it in the 21st century, we will need to keep the raw details available. This again brings up real privacy issues, discussed earlier in the blog. It may be that here access will be restricted and recorded: Restriction can be to generally qualified personnel, further restricted to preapproved people, or discussed on a per-request basis, to minimize the possibility of recognition of specific students by anyone accessing the raw data. Also, there could be different levels of rawness of the data: For example, a video footage with blurred-out faces may be more freely accessible than the same video in the raw.

These are examples of items in the database. The full database design would take more time than that required to write a blog entry.

About K12 Education

21st Century Education System

Saturday, October 17, 2009

Knowledgebase Data

No comments:

Post a Comment

Popular Posts